Show Code Cells

StopWordsAndClusterVisualisation

Finding The Words - Stop Words and Cluster Visualisation


This isn't going to be a particularly long post, but I wanted to look into a couple of things I didn't cover properly in my recent look at stemming in clustering algorithms.

Last time I looked at how stemming appeared to improve the clustering of articles, but was only really able to list the cluster articles to show what each cluster was discussing. I wanted to find a better way of visualising the clusters themselves, so have taken a bit of a look at WordClouds here to show how they could be used to show the 'shape' of a group of docs, as seen by the algorithm.

I also wanted to make a very basic reference to the fact that some words are more important that others, and, in the absence of a true attention-related model (we'll get there), I wanted to show that focussing the algorithm on more meaningful words should improve performance:

So with all of that in mind, for this piece I'm going to look at the following:

  • Cluster Visualisation
    • Wordclouds as a means of visualising clusters
    • Qualitatively analysing some of our news clusters
  • Removing Low-Meaning Words
    • Stopwords
    • Measuring improvements in word frequency clustering

First things first...

I'm not going to rewrite all the code from scratch, sorry.

This blog is about building on what I've learnt, and frankly everyone should try to minimise the amount of work they need to do to get the results they want. This doesn't mean writing lazy code, because lazy means bad and bad always needs fixing in the long run... This means neat and efficient, and (hopefully) well-annotated for later.

For this post, I copied over the code from my previous work on Word Roots and Associations, and converted it into python modules. I'll access these throughout, but any new work, I'll write out in the cells of this notebook.

The modules and any other code I've written can always be accessed on Github.

lazy means bad and bad always needs fixing in the long run...

Getting stuck in

Finally, in the cells below I've run the functions to gather the latest news articles (using NewsAPI), scrape their contents (using BeautifulSoup and requests), preprocess them into lists of cleaned, individual words (using PyStemmer, among others), and performed simple k-means clustering on them (I'll try a different clustering approach soon, I promise):

In [1]:
import DataAccess
import Preprocessing
import Vectors
import Clustering
In [2]:
rawArticles = DataAccess.getArticles()
articles = Preprocessing.preprocessArticles(rawArticles)
stemmedArticles = Preprocessing.stemArticles(articles)
In [5]:
vectorisedStemmedArticles, stemmedVocabulary = Vectors.vectoriseArticles(stemmedArticles)
In [9]:
K=20
G=100
articleCentroidIds,centroids,performance = Clustering.kMeansCluster(vectorisedStemmedArticles,K,G)
0	(159, 19470)
1	(1, 19470)
2	(5, 19470)
3	(37, 19470)
4	(90, 19470)
5	(2, 19470)
6	(1, 19470)
7	(1, 19470)
8	(3, 19470)
9	(25, 19470)
10	(4, 19470)
11	(3, 19470)
12	(2, 19470)
13	(3, 19470)
14	(22, 19470)
15	(1, 19470)
16	(5, 19470)
17	(115, 19470)
18	(1, 19470)
19	(3, 19470)
Last Generation:	100

We're now in roughly the position we were towards the end of the last post.

Now onto the interesting stuff...

Cluster Visualisation

Word Clouds

I'd been trying to think of an intuitive way to show the shape of each cluster, rather than just listing the texts within it (which is ugly, boo). While I could measure the relatived distances within each cluster to plot them (and I still might) the most fitting visualisation of what each one actually said would have to be a bit less scientific. For this I think the best answer is to use word clouds.

I'm a little bit of a snob about visualising data, and word clouds very much fall into the "soft" end of that. There's nothing to really measure from them, it's quite hard to compare between them, and they only really look at one component of the document, word frequency.

However, word frequency is exactly what I'm interested in here, and I don't want to make hard measurements (yet) at the expense of any intuition of the contents of each cluster.

The package I'm using for this is called, reassuringly, wordcloud, and has the ability to generate sophisticated visualisations from the text provided. All I'm interested in however, is a simple blob of words, where the size of each word is equivalent to the number of times it appears in a cluster.

To install the module in your python environment, run the following pip command in a terminal:

pip install wordcloud

To present the wordcloud content, I'm using the Matplotlib library.

[Extra] Plotting Word Clouds

The cells below contain the function definitions I used to generate the wordclouds. I also built a couple of helper functions to manage the contents of the clusters.

In [61]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import numpy as np
In [123]:
"""returns all articles in a cluster"""
def ListClusterTexts(articles,articleCentroidIds,K) :
    return [articles[i] for i in range(articleCentroidIds.shape[0]) if articleCentroidIds[i]==K]

"""concatinates all tokenised article texts in an article cluster into a single pseudo-natural text"""
def ConcatinateClusterTexts(articles,articleCentroidIds,K) :
    clusterText = ''
    for article in ListClusterTexts(articles,articleCentroidIds,K) :
        clusterText+=' '.join(article)
    return clusterText

"""counts the number of articles in a cluster"""
def CountClusterArticles(articles,articleCentroidIds,K) :
    return len(ListClusterTexts(articles,articleCentroidIds,K))

"""creates a WordCloud object from natural text, which can be cast as an image or array of word frequencies"""
def CreateWordCloud(text) :
    #removes STOPWORDS from the chart to make more readable
    return WordCloud(stopwords=Preprocessing.stemText(STOPWORDS|{'endofsen','endofpar','said','say','will'}),
                     background_color="white",
                     width=500,
                     height=500                    ).generate(text)

"""converts natural text into a WordCloud object, and plots it using Matplotlib"""
def PlotWordCloud(text) :
    wordcloud = CreateWordCloud(text)
    # Display the generated image:
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.show()

"""takes an array of cluster IDs and converts it into an array of wordclouds from the text within each cluster"""
def PlotClusterWordCloudArray(articles,articleCentroidIds,Ks) :
    fig, axes = plt.subplots(Ks.shape[0], Ks.shape[1], figsize=(12,12))
    for i in range(Ks.shape[0]) :
        for j in range(Ks.shape[1]) :
            axes[i, j].imshow(CreateWordCloud(ConcatinateClusterTexts(articles,articleCentroidIds,Ks[i,j])))
            axes[i, j].axis("off")
            axes[i, j].set_title("Cluster "+str(Ks[i,j])+"; count="+str(CountClusterArticles(articles,articleCentroidIds,Ks[i,j])))

What's the word?

Using these functions I can generate an array of word clouds to help see what vocabulary the clusters contain. Based on the output from the clustering algorithm, we can see that some clusters have more texts than others. This amy affect the quality or visualisation of each cluster, so in the interest of transparency I've selected clusters (see the Ks array) with a variety of sizes.

In [127]:
Ks = np.array([[0, 3, 4], [9, 14, 17], [2, 10, 16]])

PlotClusterWordCloudArray(stemmedArticles,articleCentroidIds,Ks)

I think this is great!

The word clouds demonstrate the main vocabularies of each individual cluster. Clusters 14 and 17 are very strong clusters, in that they all share a very similar, small vocabulary. Unfortunately for us, while they are very similar, it is not for very interesting reasons (namely the BBC World Bulletin and all articles that have been protected by Captcha).

However, the vocabulary of the other clusters gives you some indication that the articles within them share a common theme or at least common language. Cluster 9 is clearly sports, Cluster 3 is crime and policing, Cluster 4 is probably media, and so on.

This doesn't map cleanly onto our expectations of individual stories, but we are only looking at vocabulary, and without paying any attention to which words are more or less significant. There are a few clusters that seem to be more exclusive (see the bottom row of the charts above), but there is still no indication that these represent specific new events.

the articles within them share a common theme or at least common language

Stopwords

Less is sometimes more

It's worth noting that I had to clean the cluster texts further in order to make these word clouds. The initial clouds contained many stopwords (such as "and", "then", "he", "she", "it"), that carry meaning within a sentence, or even an n-gram, but in the context of comparing vocabulary, are just additional noise that the algorithm has to work through.

While I removed the stop words from the results, I didn't remove them from the inputs, and because they are so frequent, this will have affected the clustering.

In the cells below I generate a new vocabulary by removing the stopwords, and create a new article matrix by removing the associated columns. Even though this only reduces the vocabulary by ~100 words, it has a significant effect on the makeup fo the clusters:

In [8]:
vectorisedStemmedArticles.shape
Out[8]:
(483, 19470)
In [103]:
stopWords = set(Preprocessing.stemText(STOPWORDS|{'endofsen','endofpar','said','say','will'}))
vocabStopWordIndexes = [i for i in range(len(stemmedVocabulary)) if stemmedVocabulary[i] in stopWords]
vocabGoWordIndexes = [i for i in range(len(stemmedVocabulary)) if stemmedVocabulary[i] not in stopWords]

goStemmedVocabulary = [stemmedVocabulary[i] for i in vocabGoWordIndexes]
vectorisedGoStemmedArticles = vectorisedStemmedArticles[:,vocabGoWordIndexes]

vectorisedGoStemmedArticles.shape
Out[103]:
(483, 19304)

Removing the stopwords reduced the vocabulary from 19470 to 19304, but it had a larger effect on the distribution of articles in the clusters generated by the k-Means algorithm. In the cell below, the resulting clusters each contain a more balanced number of articles than with the previous vocabulary.

This is not necessarily solely because we removed the stopwords, the k-means algorithm is by no means infallible and results can depend heavily on the starting conditions. That said, having run it a few times, the balance between the clusters remains relatively consistent:

In [104]:
K=20
G=100
goArticleCentroidIds,goCentroids,goPerformance = Clustering.kMeansCluster(vectorisedGoStemmedArticles,K,G)
0	(20, 19304)
1	(34, 19304)
2	(10, 19304)
3	(45, 19304)
4	(16, 19304)
5	(5, 19304)
6	(23, 19304)
7	(20, 19304)
8	(14, 19304)
9	(48, 19304)
10	(27, 19304)
11	(23, 19304)
12	(123, 19304)
13	(1, 19304)
14	(28, 19304)
15	(8, 19304)
16	(16, 19304)
17	(6, 19304)
18	(6, 19304)
19	(10, 19304)
Last Generation:	100

Having removed the stopwords and by plotting the resulting wordclouds, it's hard to see whether this balancing of cluster sizes improved the quality of the clusters themselves.

The language in the clusters still appears to be self-consistent, and the language between clusters appears to be distinct, and that's about all I can really say. that siad, is it just me or do the clusters feel more 'specific'? Brexit is a topic now, in a cluster of 27 articles, whereas before there was just a general sense of politics?

In [105]:
goKs = np.array([[0, 1, 2], [3, 4, 6], [9, 10, 11]])
PlotClusterWordCloudArray(stemmedArticles,goArticleCentroidIds,goKs)

Measuring Cluster Quality

In my first NLP post, I tinkered with plotting clusters on an array of 2D measures (distances from the centroid article, if I remember correctly) and generating quantitative measures of clustering. The latter would appear to be useful now, and now that I've started vectorising documents, doesn't require me to implement kernel tricks to perform:

In [137]:
#All vocabulary - between cluster mean distance 
Clustering.meanInterCentroidDistance(centroids)
Out[137]:
0.3208521342912205
In [139]:
#All vocabulary - within cluster mean distance 
Clustering.meanIntraCentroidDistance(centroids)
Out[139]:
0.00034530422084758826
In [134]:
#Filtered vocabulary - between cluster mean distance 
Clustering.meanInterCentroidDistance(goCentroids)
Out[134]:
0.5192691050900048
In [135]:
#Filtered vocabulary - within cluster mean distance 
Clustering.meanIntraCentroidDistance(goCentroids)
Out[135]:
0.00023574978402225204

From the measures above, the clusters generated using all vocabulary are, on average, less far apart and less concentrated than the the clusters generated using the vocabulary with the stopwords removed. This points to the conculsion that removing stopwords increases the quality of text clustering.

As I've talked about before, there are lots of problems with using the ratio of the inter-/intra-centroid distances, but this fits with intuition, and I'm inclined to believe it.

[Extra] Proving the point

A more scientific method, given the difference in cluster quality from one run of the k-means algorithm to the next, would be to perform clustering 30 times using each method, and compare the difference. I've run this analysis in the cells below, and you can see that difference made by removing stopwords makes a much bigger difference that the run-by-run variation in the k-means algorithm results:

In [237]:
K=20
G=10
T=30

def testClusteringQuality(K,G,T,vectorisedArticles) :
    results=[]
    for i in range(T) :
        _articleCentroidIds,_centroids,_performance = Clustering.kMeansCluster(vectorisedArticles,K,G)
        interCentroidDistance=Clustering.meanInterCentroidDistance(_centroids)
        intraCentroidDistance=Clustering.meanIntraCentroidDistance(_centroids)
        results.append([i,interCentroidDistance,intraCentroidDistance,intraCentroidDistance/interCentroidDistance])
    return np.vstack(results)

allVocabularyResults = testClusteringQuality(K,G,T,vectorisedStemmedArticles)
stopwordFilteredResults = testClusteringQuality(K,G,T,vectorisedGoStemmedArticles)
In [248]:
results=np.concatenate([allVocabularyResults[:,1:],stopwordFilteredResults[:,1:]],1)

labels=["All Vocab","No Stopwords"]

fig, axes = plt.subplots(1, 3, figsize=(14,3))
axes[0].boxplot(np.vstack([results[:,i] for i in [1,3]]).T, showfliers=False, labels=labels)
axes[0].set_title("Inter-Cluster Distance")
axes[1].boxplot(np.vstack([results[:,i] for i in [2,4]]).T, showfliers=False, labels=labels)
axes[1].set_title("Intra-Cluster Distance")
axes[2].boxplot(np.vstack([results[:,i] for i in [3,5]]).T, showfliers=False, labels=labels)
axes[2].set_title("Distance Ratio")

plt.show()

[Extra] Mean comparison of Distance Ratio

It's hardly necessary, but to really hammer home the point, I calculated the t statistic of the difference between the means below:

In [249]:
## All vocab
mu_1=np.mean(results[:,3])
sigma_1=np.std(results[:,3])
n_1=len(results[:,3])

## No stopwords
mu_2=np.mean(results[:,5])
sigma_2=np.std(results[:,5])
n_2=len(results[:,5])

## t stat
SE = np.sqrt((sigma_1/n_1) + (sigma_2/n_2))
t = (mu_1 - mu_2) / SE

print("t Statistic: "+str(t))
t Statistic: 216.5973370518503

Summary

So there you have it. We've really discuessed two main points here:

  • Word clouds are a good way of visualising the cluster but they aren't helpful for measuring cluster quality
  • The Inter-/Intra-Cluster distance ratio is a good way of measuring improvements in clustering, especially if run for a series of tests

This has been a little bit of an aside, but I want to focus on using Word Associations and Word Embeddings to improve cluster quality. Now that we have a reasonable means of visualising clusters and measuring cluster quality, the process will hopefully become a bit more scientific.

Thanks for taking the time to read this, I hope you've found it interesting. as always the code can be found on Github