Show Code Cells

WordRootsAndAssociations

Finding The Words - Word Roots and Associations in Vectorised NLP


In memory of William Fairhurst, my Grandad (and proofreader) - 250 thousand miles to the moon, 93 million miles to the sun

Introduction

In a previous post on simple NLP using R, we developed an algorithm that used frequency analysis of the vocabulary in different texts to cluster those texts together. We looked at a couple of improvements on this theme, including using w-shingling to compare more complex word combinations rather than just vocabulary.

In this post I want to develop a couple of those ideas further, and address a few of the shortcomings of those approaches; namely that they are relatively lightweight, and attempt a rather superficial form of unsupervised learning.

Word Roots and Associations

An immediate downside to comparing complete words and vocabulary within texts is that your algorithm is at the mercy of the author; idioms, favoured words, spelling conventions (think 'colour' vs 'color'), spelling mistakes... If an algorithm doesn't know to recognise the similarities between such differences (i.e. hasn't been explicitly coded, or taught, to do so), then these all add up to a lot of noise in the signal you are trying to process. This can be a serious problem in texts that are condensed, and have very few words to go by, such as a news article.

This is an important point to consider when you're working with an NLP problem. Do I try to solve it by preprocessing the data before it reaches my algorithm? Do I try an algorithm that's more robust to some of these issues, say focussing on strings of characters rather than complete words? I think both are viable options, depending on what your goals are, but for the purposes of this post, I'm going to focus on the former. In short, because I'm more interested in document-level text comparison, I think a character-level algorithm is likely to be overkill, and added to this I'm a big believer in a plug-and-play approach to programming, whereby the different components of an algorithm can be separated (see pre-processing), upgraded, replaced, generally-messed-with, forgotten, and even reintroduced at a later date, without affecting any code elsewhere in the project.

I'm a big believer in a plug-and-play approach to programming, whereby the different components of an algorithm can be separated (see pre-processing), upgraded, replaced, generally-messed-with, forgotten, and even reintroduced at a later date, without affecting any code elsewhere in the project.

Variety is the Spice of Life

A lot of work has been published around the idea of cleaning or normalising individual words before they're input into an algorithm. This is often referred to as stemming or lemmatization the text, and consists of either simply removing suffices until only the core of a word remains (stemming), or performing a more complex analysis over a large vocabulary in order to group the various forms of a word together (lemmatization).

For the purposes of demonstration, I'm going to use the Porter Stemmer algorithm to cleanup our vocabulary. The Porter Stemmer works by applying a set of rules or heuristics in order, to accurately reduce as many words as possible to a 'correct' word stem. However, the English language is rather varied - given its variety of historical influences, and England's more recent history of global trade, colonialism, and various forms of cultural osmosis (see 'pukka', 'tattoo', 'chit') - which means that such a set of rules will never be perfect.

There are lots of different stemming algorithms, but this one is the most widely known, and has the advantage of being published in its own library; Snowball.

"We don't just borrow from other languages; English pursues them down alleyways, beats them unconscious, and rifles their pockets for loose grammar."

Getting to work

For the purposes of this blog, Python is the programming language of choice. Most of my previous NLP work has been in R, however I'm keen to make use of the wide variety of tools in the Python community, especially libraries such as Numpy, Plotly, the Natural Language Toolkit (NLTK), and later, perhaps, Keras for deep learning.

While the Snowball library is included in NLTK, it is also available in a lighter python wrapper called 'PyStemmer'. As with all of these 3rd party libraries, you'll have to install a copy in addition to your Python installation if you're doing this at home. You'll need to open a cmd terminal (in Windows) and run pip install PyStemmer to install the library, and then import it into the relevant Python script, as usual.

Believe it or not, this will also be one of my first attempts to build a Python notebook from scratch, so let me know if you find something that's bad practice or poorly executed - I've been working with C#, a very similar language, for my whole career, and I've had plenty of experience in reading, contemplating, and occasionally fixing other people's work in Python, but I've never had the joy of starting from a blank piece of screen!

Anyway, here goes... In the script below, we're going to import the Snowball/PyStemmer library containing the Porter Stemmer algorithm, and apply it to a set of similar-sounding words. Note also that we have the choose the language that we're interested in. This might seem obvious, but it highlights something worth remembering; the algorithm uses a set of fixed rules based on the cases, tenses and plural forms of various words, and these will obviously change depending on the language or dialect used.

In [1]:
import numpy as np
import Stemmer as ps

# list available stemmers
print('Stemmer Languages available:\t',ps.algorithms())
Stemmer Languages available:	 ['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']
In [2]:
# create stemmer class
stemmer = ps.Stemmer('english')

# stemmer examples
rawWords = ['general','generalizing','generalization','generalise','generalising','generalisation']

print('Raw Word\t Stemmed Word')
for w in rawWords :
    print(w,'\t',stemmer.stemWord(w))
Raw Word	 Stemmed Word
general 	 general
generalizing 	 general
generalization 	 general
generalise 	 generalis
generalising 	 generalis
generalisation 	 generalis

... so it doesn't always work perfectly, see above where it treats the British and Americanised words as having separate word stems. This is because the stemming algorithm uses human-defined rules rather than, say, learning the relationships for itself. Clearly, there are only so many exceptions that can be accounted for, and the specific rules that would stem the British spellings of generalise into the word general are either too complicated to include or haven't been added yet.

[Extra] JSON, NewsAPI and building a news dataset

In [3]:
from newsapi import NewsApiClient

news = NewsApiClient(api_key='2bd0b9a9d4594be6b0ceaa26d1861165')

all_news = []
for i in range(1,11):
    all_news.append(news.get_everything(sources='bbc-news,the-verge,abc-news,ary news,associated press,wired,aftenposten,bbc news,bild,blasting news,bloomberg,business insider,engadget,google news,the verge',
                                        from_param='2019-02-20',
                                        to='2019-03-20',
                                        language='en',
                                        page_size=100,
                                        page=i))
In [4]:
import requests
from bs4 import BeautifulSoup as bs

def scrapeArticleUrl(url):
    page = requests.get(url)
    soup = bs(page.text, 'html.parser')

    for x in soup("script"): x.decompose()
    for x in soup("meta"): x.decompose()
    for x in soup("link"): x.decompose()
    for x in soup("span"): x.decompose()
    for x in soup("header"): x.decompose()
    for x in soup("nav"): x.decompose()
    for x in soup("li"): x.decompose()

    #print(soup)
    return [ p.get_text()+' endofpar ' for p in soup('p') if len(p.get_text().split(' '))>1]

rawArticles=[]
for news in all_news:
    rawArticles=rawArticles+news['articles']

# dict comprehension syntax - similar to usage in f#, or the foreach library in R
rawArticles = {i : rawArticles[i] for i in range(len(rawArticles))}

print(rawArticles[1])

rawContents = scrapeArticleUrl(rawArticles[1]['url'])
print(rawContents)
{'source': {'id': 'bbc-news', 'name': 'BBC News'}, 'author': None, 'title': 'England v Czech Republic: Go with gut instinct when choosing country, says Michael Keane', 'description': 'Players should "go with their instinct" when deciding which country to play for, England defender Michael Keane tells BBC Radio 5 Live.', 'url': 'https://www.bbc.co.uk/sport/football/47635822', 'urlToImage': 'https://ichef.bbci.co.uk/onesport/cps/624/cpsprodpb/13006/production/_106103877_keane_rex2.jpg', 'publishedAt': '2019-03-20T19:00:15Z', 'content': 'Declan Rice made three non-competitive appearances for the Republic of Ireland before choosing to play for England\r\nPlayers should "go with their instinct" when deciding which country to play for, England defender Michael Keane has told BBC Radio 5 Live.\r\nThe… [+1456 chars]'}
['Share this with endofpar ', 'Players should "go with their instinct" when deciding which country to play for, England defender Michael Keane has told BBC Radio 5 Live. endofpar ', "The Everton centre-back played for the Republic of Ireland at youth level, as did West Ham's Declan Rice, who has now declared his allegiance to England. endofpar ", 'Last week, manager Gareth Southgate declared that more than 50% of England under-16s have dual nationality. endofpar ', '"You\'ve got to go with where you feel like you belong," said Keane. endofpar ', '"I always thought I belonged with England and that\'s why I have always dreamed of playing for England.  endofpar ', '"When I was at Ireland, I wasn\'t good enough to play for England at that time. I was only young and small and still developing. I had in the back of my head that hopefully one day I could play for England. endofpar ', '"I can\'t really give them too much advice; I think you have just got to go on your gut feeling. Obviously you have to see how you\'re performing week in week out and where you think you could end up in your career." endofpar ', "In-form Crystal Palace defender Aaron Wan-Bissaka is one of those players with dual nationality who has yet to be given a senior cap by Southgate. DR Congo's manager Florent Ibenge told the Independent that he wanted the 21-year-old to commit to them. endofpar ", "Meanwhile, Southampton keeper Angus Gunn, who has been previously been called up to Southgate's squad, could also play for Scotland. His father Bryan, a former goalkeeper, played for Scotland in the early 1990s. endofpar ", 'England begin their Euro 2020 qualifying campaign against the Czech Republic on Friday followed by a match in Montenegro next Monday. endofpar ', 'Share this with endofpar ', 'Get latest scores and headlines sent straight to your phone, sign-up to our newsletter and learn where to find us on online. endofpar ', 'Find ways to get active endofpar ', 'How to get involved in just about any sport or activity endofpar ', 'Find a club, activity or sport near you endofpar ']
In [5]:
import re

def preprocessArticles(articles) :
    reEndOfSentence = re.compile('\\. ')
    reNonAlphaNumeric = re.compile('\'s|/\n/|/\t/|[\W]+ | ')
    processedArticles = {}
    for i,article in articles.items() :
        # collect contents
        contents = [str(article['description'])+' ']+scrapeArticleUrl(article['url'])
        # convert contents into words
        words = []
        for text in contents:
            # covert to lower case
            text = text.lower()
            # replace full stops with ENDOFSEN
            text = reEndOfSentence.sub(' endofsen ',text)
            # split text into individual words
            newWords = text.split(' ')
            # remove remaining non-alphanumerics
            newWords = [reNonAlphaNumeric.sub('',word) for word in newWords]
            words = words + [word for word in newWords if word!='']

        # combine into list of processed articles
        processedArticles[i] = words

    return processedArticles

def stemText(words) :
    return [stemmer.stemWord(word) for word in words]

def stemArticles(articles) :
    return {i:stemText(words) for i,words in articles.items()}


articles = preprocessArticles(rawArticles)
stemmedArticles = stemArticles(articles)

print('\nRaw Description:')
print(rawArticles[1]['description'])
print('\nPreprocessed Description:')
print(articles[1][1:20])
print('\nStemmed Description:')
print(stemmedArticles[1][1:20])
Raw Description:
Players should "go with their instinct" when deciding which country to play for, England defender Michael Keane tells BBC Radio 5 Live.

Preprocessed Description:
['should', '"go', 'with', 'their', 'instinct"', 'when', 'deciding', 'which', 'country', 'to', 'play', 'for,', 'england', 'defender', 'michael', 'keane', 'tells', 'bbc', 'radio']

Stemmed Description:
['should', '"go', 'with', 'their', 'instinct"', 'when', 'decid', 'which', 'countri', 'to', 'play', 'for,', 'england', 'defend', 'michael', 'kean', 'tell', 'bbc', 'radio']

Vectorised Representations

In my previous post, we built an algorithm that looked at the differences between articles in order to cluster them together. In that example, we only needed to generate a difference score between the vocabulary used in two texts, in order to be able to cluster them. This was done using a Jaccard Distance based on the vocabulary and n-grams in the two articles being compared.

We didn't have to describe the articles outside of this comparison. This simplified matters for us, but it did limit the scope of the encoding of the articles.

An alternative method is to build vectorised representations of the articles. In this example, it is a very simple process, which generates a vector defining the frequency distribution of the vocabulary used within a text, where each element of the vector is the frequency of one word within the article.

This is simple enough, however, because the same vector-space needs to be used to represent all articles, there must be one element for each word in the English Language, or at the very least, the vocabulary of the entire corpus! This means that we will be dealing with an extremely high dimensional vector space, which can cause significant computational overheads, and real statistical problems when the number of examples n is much less than the number of dimensions D.

Too much of a good thing

It is worth noting that in the Jaccard Distance example, we effectively compared documents in a subspace of the vector space that either one or the other article inhabited. This meant that the calculation of document similarity was much more efficient. While it would be computationally expensive to revectorise our articles at every comparison to reduce the overall vocabulary, we may be able to perform a similar function by clipping the vectors we are interested in. However, one of the main benefits of vectorisation is the ability to group all of your vectors up into matrices, and benefit from the extremely optimised Linear Algebra packages available in python.

Finally, because stemming makes similar words match with each other, this should reduce the size of the overall vocabulary a bit, but we will still find that the vectors are in tens of thousands of dimensions.

See below for code to build a vocabulary from the corpus, and then vectorise individual texts according to that vocabulary:

In [6]:
def buildVocabulary(processedArticles):
    # generate list of all vocabulary
    vocabulary = []
    for i,article in processedArticles.items():
        vocabulary = vocabulary + article

    vocabulary = list(set(vocabulary))
    vocabulary.sort(key=str)

    #return as dictionary
    return {i:vocabulary[i] for i in range(len(vocabulary))}

def vectoriseText(vocabToIndexMap,article):
    # encode words as their index in the vocabulary (e.g. possibly 'aardvark' => 23)
    indexArray=[vocabToIndexMap[word] for word in article]
    # transform each
    frequencyVector=[float(np.sum([i==j for j in indexArray])) for i in range(len(vocabToIndexMap))]
    return frequencyVector/np.linalg.norm(frequencyVector)

def vectoriseArticles(processedArticles):
    # build vocabulary
    vocabulary = buildVocabulary(processedArticles)
    # create map of word to vocabulary index
    vocabToIndexMap={w:i for i,w in vocabulary.items()}
    # convert to 2d numpy array of vectors
    vectorisedArticles = np.vstack([vectoriseText(vocabToIndexMap,article) for i,article in processedArticles.items()])
    return vectorisedArticles, vocabulary


vectorisedArticles, vocabulary = vectoriseArticles(articles)
vectorisedStemmedArticles, stemmedVocabulary = vectoriseArticles(stemmedArticles)

print('Original Vocabulary:\t'+str(vectorisedArticles.shape))
print('Stemmed Vocabulary:\t'+str(vectorisedStemmedArticles.shape))
Original Vocabulary:	(1000, 32938)
Stemmed Vocabulary:	(1000, 26908)

The numbers above show the size of our matrices. Dimension 1, the number of texts, should equal 1000 in both cases; while dimension 2, the number of words in the vector, will vary depending on the number of the size of the vocabulary used in each case.

As expected, the original vocabulary (untreated) is significantly higher than the stemmed vocabulary.

Now, each of these sets of vectorisedArticles is a table, or matrix, with one row for every article, and one column for every word in the vocabulary, the value in each cell, or element, is the number of times the word (column) appears in the article (row).

[Extra] Performance Tuning

To give a measure of the performance gain made by stemming, which doesn't even consider the value of understanding similarity between similar words, I've included timings of a dot product calculation, the same as used for Cosine Similarity calculations, between all article vector combinations (here as a matrix multiplication).

In [8]:
import time

def singleCalculationTime(vectorisedArticles):
    timeStart=time.process_time()
    totalDistances=np.dot(vectorisedArticles,vectorisedArticles.T)
    timeEnd=time.process_time()
    return (timeEnd-timeStart)

print('Calculation Time, Original Vectors: '+str(singleCalculationTime(vectorisedArticles)))

print('Calculation Time, Stemmed Vectors: '+str(singleCalculationTime(vectorisedStemmedArticles)))
Calculation Time, Original Vectors: 2.375
Calculation Time, Stemmed Vectors: 1.828125

As you can see, using stemming reduces processing costs by ~20%.

There are alternative methods for reducing the vocabulary size that focus on eliminating over-used vocabulary, but for every word removed, some information is lost. Below is an example of how winsorising your vocabulary will also improve, performance, however I will not be using this is the clustering algorithms.

In [9]:
def winsoriseWordVectors(vectorisedArticles,vocabulary):
    wordFrequency = np.sum(vectorisedArticles,axis=0)
    #print(wordFrequency.shape)
    #print(np.max(wordFrequency))
    #print(np.median(wordFrequency))
    wordFrequencyDeciles = np.percentile(wordFrequency,[i for i in range(100)], interpolation='higher')
    #print(wordFrequencyDeciles)
    columnIndeces=[i for i in range(wordFrequency.shape[0]) if wordFrequency[i]>wordFrequencyDeciles[20] and wordFrequency[i]<=wordFrequencyDeciles[80]]
    #print(columnIndeces)
    return vectorisedArticles[:,columnIndeces],{i:vocabulary[columnIndeces[i]] for i in range(len(columnIndeces))}

winsrVectorArticles,winsrVocabulary = winsoriseWordVectors(vectorisedArticles,vocabulary)
winsrVectorStemmedArticles,winsrStemmedVocabulary = winsoriseWordVectors(vectorisedStemmedArticles,stemmedVocabulary)

print('Winsorised Original Vocabulary:\t'+str(winsrVectorArticles.shape))
print('Calculation Time, Winsorised Original Vectors: '+str(singleCalculationTime(winsrVectorArticles)))
print('Winsorised Stemmed Vocabulary:\t'+str(winsrVectorStemmedArticles.shape))
print('Calculation Time, Winsorised Stemmed Vectors: '+str(singleCalculationTime(winsrVectorStemmedArticles)))
Winsorised Original Vocabulary:	(1000, 19741)
Calculation Time, Winsorised Original Vectors: 1.5625
Winsorised Stemmed Vocabulary:	(1000, 16055)
Calculation Time, Winsorised Stemmed Vectors: 1.1875

Vectorised k-Means

Below I've implemented a more regular version of the k-means clustering algorithm compared to that in my previous post. This uses the vectorised articles and a measure of Cosine Similarity to build the clusters.

In [10]:
def cosineSimilarity(vectorA,vectorB):
    vectorA = [float(element) for element in vectorA]
    vectorB = [float(element) for element in vectorB]
    return np.dot(vectorA,vectorB)/(np.linalg.norm(vectorA)*np.linalg.norm(vectorB))

def centroidDistances(centroids,article):
    return { k:cosineSimilarity(centroid,article) for k,centroid in centroids.items() }

def generateCentroid(vectorisedArticles):
    randId=np.random.randint(vectorisedArticles.shape[0], size=1)
    return vectorisedArticles[randId]

def kMeansCluster(vectorisedArticles,K,G):

    centroids = np.vstack([ generateCentroid(vectorisedArticles) for k in range(K) ])
    articleCentroidIds = []
    performance=[]
    g = 0
    centroidsChanged=True
    while (g<G and centroidsChanged):
        # generate simple cosine distance on normalised vectors between all articles and centroids
        centroidDistances=1-np.dot(vectorisedArticles,centroids.T)
        #if g==0: print(centroidDistances.shape)
        #if g==0: print(centroidDistances)

        # use numpy argmin to find index (column) of nearest centroid by article (row)
        articleCentroidIds=np.argmin(centroidDistances,axis=1)
        articleCentroidDistances=np.min(centroidDistances,axis=1)
        #if g==0: print(articleCentroidIds.shape)
        #if g==0: print(articleCentroidDistances)
        #if g==0: print(articleCentroidIds)

        # create new centroids by averaging positions of all article vectors in cluster
        newCentroids=[]
        for k in range(K):
            clusterMembers=vectorisedArticles[[i for i in range(articleCentroidIds.shape[0]) if articleCentroidIds[i]==k],:]
            if g==G-1: print(str(k)+'\t'+str(clusterMembers.shape))
            clusterSize = clusterMembers.shape[0]
            if clusterSize==0:
                newCentroid=generateCentroid(vectorisedArticles)
            else:
                newCentroid=np.mean(clusterMembers,axis=0)
            newCentroids.append(newCentroid)
        newCentroids=np.vstack(newCentroids)

        # update existing centroids
        #print(centroids-newCentroids)

        centroidsChanged = (np.sum(np.diag(np.dot(centroids,newCentroids.T))<1.0)>0)
        centroids=newCentroids
        g+=1

        interCentroidDistance = np.sum(np.sum(1-np.dot(newCentroids,newCentroids.T)))/(2*K*(K-1))
        intraCentroidDistance = np.mean(articleCentroidDistances)
        performance.append([g,interCentroidDistance,intraCentroidDistance,intraCentroidDistance/interCentroidDistance])

    print('Last Generation:\t'+str(g))
    return articleCentroidIds,centroids,np.vstack(performance)
In [22]:
K=20
G=100
articleCentroidIds,centroids,performance = kMeansCluster(vectorisedStemmedArticles,K,G)
0	(367, 26908)
1	(68, 26908)
2	(1, 26908)
3	(2, 26908)
4	(1, 26908)
5	(130, 26908)
6	(28, 26908)
7	(4, 26908)
8	(8, 26908)
9	(326, 26908)
10	(2, 26908)
11	(6, 26908)
12	(4, 26908)
13	(31, 26908)
14	(3, 26908)
15	(1, 26908)
16	(4, 26908)
17	(12, 26908)
18	(1, 26908)
19	(1, 26908)
Last Generation:	100
In [25]:
performance[1:30,]
Out[25]:
array([[ 2.        ,  0.26958994,  0.15649223,  0.58048244],
       [ 3.        ,  0.27398791,  0.18225335,  0.66518758],
       [ 4.        ,  0.32206336,  0.17600968,  0.5465064 ],
       [ 5.        ,  0.32889314,  0.20688346,  0.6290294 ],
       [ 6.        ,  0.33041174,  0.20312905,  0.61477552],
       [ 7.        ,  0.33265559,  0.20081128,  0.60366121],
       [ 8.        ,  0.33713284,  0.19980279,  0.59265301],
       [ 9.        ,  0.34198893,  0.20012372,  0.58517601],
       [10.        ,  0.34291885,  0.20026109,  0.58398974],
       [11.        ,  0.34296922,  0.1997434 ,  0.58239453],
       [12.        ,  0.34297589,  0.19966257,  0.58214752],
       [13.        ,  0.34298869,  0.19963127,  0.58203457],
       [14.        ,  0.34302356,  0.19962532,  0.58195805],
       [15.        ,  0.34299795,  0.1996156 ,  0.58197316],
       [16.        ,  0.34297178,  0.19961334,  0.58201097],
       [17.        ,  0.3429811 ,  0.19957092,  0.5818715 ],
       [18.        ,  0.34290837,  0.19955817,  0.58195771],
       [19.        ,  0.34291658,  0.19953042,  0.58186287],
       [20.        ,  0.34290465,  0.19949289,  0.58177365],
       [21.        ,  0.34298227,  0.19947395,  0.58158678],
       [22.        ,  0.34307008,  0.19946522,  0.58141248],
       [23.        ,  0.34312886,  0.19942319,  0.58119038],
       [24.        ,  0.34317665,  0.19941869,  0.58109631],
       [25.        ,  0.34317853,  0.19943807,  0.58114964],
       [26.        ,  0.34317853,  0.1995016 ,  0.58133476],
       [27.        ,  0.34317853,  0.1995016 ,  0.58133476],
       [28.        ,  0.34317853,  0.1995016 ,  0.58133476],
       [29.        ,  0.34317853,  0.1995016 ,  0.58133476],
       [30.        ,  0.34317853,  0.1995016 ,  0.58133476]])
In [32]:
k=6
for i in range(articleCentroidIds.shape[0]):
    if articleCentroidIds[i]==k:
        print(articles[i][1:10])
['should', '"go', 'with', 'their', 'instinct"', 'when', 'deciding', 'which', 'country']
['patel', 'will', 'captain', 'birmingham', 'bears', 'in', 'this', 'season', 't20']
['full-back', 'kieran', 'tierney', 'is', 'not', 'fit', 'for', 'scotland', 'european']
['dillashaw', 'says', 'he', 'has', 'vacated', 'the', 'ufc', 'bantamweight', 'title']
['tigers', 'back-row', 'forward', 'tommy', 'reffell', 'extends', 'his', 'contract', 'with']
['united', 'midfielder', 'calum', 'butcher', 'says', 'his', 'side', 'still', 'believe']
['hooker', 'tom', 'cruse', 'agrees', 'a', 'new', 'undisclosed-length', 'contract', 'extension']
['johanna', 'konta', 'says', 'playing', 'three', 'home', 'wta', 'grass-court', 'tournaments']
['full-back', 'kieran', 'trippier', 'hopes', 'returning', 'to', 'the', 'england', 'squad']
['striker', 'will', 'grigg', 'is', 'ruled', 'out', 'of', 'northern', 'ireland']
['warrington', 'is', 'open', 'to', 'a', 'rematch', 'against', 'carl', 'frampton']
["'currency", "artist'", 'ezeanyika', 'anthony', 'chibuzor', 'found', 'life', 'a', 'struggle']
['ex-england', 'seamer', 'harry', 'gurney', 'retires', 'from', 'championship', 'cricket', 'to']
['your', 'knowledge', '-', 'do', 'you', 'know', 'which', 'premier', 'league']
['united', 'defender', 'patrice', 'evra', 'says', '"only', 'god', 'can', 'judge']
['tokyo', 'olympic', 'and', 'paralympic', 'games', 'are', 'set', 'to', 'revolutionise']
['rockets', 'guard', 'james', 'harden', 'becomes', 'the', 'first', 'player', 'in']
['adam', 'yates', 'misses', 'out', 'on', 'the', 'tirreno-adriatico', 'title', 'by']
['coverage', 'of', 'saturday', 'national', 'league', 'game', 'between', 'barnet', 'and']
['world', 'heavyweight', 'champion', 'deontay', 'wilder', 'will', 'face', 'american', 'dominic']
['slam', 'winners', 'jonathan', 'davies,', 'ken', 'owens', 'and', 'rob', 'evans']
['manager', 'marco', 'silva', 'is', 'fined', '£12,000', 'by', 'the', 'football']
['retrospective', 'disciplinary', 'action', 'will', 'be', 'taken', 'against', 'any', 'rangers']
['will', 'be', 'without', 'injured', 'trio', 'ethan', 'ampadu,', 'sam', 'vokes']
['finn', 'russell', 'has', 'revealed', 'he', 'played', 'in', 'scotland', 'six']
['could', 'be', 'without', 'opener', 'dimuth', 'karunaratne', 'this', 'season', 'as']
['city', 'fa', 'cup', 'semi-final', 'against', 'brighton', 'will', 'be', 'played']
['sign', 'italy', 'centre', 'michele', 'campagnaro', 'from', 'premiership', 'rivals', 'wasps']

Just by manually investigating the clusters, we can see that they're of varying quality. Interestingly, some are clustered explicitly by story, and some are clustered by topic (the above group are all sports news articles). While I'm more interested in explicit story groupings, other forms of clustering are inevitable when using simple word frequency measures.

Using encodings that combine inter-word relationships, even simple n-grams, can improve this clustering, but this increases the complexity of the model by orders of magnitude. Assuming that a vocabulary has more than 30,000 words, a simple bi-gram vocabulary could contain 1,000,000,000 elements (although in practice significantly fewer due to rules of grammar).

Conclusions

So we've taken the crux of the k-means clustering algorithm from my last post, and converted it into a vectorised format. We've also looked at stemming algorithms to help improve the algorithms ability to compare between different texts, even if slightly different wording is used.

It's still not particularly powerful, as despite the use of stemming logic it continues to lack the power to understand synonyms or inter-word relationships.

The former could be improved by a better encoding of the text into vectors, for example using word embeddings such as Word2Vec. The latter could be improved by looking at ways of combining words, such as using n-grams or encoding individual sentences within each document. Unfortunately these will have to wait for a later post.

I hope you've found this post interesting! As always, the source code can be found on my Github page.

Thanks for reading!