A starting note: my work reading the data, cleaning the data of profanity, and tokenizing the data, and generating n-grams were all done using Python. I have attached the Python code at the bottom of this report as an appendix.

Profanity filtering, and tokenization

Profanity filtering

The first step I undertook was reading in the entire corpus of raw data provided by Swiftkey and Johns Hopkins. The data came in the form of three text documents which each contained entries from Twitter posts, news articles, and blog articles. The text documents were raw text data from each source with new-line characters separating entries.

The first step I undertook was profanity filtering. The first decision I had to make was how to deal with lines that contained profanity. The first option was to simply remove all lines which contain profanity and the second option was to undertake a more advanced profanity filtering approach, such as removing profanity from any line in which it occurs and splitting the line into two new lines at the profanity occurrence. The biggest foreseeable issue was whether removing all lines with profanity would remove too much data from the dataset.

In order to filter for profanity I downloaded a list of profane words provided by a user on stackoverflow (see appendix for url) and wrote a function to search for each term in a string of text. I iterated this function over the entire text documents and found that only a small percentage of the lines actually contained profanity.

Of the 2,360,148 lines in the Twitter data set only 43,267 contained profanity (1.8%). Of the 899,288 lines in the blogs data set only 9,911 contained profanity (1.1%). And of the 1,010,242 lines in the news data set only 3,552 contained profanity (0.4%).

Because of the small number of lines with profanity, I simply removed these lines from the data set.

Tokenization

My tokenization method was fairly simple. I first attempted to split each line based on the regular expression “+”, which splits each line into groups of one or more alphanumeric characters. This was effective, but missed a lot of common contractions formed by an apostrophe such as “wasn’t” and “I’m.” Thus I changed my regular expression to “[']” to also include words with apostrophe contractions.

N-Gram Tables

My next step was to create tables containing the frequency of each n-gram. Since we are tasked with using one-, two-, and three- grams to predict the next word, I needed to store n-grams with up to four members. I created a separate table for each length of n-gram. Python’s dictionary data structure using hashing, which allows for very fast searches. This allowed me to simply iterate over each line and iterate over each n-gram within each line. For each n-gram I searched the n-gram tables I created to see if it existed, if it did I simply incremented the count by one, if not I added a new entry for the n-gram. I used 5% of the total data to create my n-gram training tables due to memory limitations. I intend to try other approaches in the coming weeks to incorporate more training data.

Data distribution

One potentially useful feature of using text data is that there are common words and phrases that tended to be used with high frequency. Due to this, it may be possible to exclude a large portion of the n-grams while still covering most instances of n-grams in the data. To investigate this idea I sorted each n-gram table from the most common to the least common n-gram. I then calculated the percentage of the total n-gram occurences that each n-gram contributed. I then compared the percentage of the total n-gram occurences covered by progressively increasing the size of a subset of the n-grams (starting from the most common).

## List of 1
##  $ legend.title: list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

For the one-grams it appears that the majority of the total occurences can be covered by a relatively small number of terms. With the two-grams rougly 75% of the data can be covered by less than 25% of the unique two-grams. The coverage gets worse for the three- and four-grams however and due to the lines “straightening out” it appears that the majority of the three- and four-grams only occur once. Because I intend to use a backoff model that first checks four-grams, then three-grams, then two-, and finally one-grams it does not appear that I will be able to significantly reduce the size of my frequency tables at first glance. However this is something I plan to test with a validation set.

Sanity Check

Next I printed off the top 5 entries in the n-gram tables to ensure that these were commonly written sentence fragments in English. For instance in the one-gram table I would expect to see articles such as “a” and “the”.

## [1] "Most Common One Grams"

OneGram	Frequency
the	208370
to	134039
and	112591
a	111781
of	98315

## [1] "Most Common Two Grams"

TwoGram	Frequency
of the	21028
in the	18784
to the	10241
for the	9538
on the	9258

## [1] "Most Common Three Grams"

ThreeGram	Frequency
one of the	1441
a lot of	1326
to be a	940
going to be	876
Thanks for the	835

## [1] "Most Common Four Grams"

FourGram	Frequency
the end of the	376
the rest of the	304
for the first time	272
at the end of	271
is going to be	225

Looks good!

Next Steps

Now that I have frequency tables I am intending to try two different methods for building a model to predict the next word based on the preceding one, two, or three words. I am planning on trying two different methods and using a validation set to choose between them based on accuracy and speed of implementation. The first method will be back-off. Presuming there are three words preceding the word to be predicted, this method will first look at this three-gram, find all the four-grams that start with this three-gram in the frequency table I have constructed, and predict the next word based on which four gram occurs most frequently. If this three-gram does not exist in the frequency table, the algorithm will then take the preceeding two-gram and look for the three-gram starting with this two-gram that appears most frequently. If this two-gram cannot be found, the process will be repeated with the one-gram preceding the word to be predicted.

The second method is interpolation. This method will look at the one-gram, two-gram, and three-grams preceding the word to be predicted. Using the frequency tables for each it will count all two-grams, three-grams, and four-grams that start with the one-gram, two-gram, and three-gram respectively. Once it has found these entries, it will count them, and determine the probability of each next-word prediction based on the frequency of occurences. For instance if the user types “Mary had a” and there are 4 occurences of “little” and 1 occurence of “lamb”, the algorithm will ascribe an 80% probability to “little” for the three-gram prediction. Once it has calculated the probability of each prediction for the one-gram, two-gram, and three-gram models it would multiply these by weights and combine them to calculate the highest probability word. These weights will need to be trained on a validation set to find optimal values.

Thanks for reading!

Appendix

Stack Overflow link containing list of profanity:

http://stackoverflow.com/questions/3531746/what-s-a-good-python-profanity-filter-library

Python Code to read and clean data, and generate n-gram tables

#list of profane words obtained from stack overflow
#http://stackoverflow.com/questions/3531746/what-s-a-good-python-profanity-filter-library
os.chdir('/home/peter/Dropbox/Python/Coursera_Project')
fProfanity = open('Cleaned_Text_Files/profanity.txt')

areBad = fProfanity.read().splitlines()

fProfanity.close()

def profanityFilter(str, badWords):
    for w in badWords:
        if (' ' + w + ' ') in str:
            return (True, w)
    
    return (False, '')

#read in data to lists

os.chdir('/home/peter/Dropbox/Python/Coursera_Project')
fTwitter = open('en_US/en_US.twitter.txt')
fBlogs = open('en_US/en_US.blogs.txt')
fNews = open('en_US/en_US.news.txt')

twitter = fTwitter.readlines()
blogs = fBlogs.readlines()
news = fNews.readlines()

fTwitter.close()
fBlogs.close()
fNews.close()

#loop over twitter, news, blogs files line by line
#check for any profane word preceded by and followed by ' '
#this avoids non-profane word strings with substrings that 
#are profanity from being flagged
#keep running total of the number of lines with profanity
#and mark indices with no profanity

twitterProf = 0
newsProf = 0
blogsProf = 0

twitterClean = []
newsClean = []
blogsClean = []

badWords = {}

i = 0

for l in twitter:
    tmp = profanityFilter(l, areBad)
    if tmp[0]:
        twitterProf += 1
        
        if tmp[1] in badWords:
            badWords[tmp[1]] += 1
        else:
            badWords[tmp[1]] = 1
    else:
        twitterClean.append(i)
    
    i += 1

i = 0

for l in blogs:
    tmp = profanityFilter(l, areBad)
    if tmp[0]:
        blogsProf += 1
        
        if tmp[1] in badWords:
            badWords[tmp[1]] += 1
        else:
            badWords[tmp[1]] = 1
    else:
        blogsClean.append(i)
    
    i += 1

i = 0

for l in news:
    tmp = profanityFilter(l, areBad)
    if tmp[0]:
        newsProf += 1
        
        if tmp[1] in badWords:
            badWords[tmp[1]] += 1
        else:
            badWords[tmp[1]] = 1  
    else:
        newsClean.append(i)
    
    i += 1


#check percent of lines in each data file with profanity
print len(twitter), len(blogs), len(news)
print twitterProf, blogsProf, newsProf

print 'twitter:', "{0:.3f}%".format(twitterProf*1.0/len(twitter) * 100)
print 'blogs:', "{0:.3f}%".format(blogsProf*1.0/len(blogs) * 100)
print 'news:', "{0:.3f}%".format(newsProf*1.0/len(news) * 100)

#create new lists only with indices that didn't get flagged for profanity

twitterCleaned = [twitter[i] for i in twitterClean]
blogsCleaned = [blogs[i] for i in blogsClean]
newsCleaned = [news[i] for i in newsClean]

#tokenize each line into words
#use regex [\w\']+ to seperate lists of one or more alphanumeric character
#as well as apostrophes to include contractions such as "wasn't", "O'Neill", etc.

from nltk.tokenize import RegexpTokenizer, word_tokenize

tokenizer = RegexpTokenizer(r'[\w\']+')

twitterTokens = []
blogsTokens = []
newsTokens = []

for l in twitterCleaned:
    twitterTokens.append(tokenizer.tokenize(l))

for l in blogsCleaned:
    blogsTokens.append(tokenizer.tokenize(l))
    
for l in newsCleaned:
    newsTokens.append(tokenizer.tokenize(l))

#generate training, validation, and test sets
#combine twitter, blogs, and news tokens into one list
#and randomly shuffle

import random

tokens = twitterTokens + blogsTokens + newsTokens

random.shuffle(tokens)

#then take first X% as training set, next y% as validation
#set and next y% as test set

trainPct = 0.10
testPct = 0.05

train = tokens[0:int(len(tokens) * trainPct)]

#Clear Raw Data and Intermediate Variables From Memory
twitter = None
blogs = None
news = None

twitterCleaned = None
blogsCleaned = None
newsCleaned = None

twitterTokens = None
blogsTokens = None
newsTokens = None

twitterTokensShuffled = None
blogsTokensShuffled = None
newsTokensShuffled = None

twitterTrain = None
blogsTrain = None
newsTrain = None

twitterTest = None
blogsTest = None
newsTest = None

def nGramFrequencies(tokens):
    #takes a list where each entry is a list of word tokens split from a line in a corpus
    #returns a list of dictionaries where the key is each n-gram and the value is the
    #frequency in the corpus
    
    oneGrams = {}
    twoGrams = {}
    threeGrams = {}
    fourGrams = {}
    
    for l in tokens:
        for i in range(len(l)):
            if i >= 0:
                oneGram = l[i]

                if oneGram in oneGrams:
                    oneGrams[oneGram] += 1
                else:
                    oneGrams[oneGram] = 1

            if i > 0:
                twoGram = ' '.join(l[i-1:i+1])

                if twoGram in twoGrams:
                    twoGrams[twoGram] += 1
                else:
                    twoGrams[twoGram] = 1           

            if i > 1:
                threeGram = ' '.join(l[i-2:i+1])

                if threeGram in threeGrams:
                    threeGrams[threeGram] += 1
                else:
                    threeGrams[threeGram] = 1

            if i > 2:
                fourGram = ' '.join(l[i-3:i+1])

                if fourGram in fourGrams:
                    fourGrams[fourGram] += 1
                else:
                    fourGrams[fourGram] = 1

    return (oneGrams, twoGrams, threeGrams, fourGrams)

trainOneGrams, trainTwoGrams, trainThreeGrams, trainFourGrams = nGramFrequencies(train)
#testThreeGrams, testFourGrams = nGramFrequencies(test)

#convert dictionaries of N-Grams to lists

trainOneGrams = sorted(trainOneGrams.items(), key = lambda x: x[1], reverse = True)
trainTwoGrams = sorted(trainTwoGrams.items(), key = lambda x: x[1], reverse = True)
trainThreeGrams = sorted(trainThreeGrams.items(), key = lambda x: x[1], reverse = True)
trainFourGrams = sorted(trainFourGrams.items(), key = lambda x: x[1], reverse = True)

print trainOneGrams[1:100]

#write n-gram tables
fNameOne = 'trainOneGrams.txt'
fNameTwo = 'trainTwoGrams.txt'
fNameThree = 'trainThreeGrams.txt'
fNameFour = 'trainFourGrams.txt'

with open(fNameOne, 'w') as f:
    f.writelines(','.join(str(j) for j in i) + '\n' for i in trainOneGrams)

with open(fNameTwo, 'w') as f:
    f.writelines(','.join(str(j) for j in i) + '\n' for i in trainTwoGrams)
    
with open(fNameThree, 'w') as f:
    f.writelines(','.join(str(j) for j in i) + '\n' for i in trainThreeGrams)
    
with open(fNameFour, 'w') as f:
    f.writelines(','.join(str(j) for j in i) + '\n' for i in trainFourGrams)
    
f.close()

Capstone Project Milestone Report

Profanity filtering, and tokenization