Data Science Capstone Project Milestone Report: Predicting the Next Word

Introduction

The goal of this project is to use Natural Language Processing to predict, given a phrase, the next word to use in that phrase. To be able to perform this prediction, I’m using a set of data provided by Swiftkey, that consists of data from news stories, personal blogs, and Twitter. By analyzing the text provided in this data set, I will create a natural language prediction model. This milestone reports my initial progress.

First look at the data

My first step was to take a look at the three data sets. I loaded them, and got some basic facts about the makeup of the sets. This chart shoes the number of lines in each data set, the number of characters in the shortest line and the longest line, plus the average number of characters for each line.

all.info

##         num_lines min_line max_line mean_line
## blogs      899288        1    40833    229.99
## twitter   2360148        2      140     68.68
## news      1010242        1    11384    201.16

One thing this information told me is that I would need to be careful of how I managed memory when loading and dealing with data.

Second look at the data

My next task was to clean up the data and see what was there to work with. I wanted to clean the data up enough that the computer could pick out the actual words, which meant I had to do a good job defining what a word was.

Since I’m only fluent in English, I used the English subset of the data.

I cleaned up the data by

converting all the uppercase letters to lowercase. No point confusing “Cat” with “cat”.
converting from UTF-8 to latin1, per a suggestion from classmates. This removes characters that are unlikely to be used for prediction and are also difficult to manage.
chose which characters to keep in the data set in order to make words. I chose to keep all the alphanumeric values, plus the punctuation marks

@ # $ '

The @ and # marks are used in Twitter as potentially meaninful parts of words, the $ appears as its own word ($$$), and the ’ is used in contractions.

At this point, I decided to keep stopwords in to evaluate how well, if at all, they could be used as predictors.

With the text cleaned up, I broke the text into words, then added all the words into a single vector. I say that easily, but what that meant was that I read in a subset of a dataset, cleaned it up, broke it into words, added it to my list, then moved on to the next subset, until I had one long list of all the words in my datasets in order. I then took this list, gave each word a frequency count of 1, then aggregated all the words by their frequency to build a unigram table.

Taking a look at my word list, I found that

The total number of unique words is 1062747
The highest frequency for a single word is 4761442
Unsurprisingly, the minimum frequency for a single word is 1
Perhaps also unsurprisingly, but very telling, is that the median frequency is 1

With the minimum frequency matching the median, that tells me that at least half of my dataset is unique words, and more than likely words that are not actual English words. I’ll need to do further work on cleaning up my data, and possibly trimming the size of the data based on frequency.

However, I did still have a working set of words and frequencies. I decided to take a look at the words with a frequency of 100000 or more, which gave me a subset of 120 unique words.

plot of chunk unnamed-chunk-3

The strongly skewed histogram shows that the bulk of the most common words still have frequencies below around 50,000, with the most frequent words appearing much more often. The top fifty most frequently appearing words are:

the, to, and, a, of, i, in, for, is, that, it, you, on, with, was, my, at, be, this, have, are, as, but, he, we, not, from, so, me, they, all, will, by, or, said, just, your, his, an, about, out, up, one, what, if, like, when, has, can, who

which would mostly fall into the “stopwords” category, and may not be useful for prediction.

Prediction Algorithm plans

My plans for building my prediction algorithm can be summarized by

Complete my n-gram set, and have unigrams, bigrams, trigrams, and quad-grams. Apply the Markov Assumption to see if matches can be found for the given phrase (in particular, for the latest words used in the phrase, since that should take the least overhead and give the strongest predictive value), and use the frequencies to determine the propability of one of them being a good predictor for the next word.
Deal with other issues that have come out of looking at my existing unigram set.

How do I want to handle stopwords? Are the, to, and useful or not?
Do contractions matter?
Get rid of obvious junk words
If performance is an issue, particularly with 4-grams involved, what is this best way of trimming infrequent n-grams?

Deal with miscellaneous issues, such as whether profanity (particularly given Twitter) is useful to keep in as a predictor, and if so, how to sanitize the output if profanity is the best prediction?

I’d better get to work.