The goal of this project is to use Natural Language Processing to predict, given a phrase, the next word to use in that phrase. To be able to perform this prediction, I’m using a set of data provided by Swiftkey, that consists of data from news stories, personal blogs, and Twitter. By analyzing the text provided in this data set, I will create a natural language prediction model. This milestone reports my initial progress.
My first step was to take a look at the three data sets. I loaded them, and got some basic facts about the makeup of the sets. This chart shoes the number of lines in each data set, the number of characters in the shortest line and the longest line, plus the average number of characters for each line.
all.info
## num_lines min_line max_line mean_line
## blogs 899288 1 40833 229.99
## twitter 2360148 2 140 68.68
## news 1010242 1 11384 201.16
One thing this information told me is that I would need to be careful of how I managed memory when loading and dealing with data.
My next task was to clean up the data and see what was there to work with. I wanted to clean the data up enough that the computer could pick out the actual words, which meant I had to do a good job defining what a word was.
Since I’m only fluent in English, I used the English subset of the data.
I cleaned up the data by
@ # $ '
The @ and # marks are used in Twitter as potentially meaninful parts of words, the $ appears as its own word (\($\)), and the ’ is used in contractions.
At this point, I decided to keep stopwords in to evaluate how well, if at all, they could be used as predictors.
With the text cleaned up, I broke the text into words, then added all the words into a single vector. I say that easily, but what that meant was that I read in a subset of a dataset, cleaned it up, broke it into words, added it to my list, then moved on to the next subset, until I had one long list of all the words in my datasets in order. I then took this list, gave each word a frequency count of 1, then aggregated all the words by their frequency to build a unigram table.
Taking a look at my word list, I found that
With the minimum frequency matching the median, that tells me that at least half of my dataset is unique words, and more than likely words that are not actual English words. I’ll need to do further work on cleaning up my data, and possibly trimming the size of the data based on frequency.
However, I did still have a working set of words and frequencies. I decided to take a look at the words with a frequency of 100000 or more, which gave me a subset of 120 unique words.
The strongly skewed histogram shows that the bulk of the most common words still have frequencies below around 50,000, with the most frequent words appearing much more often. The top fifty most frequently appearing words are:
the, to, and, a, of, i, in, for, is, that, it, you, on, with, was, my, at, be, this, have, are, as, but, he, we, not, from, so, me, they, all, will, by, or, said, just, your, his, an, about, out, up, one, what, if, like, when, has, can, who
which would mostly fall into the “stopwords” category, and may not be useful for prediction.
My plans for building my prediction algorithm can be summarized by
I’d better get to work.