This milestone report is for the Coursera Data Science Capstone, with the eventual objective to build a next word prediction app. This report contains some exploratory analysis of the data, and a high-level description of the intended methodology for the eventual app. Credit to the authors of Text Mining with R for their helpful guide.
Our corpus for this assignment includes three documents - blog text, tweets, and news articles. First, I explored some basic features of each document in our corpus.
| Data set | Number of lines | Avg. Line Length |
|---|---|---|
| Blogs | 899,288 | 229.99 |
| News | 1,010,242 | 201.16 |
| 2,360,148 | 68.68 |
As expected, the tweets are the shortest due to the character limit and nature of the platform, while the blogs and news articles are both over 200 characters per line. We have a decent amount of data in the corpus to make a predictive application with.
Next, I sought to generate and understand how many unique words and n-grams we have. To do so, I tokenized words and n-grams, counted their frequency, and consolidated these counts for all the documents into a single corpus file. I retained stop words as they are essential to actual writing and our use case of next word prediction. For n-gram level below, I generated the count of unique n-grams, the average frequency of the n-grams, a “scatter histogram”, and a cumulative sum from most frequent to least frequent. In the interest of compute time, I used a 10% of the lines prior to tokenization as a representative sample, resulting in 426,967 lines from the total of nearly 4.3M (I wish I went even lower). I also did not conduct this analysis by source document (only for the whole corpus), though it is likely worth doing at some point.
From the table below, we can see the number of unique n-grams (count) increase dramatically, as there are more n+1 word combinations than n word combinations. As we would expect, this means the mean frequency for an n-gram to be much lower as it gets longer. We can also see the median n-gram frequency is 1 in all cases, indicating over half the n-grams at any level are singles. This will be reaffirmed in our plots later. These may have poor predictive power, and we may want to remove them later.
| gram.level | count | mean | median |
|---|---|---|---|
| 1 | 193411 | 53.098402 | 1 |
| 2 | 2757885 | 3.578936 | 1 |
| 3 | 6496958 | 1.457915 | 1 |
| 4 | 8264252 | 1.099062 | 1 |
Here, we see the cumulative distribution of the frequency of words, where the y axis is the cumulative number of words at that frequency or lower. As we would have expected from the table above, our distribution is almost completely skewed to 1. A log-scale might have been more appropriate for the plots below. I plan to generate these again with low-frequency n-grams removed
We can see here that though the vast majority of words occur very few times, and there isn’t much of a Pareto “80/20” rule here for us to exploit, which presents a modelling challenge. This is validated again in the plot below.
We now plot the cumulative sum of the frequency table we generated (please forgive the out of sync indexes on the plot). We see the pattern repeat here. The bad news is that there aren’t a few n-grams that account for the majority of word predictions. The good news is that there are a ton of n-grams that don’t have much predictive power, and, following this assignment, I’ll remove them and revisit these graphs. We do see a slight trend with longer tails on the distribution as the n-gram level rises, which is consistent with those n-grams being more unique.
With these n-grams identified, a well-known approach to next word prediction that is feasible is Kat’z Back-off model. This model looks at word choice as a Markov Chain, and assumes the that the next word choice is determined by the combination of words that precede it. Using the n-gram frequencies generated for this report, we can determine the conditional probablity of the last word in the n-gram occuring.
Then, we can match the words preceding our current state in the text to this conditional probability table, and select the most probable word that “completes” the n-gram. For example, if we have thus far written “I love to…” and we are trying to predict the fourth word, we can look at all 4-grams in our corpus that start with “I love to…”, and select the most frequent 4th word as our prediction. If this n-gram isn’t present, the model assumes we can “back off” and use tri-grams starting with “…love to…” and so on.
A few adjustments will have to be made as next steps, listed below:
My next step, after this additional data cleaning, is to split out the last word of each level of n-gram and compute the conditional probabilities for the of each word given the n-gram that precedes it.