Introduction

This milestone report is for the Coursera Data Science Capstone, with the eventual objective to build a next word prediction app. This report contains some exploratory analysis of the data, and a high-level description of the intended methodology for the eventual app. Credit to the authors of Text Mining with R for their helpful guide.

Exploratory analysis

Document files in the corpus

Our corpus for this assignment includes three documents - blog text, tweets, and news articles. First, I explored some basic features of each document in our corpus.

Data set	Number of lines	Avg. Line Length
Blogs	899,288	229.99
News	1,010,242	201.16
Twitter	2,360,148	68.68

As expected, the tweets are the shortest due to the character limit and nature of the platform, while the blogs and news articles are both over 200 characters per line. We have a decent amount of data in the corpus to make a predictive application with.

Words and n-grams

Next, I sought to generate and understand how many unique words and n-grams we have. To do so, I tokenized words and n-grams, counted their frequency, and consolidated these counts for all the documents into a single corpus file. I retained stop words as they are essential to actual writing and our use case of next word prediction. For n-gram level below, I generated the count of unique n-grams, the average frequency of the n-grams, a “scatter histogram”, and a cumulative sum from most frequent to least frequent. In the interest of compute time, I used a 10% of the lines prior to tokenization as a representative sample, resulting in 426,967 lines from the total of nearly 4.3M (I wish I went even lower). I also did not conduct this analysis by source document (only for the whole corpus), though it is likely worth doing at some point.

N-Gram Summary

From the table below, we can see the number of unique n-grams (count) increase dramatically, as there are more n+1 word combinations than n word combinations. As we would expect, this means the mean frequency for an n-gram to be much lower as it gets longer. We can also see the median n-gram frequency is 1 in all cases, indicating over half the n-grams at any level are singles. This will be reaffirmed in our plots later. These may have poor predictive power, and we may want to remove them later.

gram.level	count	mean	median
1	193411	53.098402	1
2	2757885	3.578936	1
3	6496958	1.457915	1
4	8264252	1.099062	1

Empirical Cumulative Distribution Function of Word Frequencies

Here, we see the cumulative distribution of the frequency of words, where the y axis is the cumulative number of words at that frequency or lower. As we would have expected from the table above, our distribution is almost completely skewed to 1. A log-scale might have been more appropriate for the plots below. I plan to generate these again with low-frequency n-grams removed

We can see here that though the vast majority of words occur very few times, and there isn’t much of a Pareto “80/20” rule here for us to exploit, which presents a modelling challenge. This is validated again in the plot below.

Cumulative Sum

We now plot the cumulative sum of the frequency table we generated (please forgive the out of sync indexes on the plot). We see the pattern repeat here. The bad news is that there aren’t a few n-grams that account for the majority of word predictions. The good news is that there are a ton of n-grams that don’t have much predictive power, and, following this assignment, I’ll remove them and revisit these graphs. We do see a slight trend with longer tails on the distribution as the n-gram level rises, which is consistent with those n-grams being more unique.

Intended approach

With these n-grams identified, a well-known approach to next word prediction that is feasible is Kat’z Back-off model. This model looks at word choice as a Markov Chain, and assumes the that the next word choice is determined by the combination of words that precede it. Using the n-gram frequencies generated for this report, we can determine the conditional probablity of the last word in the n-gram occuring.

Then, we can match the words preceding our current state in the text to this conditional probability table, and select the most probable word that “completes” the n-gram. For example, if we have thus far written “I love to…” and we are trying to predict the fourth word, we can look at all 4-grams in our corpus that start with “I love to…”, and select the most frequent 4th word as our prediction. If this n-gram isn’t present, the model assumes we can “back off” and use tri-grams starting with “…love to…” and so on.

Next steps

A few adjustments will have to be made as next steps, listed below:

Remove n-grams with a frequency of 1, or perhaps a those below a higher threshold, as these have low predictive power with our intended model
This method does not necessarily account for the fact that some unigrams are part of bigrams (or generally, n-grams are part of n+1-grams), and the Kneser-Ney smoothing technique corrects our conditional probabilities for this.
We want to omit other text, like non-ASCII characters, URLs, and curse words and foreign words, where present, to facilitate our use case. My tokenized counts will be filtered for these words.

My next step, after this additional data cleaning, is to split out the last word of each level of n-gram and compute the conditional probabilities for the of each word given the n-gram that precedes it.

Coursera Data Science Capstone - Milestone Report

Asher Zafar

January 7, 2017