Coursera Data Science Capstone Project

D. Bretheim
12/05/2016

Project Goals and Technical Approach

The Data Science Specialization Capstone project challenges students to design and build a public data product for next word prediction. The application must be able to capture user text input and then predict the most likely next word.

Given the need to balance model size (physical RAM) and algorithm runtime, the goal was to optimize both parameters in order to provide a positive user experience, while achieving a reasonable level of accuracy.

The sampling strategy was designed to capture the maximum amount of word variation from the three raw data sources, given the memory limitations of my PC. The data sources were randomly sampled, with a heavier emphasis on Twitter versus Blogs and News.

Pre-Processing:

Contractions were expanded into separate words for tokenization.
Stopwords were retained to avoid creating meaningless n-grams, even though doing so increases n-gram “redundancy”.
End-of-sentence indicators (periods, question marks, and exclamation points) were treated as tokens.
Other data prepatation steps included removal of numbers and most non-alpha characters, whitespace compression, and conversion to lower case.

N-Gram Creation

Pruning strategy:

Quadgrams and unigrams were pruned based on frequency > 3.
Trigrams and bigrams are further pruned to subset for the top 4 occurrences of each n-1 gram value.

Other Considerations:

Profanity was retained in the corpus, but filtered from the predictions.
An integer-based “dictionary” lookup rather than text-based lookup is used in order to significantly reduce file size.

After considerable experimentation, a quadgram model was selected, based on the constraints of the free shinyapp hosting provided to Coursera students, as well as the processing limitations of my PC.

Final Model

The model is based on the so-called “Stupid Backoff” algorithm described in “Large Language Models in Machine Translation”, Brants et al., 2007. This approach was chosen for the following reasons:

Well suited to a large web-based corpus.
Relatively straightforward implementation.
Approximates the accuracy level of Kneser-Ney smoothing for large amounts of data.

The algorithm searches the quadgrams for the last three words entered. Then the trigrams are searched for the last two words, followed by a search of the bigrams for the last word entered. Matches are scored by applying the “Stupid Backoff” scoring function. If no matches are found, the model returns the top four most likely unigrams based on their relative frequency.

Application and Hosting

The “Next Word Prediction” application has the following functionality:

The user is provided with a text box where they can submit their words.
The model returns up to 5 next word predictions in descending order of their scoring algorithm value.
If the predicted word is flagged as profanity, it will be masked.
If the model predicts a sentence termination (i.e., period, question mark, or exclamation point), it will be displayed.

The application outputs the following indicators of predictive performance:

The “Stupid Backoff” score value for each predicted word.
The n-gram associated with each predicted word.

The application is hosted on the Shinyapps.io platform. Click here to view.

I hope it meets your expectations.