D. Bretheim
12/05/2016
The Data Science Specialization Capstone project challenges students to design and build a public data product for next word prediction. The application must be able to capture user text input and then predict the most likely next word.
Given the need to balance model size (physical RAM) and algorithm runtime, the goal was to optimize both parameters in order to provide a positive user experience, while achieving a reasonable level of accuracy.
The sampling strategy was designed to capture the maximum amount of word variation from the three raw data sources, given the memory limitations of my PC. The data sources were randomly sampled, with a heavier emphasis on Twitter versus Blogs and News.
Pre-Processing:
Pruning strategy:
Other Considerations:
After considerable experimentation, a quadgram model was selected, based on the constraints of the free shinyapp hosting provided to Coursera students, as well as the processing limitations of my PC.
The model is based on the so-called “Stupid Backoff” algorithm described in “Large Language Models in Machine Translation”, Brants et al., 2007. This approach was chosen for the following reasons:
The algorithm searches the quadgrams for the last three words entered. Then the trigrams are searched for the last two words, followed by a search of the bigrams for the last word entered. Matches are scored by applying the “Stupid Backoff” scoring function. If no matches are found, the model returns the top four most likely unigrams based on their relative frequency.
The “Next Word Prediction” application has the following functionality:
The application outputs the following indicators of predictive performance:
The application is hosted on the Shinyapps.io platform. Click here to view.
I hope it meets your expectations.