Data Science Specialization: Capstone Project
By: Kevin Markham
The goal of this project was to allow a user to input a phrase into the application, and it would predict the next word that they “most likely” want to type.
The primary use case for this application is text messaging on mobile phones, in which successfully predicting the next word a user wants to type will save them from actually having to type that word, increasing their overall speed.
The data available for training the predictive model is millions of tweets, blog posts, and news articles in English. (Other language files were available but were not used.)
The first step in model training was learning all of the 2-grams (word pairs), 3-grams (word triplets), and 4-grams (word quadruplets) in about half of the training data, as well as their frequencies.
Each 4-gram was then broken into a 3-gram (its first 3 words) and the final word. For each of the resulting 3-grams, the most common final word was calculated.
This process was repeated for the original set of 3-grams, producing a set of 2-grams and the most common next word for each 2-gram.
This process was also repeated for the original set of 2-grams.
When a user types a phrase into the application, the application quickly makes a single prediction for the next word. The prediction algorithm is simple:
Because all predictions are “pre-calculated” and stored in a lookup table, the application can make predictions very quickly since it only requires checking for the presence of the previous 1, 2, and 3 words in the lookup table.
There are many possible enhancements to the application: