- Fernandes Valdrich
Part of the Data Science Specialization offered by Johns Hopkins University on Coursera
Task: Build a prediction application to suggest the next word in a sentence based on user input.
How? A model coded in R was trained on over 7.1 million sentences comprising of tweets, blog posts and news articles (Table along side). The data was provided by the course instructor.
| Dataset | No. of Sentences (mil.) | No. of Words (mil.) | No. of unique words (thous.) |
|---|---|---|---|
| Tweets | 3.16 | 29.61 | 442 |
| Blogs | 2.17 | 36.89 | 403 |
| News | 1.78 | 33.59 | 330 |
The data was pre-processed to remove numbers and punctuations (preserving contractions). Links and email addresses were also removed.
A “Stupid Backoff interpolation” was trained on the data using ngrams containing 2 to 5 words.
Only ngrams which are repeated more than once is considered while calculating the probability. However, only ngrams which appeared more than 3 times are suggested.
R packages such as quanteda, data.table and stringr were used to preprocess and perform the required calculations.
Space saving:
Only the 3 predictions with the highest score is saved in the look up table. This makes it so that the lower ngrams with a high score can be recommended more frequently while also reducing the size of the table.
Model 1. All words are known:
A straightforward Stupid Backoff Model is used. Knowing all the words allows the use of fast binary search based subset which can generate the result within 15ms. This is the case for over 80% of the data.
Model 2. At least one unknown word:
The unknown word(s) is ignored and the suggestion is made based on the remaining known words. While doing this, the position of the words are preserved. This is an attempt to incorporate the concept of context in the model. It takes about 465ms to generate a suggestion using this method.
The Application is hosted by shiny.io.
The Suggester suggests the next most likely word.
Do give the Babbeler a try. It completes the sentence based on the output from the Suggester.
After entering your phrase in the provided box, hit the Enter key to generate the prediction off of