Marshall McQuillen
December 20, 2017
PredictR is my Shiny word prediction application for the John Hopkins University Data Science Specialization, put on by Coursera. Directions are provided on the main page of the application, shown below.
Stupid Backoff (the mathematical details of which can be found here) is an algorithm that satisfies the Markov Property, which states that the probability of an event given its full history of previous events is roughly equal to the probability of the event given it's present state.
In terms of word prediction, this means that we can predict the next word in a sequence based on the previous few words with roughly the same accuracy as if we based our prediction on the entire preceding sentence.
Stupid Backoff does precisely this. It looks at the previous n words, called an ngram, checks the data set to see if there are any matches to that sequence, and, if there are, it chooses the next word with the highest count, given that previous sequence. If no matches are found, it goes “down” one level, to the (n - 1)gram, and repeats the process.
After I built the blueprint for my model, I used the incredibly informative benchmark.R repo to test a few different models and sample sizes.
Each model employs the Stupid Backoff Algorithm explained in the previous slide. I started with a 6 gram model using 25% of the data provided (randomly selected), and then compared that with a 4 gram model with 33% of the data, a 4 gram model with 25% of the data and finally a 5 gram model with 25% of the data. The results for 28,464 predictions on each model are show below…
| model | sample_size | Overall_top_3 | Top_1_Precision | Top_3_Precision | Average_runtime | Memory_Used |
|---|---|---|---|---|---|---|
| 6 Gram | 25% | 15.13% | 11.68% | 18.09% | 143.93 msec | 120.92 Mb |
| 4 Gram | 33% | 15.33% | 11.76% | 18.33% | 56.85 msec | 157.10 Mb |
| 4 Gram | 25% | 15.07% | 11.59% | 18.06% | 53.96 msec | 117.67 Mb |
| 5 Gram | 25% | 15.06% | 11.63% | 18.01% | 93.92 msec | 123.74 Mb |
From the table, we have clear proof that the model does in fact satisfy the Markov Property from the previous slide (the precision of the 6 gram model is actually worse than that of the 4gram model). I decided to use the 4gram model with a 25% sample of the data as my final model. A 25% reduction in the memory used for the model is worth the 0.17% reduction in the Top 1 Word Precision in my opinion.
“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful”
– George Box & Norman R. Draper, Empirical Model-Building and Response Surfaces
While language is clearly an ever-changing and highly dynamic area for computational processes, I can think of a few ways in which my model could be made “less wrong”:
Larger database - I would be very interested to see how my model would perform if I had a machine powerful enough to process the entire corpus.
User Specific Learning - One thought I had while building my model was that since every person has an individual style of speaking (and therefore typing), I think it would be feasible to build functionality into the application that learns how the individual user types, theoretically making the model more accurate for the user the more they use it.
Speed - Although 53.96 milliseconds seems relatively quick, over time, that will become cumbersome to the user. Notably, on this Coursera forum, there are a few users boasting average runtimes of single digit milliseconds. (Sincere applause to those users)