Andrew Luyt
Dec 11, 2022
This application uses machine learning techniques to predict the next word in a sentence. The user can type in a sentence themselves, click on predictions or load a partial sentence from a list of examples.
A large 5-gram look-up table was constructed containing the most common completion words, along with probabilities. Given an input, the table is searched and aggregated according to the selected smoothing algorithm and the best completions are returned.
Load example sentences
Customize the number of predictions
Show word probabilities / scores
Omit common words like the or and.
Advanced: Choose the n-gram smoothing algorithm. Simple interpolation and Google’s stupid backoff are available.
We trained our model on 3.4 million samples of text from news articles, Twitter, and blogs, comprising over 80 million words. Our model uses 1- through 5-grams and can use two types of smoothing, simple interpolation and Google’s stupid backoff. The web interface uses R’s Shiny framework.
Simple interpolation
270 predictions per second
31.6% accuracy
Stupid backoff
476 predictions per second.
28.1% accuracy
Accuracy: We count a prediction as correct if the model predicts the correct word in its top four predictions, as this reflects how the app is used. The common perplexity metric was inappropriate here since stupid backoff produces scores, not true probabilities and we wanted to be able to compare the two smoothing algorithms.
Most CPU time is spent looking up n-gram histories to find learned completions, so heavy use was made of hash tables with constant-time look-up. Changing to hash tables from a data.frame-based approach resulted in a speed-up of over 100x.
Parameters for simple interpolated smoothing were tuned with a grid search:
> best.results(results)
l1 l2 l3 l4 l5 proportion.correct
1 0.39 0.549 0.0488 0.01098 0.00122 0.3165915
2 0.63 0.333 0.0259 0.00666 0.00444 0.3163361
3 0.79 0.189 0.0147 0.00567 0.00063 0.3163361
Grid search on a large validation corpus is slow, so custom plotting functions and a random forest model were used to explore this 5-dimensional parameter space and optimize areas where the grid search would be performed: