Toby Huang
July 5, 2020
Our goal is to predict the most likely next words that the user is going to type, based on the previous words that they have typed. The purpose is to help users type faster.
The Shiny app is at https://toby-huang.shinyapps.io/SwiftkeyWordPrediction/. Type a sentence and the app will try to predict the top ten next words. You can adjust the number of predictions using the slider.
The data for training the model came from the Swiftkey Corpus consisting of English text from Twitter, blogs and news.
file_name number_of_lines
1 en_US.twitter.txt 2360148
2 en_US.blogs.txt 899288
3 en_US.news.txt 1010242
Since the corpus is very large, I randomly sampled 10% of the lines to train the model.
The model consists of a Markov chain that predicts the most likely next words based on the previous two words. If the previous two words aren't available, then the model uses the previous word. If no words are available, then the model suggests the most common words in general. When the input text is empty, the very first suggestion is based on the most common first words. The probability distributions were smoothed using the Kneser-Ney algorithm. The word suggestions are ranked from highest to lowest probability.
getWords("Love that film and haven't seen it in quite some")
[1] "time" "of" "peopl" "more" "other"
On a representative test set of 20 phrases, the model's top ten suggestions contained the correct next word in 35% of the cases.
I can improve the word suggestions by re-adding stemming and removing profanity, for example.
I can improve the model by using a Long Short Term Memory model, which can better remember the context of the sentence going back more than a few words. One of the key limitations of the Markov N-Grams model is that the suggestions are only based on the previous two words, at most.
This model should still be useful for helping users type faster than their normal pace.