Smart Keyboard: Prediction of the next word

Daigo Tanaka (@DaigoTanaka)
April 21, 2015

The first 2 inputs “The Obama” may be followed by the word “administration”

…It would save 14 keystrokes if we could select the word instead of typing.

We can suggest such word by machine-learning from the texts from news, blogs, and Twitters.

Try the demo by clicking here! Screenshot

Type some words in Input box. A word prediction appears in Candiates box.
When a candiate is availalbe, type “#1” as the shortcut to enter the word.
Choose the number in the settings to change the number of candidates to predict.
Extra! You can tweet to the author from the app.

Uni, bi, tri, and 4-grams language models are created from the source text. Each n-gram are sorted with the counts.
Given an input with up-to 3 words, find the n-gram with the highest count, and calcuate the conditional probability. Discount the probability by 40% as “backing-off” to lower order n-grams:

\[ S(w_{i}|w_{i-(n-1)},...,w_{i-1}) = 0.4^{4-n}\frac{count(w_{i-(n-1)},...,w_{i-1}, w_{i})}{count(w_{i-(n-1)},...,w_{i-1})} \]

Sort the prediction by the probability and display from the highest probability.
Full description can be found here
Source code is availabe from here

plot of chunk unnamed-chunk-1 The distribution of n-gram frequency. The count is in log(10) scale

Observation: Majority of the 4-gram and trigram entities appear only once.
Removing one-time appearances by approximating with n-1 gram reduced the data from 1.2GB to 257MB, and made the computation much faster.
Also used word dicitonary hashtable for reducing the memory footprint and computation time (vs. query by strings).