Smart Keyboard: Prediction of the next word

Daigo Tanaka (@DaigoTanaka)
April 21, 2015

Type faster with machine learning!

Motivating example:

The first 2 inputs “The Obama” may be followed by the word “administration”

…It would save 14 keystrokes if we could select the word instead of typing.

We can suggest such word by machine-learning from the texts from news, blogs, and Twitters.

How does the app work?

Try the demo by clicking here! Screenshot

  • Type some words in Input box. A word prediction appears in Candiates box.
  • When a candiate is availalbe, type “#1” as the shortcut to enter the word.
  • Choose the number in the settings to change the number of candidates to predict.
  • Extra! You can tweet to the author from the app.

Prediction algorithm

  • Uni, bi, tri, and 4-grams language models are created from the source text. Each n-gram are sorted with the counts.
  • Given an input with up-to 3 words, find the n-gram with the highest count, and calcuate the conditional probability. Discount the probability by 40% as “backing-off” to lower order n-grams:

\[ S(w_{i}|w_{i-(n-1)},...,w_{i-1}) = 0.4^{4-n}\frac{count(w_{i-(n-1)},...,w_{i-1}, w_{i})}{count(w_{i-(n-1)},...,w_{i-1})} \]

  • Sort the prediction by the probability and display from the highest probability.
  • Full description can be found here
  • Source code is availabe from here

How I made it lighter and faster

plot of chunk unnamed-chunk-1 The distribution of n-gram frequency. The count is in log(10) scale

  • Observation: Majority of the 4-gram and trigram entities appear only once.
  • Removing one-time appearances by approximating with n-1 gram reduced the data from 1.2GB to 257MB, and made the computation much faster.
  • Also used word dicitonary hashtable for reducing the memory footprint and computation time (vs. query by strings).