A typeahead prediction model suggests words as a user types to improve typing efficiency on constrained mobile devices. This is akin to products such as Swiftkey which is available for Android and iOS mobile devices.
Data Preparation
Aggregated text corpus of over 4.2 million lines from diverse sources including Twitter, News, and Blogs
Randomly assigned each line to a training or test set
Split lines into sentences and marked these boundaries
Trimmed excess whitespace
Kept only Latin alphanumerics, whitespace and ' (for contractions)
Transformed all letters to lower case
Replaced all numbers with a single numeric identifier; 1, 2900, 839929, etc replaced by '###'
Works well for predicting a phrase like '353 million', but not for a common expression like 'the 1980s'
N-Gram Model
Conditions the probability of the next word based on the previous 'N-1' words (context)
Phrase: 'We love you'
Context: 'We love'
Next Word: 'You'
The length of the context depends on 'N'
2-gram models consider the previous word only
3-gram model considers the previous 2 words
Probability of a phrase is the number of times the phrase occurred divided by the number of times the context occurred
p(We Love You) = #(We Love You) / #(We Love)
Katz Back-off
The language model combines a unigram, bigram, and trigram model. Why?
Models where 'N' is larger…
Account for greater context and tend to be more accurate
But, more frequently encounter n-grams not in the training data and thus have no basis to make a prediction
Using a Katz Back-off model balances the advantages and disadvantages of each
First consult the larger 'N' models (trigram) and only consult the lower order models, as needed
Application
The application provides a text box that makes suggestions to the user as she types
Multiple model diagnostics are updated in real-time
Figure 1: Top 5 suggestions
Figure 2: Cumulative accuracy for the phrase
Figure 3: Cumulative accuracy of each N-gram model
Figure 4: Cumulative suggestions for the phrase
By pressing the 'Random' button, an entire sentence is generated based on the model's understanding of the language