Type Ahead Prediction Model

Nick Allen - December 2014

A typeahead prediction model suggests words as a user types to improve typing efficiency on constrained mobile devices. This is akin to products such as Swiftkey which is available for Android and iOS mobile devices.

Data Preparation

Aggregated text corpus of over 4.2 million lines from diverse sources including Twitter, News, and Blogs
Randomly assigned each line to a training or test set
Split lines into sentences and marked these boundaries
Trimmed excess whitespace
Kept only Latin alphanumerics, whitespace and ' (for contractions)
Transformed all letters to lower case
Replaced all numbers with a single numeric identifier; 1, 2900, 839929, etc replaced by '###'
- Works well for predicting a phrase like '353 million', but not for a common expression like 'the 1980s'

N-Gram Model

Conditions the probability of the next word based on the previous 'N-1' words (context)
- Phrase: 'We love you'
- Context: 'We love'
- Next Word: 'You'
The length of the context depends on 'N'
- 2-gram models consider the previous word only
- 3-gram model considers the previous 2 words
Probability of a phrase is the number of times the phrase occurred divided by the number of times the context occurred
- p(We Love You) = #(We Love You) / #(We Love)

Katz Back-off

The language model combines a unigram, bigram, and trigram model. Why?
Models where 'N' is larger…
- Account for greater context and tend to be more accurate
- But, more frequently encounter n-grams not in the training data and thus have no basis to make a prediction
Using a Katz Back-off model balances the advantages and disadvantages of each
- First consult the larger 'N' models (trigram) and only consult the lower order models, as needed

Application

The application provides a text box that makes suggestions to the user as she types
Multiple model diagnostics are updated in real-time
- Figure 1: Top 5 suggestions
- Figure 2: Cumulative accuracy for the phrase
- Figure 3: Cumulative accuracy of each N-gram model
- Figure 4: Cumulative suggestions for the phrase
By pressing the 'Random' button, an entire sentence is generated based on the model's understanding of the language

Type Ahead Prediction Model

Data Preparation

N-Gram Model

Katz Back-off

Application

Resources