Type Ahead Prediction Model

Nick Allen - December 2014

A typeahead prediction model suggests words as a user types to improve typing efficiency on constrained mobile devices. This is akin to products such as Swiftkey which is available for Android and iOS mobile devices.

Data Preparation

  • Aggregated text corpus of over 4.2 million lines from diverse sources including Twitter, News, and Blogs
  • Randomly assigned each line to a training or test set
  • Split lines into sentences and marked these boundaries
  • Trimmed excess whitespace
  • Kept only Latin alphanumerics, whitespace and ' (for contractions)
  • Transformed all letters to lower case
  • Replaced all numbers with a single numeric identifier; 1, 2900, 839929, etc replaced by '###'
    • Works well for predicting a phrase like '353 million', but not for a common expression like 'the 1980s'

N-Gram Model

  • Conditions the probability of the next word based on the previous 'N-1' words (context)
    • Phrase: 'We love you'
    • Context: 'We love'
    • Next Word: 'You'
  • The length of the context depends on 'N'
    • 2-gram models consider the previous word only
    • 3-gram model considers the previous 2 words
  • Probability of a phrase is the number of times the phrase occurred divided by the number of times the context occurred
    • p(We Love You) = #(We Love You) / #(We Love)

Katz Back-off

  • The language model combines a unigram, bigram, and trigram model. Why?
  • Models where 'N' is larger…
    • Account for greater context and tend to be more accurate
    • But, more frequently encounter n-grams not in the training data and thus have no basis to make a prediction
  • Using a Katz Back-off model balances the advantages and disadvantages of each
    • First consult the larger 'N' models (trigram) and only consult the lower order models, as needed

Application

  • The application provides a text box that makes suggestions to the user as she types
  • Multiple model diagnostics are updated in real-time
    • Figure 1: Top 5 suggestions
    • Figure 2: Cumulative accuracy for the phrase
    • Figure 3: Cumulative accuracy of each N-gram model
    • Figure 4: Cumulative suggestions for the phrase
  • By pressing the 'Random' button, an entire sentence is generated based on the model's understanding of the language

Resources