Type Ahead Prediction Model
Nick Allen
December 2014
Data Preparation
- Merge the Twitter, News, and Blog data
- Randomly assign each line to a training or test set
- Split lines into sentences and mark sentence boundaries
- Keep only letters, numbers, whitespace and ' (for contractions)
- Lowercase all letters
- Replace all numbers with a single numeric identifier (###)
- Works quite well for a phrase like '353 million' or '292 billion', but not for a frequent numeric expression like 'the 1980s'
- Trim excess whitespace
N-Gram Model
- An N-gram model conditions the probability of a word based on its context
- The context is the previous 'n-1' words
- Bigram model considers the previous word
- Trigram model considers the previous 2 words
- 4-gram considers the previous 3 words
- The probability of a phrase is simply the number of times the phrase was seen in the training data divided by the number of times the context was seen
- p (we love) = count (we love) / count (we)
Multi-Level Model
- The model combines a unigram, bigram, and trigram model
- Larger 'N' models…
- Take into account a greater context and tend to be more accurate
- Have a greater number of n-grams and more frequently encounter ones not in the training data and thus have no basis to make a prediction
- A multi-level model balances the advantages and disadvantages of each
- First consult the larger 'N' models (trigram) and only consult the lower order models as needed
Application
- The application provides a text box for typing
- Updates automatically as the user types
- Most likely next word is shown directly under the text entry box
- Figure 1: The model's top 5 Suggestions along with the associated probabilities of each
- Figure 2: The model's cumulative accuracy in predicting each sub-phrase
- Figure 3: The cumulative suggestions of the model for the entire phrase
- Nodes with a caret ^ contain a component of the phrase and connected nodes contain the model's suggestions for that phrase component