Type Ahead Prediction Model

Nick Allen
December 2014

Merge the Twitter, News, and Blog data
Randomly assign each line to a training or test set
Split lines into sentences and mark sentence boundaries
Keep only letters, numbers, whitespace and ' (for contractions)
Lowercase all letters
Replace all numbers with a single numeric identifier (###)
- Works quite well for a phrase like '353 million' or '292 billion', but not for a frequent numeric expression like 'the 1980s'
Trim excess whitespace

An N-gram model conditions the probability of a word based on its context
The context is the previous 'n-1' words
- Bigram model considers the previous word
- Trigram model considers the previous 2 words
- 4-gram considers the previous 3 words
The probability of a phrase is simply the number of times the phrase was seen in the training data divided by the number of times the context was seen
- p (we love) = count (we love) / count (we)

The model combines a unigram, bigram, and trigram model
Larger 'N' models…
- Take into account a greater context and tend to be more accurate
- Have a greater number of n-grams and more frequently encounter ones not in the training data and thus have no basis to make a prediction
A multi-level model balances the advantages and disadvantages of each
- First consult the larger 'N' models (trigram) and only consult the lower order models as needed

The application provides a text box for typing
- Updates automatically as the user types
- Most likely next word is shown directly under the text entry box
Figure 1: The model's top 5 Suggestions along with the associated probabilities of each
Figure 2: The model's cumulative accuracy in predicting each sub-phrase
Figure 3: The cumulative suggestions of the model for the entire phrase
- Nodes with a caret ^ contain a component of the phrase and connected nodes contain the model's suggestions for that phrase component