Predictive Text App Powered by an N-Gram Language Model
Trained on 4.3M lines of blogs, twitter posts, and news articles
Vocabulary truncated to 20,000 most frequent words
Predictive model constructed from frequencies of:
Single words encountered in corpus (uni-grams)
Pairs of words (bi-grams)
Triplets of words (tri-grams)
Discounting
The discounting algorithm combines the tri-, bi- and uni-gram frequencies to calculate the probability of the next word.
Given 2 words, P(next word) calculated using observed tri-gram frequencies, discounted by value between 0 and 1.
The discount causes the calculated probabilities to sum to < 1, with the “missing” probability mass denoted alpha_1.
2nd pass calculation uses single context word and the bi-gram frequencies, (again discounted) scaled by alpha_1. Total probability mass < alpha_1, with remaining mass alpha_2.
Finally, the baseline uni-gram frequencies are scaled by alpha_2 (no discount).
All the above probabilities are combined and sum to 1. Top n words are offered to the user.
Unknown Words and the Stop Token
Training set words outside the 20,000 most frequent are converted to a special unknown token (UNK).
The UNK token will never be offered as a prediction, but will be used as a context word when the previous one or two words typed by the user fall outside the vocabulary.
The end of each sentence is converted to a single STOP token. In predictive mode, the stop token will not be offered.
The STOP token is used to calculate Perplexity when evaluating model performance.
Perplexity - Measuring Performance
A test set is used to evaluate performance, using the standard measure, Perplexity, where a lower score is better.
Perplexity \( ~ = 2^{(-l)} \)
where \( ~l = 1/M * \sum log(P(sentence))) \)
\( M =~ \) # words in test set
P(sentence) = \( \prod \) P(each word | previous 0, 1 or 2 words)