ExtraFingers

Predictive Text App Powered by an N-Gram Language Model

  • Trained on 4.3M lines of blogs, twitter posts, and news articles
  • Vocabulary truncated to 20,000 most frequent words
  • Predictive model constructed from frequencies of:
    • Single words encountered in corpus (uni-grams)
    • Pairs of words (bi-grams)
    • Triplets of words (tri-grams)

Discounting

  • The discounting algorithm combines the tri-, bi- and uni-gram frequencies to calculate the probability of the next word.
  • Given 2 words, P(next word) calculated using observed tri-gram frequencies, discounted by value between 0 and 1.
  • The discount causes the calculated probabilities to sum to < 1, with the “missing” probability mass denoted alpha_1.
  • 2nd pass calculation uses single context word and the bi-gram frequencies, (again discounted) scaled by alpha_1. Total probability mass < alpha_1, with remaining mass alpha_2.
  • Finally, the baseline uni-gram frequencies are scaled by alpha_2 (no discount).
  • All the above probabilities are combined and sum to 1. Top n words are offered to the user.

Unknown Words and the Stop Token

  • Training set words outside the 20,000 most frequent are converted to a special unknown token (UNK).
  • The UNK token will never be offered as a prediction, but will be used as a context word when the previous one or two words typed by the user fall outside the vocabulary.
  • The end of each sentence is converted to a single STOP token. In predictive mode, the stop token will not be offered.
  • The STOP token is used to calculate Perplexity when evaluating model performance.

Perplexity - Measuring Performance

  • A test set is used to evaluate performance, using the standard measure, Perplexity, where a lower score is better.
  • Perplexity \( ~ = 2^{(-l)} \)
    • where \( ~l = 1/M * \sum log(P(sentence))) \)
    • \( M =~ \) # words in test set
    • P(sentence) = \( \prod \) P(each word | previous 0, 1 or 2 words)
  • Minimum Perplexity Achieved: 208.5

  • For more information, see Arnaldo Pedro Figueira Figueira's NLP lectures from Columbia University.

The Extra Fingers App

  • automatic refocus onto text input after selection
  • configurable number of predictions based on 0, 1 or 2 words
  • click here to go to app Screen Shot