Text Prediction Algorithm

Alan Whitelock-Jones
22 Jan 2016

Coursera Data Science Capstone

Purpose

The Prediction algorithm predicts the most likely next words and presents them as a list of buttons to choose from

  • Speeds up typing
  • Number of words is variable (but demo uses 10)
  • Derived from a large training set
  • Can include new text and adapt to users favorite words

Efficiency

The aim is to optimise the benefit of the predictions considering memory and speed.

  • Small memory footprint
  • Quick to recalculate new words
  • The recalculation does not prevent the user from typing

Algorithm

The application has a vocabulary of the 5000 most common words (from the training set excluding profanities) and predicts the next word based on the 9 previous words typed in the phrase.

  • A sentence ender(.?!) results in the next word predicted from the 10 most common sentence starters.
  • Each word in the dictionary has a default Case (Upper, Proper or Lower)
  • The prediction gives priority to the longest n-gram that fits the typed phrase so far falling back to a shorter n-gram if there is no match, all the way to single words.

Word Completion

As well as predicting the next word, when you start typing the algorithm shows matches from the Vocabulary.

  • You can press a button to complete the partial word
  • This saves time (particularly on a phone with a screen keyboard)
  • Can be used to see what words are in the Vocabulary

Details

The numbers that gave the optimal output were:

#Parameters
size.vocab  <- 5000
size.2.gram <- 10000
size.3.gram <- 5000 
size.4.gram <- 4000
size.5.gram <- 3000
size.6.gram <- 2000
size.7.gram <- 1000
size.8.gram <- 1000
size.9.gram <- 1000
size.predicted <- 10

Further efficiencies gained by not storing more than 10 n-grams with the same first (n-1) words (as they would never rank in any prediction anyway)