JHU DS Capstone: Next Word Prediction

Scott D. Koenigsman
3/10/2017

N-gram Next Word Prediction

  • Next word prediction model is an N-gram Katz backoff model using Good-Turing discounting.
    • Model predicts based on the last four words typed, excluding profanity words.
    • Look for matching prefix in 5-gram model and return 5 highest probability next word predictions.
    • If insufficient matches found, perform discounting and repeat for successively smaller n-grams (4, 3, 2).
    • If insufficient matches found, use 1-skip-2-grams to predict based on context ignoring last word typed (which may be a unknown word).
    • If all else fails, predict based on default common words.

Corpus and Data Cleaning

  • Large English language corpus of blogs, news, and twitter provided by Swiftkey.
  • Corpus sampled at 20% and split into 60/20/20 training/validation/test sets.
  • Punctuation, symbols, numbers, and profanity are removed from the sampled corpus.
  • Data frequency tables for n-grams of size 5, 4, 3, 2, 1, and 1-skip-2 grams are generated.
  • Sparse entries (n-gram occurance < 2) removed to reduce size of model.
  • Good-Turing discount factor calculated for each n-gram.

Model Performance

  • Accuracy measured by generating all possible 5-grams from validation/test sets and counting number of times last word of 5-gram matches one of the top five predicted words.
  • Intitial model built using 5-gram, 4-gram, 3-gram, 2-gram, 1-gram (default common words) Katz backoff with Good-Turing discounting achieved validation accuracy of 70.727%.
  • Using five predictions is a compromise between three (64.1% accuracy) and seven (74.6% accuracy).
  • If the last word is unknown, the model falls through all n-grams and predicts default words with no context.
    • Introducing 1-skip-2-grams to predict based on next to last word improves accuracy slightly to 70.759%.
  • Final test set accuracy is 70.477%.

Model Usage

  • Navigate to the shiny app hosted on shinyapps.io.
  • Select the Model tab.
  • Start entering text in the input text box.
  • Predicted next words will appear in prediction box.
  • At any point, select a predicted word:
    • The selected word will be added to the input.
    • Continue typing or select another predicted word.