JHU DS Capstone: Next Word Prediction
Scott D. Koenigsman
3/10/2017
N-gram Next Word Prediction
- Next word prediction model is an N-gram Katz backoff model using Good-Turing discounting.
- Model predicts based on the last four words typed, excluding profanity words.
- Look for matching prefix in 5-gram model and return 5 highest probability next word predictions.
- If insufficient matches found, perform discounting and repeat for successively smaller n-grams (4, 3, 2).
- If insufficient matches found, use 1-skip-2-grams to predict based on context ignoring last word typed (which may be a unknown word).
- If all else fails, predict based on default common words.
Corpus and Data Cleaning
- Large English language corpus of blogs, news, and twitter provided by Swiftkey.
- Corpus sampled at 20% and split into 60/20/20 training/validation/test sets.
- Punctuation, symbols, numbers, and profanity are removed from the sampled corpus.
- Data frequency tables for n-grams of size 5, 4, 3, 2, 1, and 1-skip-2 grams are generated.
- Sparse entries (n-gram occurance < 2) removed to reduce size of model.
- Good-Turing discount factor calculated for each n-gram.
Model Performance
- Accuracy measured by generating all possible 5-grams from validation/test sets and counting number of times last word of 5-gram matches one of the top five predicted words.
- Intitial model built using 5-gram, 4-gram, 3-gram, 2-gram, 1-gram (default common words) Katz backoff with Good-Turing discounting achieved validation accuracy of 70.727%.
- Using five predictions is a compromise between three (64.1% accuracy) and seven (74.6% accuracy).
- If the last word is unknown, the model falls through all n-grams and predicts default words with no context.
- Introducing 1-skip-2-grams to predict based on next to last word improves accuracy slightly to 70.759%.
- Final test set accuracy is 70.477%.
Model Usage
- Navigate to the shiny app hosted on shinyapps.io.
- Select the Model tab.
- Start entering text in the input text box.
- Predicted next words will appear in prediction box.
- At any point, select a predicted word:
- The selected word will be added to the input.
- Continue typing or select another predicted word.