Next Word Prediction

Leo Yang
2018-06-16

Model Highlights

  • We use a combined approach of ngram and non-stop words pair correlation for the next word prediction;
  • And for the ngram prediction, one, two and three previous words are used;
  • The output probabilities from each model are then averaged and weighted to give a final score for each prediction;
  • If a model doesn't give a prediction for the next word, we use the most frequent words as a fallback strategy;

Data Preprocess

  • Package Used:

We mostly use Hadley’s tidyverse / tidytext package for tokenization, filtering and aggregation; For pairwise word correlation within a sentence, we use the pairwise_cor from Hadley’s widyr package; Also the tm package is used for word stemming;

  • Data Preprocess:

Due to limited computer memory, we only sample ~ 40% of the complete data for modeling process. Ngrams that contain letters + apostrophe (‘) and above certain frequencies are kept, which vary depending on the size of the model; For pairwise correlation, we use tm::stemDocument to stem the words to reduce the word space;

Model Details

  • NGram Models:

We use tidytext::unnest_tokens for ngram tokenizations; For each ngram model (1, 2 or 3), we apply filters as specified in Data Preprocess section to remove undesired ngrams; Then we count the ngram and normalize it as prediction probability; The probabilities from the three ngram models are added with a weight as shown below:

ngrams weight
bigram 1
trigram 5
fourgram 10
  • Pairwise Correlation Model:

    To calculate meaningful words correlation within a sentence, we go through the following steps: 1)tokenize the texts as sentences; 2) remove stop words using tidytext::stop_words collection; 3) use tm::stemDocument to stem the words; 4) Finally we calculate the pairwise correlation using widyr::pairwise_cor.

App Outline

App Screen Shot

The App contains four sections mostly as shown on the left. An input field for entering the phrase; A predict button for model triggering; The output section includes the next word prediction, i.e. the word with the highest probability and also top 10 suggestions with their corresponding probabilites; The model calculation typically takes from less to a few seconds;