Next Word Prediction

Leo Yang
2018-06-16

Model Highlights

We use a combined approach of ngram and non-stop words pair correlation for the next word prediction;
And for the ngram prediction, one, two and three previous words are used;
The output probabilities from each model are then averaged and weighted to give a final score for each prediction;
If a model doesn't give a prediction for the next word, we use the most frequent words as a fallback strategy;

Data Preprocess

Package Used:

We mostly use Hadley’s tidyverse / tidytext package for tokenization, filtering and aggregation; For pairwise word correlation within a sentence, we use the pairwise_cor from Hadley’s widyr package; Also the tm package is used for word stemming;

Data Preprocess:

Due to limited computer memory, we only sample ~ 40% of the complete data for modeling process. Ngrams that contain letters + apostrophe (‘) and above certain frequencies are kept, which vary depending on the size of the model; For pairwise correlation, we use tm::stemDocument to stem the words to reduce the word space;

Model Details

NGram Models:

We use tidytext::unnest_tokens for ngram tokenizations; For each ngram model (1, 2 or 3), we apply filters as specified in Data Preprocess section to remove undesired ngrams; Then we count the ngram and normalize it as prediction probability; The probabilities from the three ngram models are added with a weight as shown below:

ngrams	weight
bigram	1
trigram	5
fourgram	10

Pairwise Correlation Model:

To calculate meaningful words correlation within a sentence, we go through the following steps: 1)tokenize the texts as sentences; 2) remove stop words using tidytext::stop_words collection; 3) use tm::stemDocument to stem the words; 4) Finally we calculate the pairwise correlation using widyr::pairwise_cor.

App Outline

App Screen Shot

The App contains four sections mostly as shown on the left. An input field for entering the phrase; A predict button for model triggering; The output section includes the next word prediction, i.e. the word with the highest probability and also top 10 suggestions with their corresponding probabilites; The model calculation typically takes from less to a few seconds;