Next Word Prediction

Samer Alwan
April 16, 2019

Data Science Specialization
Capstone Project
www.coursera.org

Next Word App - Description

Shiny application that predicts the next word based on a text input. It uses an n-gram model with Kneser-Ney smoothing.
Link: https://sal2040.shinyapps.io/word_prediction/


Desktop performance:

  • RAM: 1.4Mb
  • Approx. system time: 0.2sec

How the App Works

  • The App creates a prefix from the text input and generates all possible n-grams that start with that prefix.
  • Then it queries data frames (loaded into RAM when started) created in the model training phase. Each data frame contains a set of n-grams of one order (4, 3, 2, 1), their continuation and “simple” probabilities and corresponding lambdas.
  • Finally, the App calculates and assigns Kneser-Ney interpolated probability to each of the possible n-grams, selects 10 most probable ones and returns their endings.


Model Parameters and Optimization

Highest n-gram order:

  • 4-grams => prediction from 3 words at most

Discount method:

  • Absolute discount:
    \[ D_1 = 0.5 \] \[ D_{\ge 2} = 0.75 \]

Transformation to integers

  • The n-grams in loaded data frames are represented by integers. The App converts between strings and integers via a hash table that is loaded into RAM.

Unknown words

  • all words from the training vocabulary with frequency = 1 were replaced by the “UNK” token

Pruning

  • Simple cut-off with cut-off frequency = 1
  • All n-grams from the training corpus were taken into account in model training and they were pruned at the end. This means that the model takes into account the whole corpus. This is a so called “hide” method.

Model Selection I

Several versions of the model were compared on a development set. They differed in n-gram order, pruning cut-off method and discount factors:

Highest n-gram orde:r Discount factor:
3-gram Basic:
4-gram \[ D_1 = 0.5 \]
\[ D_{\ge 2} = 0.75 \]
Pruning cut-off method: Modified:
“Hide” = cut-off done after probability calculations \[ Y = \frac{n_1}{n_1+2n_2} \]
“Remove” = cut-off done before probability calculations \[ D_1 = 1-2Y\frac{n_2}{n_1} \]
No pruning \[ D_2 = 2-3Y\frac{n_3}{n_2} \]
\[ D_3 = 3-4Y\frac{n_4}{n_3} \]


Data set summary:

Type No. of texts in corpus
Train 373,600
Dev 100 x 4000
Test 400,000

Model Selection II

Performance of each model was measured by perplexity. To measure the significance of differences in performance, the dev set was divided into 100 subsets and perplexity of each model was calculated on each subset. This way distributions of perplexities were simulated and significances of differences were measured by paired t-test. Normality of distributions was tested by Shapiro-Wilk test.

Model Shapiro-Wilk P-value Mean Perplexity Delta T-test P-value
UNPRUNED_BASIC_4G 0.4934 110.6234 -15.0483 1.75e-172
UNPRUNED_MOD_4G 0.5747 125.6716 -4.9026 2.60e-119
HIDE_BASIC_4G 0.5207 130.5743 -5.1068 3.21e-165
HIDE_MOD_4G 0.5798 135.6810 -1.2578 3.04e-26
REMOVE_BASIC_4G 0.5021 136.9388 -2.2620 1.46e-37
UNPRUNED_BASIC_3G 0.3521 139.2009 -9.0935 1.44e-94
REMOVE_BASIC_3G 0.6195 148.2943 -7.1126 4.09e-85
UNPRUNED_MOD_3G 0.4552 155.4069 -6.5194 8.99e-124
HIDE_BASIC_3G 0.3624 161.9264 -5.5459 7.20e-168
HIDE_MOD_3G 0.4129 167.4723 NA NA

Even though the unpruned models generally perform better, their size reduce the App performance significantly. Therefore, the 4-gram model with basic discounting and “hide” cut-off method was selected

Perplexity measured on the final test set: 130.8744