Next Word Prediction

Samer Alwan
April 16, 2019

Data Science Specialization
Capstone Project
www.coursera.org

Next Word App - Description

Shiny application that predicts the next word based on a text input. It uses an n-gram model with Kneser-Ney smoothing.
Link: https://sal2040.shinyapps.io/word_prediction/

Desktop performance:

RAM: 1.4Mb
Approx. system time: 0.2sec

How the App Works

The App creates a prefix from the text input and generates all possible n-grams that start with that prefix.
Then it queries data frames (loaded into RAM when started) created in the model training phase. Each data frame contains a set of n-grams of one order (4, 3, 2, 1), their continuation and “simple” probabilities and corresponding lambdas.
Finally, the App calculates and assigns Kneser-Ney interpolated probability to each of the possible n-grams, selects 10 most probable ones and returns their endings.

Model Parameters and Optimization

Highest n-gram order:

4-grams => prediction from 3 words at most

Discount method:

Absolute discount:
\[ D_1 = 0.5 \] \[ D_{\ge 2} = 0.75 \]

Transformation to integers

The n-grams in loaded data frames are represented by integers. The App converts between strings and integers via a hash table that is loaded into RAM.

Unknown words

all words from the training vocabulary with frequency = 1 were replaced by the “UNK” token

Pruning

Simple cut-off with cut-off frequency = 1
All n-grams from the training corpus were taken into account in model training and they were pruned at the end. This means that the model takes into account the whole corpus. This is a so called “hide” method.

Model Selection I

Several versions of the model were compared on a development set. They differed in n-gram order, pruning cut-off method and discount factors:

Highest n-gram orde:r	Discount factor:
3-gram	Basic:
4-gram	\[ D_1 = 0.5 \]
	\[ D_{\ge 2} = 0.75 \]
Pruning cut-off method:	Modified:
“Hide” = cut-off done after probability calculations	\[ Y = \frac{n_1}{n_1+2n_2} \]
“Remove” = cut-off done before probability calculations	\[ D_1 = 1-2Y\frac{n_2}{n_1} \]
No pruning	\[ D_2 = 2-3Y\frac{n_3}{n_2} \]
	\[ D_3 = 3-4Y\frac{n_4}{n_3} \]

Data set summary:

Type	No. of texts in corpus
Train	373,600
Dev	100 x 4000
Test	400,000

Model Selection II

Performance of each model was measured by perplexity. To measure the significance of differences in performance, the dev set was divided into 100 subsets and perplexity of each model was calculated on each subset. This way distributions of perplexities were simulated and significances of differences were measured by paired t-test. Normality of distributions was tested by Shapiro-Wilk test.

Model	Shapiro-Wilk P-value	Mean Perplexity	Delta	T-test P-value
UNPRUNED_BASIC_4G	0.4934	110.6234	-15.0483	1.75e-172
UNPRUNED_MOD_4G	0.5747	125.6716	-4.9026	2.60e-119
HIDE_BASIC_4G	0.5207	130.5743	-5.1068	3.21e-165
HIDE_MOD_4G	0.5798	135.6810	-1.2578	3.04e-26
REMOVE_BASIC_4G	0.5021	136.9388	-2.2620	1.46e-37
UNPRUNED_BASIC_3G	0.3521	139.2009	-9.0935	1.44e-94
REMOVE_BASIC_3G	0.6195	148.2943	-7.1126	4.09e-85
UNPRUNED_MOD_3G	0.4552	155.4069	-6.5194	8.99e-124
HIDE_BASIC_3G	0.3624	161.9264	-5.5459	7.20e-168
HIDE_MOD_3G	0.4129	167.4723	NA	NA

Even though the unpruned models generally perform better, their size reduce the App performance significantly. Therefore, the 4-gram model with basic discounting and “hide” cut-off method was selected

Perplexity measured on the final test set: 130.8744