Next Word Prediction

Murtuza Attarwala
November 28 2014

Model Description

Next word prediction is the task of suggesting the most probable word a user will type next.
This model uses n-gram probability distribution to predict the next word.
Corpus is divided into train(70%) and test(30%). 10% random sample of the train corpus is used to build the model.
The train corpus is first cleaned and then tokenized into n-grams, with largest n used by the model being 4.
Kneser-Ney smoothing is used to reduce the data sparsity by adjusting the probability distribution of unseen sequences.
The idea of Kneser-Ney smoothing is to optimize the calculation of lower-order n-gram probabilities in case the higher-order n-gram was unseen in the corpus. It is thereby a backoff smoothing algorithm.

Backoff Smoothing

Good-Turing Estimate is based on the assumption that the probability of unseen sequences can be estimated based on $N_{1}$ which is the number of n-grams that occured once in the Corpus.
Since the probability mass of unseen n-grams is now greater than 0, the probability mass of seen n-grams needs to be reduced to let the total probability sum to 1. This is achieved with

$\begin{aligned} p_r = \frac{\substack{r^*}} {\substack{N}} \end{aligned}$ where $\begin{aligned} r^* = (r+1) \frac{\substack{N_r+1}} {\substack{N_r}} \end{aligned}$

Here $N_r$ is the frequency of n-grams that were seen r times in the corpus and $r^*$ is the adjusted frequency r.

Backoff Smoothing Contd.

Church et. al empirically show that the discount $(r-r^*)$ applied at the Good-Turing estimate is generally constant for $r \geq 3$
Ney et. al estimate the discount value D based on the total number of n-grams occuring exactly once $(n_1)$ and twice $(n_2)$ $\begin{aligned} D = \frac{\substack{n_1}} {\substack{n_1 + 2n_2}} \end{aligned}$
Based on this $\begin{aligned} P_{absolutediscount}(w_i|w_{i-1}) = \frac{\substack{c(w_{i-1},w_i) - D}} {\substack{c(w_i-1)}} + \lambda{(w_{i-1})} P(w) \end{aligned}$
Kneser-Ney smoothing is a modified version of absolute discounting, where instead of the unigram probability P(w) we use $P_{continuation}(w)$
$P_{continuation}(w)$ gives ,“How likely is w to appear as a novel continuation”, and $\lambda$ is a nomalizing constant; the probability mass we have discounted

Prediction Accuracy

The test corpus(30% of the corpus) is used to verify the prediction accuracy of the model
The plot below shows the accuracy of the algorithm for different number of suggestions provided

plot of chunk ggplot

For a single suggestion the accuracy is about 13%

How to use the App

The app loads n-gram (n being 1-4) probability tables at startup, where each probability table has the n-gram and it's corresponding Kneser-Ney probability.
The algorithm accepts a string, and uses the last n-grams to predict the next word. The next word with highest KN probability is returned. When exact match is not found, algorithm backs off to $(n-1)^{th}$ gram.
User can also specify max. number of predictions that he/she wants from the algorithm, and they are returned sorted by their KN probability.