Next Word Prediction App

Jose A. Arino
07- October-2016

The App predicts the next word of any incomplete phrase or sentence typed by the user.
The last statement is not the result of any magical ability; really, predicting means knowing the most likely words that would continue the phrase typed. To do that, we need to apply an algorithm over a statistical distribution of text data.

The text data distribution -n-gram model-

We start with a sample of sentences from the HC Corpora, 10% of the whole sample.
In order to estimate next word probabilities, we need to deal with word sequences. We have chosen sequences of 4 words or less. After the treatment with the “tm” package, we have obtained a data set of 893,325 word sequences types, with its frequency.

An algorithm takes as input the n-1-gram (3 words or less in our case) that finishes the typed sentence. This step is based in the Markov assumption: looking for in the closer history is enough for estimating probabilities.
The n-grams whose previous words match with the input are selected
For each end word of the last selection, a conditional probability is obtained. The end word and its conditional probability is the final output.

The way that the probability of each end word is obtained, including the treatment of sequences with zero counts (smoothing), defines each algorithm type.
The Kneser-Ney (KN) algorithm has been selected. It has a slightly better precision and its results are truly probabilities, not only scores, even considering its complex method for smoothing.

After selecting the n-grams as above, an iterative weighted mean is worked out; from the biggest order to the lowest one.
The weights are associated to a given discount factor, that keeps the probability into the margins of the MLE (Maximum likely estimation).
The probability for the biggest order will be the conditional probability, and the ones for the lower orders the called “continuation probabilities”.
The continuation probability is obtained regarding the number of times that the end word is continuation of another previous word type.

The App has a small execution time -if we can continue the development, it could be improved- and does not use a lot of memory. These are some key features, given its likely use on mobile devices. Additionally, it has a good precision. Let's see the results of an evaluation consisting in an external test created in this site.

alt text

At the other hand, if we use a testing set extracted from the same corpora as the training set, the precision is bigger:
- top1 precision = 15.08 %
- top3 precision = 24.57 %.
So, if we get training sets of the same type than the users, the precision increases. At the limit, learning from the typed sentences by the users could be an improvement for the future.

alt text

The App can be loaded in this link. It takes a few seconds.
You can type into the box any incomplete phrase and select the number of words for the predictions list.
After pressing enter or the blue button, the results are at the right side: A list with the next predicted words. It has the number of options chosen and the words are ordered by its probability, from the highest to lowest.