Next-Word Prediction

Bernie Pilgram, PhD
bernie.pilgram@gmail.com

10 December 2014

Executive Summary

We present a prototype ShinyApp for next-word prediction to facilitate typing on mobile devices: (1) Unigrams, bigrams, and trigrams (16 MB in all) generated from 500,000 lines of blog, news, and twitter texts are used. (2) Modified Kneser-Ney smoothing is applied for probability estimation. (3) Possible next words are obtained from the probabilities estimated. (4) The app predicts the next word and gives a choice of most probable next words.

Introduction - The n-gram Language Model

In an n -gram, a sequence of words truncated to n-1 words with conditional probability

p (w_i |w₁, …,w_i-1 ) =
p (w_i |w_i-n+1, …,w_i-1 )

results in sequences of n variables, given here for n = 1, 2, and 3 :
1) Unigrams p (w_i ) (i.i.d. process)
2) Bigrams p (w_i |w_i-1 ) (Markov process)
3) Trigrams p (w_i |w_i-2,w_i-1 ) (as above)
An n -gram can be represented as a V -ary branching tree structure for voca-bulary size V. To illustrate how language branches, see the dendrogram, from blog-text lines, depicted in Figure 1.

plot of chunk word_dendrograms Figure 1: Word dendrogram to show that amount of n -grams can get very large even for a very small vocabulary V with a few words.

Prediction - Modified Kneser-Ney Smoothing

Problem

a) For vocabulary V vⁿ parameters are necessary to describe all n-grams with storage, memory, and estimation problems as a consequence. Thus, only a subset of all possible n-grams is used.
b) N-gram probabilities are estimated from relative counts of a training corpus. Due to finite training data, count c of word w_a might be 0. Thus, probability for the unigram case is zero and undefined for the bigram, and the trigram case.

Smoothing as solution

Three options: (1) add-counts, (2) inter-polate target n-gram and lower order di-stributions, (3) backoff: take from seen,

give to unseen events. We use modified Kneser-Ney (KN), which applies (2), for probability estimation, e.g., for trigrams:

P_KN (w_i |w_i-2,w_i-1 ) =
(c (w_i-2,w_i-1,w_i )-D (c (w_i-2,w_i-1,w_i )))/
c (w_i-2,w_i-1 ) +
g (w_i-2,w_i-1 )P_KN (w_i |w_i-1 )

where g (w_i-2,w_i-1 ) is chosen such that distributions sum to 1 and D (c (.)) allows smaller discounts for smaller counts.

Next-word prediction

The highest probability P_KN from the n -gram with matching first words yields the next-word prediction w_i .

Prediction Results - Modified KN Smoothing

Results produced

a) Start with preprocessed unigrams, bi-grams, and trigrams. Here is a trigram example with counts.

        X1  X2  X3 count
400123 one  of the  3155
16595    a lot  of  2759

b) Caculate probabilities P_KN of possible next words and obtain the most probable next-word and a set of choices. Here is an example for the next-word prediction of the bigram “Get started”.

[1] "Get started on"

Here are the four most probable next words along with the word “on.”

     A      B      C      D     
word to     on     the    with  
P    0.1953 0.1736 0.1694 0.1344

Next steps

a) Use subset of encoded “clean” Google n -grams (incl. 4-grams), for improved prediction & memory-space optimization.
b) Implement method in either C#, C++, or Java for faster processing.

Conclusion

The use of Modified KN Smoothing yields good results but does not capture long-term language dependencies.

ShinyApp Implementation - How It Works

1) Utilized R-packages

a) Shiny library, ShinyIncubator library to isolate all reactives but the “Submit” button, and to show processing progress with withProgress, and memoise library for results caching.

2) Parameter settings

a) Input text field (default is “Get started”) for next-word prediction.
b) Number n of words to be predicted (default is “1”, maximum is “5”)
c) Number m of next possible word choices for word predictions (default is “4”, minimum is “2”, maximum is “6”)
d) Submit button to start processing

3) Processing

a) Load unigrams (100,000), bigrams (780,000), and trigrams (680,000).
b) Search for trigrams with identical first two words. Calculate probabilities using modified Kneser-Ney smoothing. Use the word with the highest probability as prediction and the words with the highest probabilities as possible choices.
c) If no trigrams are found back off to bigrams. Repeat the procedure as in b).
d) If no bigrams are found use unigrams.
e) Display the next word prediction and possible choices combined from trigrams, bigrams, and unigrams with probabilities.
f) Iterate for more-word predictions.