Bernie Pilgram, PhD
bernie.pilgram@gmail.com
10 December 2014
We present a prototype ShinyApp for next-word prediction to facilitate typing on mobile devices: (1) Unigrams, bigrams, and trigrams (16 MB in all) generated from 500,000 lines of blog, news, and twitter texts are used. (2) Modified Kneser-Ney smoothing is applied for probability estimation. (3) Possible next words are obtained from the probabilities estimated. (4) The app predicts the next word and gives a choice of most probable next words.
In an n -gram, a sequence of words truncated to n-1 words with conditional probability
p (wi |w1, …,wi-1 ) =
p (wi |wi-n+1, …,wi-1 )
results in sequences of n variables, given here for n = 1, 2, and 3 :
1) Unigrams p (wi ) (i.i.d. process)
2) Bigrams p (wi |wi-1 ) (Markov process)
3) Trigrams p (wi |wi-2,wi-1 ) (as above)
An n -gram can be represented as a V -ary branching tree structure for voca-bulary size V. To illustrate how language branches, see the dendrogram, from blog-text lines, depicted in Figure 1.
Figure 1: Word dendrogram to show that amount of n -grams can get very large even for a very small vocabulary V with a few words.
Problem
a) For vocabulary V v n parameters are necessary to describe all n-grams with storage, memory, and estimation problems as a consequence. Thus, only a subset of all possible n-grams is used.
b) N-gram probabilities are estimated from relative counts of a training corpus. Due to finite training data, count c of word wa might be 0. Thus, probability for the unigram case is zero and undefined for the bigram, and the trigram case.
Smoothing as solution
Three options: (1) add-counts, (2) inter-polate target n-gram and lower order di-stributions, (3) backoff: take from seen,
give to unseen events. We use modified Kneser-Ney (KN), which applies (2), for probability estimation, e.g., for trigrams:
PKN (wi |wi-2,wi-1 ) =
(c (wi-2,wi-1,wi )-D (c (wi-2,wi-1,wi )))/
c (wi-2,wi-1 ) +
g (wi-2,wi-1 )PKN (wi |wi-1 )
where g (wi-2,wi-1 ) is chosen such that distributions sum to 1 and D (c (.)) allows smaller discounts for smaller counts.
Next-word prediction
The highest probability PKN from the n -gram with matching first words yields the next-word prediction wi .
Results produced
a) Start with preprocessed unigrams, bi-grams, and trigrams. Here is a trigram example with counts.
X1 X2 X3 count
400123 one of the 3155
16595 a lot of 2759
b) Caculate probabilities PKN of possible next words and obtain the most probable next-word and a set of choices. Here is an example for the next-word prediction of the bigram “Get started”.
[1] "Get started on"
Here are the four most probable next words along with the word “on.”
A B C D
word to on the with
P 0.1953 0.1736 0.1694 0.1344
Next steps
a) Use subset of encoded “clean” Google n -grams (incl. 4-grams), for improved prediction & memory-space optimization.
b) Implement method in either C#, C++, or Java for faster processing.
Conclusion
The use of Modified KN Smoothing yields good results but does not capture long-term language dependencies.
1) Utilized R-packages
a) Shiny library, ShinyIncubator library to isolate all reactives but the “Submit” button, and to show processing progress with withProgress, and memoise library for results caching.
2) Parameter settings
a) Input text field (default is “Get started”) for next-word prediction.
b) Number n of words to be predicted (default is “1”, maximum is “5”)
c) Number m of next possible word choices for word predictions (default is “4”, minimum is “2”, maximum is “6”)
d) Submit button to start processing
3) Processing
a) Load unigrams (100,000), bigrams (780,000), and trigrams (680,000).
b) Search for trigrams with identical first two words. Calculate probabilities using modified Kneser-Ney smoothing. Use the word with the highest probability as prediction and the words with the highest probabilities as possible choices.
c) If no trigrams are found back off to bigrams. Repeat the procedure as in b).
d) If no bigrams are found use unigrams.
e) Display the next word prediction and possible choices combined from trigrams, bigrams, and unigrams with probabilities.
f) Iterate for more-word predictions.