3D Next Word Predictor

An Interactive English Text Prediction Application

Ilias S. Konsoulas

This Application was developed using shiny and RStudio to fulfill the requirements of the Data Science Specialization Capstone Project provided by Coursera and Swiftkey.
It predicts the next english word given a history of n previous words. For faster execution and memory economy the maximum depth of the search history was set equal to 2. i.e. we start to predict the next word \(w_i\) using a context history of the two previous words \(w_{i-1}\) and \(w_{i-2}\).
We perform word prediction using three algorithms as described in Chen & Goodman (1998), Jurafsky & Martin (2008). After taking into account the context of the last 2 words, the app proposes the next most probable word based on the the maximum estimated conditional probability \(P(w_i|w_{i-2},w_{i-1})\).
The conditional probability estimators that I developed for this app are: Maximum Likelihood Stupid Back-Off, Katz Smoothing, Kneser-Ney Smoothing, Absolute Discounting and Jelinek-Mercer Smoothing as described in full detail in Chen & Goodman (1998).
However, in the final deployment, Katz method was not used due to its increased latency and Absolute Discounting was abandoned because its results were very similar to Knesser Ney interpolated smoothing.

This app is called 3D Next Word Predictor because it uses three different N-gram models.
On the top one-third of the app screen there is an input box where the user enters his/her english text.
In the middle, the inserted text is repeated indicating that the app received the intended input.
The bottom of the screen is divided in three columns where the app provides:
- The next word predicted by each method, accompanied by its conditional probability.
- The next four most probable words accompanied by their conditional probabilities.
- In total, the app provides 15 word predictions for each new word inserted.
Usually, many of the suggested words between the three different methods are common.

Chen S., Googman J., "An Empirical Study of Smoothing Techniques for Language Modeling", TR-10-98, Aug.1998, Computer Science Group, Harvard University, Cambridge, Massachusetts.
Jurafsky D., Martin J.H., "Speech and Language Processing", 2nd Ed., Prentice Hall, 2008.

Jelinek - Mercer Smoothing [ref.1] \[ P_{interp} (w_i|w_{i-n+1}^{i-1}) = \lambda_{w_{i-n+1}^{i-1}} P_{ML}(w_i|w_{i-n+1}^{i-1}) + (1-\lambda_{w_{i-n+1}^{i-1}}) P_{interp}(w_i|w_{i-n+2}^{i-1})\] with fixed \(\lambda_{w_{i-n+1}^{i-1}} = 0.6\). Here, the nth-order smoothed model is defined recursively as a linear interpolation between the nth-order maximum likelihood model and the (n-1)th-order smoothed model.
Kneser-Ney Smoothing [ref.1] \[P_{KN} (w_i|w_{i-n+1}^{i-1}) = \frac {\max\{c(w_{i-n+1}^i)-D,0\}}{\sum\nolimits_{w_i}c(w_{i-n+1}^i)} + \frac {D} {\sum\nolimits_{w_i}c(w_{i-n+1}^i)} N_{1+}(w_{i-n+1}^{i-1}\bullet) P_{KN}(w_i|w_{i-n+2}^{i-1})\] recursion ending with \(P_{KN}(w_i) = \frac {N_{1+}(\bullet w_i)} {N_{1+}(\bullet\bullet)}\) and \(D=\frac{n_1}{n_1+2n_2}\).
Stupid-BackOff [ref.2] \(S(w_i|w_{i-n+1}^{i-1}) = \begin{cases} \frac{c(w_{i-n+1}^i)} {c(w_{i-n+1}^{i-1})} & \mbox{if } c(w_{i-n+1}^i)>0 \\ \lambda S(w_i|w_{i-n+2}^{i-1}) & \mbox{otherwise} \end{cases}\) recursion ending with \(S(w_i) = \frac {c(w_i)}{N}\) and fixed \(\lambda = 0.4\).

For this app we have used training data (i.e. N-grams and their count) extracted from the given .txt files originating from the HC Corpora link.
The training data consist of three data frames saved in .rds format, namely unigrams.rds, bigrams.rds and trigrams.rds.
Each data frame contains all N-gram terms (as factors) along with their count frequency (as integers). For example, in the bigrams data frame each line has the following format: \([w_{i-1},w_{i},c(w_{i-1},w_{i})]\) .
Prediction accuracy and execution speed for each prediction, depend heavily on the size of the training set. The training set that we use here was created by the english text corpus created after selecting randomly 1% of the corpus lines set.
All computations of conditional probabilities and model parameters are executed on-line. Therefore, some short latency (~2 sec/new prediction set) is usual.