WordPredict

Raj Maddali
Sep 12, 2017

The simple GUI overlays a complex framework of pre processed data.
Corpii are initially processed using Quanteda producing n-grams (1,2,3,4)
N-grams are hashed to reduce memory footprint.
Low frequency n-grams are excluded for performance
Modified Knesser-Ney probabilites are computed on the compressed n-gram tables.
Referenced from Stanley Chen and Joshua Goodman: An empirical study of smoothing techniques for language modeling. Computer Speech & Language Journal
Final predictions are unhashed to their English words for presentation

Knesser-Nye algorithm is used to compute next word predictions
\[ P_{KN}(w_i|w_{i-n+1}^{i+1}) = \frac{max(N_{1+}(*w_{i-n+1}^{i}-D),0)}{N_{1+}(*w_{i-n+1}^{i-1}*)} \] \[ + \frac{D}{N_{1+}(*w_{i-n+1}^{i-1}*)}{N_{1+}(w_{i-n+1}^{i-1}*)}{P_{KN}(w_i|w_{i-n+2}^{i-1}} \]
where
\[ N_{1+}(*w_{i-n+1}^{i}) = |\{w_{i-n+1}:C(w_{i-n}^{i}) > 0\} \]
\[ N_{1+}(*w_{i-n+1}^{i-1}*) = |\{w_{i-n},w_{i}:C(w_{i-n}^{i}) > 0\} =\sum_{N_{1+}}(*w_{i-n+1}^{i}) \]
Additionally, a simple variant of Simple Back Off MLE is used to estimate missing words within the Knesser-Nye framework.

Enter your sentence in the text area and press the predict button
The right part of the screen displays your prediction along with a word cloud of other possibilities
https://rajm.shinyapps.io/WordPredict/