Guido Gallopyn
4-26-2015
Text Prediction
This project was done entirely in R using tm, RWeka, hash, dplyr and ggplot2 packages. (excellent LM tool-kits are available from CMU, SRI, but I preferred to dig into R text processing and NLP)
An S3 object was developed for NGram language models with arbitrary N.
with a 3 character prefix, an auto-complete feature based on this moel can predict the right next word 2 out of 3 times.
a shiny App demo was build to showcase the Text Predictor.
Text prediction is the task of selecting the word \( w_{i} \) with maximum probability \( p \) using left context \( w_{i-n+1}... w_{i-1} \) and a noisy channel observation \( o \) of the word $w_{i}$n as predictors.
Via Bayes Rule we get:
\[ \hat{w} = \underset{w_{i} \in Words} {argmax}~(~(p( o \mid w_{i} ) \cdot p( w_{i} \mid w_{i-n+1}... w_{i-1} )~) \]
Let \( o \) be a \( prefix \) of \( w_{i} \), then
\[ \hat{w} = \underset{w_{i} \in \{w~:~isPrefix(prefix,w) , w \in Words\ \} } {argmax} ~p( w_{i} \mid w_{i-n+1}... w_{i-1} ) \]
\( p( w_{i} \mid w_{i-n+1}... w_{i-1}) \) : n-gram language model
Note: observations other than word prefixes are commonly in use, including keyboard swipe, pen-strokes in hand writing recognition, word images in OCR, and voice in speech recognition.
As mentioned in the Milestone Report, the Capstone data was cleaned and tokenized with tm and RWeka
The full text corpus (~100M words) was split in training (90%) and evaluation (10%), then the training corpus was sub-sampled in 10, 3 and 1 million word corpora, final models were trained on the 10 million word training corpus.
Model building and evaluation experiments were run to measure Coverage, Perplexity, Prediction Accuracy, LM-Size and prediction speed in function of word-frequency threshold K, Ngram-order N and training corpus size.
Finally a “small”“, "medium” and “large”“ language model were selected for use with the shiny app
| Model | N | K | uniGrams | biGrams | triGrams | Size.Mb | |
|---|---|---|---|---|---|---|---|
| 1 | small | 2 | 40 | 11784 | 19985 | 0 | 0.95 |
| 2 | medium | 3 | 20 | 17896 | 40661 | 16757 | 2.30 |
| 3 | large | 3 | 7 | 32285 | 110943 | 65020 | 5.90 |
| Model | N | 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|---|---|
| 1 | small | 2 | 9.30 | 27.80 | 44.40 | 60.30 | 67.90 |
| 2 | medium | 3 | 11.90 | 30.90 | 47.30 | 63.40 | 70.70 |
| 3 | large | 3 | 13.20 | 32.80 | 49.10 | 65.00 | 72.60 |
Note: prefix length zero means model only uses previous N-1 words
Accuracy was measured on a 1% sub-sample of the evaluation corpus (~ 100k words) for word-prefix lengths from 0 to 4
10 bootstrap iterations were run to estimate a mean and a standard error of the prediction accuracy.
Observations and Conclusions
a shiny App demo was build to show-case the Text Predictor.
it takes as input a phrase (multiple words) in a text box input and outputs a prediction of the top-5 next best words (best on top)
it provides an auto-complete demonstration (see screenshot) that simulates a mobile keyboard App, rthat provides top 5 predicted words as users type. This allows for an evaluation of typing performance inprovement and gives a good sense of the prediction speed.
it provides a glimpse under the hood, with summary statistics of the underlying language models.