Coursera Data Science Capstone

Guido Gallopyn
4-26-2015

Text Prediction

Overview

This project was done entirely in R using tm, RWeka, hash, dplyr and ggplot2 packages. (excellent LM tool-kits are available from CMU, SRI, but I preferred to dig into R text processing and NLP)
An S3 object was developed for NGram language models with arbitrary N.
- Katz back-off model training,
- Prediction with BestScore and NBestScore calculations.
- Performance evaluation, including Coverage, Perplexity and prediction Accuracy
- ARPA LM format import/export
- makes extensive use of hash package to efficiently store and access n-gram counts, probabilities, back-off weights, following word lists and a word trie data structure.
with a 3 character prefix, an auto-complete feature based on this moel can predict the right next word 2 out of 3 times.
a shiny App demo was build to showcase the Text Predictor.

Algorithm

Text prediction is the task of selecting the word $ w_{i} $ with maximum probability $ p $ using left context $ w_{i-n+1}... w_{i-1} $ and a noisy channel observation $ o $ of the word $w_{i}$n as predictors.

Via Bayes Rule we get:

\[ \hat{w} = \underset{w_{i} \in Words} {argmax}~(~(p( o \mid w_{i} ) \cdot p( w_{i} \mid w_{i-n+1}... w_{i-1} )~) \]

Let $ o $ be a $ prefix $ of $ w_{i} $, then

\[ \hat{w} = \underset{w_{i} \in \{w~:~isPrefix(prefix,w) , w \in Words\ \} } {argmax} ~p( w_{i} \mid w_{i-n+1}... w_{i-1} ) \]

$ p( w_{i} \mid w_{i-n+1}... w_{i-1}) $ : n-gram language model

Note: observations other than word prefixes are commonly in use, including keyboard swipe, pen-strokes in hand writing recognition, word images in OCR, and voice in speech recognition.

Language Model Training

As mentioned in the Milestone Report, the Capstone data was cleaned and tokenized with tm and RWeka
The full text corpus (~100M words) was split in training (90%) and evaluation (10%), then the training corpus was sub-sampled in 10, 3 and 1 million word corpora, final models were trained on the 10 million word training corpus.
Model building and evaluation experiments were run to measure Coverage, Perplexity, Prediction Accuracy, LM-Size and prediction speed in function of word-frequency threshold K, Ngram-order N and training corpus size.
Finally a “small”“, "medium” and “large”“ language model were selected for use with the shiny app

	Model	N	K	uniGrams	biGrams	triGrams	Size.Mb
1	small	2	40	11784	19985	0	0.95
2	medium	3	20	17896	40661	16757	2.30
3	large	3	7	32285	110943	65020	5.90

Performance Evaluation

plot of chunk Accuracy

	Model	N	0	1	2	3	4
1	small	2	9.30	27.80	44.40	60.30	67.90
2	medium	3	11.90	30.90	47.30	63.40	70.70
3	large	3	13.20	32.80	49.10	65.00	72.60

Note: prefix length zero means model only uses previous N-1 words

Accuracy was measured on a 1% sub-sample of the evaluation corpus (~ 100k words) for word-prefix lengths from 0 to 4
10 bootstrap iterations were run to estimate a mean and a standard error of the prediction accuracy.
Observations and Conclusions
- The best models for all prefix lengths is the large tri-gram model.
- using only trigram word history, the best model is only ~13% accuracy, best model has ~4% absolute increase in accuracy over the small model
- observation of a letter increases accuracy with ~15%, upto 3 letters, then accuracy increase slows down

Show-Case Application

Shiny App Screen Shot

a shiny App demo was build to show-case the Text Predictor.

it takes as input a phrase (multiple words) in a text box input and outputs a prediction of the top-5 next best words (best on top)
it provides an auto-complete demonstration (see screenshot) that simulates a mobile keyboard App, rthat provides top 5 predicted words as users type. This allows for an evaluation of typing performance inprovement and gives a good sense of the prediction speed.
it provides a glimpse under the hood, with summary statistics of the underlying language models.