Coursera Data Science Capstone

Guido Gallopyn
4-26-2015

Text Prediction

Overview

  • This project was done entirely in R using tm, RWeka, hash, dplyr and ggplot2 packages. (excellent LM tool-kits are available from CMU, SRI, but I preferred to dig into R text processing and NLP)

  • An S3 object was developed for NGram language models with arbitrary N.

  • with a 3 character prefix, an auto-complete feature based on this moel can predict the right next word 2 out of 3 times.

  • a shiny App demo was build to showcase the Text Predictor.

Algorithm

Text prediction is the task of selecting the word \( w_{i} \) with maximum probability \( p \) using left context \( w_{i-n+1}... w_{i-1} \) and a noisy channel observation \( o \) of the word $w_{i}$n as predictors.

Via Bayes Rule we get:

\[ \hat{w} = \underset{w_{i} \in Words} {argmax}~(~(p( o \mid w_{i} ) \cdot p( w_{i} \mid w_{i-n+1}... w_{i-1} )~) \]

Let \( o \) be a \( prefix \) of \( w_{i} \), then

\[ \hat{w} = \underset{w_{i} \in \{w~:~isPrefix(prefix,w) , w \in Words\ \} } {argmax} ~p( w_{i} \mid w_{i-n+1}... w_{i-1} ) \]

\( p( w_{i} \mid w_{i-n+1}... w_{i-1}) \) : n-gram language model

Note: observations other than word prefixes are commonly in use, including keyboard swipe, pen-strokes in hand writing recognition, word images in OCR, and voice in speech recognition.

Language Model Training

  • As mentioned in the Milestone Report, the Capstone data was cleaned and tokenized with tm and RWeka

  • The full text corpus (~100M words) was split in training (90%) and evaluation (10%), then the training corpus was sub-sampled in 10, 3 and 1 million word corpora, final models were trained on the 10 million word training corpus.

  • Model building and evaluation experiments were run to measure Coverage, Perplexity, Prediction Accuracy, LM-Size and prediction speed in function of word-frequency threshold K, Ngram-order N and training corpus size.

  • Finally a “small”“, "medium” and “large”“ language model were selected for use with the shiny app

Model N K uniGrams biGrams triGrams Size.Mb
1 small 2 40 11784 19985 0 0.95
2 medium 3 20 17896 40661 16757 2.30
3 large 3 7 32285 110943 65020 5.90

Performance Evaluation

plot of chunk Accuracy

Model N 0 1 2 3 4
1 small 2 9.30 27.80 44.40 60.30 67.90
2 medium 3 11.90 30.90 47.30 63.40 70.70
3 large 3 13.20 32.80 49.10 65.00 72.60

Note: prefix length zero means model only uses previous N-1 words

  • Accuracy was measured on a 1% sub-sample of the evaluation corpus (~ 100k words) for word-prefix lengths from 0 to 4

  • 10 bootstrap iterations were run to estimate a mean and a standard error of the prediction accuracy.

  • Observations and Conclusions

    • The best models for all prefix lengths is the large tri-gram model.
    • using only trigram word history, the best model is only ~13% accuracy, best model has ~4% absolute increase in accuracy over the small model
    • observation of a letter increases accuracy with ~15%, upto 3 letters, then accuracy increase slows down

Show-Case Application

Shiny App Screen Shot

a shiny App demo was build to show-case the Text Predictor.

  • it takes as input a phrase (multiple words) in a text box input and outputs a prediction of the top-5 next best words (best on top)

  • it provides an auto-complete demonstration (see screenshot) that simulates a mobile keyboard App, rthat provides top 5 predicted words as users type. This allows for an evaluation of typing performance inprovement and gives a good sense of the prediction speed.

  • it provides a glimpse under the hood, with summary statistics of the underlying language models.