Next Word Prediction App

Dave Hurst
12-Dec-2014

https://dsdaveh.shinyapps.io/ShinyText/
Type in a phrase and leave off the last word – E.g.“It would make me feel better if you opened the”, then click 'Predict'
The phrase is processed and redisplayed to the right
After the first word, the app tries to predict each next word
– Green words are correct guesses
– Red words were not the first guess
The score is the total correct guesses divided by the number of guesses
The best 5 predictions for the final word are shown in the bar chart

alt text

The prediction app uses N-grams with a simple backoff strategy.

Data Preparation Steps:

Sample data from provide texts (10% was used)
Data cleansing (punctuation, case, etc.)
Create Term Document Matrices for 1,2,3,4,5,6-grams
Create probability matrices for 1-5-grams
- row = N-grams, columns = known terms
- cell value is probability that term will follow the N-gram
Extract top 10 highest probabilities for each term

Prediction Steps:

load(file=ptm.squashed.file)
phrase <- tcorpus[[1]]$content[56]
phrase
plotPhraseScore( pmat.s, phrase)

[1] "Behold the graphic design equivalent of saying \"white Hispanic\" :"

plot of chunk unnamed-chunk-4

Memory and CPU limitations in processing Text
- Solution sample the text
- Limit number of terms and n-grams collected

MAX_NGRAMS <- 100000
MAX_TERMS  <- 15000

Saved Matrices are too large to host on Shiny server
- keep the top few probabilties for each term (required less the 1% of original storage)

object_size(pmat.s)  #n-gram data [1-5]

35.1 MB

The app currently performs poorly, catching mainly prepositions
- consolidate infrequent terms
- use a word association algorythm and combine with N-grams in an ensemble model.