Next Word Prediction App

Dave Hurst
12-Dec-2014

The App

  • https://dsdaveh.shinyapps.io/ShinyText/
  • Type in a phrase and leave off the last word – E.g.“It would make me feel better if you opened the”, then click 'Predict'
  • The phrase is processed and redisplayed to the right
  • After the first word, the app tries to predict each next word
    – Green words are correct guesses
    – Red words were not the first guess
  • The score is the total correct guesses divided by the number of guesses
  • The best 5 predictions for the final word are shown in the bar chart

alt text

The Method

The prediction app uses N-grams with a simple backoff strategy.

Data Preparation Steps:

  • Sample data from provide texts (10% was used)
  • Data cleansing (punctuation, case, etc.)
  • Create Term Document Matrices for 1,2,3,4,5,6-grams
  • Create probability matrices for 1-5-grams
    • row = N-grams, columns = known terms
    • cell value is probability that term will follow the N-gram
  • Extract top 10 highest probabilities for each term

The Method (cont'd)

Prediction Steps:

  • Apply data preparation to an input phrase
  • find the longest N-gram that exists in the stored matrices
  • find the term with the highest probability

load(file=ptm.squashed.file)
phrase <- tcorpus[[1]]$content[56]
phrase
plotPhraseScore( pmat.s, phrase)
[1] "Behold the graphic design equivalent of saying \"white Hispanic\" :"

plot of chunk unnamed-chunk-4

Challenges

  • Memory and CPU limitations in processing Text
    • Solution sample the text
    • Limit number of terms and n-grams collected
MAX_NGRAMS <- 100000
MAX_TERMS  <- 15000
  • Saved Matrices are too large to host on Shiny server
    • keep the top few probabilties for each term (required less the 1% of original storage)
object_size(pmat.s)  #n-gram data [1-5]
35.1 MB

Observations and Next Steps

  • The app currently performs poorly, catching mainly prepositions
    • consolidate infrequent terms
    • use a word association algorythm and combine with N-grams in an ensemble model.