App to Predict the next word

Piyush Neupane
06/01/2016

Coursera Data Science Capstone project

What does the App do?

This App does the following:

takes as input any word/phrase and attempts to predict the next word in the sequence.
outputs 10 words with the largest Backoff Score, sorted by their likelihood.
shows the user-input broken down into tokens, as used by the App.

Access the App here: https://piyush.shinyapps.io/shinyapp/

Initially the App might take some time to load. Once loaded, it should run pretty quick!

Layout of the App

Here is how the App looks:https://piyush.shinyapps.io/shinyapp/

Next, we will talk about how the Prediction algorithm was built.

Step1: Create n-gram models

Used the corpus provided in the course, and extracted 20% sample (about 700k records)

For the English tweet, blogs and news text in the corpus,
- only alphabetic characters were kept (all punctuations and numeric characters were removed)
- profanity was removed
- common stopwords such as 'a', 'the','I', etc. were removed
- extraneous spaces were removed
  - fivegrams, fourgrams, trigrams, bigrams and unigrams were computed.
  - for each gram, the last word was set aside as potential predicted word, and the remaining words were used to match with the input-value.

Step2: Predict the next word

Given the size of the dataset, the Stupid Backoff Algorithm was used. It is not as resource intensive as other algorithms such as Katz' Backoff Model, or Kneser-Ney Smoothing. Also, the quality is comparable to more intensive models.

Started with Five-gram Stupid Backoff.
The user input is broken down into tokens, and compared against Five-gram table. If the token sequence is not found, it backs-off to Four-Gram table and so on(defaults to unigrams). Following algorithm is used to calculate the Backoff score:

References

A Collection of NLP notes: https://gist.github.com/ttezel/4138642

Coursera Stranford Natural Language Processing: https://www.coursera.org/course/nlp

Speech and Language Processing. Daniel Jurafsky & James H. Martin. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf

Basic Text Mining in R https://rstudio-pubs-static.s3.amazonaws.com/31867_8236987cf0a8444e962ccd2aec46d9c3.html