Shiny App: NLP Word Prediction

Shaun D.
April 2018

COURSERA - Data Science Specialization Capstone

Link to R Shiny App: NLP Word Prediction using Katz's Back-off Model

1. Summary

This Shiny app is designed to provdied an API to the materials developped under the John's Hopkins/Coursera Data Science Capstone Course. The underlying code implenents a natural language processing approach to predict the most likely word for a given input string, based on learned outcomes modelled from a supplied corpus of data derived from Twitter, blogs, and news sources.

2. Katz's Backoff Model

Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.

The back-off approach will match an appropriate n-gram model (bigram, trigram, or quadgram) and return maximum likelihood to assess conditional probability for the next word in the series, given by:

\[ P(w_{bo} | w_1...w_{n-1}) = \frac{c(w_1...w_n)}{c(w_1...w_{n-1})} \]

3. Data:

The underlying code implenents a natural language processing approach, as previously detailed, to predict the most likely word for a given input string, which is based on learned outcomes modelled from a supplied corpus of data. Model parameters were derived from a large corpus of text, which included:

News Sources
Blog Postings
Twitter Feeds

4. Instructions:

Input:

To apply the model using the provided API, enter a character, string, or sentence fragement.

Output:

For each bigram, trigram, and quadgram, the app will calculate conditional probability of a word given its history, proportional to the maximum likelihood estimate (MLE) of that n-gram. Otherwise, the back-off model will calculate the conditional probability equal to the back-off conditional probability of the (n - 1)-gram. The app will then return the word with the highest liklihood for the associated n-gram. In effect, this appraoch will predict the next word most likely to occur.

References

Katz's Back-off Model. Wikipedia.
https://en.wikipedia.org/wiki/Katz%27s_back-off_model
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Ch. 4, NLP. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf
JHU/Coursera Data Science Capstone Course.
https://www.coursera.org/learn/data-science-project