Shaun D.
April 2018
This Shiny app is designed to provdied an API to the materials developped under the John's Hopkins/Coursera Data Science Capstone Course. The underlying code implenents a natural language processing approach to predict the most likely word for a given input string, based on learned outcomes modelled from a supplied corpus of data derived from Twitter, blogs, and news sources.
Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.
The back-off approach will match an appropriate n-gram model (bigram, trigram, or quadgram) and return maximum likelihood to assess conditional probability for the next word in the series, given by:
\[ P(w_{bo} | w_1...w_{n-1}) = \frac{c(w_1...w_n)}{c(w_1...w_{n-1})} \]
The underlying code implenents a natural language processing approach, as previously detailed, to predict the most likely word for a given input string, which is based on learned outcomes modelled from a supplied corpus of data. Model parameters were derived from a large corpus of text, which included:
News Sources
Blog Postings
Twitter Feeds
To apply the model using the provided API, enter a character, string, or sentence fragement.
For each bigram, trigram, and quadgram, the app will calculate conditional probability of a word given its history, proportional to the maximum likelihood estimate (MLE) of that n-gram. Otherwise, the back-off model will calculate the conditional probability equal to the back-off conditional probability of the (n - 1)-gram. The app will then return the word with the highest liklihood for the associated n-gram. In effect, this appraoch will predict the next word most likely to occur.
Katz's Back-off Model. Wikipedia.
https://en.wikipedia.org/wiki/Katz%27s_back-off_model
Speech and Language Processing. Daniel Jurafsky & James H. Martin. Ch. 4, NLP. https://lagunita.stanford.edu/c4x/Engineering/CS-224N/asset/slp4.pdf
JHU/Coursera Data Science Capstone Course.
https://www.coursera.org/learn/data-science-project