Data Science Capstone: Predictive Text Modeling using R

Hans P.
9-02-2020

Project Summary

The Data Science Capstone was the final course in the Data Science Specialization offered by Johns Hopkins University on Coursera. The purpose of this course was to create a predictive text model using R and deploy it with a Shiny app.

My App:

Allows the user to input an incomplete sentence in a text box
Predicts the most likely next word in the sentence
Returns the most likely next word, as well as a wordcloud of the most likely word and other likely candidates

You can try the app yourself on shinyapps.io

Screen shot of app

How it works

This app leverages an ngram approach to modeling that is similar to a Katz back-off model.

Steps

Tokenize input string into individual words
Create bi- tri- and four- grams using the last 2, 3, and 4 words in the input string
Try to match each ngram to 2, 3, and 4 ngrams in the corpus. Return the 'next' word in each case
Apply weights to the matched ngrams based on length and commonality
Rank order the predicted words and return to user

Some Limitations of the App

The app runs quickly and returns reasonable results in most cases. However there are some limitations to the app that should be addressed before it would be commercially viable.

The corpus used for modeling was developed from Tweets, news articles, and blogs. It may not work as well for texting or writing emails as the style of language is quite different.
My model is kind of “stupid”. It will return good candidate words that fit the grammar and general context, but will rarely return anything too complex.
My model sometimes makes the error in the discussion section of this wikiepedia article on Katz' back-off models.