Coursera Capstone Project

Amanda Salvesen
06/18/2017

The next slides will present a text prediction application created for the Coursera Data Science Specialization Capstone Project in June 2017.

Text Prediction Capstone Project

The text prediction application draws on a large corpus of blogs, news articles, and tweets to quickly and accurately predict the next word based on a given phrase.

To build this application, the designer:

  • Ingested a representative sample of the raw data, which can be downloaded from this link.
  • Cleaned the data by removing profanity, numbers, punctuation, and white space, and converting to lower case.
  • Tokenized the text data into n-grams of one, two, three, and four word groupings in a frequency dictionary.
  • Developed a predictive model and tested the speed and accuracy.
  • Created a Shiny app to deploy this predictive model to users.

The Predictive Model

The predictive function cleans the input data to match the data in the loaded bigram, trigram, and quadgram frequency dictionaries. The model then draws on these dictionaries to predict the next word. First, it tries to match from the quadgram dictionary. If no quadgram is available or the user provides fewer than three words of input, the model reverts to the trigram dictionary. The model then continues to back off to bigrams and the most common unigram in this manner.

Click for further information on backoff models and their implementation in R.

Using the Shiny App

To use the application, the user enters any number of words in the data entry box (red) and clicks “Go!”. The app will automatically display the predicted word in the results box (green). The user can also view the clean input used in the model.

Prediction App Usage

Visit https://asalvesen.shinyapps.io/Capstone/ to try it yourself!

For more information...