Coursera Data Science Capstone Project: text prediction model

Marc Boulet
2018-02-05

Purpose

The purpose of the Coursera JHU Data Science Specialization capstone project is to build an app that will predict the next word based a user's text input.

This app uses a text prediction model with the following features:

Efficient, iterative generation of predicted words
Real-time retrieval and display of predicted words
User option to view predicted word statistics (frequency and probability)
User option to display 1-7 predicted words

Data source and preparation

The source content for the prediction app is from heliohost.org on September 30, 2016, with profanity filtered out.
The text prediction model uses 100% of the corpus, with a filter for low-frequency word events in order to make the data compact enough to fit into a Shiny app.
The corpus was tokenized (or split) into n-grams, or word chunks.
For instance, the sentence “many miles to go before I sleep” would be divided into multiple n-grams:
- bi-gram: “many miles”
- trigram: “many miles to”
- 4-gram: “many miles to go”
- 5-gram: “many miles to go before”

Algorithm

The text prediction model uses these n-grams for prediction, using a maximum likelihood estimation (MLE) technique.
The statistics for ranking the predictions are shown using the Full output type option in the app.
If there are no predictions available, then the model uses a Stupid Backoff algorithm, which iteratively removes one word from the input text until the correct number of predicted words are displayed.

App description

The app is designed for ease of use. Simply enter some text into the Input sentence box and the predicted words will appear on the right.

There are a few options to customize the text prediction model:

simple output, which shows the predicted word(s) only
full output, which also shows the frequency and probability of the predicted word(s)
the slider adjusts the number of predicted word(s) shown.

For more information, click on the Documentation button.