Coursera Capstone Project

S. Phillips
7/2016

Project Goal

Create a Shiny app that performs next-word predictions for the Coursera Data Science Specialization Capstone project.

How to use

The application can be launched by clicking here. After a short initialization routine, simply begin entering words into the Text Input box and word predictions will automatically render in the lower panel.

Preparing the Text Data

The Quanteda package was used to build a corpus in which the data was stripped of non-useful text.

The corpus was then loaded into Document Feature Matrices and ultimately converted into dataframes of n-grams.

Optimal sample sizes from each text file were analyzed to find the max size point where prediction accuracy was not improved sufficiently to justify performance time.

The Prediction Algorithm

The following steps were used to balance response time with accuracy.

Good-Turing prediction with back-off was the chosen methodology for its efficiency.
N-grams were converted to data-frames and stored as text files that each Shiny session uses to initialize. (takes about 10 seconds)

Further information about the data is at http://www.rpubs.com/sydphi/188926

Additional Features

The algorithm uses partial string matching to also help with the current word being entered.

A second column in the app indicates what method rendered each result.

—Thanks, for trying my Shiny prediction app!