Coursera Data Science Capstone Project

Greg Bennett
June 14, 2018

Intro

The Coursera Data Science Specialization Capstone project objective was to build a working predictive text model. The data used in the model came from a corpus called HC Corpora https://www.corpora.heliohost.org.

  • Bullet 1

Algorithm Development

The algorithm developed to predict the next word in a user-entered text string was based on a classic N-gram model. [2] Using a subset of cleaned data from blogs, twitter, and news Internet files, Maximum Likelihood Estimation (MLE) of unigrams, bigrams, and trigrams were computed.

The Shiny Application

…An application was developed that accepts a phrase as input, suggests word completion from the unigrams, and predicts the most likely next word based on the linear interpolation of trigrams, bigrams, and unigrams. The web-based application can be found here.