Coursera Data Science Capstone Project

Greg Bennett
June 14, 2018

Intro

The Coursera Data Science Specialization Capstone project objective was to build a working predictive text model. The data used in the model came from a corpus called HC Corpora https://www.corpora.heliohost.org.

Bullet 1

Algorithm Development

The algorithm developed to predict the next word in a user-entered text string was based on a classic N-gram model. [2] Using a subset of cleaned data from blogs, twitter, and news Internet files, Maximum Likelihood Estimation (MLE) of unigrams, bigrams, and trigrams were computed.

https://en.wikipedia.org/wiki/N-gram
https://en.wikipedia.org/wiki/Part-of-speech_tagging

The Shiny Application

…An application was developed that accepts a phrase as input, suggests word completion from the unigrams, and predicts the most likely next word based on the linear interpolation of trigrams, bigrams, and unigrams. The web-based application can be found here.