Greg Bennett
June 14, 2018
The Coursera Data Science Specialization Capstone project objective was to build a working predictive text model. The data used in the model came from a corpus called HC Corpora https://www.corpora.heliohost.org.
The algorithm developed to predict the next word in a user-entered text string was based on a classic N-gram model. [2] Using a subset of cleaned data from blogs, twitter, and news Internet files, Maximum Likelihood Estimation (MLE) of unigrams, bigrams, and trigrams were computed.
…An application was developed that accepts a phrase as input, suggests word completion from the unigrams, and predicts the most likely next word based on the linear interpolation of trigrams, bigrams, and unigrams. The web-based application can be found here.