Gary Martin
1/1/2017
Developing a Predictive algorithm for the Data Science Specialization Capstone gives a student a chance to explore real world data projects and show off all the knowledge learned over the course of the program.
This project was in conjunction with Swift key in the area of natural language processing. Data was provided from HC Corpora (www.corpora.heliohost.org) and was a large amount of text that could be analyzed to determine patterns.
Through arduous trial an error and many difficulties with the sheer size of the data, an algorithm was developed using the N-gram model, Maximum Likelihood Estimation (MLE) of unigrams, bigrams, and trigrams were computed and tested to determine their predictive abilities.
After initial accuracy issues, Jelinek-Mercer smoothing was employed to combine the probabilites of the unigram, bigram, and trigrams. The algorithm developed to predict the next word in a user-entered text string was based on a classic N-gram model. [2] Using a subset of cleaned data from blogs, twitter, and news Internet files, Maximum Likelihood Estimation (MLE) of unigrams, bigrams, and trigrams were computed. part-of-speech tagging was also employed to provide default predictions by part of speech. It appears many students in the past used a “bad word” filter. I decided against this, for the predictive accuracy suffers when a filter of this type is applied.
Using the model developed, a Shiny application was developed that accepts a phrase as input, suggests words from the unigrams to complete the input, and predicts the next word based on the trigrams, bigrams, and unigrams. The web-based application can be found here.
This application, while not practical for use, shows the predictive ability of the code off well. The user begins just by typing some text without punctuation in the input box and as the user types, the text is echoed in the box below along with a suggested word completion. At the bottom the predicted next word in the phrase is shown.