Word Prediction: Data Science Capstone Project

Quinton Barrington

Introduction

This Capstone project from Coursera & Johns Hopkins University (JHU) show cases the students ability to create a usable public data product. For this final class project, JHU partnered with SwiftKey (http://swiftkey.com/en/) to apply data science in the area of natural language processing.

The objective of this project was to build a working predictive text model. The data used in the model came from a corpus called HC Corpora (www.corpora.heliohost.org). A corpus is body of text, usually containing a large number of sentences. [1]

[1] http://desilinguist.org/pdf/crossroads.pdf

Algorithm Development

The algorithm developed to predict the next word in a user-entered text string was based on a classic N-gram model. [2] Using a subset of cleaned data from blogs, twitter, and news Internet files, Maximum Likelihood Estimation (MLE) of unigrams, bigrams, and trigrams were computed. To improve accuracy, Jelinek-Mercer smoothing was used in the algorithm, combining trigram, bigram, and unigram probabilities. [3] Where interpolation failed, part-of-speech tagging (POST) was employed to provide default predictions by part of speech. [4] Suggested word completion was based on the unigrams. A profanity filter was also utilized on all output using Google's bad words list. [5]

[2] http://en.wikipedia.org/wiki/N-gram [3] http://www.ee.columbia.edu/~stanchen/papers/h015l.final.pdf [4] http://en.wikipedia.org/wiki/Part-of-speech_tagging [5] https://badwordslist.googlecode.com/files/badwords.txt

The Shiny Application

Using the algorithm, a Shiny (http://shiny.rstudio.com/) application was developed that accepts a phrase as input, suggests word completion from the unigrams, and predicts the most likely next word based on the linear interpolation of trigrams, bigrams, and unigrams. The web-based application can be found at (https://templar32.shinyapps.io/Capstone/).

Using the Application

Using the application is easy and the user begins by typing some text without punctuation in the supplied input box. As the user types, the text is echoed in the field below along with a suggested word completion. At the bottom of the screen, the predicted next word in the phrase is shown.

This application has many educational and commercial uses. The files used in the application only occupy about twenty four kilobytes thus making it easy to download and use.