Johns Hopkins Coursera Data Science Capstone

Raphael Cóbe
April 24th, 2016

Introduction

The Coursera Data Science Specialization Capstone project from Johns Hopkins University (JHU) allows students to create a usable public data product that can show their skills to potential employers. For this iteration of the class, JHU partnered with SwiftKey (http://swiftkey.com/en/) to apply data science in the area of natural language processing.

The goal of the capstone project is to create a predictive text model using a large text corpus of documents as training data. The data used in the model came from a corpus called HC Corpora (www.corpora.heliohost.org). Natural language processing techniques will be used to perform the analysis and build the predictive model.

Algorithm Development

The algorithm developed to predict the next word was based on the N-gram model using a subset of cleaned data from blogs, twitter, and news Internet files.

Our N-gram proposed model used N values of 1, 2, and 3, namely unigrams, bigrams, and trigrams. For each of the N-grams we calculated the Maximum Likelihood Estimation (MLE).

During the development of the project we have been forced to work with only a slice of the data made available, since the chosen API for constructing N-grams models (RWeka) is very memory consuming.

The Back-off Model

Although the corpus was vast, we had to deal with the data sparsity, in the sense that there are more unseen sentences than seen sentences. In order to cope with this issue we addopted a Katz's back-off model.

By using this model, we can back off a prediction to models with less history, e.g., if the probability of seing a trigram x y z is 0 (the trigram was never seen) we can backoff to the bigram model, by discarding x and calculating the probability P(z|y).

An alternative worth trying is to test Smoothing methods such as Jelinek-Mercer that uses a form of interpolation between the N-grams models.

Using the Application

Use of the application is straightforward as can be seen at the screenshot below. The user begins just by typing some text without punctuation in the supplied input box. The prediction is made when the space bar is pressed and, for a small amount of time, nothing is typed.

alt text