Data Science Capstone

Desire De Waele
September 20th, 2016

This presentation introduces a word prediction application, how it was developed, and how to use it. It is the Capstone project for the data science specialization on Coursera, taught by John Hopkins University professors, in collaboration with Swiftkey.

Introduction

The goal of the capstone project is to build an application that predicts the next word when given any text input. It should mimic the widely used Swiftkey word predictor generally found on smartphones.

Starting from a corpus of tweets, newsitems and blogs, and deploying the learned skills in the specialization, the predictor is implemented from scratch. Data is being retrieved, cleaned and explored, an algorithm and shiny application are being developed.

In the next slides, I discuss the algorithm and the application.

The Algorithm: 3 parts

  1. Generate counts and probabilities of any n-gram - i.e. sets of n subsequent words - in the training corpus. For the probabilities, Kneser-Ney smoothing is used, generally considered the most effective smoothing method.

  2. Create a back-off model by looking at n-grams with high probability. As soon as a given input does not appear in an n-gram or the n-gram doesn't surpass a given probability treshold, we back-off to a lower 'n-1-gram' and look there.

  3. Parameter optimalization. The n-grams to start looking in and the probability treshold are two parameters which are optimalized on a validation set. Ultimately the most efficient model is used.

The Application

The link: https://dezwirey.shinyapps.io/10-Capstone-Data-Science/. The application is deployed using Shiny, a very accessible and easy tool to let users play around with data.

Using the application is very simple. Just start typing some text, and see how the three best predictions are displayed, just as your smartphone would work. The most probable option is shown first.

alt text

Please do not compare this with Google's predictor.

Additional Information

These first two links contain a thorough overview on the how the data was processed and the application was built.

Report of the exploratory data analysis: http://rpubs.com/Dezwirey/205739

Report of the model and predictor building: http://rpubs.com/Dezwirey/211935

Github repository with all the code: https://github.com/dezwier/10-Capstone-Data-Science

The data science specialization: https://www.coursera.org/specializations/jhu-data-science