Coursera Data Science Capstone Project - Next Word Predictor

Andrew Weston
7/13/2016

Project Overview

The goal of this project was to create a web application in Shiny that would allow the user to input a word or several words and predict the next word.

We used Natural Language Processing (NLP) techniques and large samples of text from twitter, news media and blogs to train our model.

The training text can be obtained here:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Data processing and preparation

We read in a random sample from the training text, removing punctuation, URLs, etc. This step was documented in detail in the milestone report available here.

We then analyzed word frequency, including sequences of 2, 3, 4 and 5 words (called n-grams). E.g. “the black cat” is a 3-gram and if the user typed “the black” the application might suggest “cat” if “the black cat” is a common 3-gram.

Constructing the application

Once we have computed common n-grams, we store them (with frequency counts) for fast lookup in our application. That means that the processing step will not have to be repeated and the app can quickly look up words. But it also means that the app is not going to learn anything new based on what the user typed (which would be nice, but more complicated).

Technical Details (algorithm)

The application makes use of the stupid backoff NLP algorithm. If the user types 3 words, it will look for a 4-gram that begins with those 3 words. If it can't find one, it will continue and look for 2-grams that begin with the 2nd word, etc.

Overall the app predicts words that do not lead to valid sentences quite often. It would be a bit more coherent if it took into account sentence structure (i.e. parts of speech) and largely predicts “the” most of the time.