N-Gram Word Prediction

Ben Bray
12/14/2014

Application Overview

This Word Prediction App, developed for the Johns Hopkins Data Science Specialization, uses a model trained on a corpus of millions of U.S. English documents to predict text. Given a few words of text, this application will return:

The predicted result and its score
A wordcloud displaying the top 10 results
A prose explanation of how the top result was chosen

The model was tested on about 100,000 previously unseen example entries. The model's first predicted result was the correct answer 20.92 percent of the time.

Application Details

The probability of a word occuring is calculated based on the preceding “n-gram” (sequence of n words). If the n-gram “I am going” appears in the corpus 10 times, and it is followed by the word “to” 9 of those times, the probability of “to” being the next word is calculated as 90%.

The model uses an “Interpolation” method. 3 probabilities for a word are calculated as described above. They are based on the previous 3, 2, and 1 words. Each of these probabilities is given a weight, and the weighted probabilities are combined to give a final score. The weights can be optimized using a number of testing techniques. If there are no candidate words, the model suggests the word “the”.

When building the model, sentence structure was taken into account. This allows the model to use the beginning and ends of sentences predictively. It also prevents the model from trying to predict a word using words from a previous sentence, which have less predictive power.

The model also abstracts certain text into tags, for example, dates or money amounts. This allows frequent patterns to be better recognized.

For compression reasons, and also to exclude possibly irrelevant data, n-grams that occured less than ten times were were removed from the model.

Training The Model

The model was created using about 2.5 Million documents, randomly selected from a corpus of text pulled from Twitter, blogs, and news sites.

Ideally, the “Lambda” interpolation weights would have been trained using something like a neural network algorithm. However, I have not done this. Instead I simply tried out a few settings and used the ones with the best results. Surprisingly to me, the results did not vary much.

Data Set	Lambda Settings	Trials	Success Rate
Held Out	1/3 1/3 1/3	46927	0.2064483
Held Out	.5 .25 .25	46927	0.2050419
Held Out	.6 .3 .1	46927	0.2087924
Test Set	.6 .3 .1	94054	0.2092096

Model Size

The size of the model can be greatly reduced by removing ngrams that don't occur very often. Because I was worried about the amount of memory which would be available on Shiny Apps, I may have cut mine off a bit low, using a minimum of ten occurances for an n-gram to be included. It may have been worth experimenting with the size to see if it improved performance. plot of chunk unnamed-chunk-1

Links To Code

Load and Organize Data

Seperate into Training/Held Out/Test sets (60/20/20)
Create 4-Gram Frequency Tables, Compile into Single Table

Regex cleanup. Break into sentences. Tokenize. Aggregate 4-grams counts.
Create Model from 4-Gram Frequency Table

Remove unknown words, profanity. Calculate probabilities. Prune/Compress.
Prepare Test/CV Sets, Evaluate Lambda Training, Summarize

Train Lambda parameters on Held Out Set. Apply to Test Set.
Application Code: Shiny server.R, Shiny ui.R