Milestone Report

Executive Summary

The word prediction project has been started and an initial assessment of the task completed. This report describes the goals of the overall project, the magnitude of the challenge, and gives some initial assessments of the results.

The goal of the project, over all, is to use a large body of collected text and use that to be able to predict what the next word will be that a user types. This will be tested by creating a web application that allows a user to enter a fragment of a sentance and then see what the engine predicts.

If successful, this kind of prediction tool is useful for mobile keyboards where being able to present a small set of likely words allows people to type word by word rather than character by character.

The project is roughly halfway complete:

Flowchart

The data sources (containing a large body of text from various sources) have been acquired. The text has been broken into words, URLs, and (for Twitter data) hashtags & user ids. From these words I have calculated some basic statistics and then build a “model” of the data that can be used for predictions. This model is based on something called “n-grams”, which are simply some number (the “n”) of words that appear in sequence in text. For my project, I have decided to work with 2- through 5-grams (that is, 2 to 5 word sequences).

These sequences will be used to make the final prediction program. This document describes how those sequences were generated, what I discovered in the process of discovery, and some initial conclusions. A subsequent document will provide the final conclusions when the project is complete.

Exploratory Analysis

In this initial phase of the project I have a few simple goals I want to accomplish:

Get the data in, look at it, see if there’s anything fundamentally different about how the various types of data (twitter, blogs, news) are structured that I need to worry about. For example, Twitter is full of @users and #hashtags that are not likely to appear in blogs and news reports.
Get some sense of the dimensions of the challenge. I want to know how hard this is going to be, from just the volume of data, as well as maybe some unexpected complexity.
Start to do some initial processing of the data – break things into words – and see what we can tell from that. Then look at pairs of words, etc., to start to see what more there is to learn…

For the purposes of this analysis, I’m going to work with a subset of the data ammounting to about 1% of the full dataset. The full dataset is quite large:

##    Source Size_In_Bytes   Lines Estimated_Words Sample_Percent_Unique
## 1 Twitter     316037344 2360148        29824826                 8.761
## 2   Blogs     260564320  899288        38272038                 7.832
## 3    News     261759048 1010242        34968573                 9.364

(We estimate the number of words by extrapolating from our small random sample to keep compute times manageable at this point)

Word Frequencies

A predictive model is going to favor words that appear more frequently in the text. What are the typical frequences of words?

plot of chunk wordfreqsB

We can see that the distribution is skewed heavily to the right – most words appear very infrequently in the text, and only a few words (e.g., “The”, “a” appear very frequently).

n-gram Frequencies

Although the raw frequency of words is interesting, it isn’t a very useful set of data for predictions (the next word is always “the”!). Instead, we want to use the word or words that are preceeding to predict the next. To build that prediction model we need to create a set of “n-grams”, which have the word(s) leading up to our word. Because this represents a combination of words, we have to worry about how many preceeding words we want to predict on: too few or too many results in bad predictions. For this milestone step, we’re looking at bigrams (two words in sequence), trigrams (three words), and 4-grams.

There are 2116501 n-grams generated from our subset, and a total of 1246843 unique n-1 grams that hold our predictors. Here’s the frequency of the n-grams.

plot of chunk ngrams

As you can see, there’s a severe right skew to the data (the absolute values are not important). In the case that a 4-gram, 3-gram and a 2-gram all have the one and same word they predict, we can remove the longer ngrams.

Here’s a quick test:

# Current number of ngrams
ngrams.ngramsLen()

## [1] 2116501

# Remove all "useless" ngrams
ngrams.prune()
# New number of ngrams for making predictions
ngrams.ngramsLen()

## [1] 528955

Winnowing down the prediction model

Given the large size of the data set, I will have to be selective about what predictions I keep. I filtered the data based upon whether a phrase occured at least twice, at least 5 times, or at least 10 times to try to build a manageable set of data.

Here are the number of n-grams that remained at each level after processing the full corpus:

##      >1      5+     10+ 
## 7000088 1173900  468467

Tuning the threshold gives me an opportunity to get the “best” predictions within performance limits. This leaves me ready to build the final prediction engine and optimize it for on-the-web performance.

Next steps

With a set of optimized n-grams in place, my next step is to build a prediction tool that will use those n-grams to guess the next word in a sentence fragment. The tool needs UI design, prediction engine design, and then to be tested and deployed to the web.