The word prediction project has been started and an initial assessment of the task completed. This report describes the goals of the overall project, the magnitude of the challenge, and gives some initial assessments of the results.
The goal of the project, over all, is to use a large body of collected text and use that to be able to predict what the next word will be that a user types. This will be tested by creating a web application that allows a user to enter a fragment of a sentance and then see what the engine predicts.
If successful, this kind of prediction tool is useful for mobile keyboards where being able to present a small set of likely words allows people to type word by word rather than character by character.
The project is roughly halfway complete:
The data sources (containing a large body of text from various sources) have been acquired. The text has been broken into words, URLs, and (for Twitter data) hashtags & user ids. From these words I have calculated some basic statistics and then build a “model” of the data that can be used for predictions. This model is based on something called “n-grams”, which are simply some number (the “n”) of words that appear in sequence in text. For my project, I have decided to work with 2- through 5-grams (that is, 2 to 5 word sequences).
These sequences will be used to make the final prediction program. This document describes how those sequences were generated, what I discovered in the process of discovery, and some initial conclusions. A subsequent document will provide the final conclusions when the project is complete.
In this initial phase of the project I have a few simple goals I want to accomplish:
For the purposes of this analysis, I’m going to work with a subset of the data ammounting to about 1% of the full dataset. The full dataset is quite large:
## Source Size_In_Bytes Lines Estimated_Words Sample_Percent_Unique
## 1 Twitter 316037344 2360148 29824826 8.761
## 2 Blogs 260564320 899288 38272038 7.832
## 3 News 261759048 1010242 34968573 9.364
(We estimate the number of words by extrapolating from our small random sample to keep compute times manageable at this point)
A predictive model is going to favor words that appear more frequently in the text. What are the typical frequences of words?
We can see that the distribution is skewed heavily to the right – most words appear very infrequently in the text, and only a few words (e.g., “The”, “a” appear very frequently).
Although the raw frequency of words is interesting, it isn’t a very useful set of data for predictions (the next word is always “the”!). Instead, we want to use the word or words that are preceeding to predict the next. To build that prediction model we need to create a set of “n-grams”, which have the word(s) leading up to our word. Because this represents a combination of words, we have to worry about how many preceeding words we want to predict on: too few or too many results in bad predictions. For this milestone step, we’re looking at bigrams (two words in sequence), trigrams (three words), and 4-grams.
There are 2116501 n-grams generated from our subset, and a total of 1246843 unique n-1 grams that hold our predictors. Here’s the frequency of the n-grams.
As you can see, there’s a severe right skew to the data (the absolute values are not important). In the case that a 4-gram, 3-gram and a 2-gram all have the one and same word they predict, we can remove the longer ngrams.
Here’s a quick test:
# Current number of ngrams
ngrams.ngramsLen()
## [1] 2116501
# Remove all "useless" ngrams
ngrams.prune()
# New number of ngrams for making predictions
ngrams.ngramsLen()
## [1] 528955
Given the large size of the data set, I will have to be selective about what predictions I keep. I filtered the data based upon whether a phrase occured at least twice, at least 5 times, or at least 10 times to try to build a manageable set of data.
Here are the number of n-grams that remained at each level after processing the full corpus:
## >1 5+ 10+
## 7000088 1173900 468467
Tuning the threshold gives me an opportunity to get the “best” predictions within performance limits. This leaves me ready to build the final prediction engine and optimize it for on-the-web performance.
With a set of optimized n-grams in place, my next step is to build a prediction tool that will use those n-grams to guess the next word in a sentence fragment. The tool needs UI design, prediction engine design, and then to be tested and deployed to the web.