Ben Bray
12/14/2014
This Word Prediction App, developed for the Johns Hopkins Data Science Specialization, uses a model trained on a corpus of millions of U.S. English documents to predict text. Given a few words of text, this application will return:
The model was tested on about 100,000 previously unseen example entries. The model's first predicted result was the correct answer 20.92 percent of the time.
The probability of a word occuring is calculated based on the preceding “n-gram” (sequence of n words). If the n-gram “I am going” appears in the corpus 10 times, and it is followed by the word “to” 9 of those times, the probability of “to” being the next word is calculated as 90%.
The model uses an “Interpolation” method. 3 probabilities for a word are calculated as described above. They are based on the previous 3, 2, and 1 words. Each of these probabilities is given a weight, and the weighted probabilities are combined to give a final score. The weights can be optimized using a number of testing techniques. If there are no candidate words, the model suggests the word “the”.
When building the model, sentence structure was taken into account. This allows the model to use the beginning and ends of sentences predictively. It also prevents the model from trying to predict a word using words from a previous sentence, which have less predictive power.
The model also abstracts certain text into tags, for example, dates or money amounts. This allows frequent patterns to be better recognized.
For compression reasons, and also to exclude possibly irrelevant data, n-grams that occured less than ten times were were removed from the model.
The model was created using about 2.5 Million documents, randomly selected from a corpus of text pulled from Twitter, blogs, and news sites.
Ideally, the “Lambda” interpolation weights would have been trained using something like a neural network algorithm. However, I have not done this. Instead I simply tried out a few settings and used the ones with the best results. Surprisingly to me, the results did not vary much.
| Data Set | Lambda Settings | Trials | Success Rate |
|---|---|---|---|
| Held Out | 1/3 1/3 1/3 | 46927 | 0.2064483 |
| Held Out | .5 .25 .25 | 46927 | 0.2050419 |
| Held Out | .6 .3 .1 | 46927 | 0.2087924 |
| Test Set | .6 .3 .1 | 94054 | 0.2092096 |
The size of the model can be greatly reduced by removing ngrams that don't occur very often. Because I was worried about the amount of memory which would be available on Shiny Apps, I may have cut mine off a bit low, using a minimum of ten occurances for an n-gram to be included. It may have been worth experimenting with the size to see if it improved performance.
Links To Code
Seperate into Training/Held Out/Test sets (60/20/20)
Create 4-Gram Frequency Tables, Compile into Single Table
Regex cleanup. Break into sentences. Tokenize. Aggregate 4-grams counts.
Create Model from 4-Gram Frequency Table
Remove unknown words, profanity. Calculate probabilities. Prune/Compress.
Prepare Test/CV Sets, Evaluate Lambda Training, Summarize
Train Lambda parameters on Held Out Set. Apply to Test Set.
Application Code: Shiny server.R, Shiny ui.R