Gabe Rudy
2016-01-23
The prediction of a next word requires building a model that uses preprocessed word frequencies and a method to integrated multiple individual predictions into a single selected next world.
This model was derived from a corpus of ~500MB of blog news and twitter text.
| Files | #Lines | #Words | Size |
|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334690 | 200M |
| en_US.news.txt | 1010242 | 34372720 | 196M |
| en_US.news.txt | 2360148 | 30374206 | 159M |
From the corpus, the di, tri, tetra and penta-grams were extracted and had the following steps applied:
prefix, next_word, countNote: I had to fork/contribute to RcppLevelDB to support reading back sorked key/value pairs.
For a given input phrase, the tokenized version is used to predict the next word as follows:
prefix tokens are available.next_word and counts retrieved. The penta-gram is given a weight of 1, and remaining counts weighted as \[ 0.6^{5-gramlength} \]Novel test sets were extracted from the sentences used in the quizzes, a news article and tweets with #today. For each word in each line, the preceding portion of the line was passed to the next word predictor.
Correct predictions were counted when using only the di-gram and then adding each other database into the model. Accuracy of 20% was achieved on the quiz sentences, but on bulk text 15% is more common.
| Files | #Preds | Di-Gram Only | Up to Tri-Gram | Up to Tetra-Gram | Up to Penta-Gam |
|---|---|---|---|---|---|
| quizes_sentances.txt | 282 | 41 | 56 | 57 | 57 |
| news_charm | 382 | 48 | 64 | 66 | 66 |
| twitter_today | 277 | 33 | 39 | 39 | 41 |
Open the Word Predictor Shiny App, follow the directions by entering text in the input line. The next word is displayed below.