Word Predictor

Gabe Rudy
2016-01-23

Method

The prediction of a next word requires building a model that uses preprocessed word frequencies and a method to integrated multiple individual predictions into a single selected next world.

This model was derived from a corpus of ~500MB of blog news and twitter text.

Files #Lines #Words Size
en_US.blogs.txt 899288 37334690 200M
en_US.news.txt 1010242 34372720 196M
en_US.news.txt 2360148 30374206 159M

Model Building

From the corpus, the di, tri, tetra and penta-grams were extracted and had the following steps applied:

  • Insert each observed n-gram into a LevelDB key/value store as the key, with value of 1
  • If the n-gram already exists, increment the value to reflect the observations of the n-gram
  • Read back the ordered n-gram key/value store, keeping only n-grams with > 1 for di and tri-gram and > 2 for tetra and penta-grams.
  • Splitting the last token off, store these into a SQLite database with attributes prefix, next_word, count

Note: I had to fork/contribute to RcppLevelDB to support reading back sorked key/value pairs.

Next Word Prediction

For a given input phrase, the tokenized version is used to predict the next word as follows:

  • Each N-grab database is queried if enough input prefix tokens are available.
  • The next_word and counts retrieved. The penta-gram is given a weight of 1, and remaining counts weighted as \[ 0.6^{5-gramlength} \]
  • di-grams have a fixed unweighted count of 2
  • The predicted word is the one with the greatest total weighted counts, or the if no predictions were made.

Evaluation

Novel test sets were extracted from the sentences used in the quizzes, a news article and tweets with #today. For each word in each line, the preceding portion of the line was passed to the next word predictor.

Correct predictions were counted when using only the di-gram and then adding each other database into the model. Accuracy of 20% was achieved on the quiz sentences, but on bulk text 15% is more common.

Files #Preds Di-Gram Only Up to Tri-Gram Up to Tetra-Gram Up to Penta-Gam
quizes_sentances.txt 282 41 56 57 57
news_charm 382 48 64 66 66
twitter_today 277 33 39 39 41

Usage

Word Predictor

Open the Word Predictor Shiny App, follow the directions by entering text in the input line. The next word is displayed below.