Word Predictor

Gabe Rudy
2016-01-23

Method

The prediction of a next word requires building a model that uses preprocessed word frequencies and a method to integrated multiple individual predictions into a single selected next world.

This model was derived from a corpus of ~500MB of blog news and twitter text.

Files	#Lines	#Words	Size
en_US.blogs.txt	899288	37334690	200M
en_US.news.txt	1010242	34372720	196M
en_US.news.txt	2360148	30374206	159M

Model Building

From the corpus, the di, tri, tetra and penta-grams were extracted and had the following steps applied:

Insert each observed n-gram into a LevelDB key/value store as the key, with value of 1
If the n-gram already exists, increment the value to reflect the observations of the n-gram
Read back the ordered n-gram key/value store, keeping only n-grams with > 1 for di and tri-gram and > 2 for tetra and penta-grams.
Splitting the last token off, store these into a SQLite database with attributes prefix, next_word, count

Note: I had to fork/contribute to RcppLevelDB to support reading back sorked key/value pairs.

Next Word Prediction

For a given input phrase, the tokenized version is used to predict the next word as follows:

Each N-grab database is queried if enough input prefix tokens are available.
The next_word and counts retrieved. The penta-gram is given a weight of 1, and remaining counts weighted as \[ 0.6^{5-gramlength} \]
di-grams have a fixed unweighted count of 2
The predicted word is the one with the greatest total weighted counts, or the if no predictions were made.

Evaluation

Novel test sets were extracted from the sentences used in the quizzes, a news article and tweets with #today. For each word in each line, the preceding portion of the line was passed to the next word predictor.

Correct predictions were counted when using only the di-gram and then adding each other database into the model. Accuracy of 20% was achieved on the quiz sentences, but on bulk text 15% is more common.

Files	#Preds	Di-Gram Only	Up to Tri-Gram	Up to Tetra-Gram	Up to Penta-Gam
quizes_sentances.txt	282	41	56	57	57
news_charm	382	48	64	66	66
twitter_today	277	33	39	39	41

Usage

Word Predictor

Open the Word Predictor Shiny App, follow the directions by entering text in the input line. The next word is displayed below.