Wordchomp

is fast, scalable, and uses a "chewing" method to save memory.

Go to the Shiny app here. Enter text in the textbox.

Wordchomp takes user input and dynamically returns a ranked table of up to 20 expected next words; relative confidence in each prediction; and a wordcloud illustrating those expectations.

Algorithm

Stupid backoff based on 100% of the dataset, using 4grams and trimming out words that appear less than 6 times. Words were stemmed before building the model.

Optimized for speed and memory use with good accuracy. The accuracy is lower than what's achievable with less trimming, no stemming, and larger ngrams; but the dataset used here is only about 7.5 MB! This lets it run on even old and underpowered phones.

Algorithm details

All ngrams are stored in a single data table, indexed by frequency. Goes down the table finding entries that start with the input. If it finds none, it chews off the first word of the input string and tries again.

Because a ngram is always more common than the corresponding n+1gram, it doesn't match entries with too many words. Because there is a trailing space in the search string, it doesn't match entries with too few words. This means there's no need to explicitly switch ngram tables or store an n column!

Performance

Results on Hernán Foffani's benchmark, using 100% of blogs and tweets:

Overall top-3 score: 15.17 %

Overall top-1 precision: 11.66 %

Overall top-3 precision: 18.29 %

Average runtime: 84.93 msec

Number of predictions: 28464

Total memory used: 109.76 MB

Leaving in words that appear 2-5 times and using non-stemmed words increases top-3 score and top-3 precision by about 1%, but triples runtime and memory use.