Words prediction

Alexey Nepomnyashchiy
August 23, 2015

Word prediction is the one of most common problems in the Natural language processing.

Basic approach: try to determine the most probable word, using the several last words.
Problem: too many words, too many possible combinations, too many writing styles.
Solution: use suitable sources, use smoothing methods.

Three corpora of ~20 MB each: samples from news, blogs and Twitter messages (diversity!)
Preprocessing: corpora was broken down to sentences, profanity, hashtags, punctuation filtered out, URLs, numbers, dates etc substituted by special tokens.
Words that occurs only once was substituted by the UNK token.

Two smoothing methods: absolute discounting and interpolated modified Kneyser-Ney (current state of the art).
Probabilities was calculated for [1:5]-grams, but while exporting the data model, [3:5]-grams that was found in corpora only once were filtered out.
Only most frequent combinations was exported, so only one word can be suggested.
Words that weren’t found in unigrams substituted with UNK.

Algorithm is very straightforward, but produces sensible suggestions.
Accuracy was sacrificed for size (all uncompressed .CSV files occupy about ~22 MB per smoothing method) and speed (0.001-0.03 s per operation on iMac Core i5).
Much more sophisticated Modified Kneyser-Ney performs as good as simple Absolute discounts.
Using the corpus derived from the same source where the application is meant to be deployed (i.e. tweets for messages suggestion) could improve accuracy much better than complicated algorythms.