Marek Ostaszewski
9/12/2017
The problem
Users need accurate systems suggesting the next word when writing.
This greatly speeds up typing, especially on mobile devices.
Source data
Ngram corpora
Three models were considered for next word prediction using the constructed ngram corpora:
The following steps were taken to ensure the algorithm works smoothly
With optimization as above, three corpora take 14.1Mb, 14.5Mb and 11Mb of disk space (29.6Mb total), and 142.6Mb, 141Mb and 110Mb of memory (393.6Mb total).
The accuracy of the algorithm was evaluated as follows:
Test set: Blogs Test set: News Test set: Twitter
Ngram corpus: Blogs 0.276 0.243 0.245
Ngram corpus: News 0.247 0.299 0.222
Ngram corpus: Twitter 0.220 0.214 0.297
In all cases the best performing algorithm was Knesser-Ney smoothing.
The context is an important factor for accuracy - ngram corpora constructed using certain type of source text perform the best on their matching test sets.
Runtime performance evaluation is available directly via the Shiny app (see next slide).
Shiny app with the prediction algorithm is available here.
Three constructed ngram corpora and three methods are available for prediction
Predictions are proposed when typing, three buttons allow o quickly add the most probable next words to the text field
The slider controls the number of predicted words
When Profile runtime checkbox is ticked, Rprof() is run during prediction; its results are displayed for each query