Next Likely Word

Donald Hescht
31-Dec-2017

Coursera
John Hopkins University
Data Science Specialization
Capstone Project

Executive Summary

This challenging Data Science Specialization capstone project produces a next word predictor, similar to SwiftKey. I implemented the Stupid Back Off model to process probabilities with discounts for each N-Gram back-off thus favoring highly occurring word combinations. The historical learning text data came from Twitter, Blob and News feeds. The user application is hosted at the Shiny server allowing global user access.

Model Explanation

Stupid Back Off Model
Prototyped Kneser-Ney Interpolation to smooth probabilities and decided after comparing to SBF that SBF was much simpler to maintain and provided adequate accuracy. Coding effort and size was approximately ½ which makes maintenance easier in the future. Back off discount was 0.4 per level of backoff.
5 Ngram Model
Built model starting with 3-NGram and and moved to 5-NGram to improve accuracy.
Misspelling Correction
Added uni-gram back off to correct for misspelled or vocabulary word errors.
Hash Key
I converted the N-Gram prefix from W_W_W_W format to a 32 bit hash key using the “digest” library. This significantly improved speed (635ms to 28ms) and reduced memory (130MB to 76MB) with a small reduction in precision (22% to 20% top-3).

Closing Thoughts

Next Versions

Add skip grams to cover context (research underway)
Use a pruned common vocabulary file (uni-gram) to filter and reduce data
Improve hash by using C++ and 64 integers rather than character vector

Acknowledgements

Thanks to Phil Ferriere / April 2016 for ideas
Quanteda for the awesome work on Corpus and Ngrams
Data for project is here

How to Use

User Screen

Type in the “Text Input” field your desired phase.
Next likely (up to ten words) appear in “Next Likely Word” field.
Here is a link to the application in shiny https://donhescht.shinyapps.io/likelynextword/