Next Likely Word

Donald Hescht
31-Dec-2017

Coursera
John Hopkins University
Data Science Specialization
Capstone Project

Executive Summary

This challenging Data Science Specialization capstone project produces a next word predictor, similar to SwiftKey. I implemented the Stupid Back Off model to process probabilities with discounts for each N-Gram back-off thus favoring highly occurring word combinations. The historical learning text data came from Twitter, Blob and News feeds. The user application is hosted at the Shiny server allowing global user access.

Model Explanation

  • Stupid Back Off Model
    Prototyped Kneser-Ney Interpolation to smooth probabilities and decided after comparing to SBF that SBF was much simpler to maintain and provided adequate accuracy. Coding effort and size was approximately ½ which makes maintenance easier in the future. Back off discount was 0.4 per level of backoff.
  • 5 Ngram Model
    Built model starting with 3-NGram and and moved to 5-NGram to improve accuracy.
  • Misspelling Correction
    Added uni-gram back off to correct for misspelled or vocabulary word errors.
  • Hash Key
    I converted the N-Gram prefix from W_W_W_W format to a 32 bit hash key using the “digest” library. This significantly improved speed (635ms to 28ms) and reduced memory (130MB to 76MB) with a small reduction in precision (22% to 20% top-3).

Closing Thoughts

Next Versions

  • Add skip grams to cover context (research underway)
  • Use a pruned common vocabulary file (uni-gram) to filter and reduce data
  • Improve hash by using C++ and 64 integers rather than character vector

Acknowledgements

  • Thanks to Phil Ferriere / April 2016 for ideas
  • Quanteda for the awesome work on Corpus and Ngrams
  • Data for project is here

How to Use

User Screen