NLPTextPrediction: In Fulfillment of the Coursera Data Science Capstone Project

Jeff Gross
2016-12-29

Textual Next-Word Prediction: Hybrid Approach

This Data Science Capstone project leverages several innovative techniques learned in prior courses and by resarch for the current Capstone project

  • Memory and CPU efficient 'reactive' Shiny.io web app using pre-built data.table objects
  • Like modern spell checkers, prediction is a SET of words ranked in decreasing order of probabilities
  • Multi-level Katz backoff (5-gram down to Bigram) with all levels evaluated
  • Search augmented, and data.tables objects re-used, in simple Bag of Word (BoW) search

Design Decisions

  • Sampling and sentence-parsing of source documents into Train and Test files achieved with seeded pseudo-random unix scripting
  • DocumentTermMatrix and text normaiztion started with Weka but memory problems eventually led to tm package
  • Transforms include lower case, stemming, lower case, removal of numbers and punctuation. Stopwords were retained for ngram matching but removed for Bag of Word searches
  • Indexing feature of data.tables promised better performance than Markov chain objects
  • BoW searches cannot use indexing but a non-consuming regex “AND” search is quite fast
  • Web UI provides a choice between a simplified list of predicted words as well as a more detailed history of the prediction process

Prediction Algorithm

  • A test harness (Step 4 in github) was used to select split a sentence from the Training data, and the next word compared to the prediction
  • data.table objects were built containing the predictor text (“pre_gram”) and the predicted text ('post_gram") along with the Baysian probabilitly and over-all post_gram probability (for tie resolution)
  • Tuning consisted of manually optimizing the sequence of the fallbacks. The final sequence is: 5gram, 4gram, Trigram, 5BoW, 4BoW, Bigram
  • Predicted probabilities include a weighting factor of 0.4fallback_level, a typical approximation of the Katz backoff approach

Web App and Source Code