Data Science Capstone

Yiu Chung Wong
12-Jan-2018

The app is here

The Prediction Application

Overview

  • Predicts English word based on the preceding words
  • Each predicted word is assigned a probability score
  • The app outputs a list of probable words and a wordcloud

How to use

  1. Enter a sentence into the text field (four or more words for better accuracy)
  2. Voila!

The algorithm

  • The app employs the simple Back-off algorithm (Brants, Popat, Xu, Och, & Dean, 2007)
  • The corpus is used to construct a 5-gram model
  • The algorithm first look for matching 5-grams in the 5-gram database
  • Then recursively backs off to lower gram databases to look for additional matches
  • Finally, look for most frequent words in the unigram database.

Performance

  • The application takes advantage of pre-computation
  • i.e. all the probability and scores of all possible word combinations from the corpus are ready to be pulled out, no calculation needed

  • Since we only care the predictions with the highest probabilities, we only need to keep a handful of unique n-grams from each database below 5-gram. i.e. these database only contains identical starting features k times (I set k to be 5 in this application, this is completely arbitrary)

  • Performance benchmark can be found here