Data Science Capstone
Yiu Chung Wong
12-Jan-2018
The Prediction Application
Overview
- Predicts English word based on the preceding words
- Each predicted word is assigned a probability score
- The app outputs a list of probable words and a wordcloud
How to use
- Enter a sentence into the text field (four or more words for better accuracy)
- Voila!
The algorithm
- The app employs the simple Back-off algorithm (Brants, Popat, Xu, Och, & Dean, 2007)
- The corpus is used to construct a 5-gram model
- The algorithm first look for matching 5-grams in the 5-gram database
- Then recursively backs off to lower gram databases to look for additional matches
- Finally, look for most frequent words in the unigram database.
Performance
- The application takes advantage of pre-computation
i.e. all the probability and scores of all possible word combinations from the corpus are ready to be pulled out, no calculation needed
Since we only care the predictions with the highest probabilities, we only need to keep a handful of unique n-grams from each database below 5-gram. i.e. these database only contains identical starting features k times (I set k to be 5 in this application, this is completely arbitrary)
Performance benchmark can be found here