JHU Coursera CapstoneNLP Project
Chris Harris
09/01/2017
Presenting the all-new Text Predictor 3000! This is the finest app on the market in predictive text analytics. It has been trained on various corpora, including news, blogs, and Twitter, so that once you input some text it will offer a prediction for what word is to follow (along with four “runner's up” added for further insight).
This application implements a Katz's backoff model with stupid (though there's nothing stupid about it) backoff and \( \alpha=0.4 \). For example, given 2 words, we look at a list of top 3-grams and top 2-grams that could complete those 2 words. We compute the probabilities in each case of the 3-gram occuring and also “back off” and compute the probability of the 2-gram occuring with the discount \( \alpha \). It's that easy!!
Various exploratory data analysis was done to strike the right balance between memory and accuracy by keeping top 2-grams and 3-grams for about 50-60% coverage. We also stored these n-grams as environments backed by hash tables for faster lookup. See the code here.
Manual testing seems to indicate the app gets the correct result most often when the following word would be a stop word (at a rate of 1-2 times out of 10.. not so different from a human). In cases where the following word is not a stop word, choosing the “Remove stop words” option often produce very plausible candidates with the right answer often appearing within the top 5 results.
The app is available here. Enter some text and push the “submit” button. The app will return the top five predicted values. The first word should be considered the application's official prediction for the next word.
The app provides various options:
Feel free to experiment with the various options. As the writing style is quite different between news, Twitter, and blogs we wanted to allow the user the customizability options for the results.