Simple Word Prediction App

Coursera Data Science Capstone

Craig Covey
10/6/2016

Summary

The objective of the capstone project is to build a predictive text model. An example of a predictive text model is the three possible word choices in a smartphones keyboard. One of the best third party smartphone keyboard apps is SwiftKey.

Prediction Model

A predictive model takes a series of words or a phrase as an input and predicts what the most likely next word is. The most common method uses n-grams. N-grams are the counts of every n combination of words in a corpus (body of text). For example, a 2-gram of the sentence “The cow jumps over the moon” would be “the cow”, “cow jumps”, “jumps over”, “over the”, “the moon” and subsequent counts for each time a particular two-word combination occurs. Using the counts of every possible n-gram in a corpus, one can predict the next word given a phrase.

One issue with n-gram models is that it cannot predict the next word of a phrase if the phrase is not in the corpus. A popular solution to this problem is using an additional alogirthm called Stupid Backoff. Stupid Backoff was created by Google and “is inexpensive to train on large data sets and approaches the quality of Kneser-Ney Smoothing as the amount of training data increases.” Stupid Backoff starts with the highest n-gram constructed and if no exact matches are found it then reduces the number of words to n-1 and calculates again. This continues until all the n-grams are used. The final result is a comprehensive list of the most likely next words.

Application

Features

English 5-gram model with Stupid Backoff algorithm and sampled 10% of corpus
Removed numbers, symbols, punctuation, and non-english wordsfrom corpus before processing
Removed all n-grams with a frequency of 1 (reduces size for app)

Instructions

Enter a phrase into the Input textbox
Click the Predict button to see details for the Stupid Backoff prediction scores

The Simple Word Prediction App can be found here

Future Enhancements

Split corpus by sentence instead of by line
Removed sentences from corpus that have bad words, numbers as words, or acronyms instead of removing just the word
Store n-grams in a SQL database (increases speed)
Increase the percent of corpus documents processed

Appendix

Download the corpus from HC Corpora (news articles, blogs, and tweets)
Exploratory Data Analysis can be found here
Coursera Data Science Capstone site can be found here