Data Science Capstone Project - Next Word Prediction

author: sammyds

date: Dec 14, 2014

The Application

This application predicts the next word after a user submits a phrase drawn from Twitter or news articles in English
The application consists of a 4-gram language model that was built from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details.
The Corpora has data for four languages. This application only covers US English data (en_US).

The Algorithm

Load the raw data from the Corpus and cleanup unwanted characters.
Build 4-gram tokens
Remove sparse tokens - tokens with less than 5 occurances
Build a launguage model using the maximum likelihood estimates of the n-gram probabilities. This results in a lookup table with a key (n-1 word phrase) and the predicted next word, based on the highest probability.

The Algorithm (continued)

Given an input phrase, the same cleanup rules as described above are applied and the last n = 3 words are used to lookup in the language model.
If a match is found, it is returned. If not, a back-off strategy is used, to lookup based on the last n-1 words, n-2 words etc, until a match is found.
If no match is found, the unigram word with the highest occurance is returned.

How it works

Load the application https://sammyds.shinyapps.io/TextPrediction/
Get a phrase from Twitter or news articles in English
Type the phrase in the input box on the left marked “Type input phrase”, leaving out the last word of the phrase
Press “Enter” or click the “Submit” button
The predicted next word will appear on the right-hand side.