Data Science Capstone Project - Next Word Prediction

author: sammyds

date: Dec 14, 2014

The Application

  • This application predicts the next word after a user submits a phrase drawn from Twitter or news articles in English

  • The application consists of a 4-gram language model that was built from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details.

  • The Corpora has data for four languages. This application only covers US English data (en_US).

The Algorithm

  • Load the raw data from the Corpus and cleanup unwanted characters.

  • Build 4-gram tokens

  • Remove sparse tokens - tokens with less than 5 occurances

  • Build a launguage model using the maximum likelihood estimates of the n-gram probabilities. This results in a lookup table with a key (n-1 word phrase) and the predicted next word, based on the highest probability.

The Algorithm (continued)

  • Given an input phrase, the same cleanup rules as described above are applied and the last n = 3 words are used to lookup in the language model.

  • If a match is found, it is returned. If not, a back-off strategy is used, to lookup based on the last n-1 words, n-2 words etc, until a match is found.

  • If no match is found, the unigram word with the highest occurance is returned.

How it works

  • Load the application https://sammyds.shinyapps.io/TextPrediction/

  • Get a phrase from Twitter or news articles in English

  • Type the phrase in the input box on the left marked “Type input phrase”, leaving out the last word of the phrase

  • Press “Enter” or click the “Submit” button

  • The predicted next word will appear on the right-hand side.