Coursera Data Science Capstone: Next Word Prediction Algorithm

David Risius
Thu Apr 28 22:31:03 2016

Background

Objective Build a predictive text models like those used by SwiftKey to predict the next word based on either a one, two, or three n-gram. This is done by:

  • Cleaning and analyzing a large corpus of text documents.
  • Building and sampling from a predictive text model.
  • Building a predictive text product in Shiny.

The Data Three english text files were used in the analysis and building of the predictive text.

  • A blog file consisting of over 38 million words.
  • A twitter file consisting of over 31 million words.
  • A news file consisting of 2.7 million words.

The Model

  1. Maximum Likelihood Estimation to compute the next word based on the previous one, two, or three words.
  2. Katz back-off model. A generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.
  3. Get the model from the here at: https://risiud.shinyapps.io/WordPredictApp/

Methodology

  1. Explore the Data
  2. Clean the Data
    • Remove numbers, punctuation, profanity
    • Build n-grams
    • Make test and training sets
  3. Build the Model
    • Files of 1, 2, 3, and 4 n-grams
    • Maximum likelihood estimator
    • Katz's Backoff Model
  4. Test the Model
    • Using test set, check accuracy of model
    • 65 percent accuracy of prediction
  5. Deploy the Model Using ShinyIO

Application

Below is a screen shot of the application

Find the application here