Coursera Data Science Capstone: Next Word Prediction Algorithm

David Risius
Thu Apr 28 22:31:03 2016

Background

Objective Build a predictive text models like those used by SwiftKey to predict the next word based on either a one, two, or three n-gram. This is done by:

Cleaning and analyzing a large corpus of text documents.
Building and sampling from a predictive text model.
Building a predictive text product in Shiny.

The Data Three english text files were used in the analysis and building of the predictive text.

A blog file consisting of over 38 million words.
A twitter file consisting of over 31 million words.
A news file consisting of 2.7 million words.

The Model

Maximum Likelihood Estimation to compute the next word based on the previous one, two, or three words.
Katz back-off model. A generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram.
Get the model from the here at: https://risiud.shinyapps.io/WordPredictApp/

Methodology

Explore the Data
Clean the Data
- Remove numbers, punctuation, profanity
- Build n-grams
- Make test and training sets
Build the Model
- Files of 1, 2, 3, and 4 n-grams
- Maximum likelihood estimator
- Katz's Backoff Model
Test the Model
- Using test set, check accuracy of model
- 65 percent accuracy of prediction
Deploy the Model Using ShinyIO

Application

Below is a screen shot of the application

Find the application here