Shiny App to Predict the Next Word - Data Science Capstone Project

Dec. 14, 2014

Introduction

The goal of this capstone project is to analyze a large corpus of text documents to discover the structure in the data and how words are put together. Then, build a prediction application using n-grams models to predict the next word in a phrase entered by a user. several stepts were performed including:

Removing numbers, punctuation, whitespaces, strange characters, and converting all text to lower case.
Building n-gram models (4-gram, 3-gram, 2-gram, and unigram).
Building a Shiny application.

Implementation of N-gram Models

Due to limited resources, 10% random samples from each file were drawn. After the cleaning process, n-gram models were built. Each n-gram model was converted to a data.table for fast binary search. For example, the 4-gram data.table looks like:

         n1   n2    n3 pred freq
7806    are  you going   to 1242
8883     at  the   end   of 1155
8999     at  the  same time 1155
14664 can't wait    to  see 2653
23684   for  the first time 1673

So the n1, n2, and n3 represent the first 3 grams in the 4-gram model, and pred represent the predidted word. This implementation allowed the high accuracy (more than 30%) and speed of prediction.

Prediction Algorithm

The prediction is based on Katz Back-off algorithm. The steps taken to predict the next word are as follows:

If the number of words >= 3, first, the 4-grams model is used. Otherwise use lower model.
If no prediction is produced by the 4-gram model, back-off to 3-grams model.
Similarly, if no prediction is produced by the 3-gram model, back-off to 2-grams model.
Finally, if no prediction is produced by the 2-gram model, then predict the most frequent word in the unigram model, which is “the”.
If no words entered by the user, the algorithm returns the word “the”.

The Shiny Application

The application starts by predicting the most frequent word “the”. After the user enters a phrase and clicks “Predict next word”, the application displays user input and the predicted next word.

shiny app