N-Gram Predictor

Erik White
2/10/2018

The Application

The N-gram Predictor application is a text analyzing application that accepts a phrase or sequence of words as inputted by the user and predicts a subsequent word that may be likely to follow.

The application is hosted on shinyapps.io by RStudio and can be found below:

https://ejwhite90.shinyapps.io/n-gram_predictor_ver2/.

User Guide

The application may take a few seconds to load initially. Once fully rendered, a default example is automatically presented to the user, urging the user to “Give it a try”.

Application usage is as simple as:

Typing or copying a sequence of words into the User Input text box
Clicking the orange “Predict Next Word!” button
Seeing your predicted word in the blue prediction box

The Algorithm - Pre-Processing

The model that drives the predictions was trained on three datasets containing primarily English vocabulary, provided by the Johns Hopkins University Data Science Specialization on Coursera. The full datasets can be found here:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

25% of the lines from the English datasets were randomly selected for training, at which point they were cleaned and then processed into .rds files containing the relative frequencies of every unigram identified in the training set. This processing step was then repeated another three times, so that .rds files were written containing the relative frequencies of all bigrams, trigrams, and quadgrams within the clean training set. These .rds files were strategically structured in such a way that a b-tree algorithm could be used to search the datasets quickly and efficiently.

The Algorithm - Live-Processing

With the majority of the heavy lifting having already been performed in the pre-processing stage, there is relatively little processing power that is needed when a user clicks the “Predict Next Word!” button.

First, the application performs the same cleaning steps that were applied to the training data to the user input.

Next, the application traverses the b-tree structure of .rds files to search for any quadgrams that started with the last three tokens from the user input.

If the application finds a match, then the most frequently occurring quadgram from the training set is used to predict the next word, and that word is returned in the blue prediction box.

If no matches are found in the quadgram dataset (or if the user input wasn't long enough to warrant a quadgram search), then the application will back off to search the trigrams for a match, only using the last two tokens from the user input. If no matches are found in the trigrams, then the process is repeated again with the bigrams.

In the event that we've encounted a new word that was not included in the training set, then the above process will not return a match. Since our algorithm doesn't know how to interpret the new word, we provide the user with the most common unigram that we found in our dataset, “the”.