Data Science Capstone - Next Word Prediction App

Katie Martins
February 4, 2018

Introduction

The goal of this project was to create a Shiny app that takes as input the first few words of a sentence and predicts the next word.

Corpora from twitter, blogs, and news articles were provided from HC Corpora.
N-gram models were built from subsets of the data from the corpora using the quanteda package.
The Stupid Backoff algorithm for next-word prediction was implemented based on the n-gram models. This algorithm was chosen due to its computational efficiency and relatively high accuracy for its simplicity.

Model

The training dataset was built using 50% of the blog corpus, and 40% each of the news and twitter corpora. The data were tokenized into n-grams, from unigrams up to 5-grams, with the following processing steps:

Converting all strings to lowercase
Removing punctuation, numbers, symbols, and urls
Removing profanity
Removing all n-grams with a frequency count less than 3

Algorithm

The algorithm starts by taking the last four words of the user input and searching for a match among the 5-grams in the model. If a match is found, the scores for each possible next word prediction are computed as the number of times that word follows the 4-gram input divided by the total number of times the 4-gram input occurs. Then, the algorithm moves to 4-grams, and computes the scores analogously, with a constant discount factor of 0.4. The backoff process continues for trigrams and bigrams, with a discount of 0.4 at each lower n-gram level. The predicted next word is the word with the highest score.

If no 5-gram matches are found, the algorithm searches 4-grams, then trigrams, then bigrams. If no match is found among any of the n-grams, the predicted next word will be the most frequent unigram.

Accuracy

Accuracy of the model was assessed by taking 1,000 lines from each of the corpora (news, twitter, and blogs). The lines used for testing were not used in building the model. The first four words of each line were used as input, and the predicted next word was compared to the actual next word.

In about 15% of cases, the next word prediction matched the actual next word.

In about 30% of cases, the actual next word was within the top five next word predictions.

App

text

Click here to check out the app!