Data Science Capstone

Aram Sethian
04/20/2019

Overview

The purpose of this project was to build a natural language processing model that would predict the next word for any given phrase input by a user. The general process by which I achieved this is the following:

Generating unigrams, bigrams, and trigrams from a corpus of news articles, tweets, and blogs as provided.
Cleaning the noise from the data such as capitalizations, punctuation, numbers, characters, and stop words that do not contribute to the substantive elements of an input.
Implementing a Katz Back-Off algorithm to assign probabilities to all observed and unobserved n-grams for a given input.

The Corpus

We were provided 3 text files consisting of blog posts, news articles, and tweets from Twitter. This totaled 594.2 MB of raw text data as the main corpus for the project.
The size of the corpus immediately presented the first problem of computational resources, as n-grams of the complete corpus would produce over 5 GB of data and extremely long processing times for a prediction.
However, I found that the sparsity of this corpus would be adequately covered with a random sampling of only 10% of each text source, and the KB algorithm would serve to fill in the gaps for unobserved phrases adequately for our purposes.

Ngrams

I found the fastest tokenization of ngrams to be in the 'ngram' package, produced by Christian Heckendorf. (https://www.rdocumentation.org/packages/ngram/versions/3.0.4)
Even a 10% random sampling of each corpus produced 185,000 unigrams, 2.53 million bigrams, and 4 million trigrams.
Loading these ngrams take up a manageable 536 MB of memory.

Katz Back-Off

The real workhorse is the Katz-Backoff, which estimates conditional probabilities of a word given its history in the n-grams.
For this, I relied heavily on a number of functions produced by Michael Szczepaniak to implement this algorithm in R. (https://rpubs.com/mszczepaniak/predictkbo3model)
The general operation consists of assign a certain portion of the probability mass function of observed ngrams to unobserved ones, and assigning probabilities to those unobserved ngrams based on frequencies of occurrence in the lower-power ngram.

The Shiny Application

The web application was produced on Shiny. The text input passes a character string into my 'nextword' function, which loads previously compiled n-gram files, and outputs the top 5 likely predictions for a given phrase.
The probabilities are based on frequencies of observed ngrams, as well as estimated probabilities of unobserved ngrams based on the Katz Back-Off.