Data Science Capstone

Aram Sethian
04/20/2019

Overview

The purpose of this project was to build a natural language processing model that would predict the next word for any given phrase input by a user. The general process by which I achieved this is the following:

  • Generating unigrams, bigrams, and trigrams from a corpus of news articles, tweets, and blogs as provided.
  • Cleaning the noise from the data such as capitalizations, punctuation, numbers, characters, and stop words that do not contribute to the substantive elements of an input.
  • Implementing a Katz Back-Off algorithm to assign probabilities to all observed and unobserved n-grams for a given input.

The Corpus

  • We were provided 3 text files consisting of blog posts, news articles, and tweets from Twitter. This totaled 594.2 MB of raw text data as the main corpus for the project.
  • The size of the corpus immediately presented the first problem of computational resources, as n-grams of the complete corpus would produce over 5 GB of data and extremely long processing times for a prediction.
  • However, I found that the sparsity of this corpus would be adequately covered with a random sampling of only 10% of each text source, and the KB algorithm would serve to fill in the gaps for unobserved phrases adequately for our purposes.

Ngrams

  • I found the fastest tokenization of ngrams to be in the 'ngram' package, produced by Christian Heckendorf. (https://www.rdocumentation.org/packages/ngram/versions/3.0.4)
  • Even a 10% random sampling of each corpus produced 185,000 unigrams, 2.53 million bigrams, and 4 million trigrams.
  • Loading these ngrams take up a manageable 536 MB of memory.

Katz Back-Off

  • The real workhorse is the Katz-Backoff, which estimates conditional probabilities of a word given its history in the n-grams.
  • For this, I relied heavily on a number of functions produced by Michael Szczepaniak to implement this algorithm in R. (https://rpubs.com/mszczepaniak/predictkbo3model)
  • The general operation consists of assign a certain portion of the probability mass function of observed ngrams to unobserved ones, and assigning probabilities to those unobserved ngrams based on frequencies of occurrence in the lower-power ngram.

The Shiny Application

  • The web application was produced on Shiny. The text input passes a character string into my 'nextword' function, which loads previously compiled n-gram files, and outputs the top 5 likely predictions for a given phrase.
  • The probabilities are based on frequencies of observed ngrams, as well as estimated probabilities of unobserved ngrams based on the Katz Back-Off.