Capstone Project

Jonathan Kunze
Nov 24, 2017

Data Science Specialization

Interactive demonstration: https://pyropenguin.shinyapps.io/Capstone/

Problem

The purpose of this project is to build a predictive text model, in which the user can enter a word or phrase and the next word is predicted.

A large dataset was provided called the HC Corpora. This dataset consisted of four locales (English, German, Russian and Finnish). Within each locale, text from blogs, news articles, and Twitter were aggregated.

A Shiny app was developed to demonstrate the model and showcase its predictive functionality.

Predictive Model

The model used is a variant of the Stupid Backoff algorithm described by Brants et al. 2007. This algorithm is a bit faster, albeit slighty less accurate, than other predictive text algorithms (such as Kneser-Ney Smoothing or Katz Backoff)

The dataset was parsed, cleaned and tokenized into a set of n-grams (2 through 5). A SQLite database contianing the n-grams was constructed with each respective frequency included. Given that this is a simple demonstration, the distributed application described in the paper was unnecessary.

Through empirical testing, an alpha value of 0.001 was found to work best with the provided dataset. This gives significant weight to higher-order n-grams.

Model Accuracy

A benchmarking tool was used to determine model accuracy and performance.

A baseline of the three most common words in the english language received an overall top-three score of 6.64%, with an average runtime of 0.14 msec per iteration.

The model implemented here received a top-three score of 18.02%, with an average runtime of 4.88 seconds per iteration. While slower than the baseline, it offers a nearly threefold improvement in accuracy.

The accuracy could be improved with a larger database and more complex algorithm, but given the space and processing constraints of the server this performance is acceptable.

Usage

plot of chunk unnamed-chunk-1