Data Science Capstone Final Project

Cobus Nel
22 August 2015

Overview

This presentation summarizes the final project completed in order to fulfill the requirements of the Coursera, Johns Hopkins Data Science specialization.

The goal of the project was to create a data product that showcase a text prediction algorithm and to provide a user interface to the algorithm and its data.

The prediction application can be found here.

Unique Features

The product delivers the following unique features:

  • High speed ngram creation using Python by means of combination of on-disk data with in-memory indexes.
  • Ngram tables optimized for speed, size and accuracy
  • Lightweight R deployment. Most of the logic is is built into the tables.
  • Custom, optimized data structures that minimize the in memory footprint by means of hash table look-ups.
  • A custom back-off algorithm was developed for this implementation.
  • Does not rely on specialized external R libraries.

Components

The following infrastructure components were used to develop the product:

  • Python programming language for integration and data processing.
  • Python Natural Language Toolkit (NLTK) Library.
  • Custom Python library for integrating data cleansing and constructing ngram tables.
  • Custom R data structures and hash tables for word lookups that allow for a memory footprint of 21MB.
  • Model View Controller (MVC) user interface utilising plain R, Shiny and Shiny Dashboard.

Approach

The following approach were taken in completing the project:

  • Data was cleaned using the Python Natural Language Toolkit.
  • Bigrams, trigrams and quadgrams were extracted and stored in integer tables.
  • All ngram tables were optimized for size by means of careful pruning.
  • Words are looked up using hash tables to minimize memory footprint.
  • When a sentence is entered the application will first consult a quadgram table and then follow a back-off process until the best match is found.

Instructions

Load the user inteface from here. Enter your words in the text box provided. The prediction is performed each time the space-bar is pressed.

User Interface

Ngram probability, execution time and ngram level used is displayed below the prediction.