Data Science Capstone Final Project

Cobus Nel
22 August 2015

Overview

This presentation summarizes the final project completed in order to fulfil the requirements of the Coursera, Johns Hopkins Data Science specialization.

The goal of the project was to create a data product that showcase a text prediction algorithm developed and to provide a user interface to the algorithm and its data.

The prediction application can be found here.

Unique Features

The product delivers the following unique features:

  • High speed ngram creation using Python by means of combination of on-disk data with in-memory indexes.
  • Ngram tables optimized for speed, size and accuracy
  • Lightweight R deployment (most of the logic is is built into the tables)
  • Custom, optimized data structures that minimize the in memory footprint by means of hash table lookups.
  • A custom back-off algorithm was developed for this implementation.
  • Do not rely on specialized external R libraries.

Components

The following infrastructure components were used to develop the product:

  • Python programming language (integration and data processing)
  • Python NLTK (Language processing) Library
  • Custom Python library developed for integrating data cleansing and constructing ngram tables
  • Custom developed R data structures and hash tables (word lookups) that allow for small memory footprint (21MB in memory)
  • Model View Controller (MVC) user interface utilising plain R, Shiny and Shiny Dashboard

Approach

The following approach were taken in completing the project:

  • Data was cleaned using the Python Natural Language toolkit.
  • Bigrams, trigrams and quadgrams were extracted and stored in integer tables.
  • All ngram tables were optimized for size by means of careful pruning.
  • Words are looked up using hash tables to minimize memory footprint.
  • When a sentence is entered, the application will first consult a quadgram table and then follow a back-off process until the best match is found.

Instructions

Load the user inteface from here. Enter your words in the text box provided. The prediction is performed each time the space-bar is pressed.

User Interface

Ngram probability, execution time and ngram level used is displayed below the prediction.