Data Science Capstone Project

Steven D Rankine
23 August 2015

logo

Goal

The goal of John Hopkins University's Data Science Coursera capstone project is to build a predictive model for user input.

Data

The dataset used for the training was provide by the SwiftKey Corporation in the form of three corpus extracted from blogs, news feeds and twitter feeds.

Model

The prediction model is based on the character level analysis for single word phrases and the word level analysis for multi-word phrases. A Shiny App was created to demonstrate the prediction model.

Algorithm

twenty-Five batches of random samples (7000) from each the three Corpus were taken. This collection of samples were used to create a frequency-optimized term-document matrix (TDM) containing terms up to 3-gram.

Queries are made into the TDM based a users input, the typing context (e.g. blog, news, or twitter), and the maximum number of matches to search for.

Based on those inputs, the algorithm returns a data frame containing the most frequently occurring predictions for a given input phrase.

Application & Usage

Link to the application: https://sdr4w.shinyapps.io/final
The left side panel is the input section with the three input options: Typing context, Search scope, and Phrase Input.
The right side panel has the three output options: Best Prediction, List of Alternatives, and the raw output from the prediction model The prediction model starts as soon as text is entered into the phrase input field.

capture

Summary

The accuracy of my prediction model was directly related to the following constarints:

Inadequate computing resources on development system
Shiny.io limits on data uploads

On the other hand the computational speed of the model would be enhanced if the following could be implemented:

Upgraded hardware platform running the Shiny App
Incorporation of Hidden Markov techniques within the prediction model