This is the Capstone Project for the Johns Hopkins University, Data Science 10 Course Certificate.

Joe Larson
06-13-2017

A power point promoting a text predication product

Overview

The capstone project is an effort to have a product created using the skills learned during the 10 courses, displaying methods taught during the data science MOLC education.

The product is developed to showcase on Shiny as the final product site and using a text predictor as the minima viable product to be offered. The product developed can be explored here: (https://machinelearner452.shinyapps.io/jhu_capstone_project/) Product MVP power point, can be found here: (http://rpubs.com/machinelearner452/) The user interface is simple, as the user types text into the “Input Section”; the product will display the predicted next word in the display areas: table and word cloud, with a splash of user selectable input - selecting the count of words to be displayed in the table and word cloud.

Discovery and Torture

The prediction is based on a set of n-grams (1, 2, 3 & 4) created by per processing the supplied reference material: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. After the raw data was downloaded successfully, a means to create n-grams had to be selected. As many developers found a variety of successful package, TM was tried first, causing many delays and side tracks. As the selected sample size was a key factor for a winning solution, reaching this minimal stage was a first critical milestone to reach. A 10% sample size was chosen, some 500K+ lines, which was then broken into 2K events for n-gram creation. Splicing these 300+ pieces was undertaken. Post this effort, a key learning occurred, watch the videos and read the discussion boards (at least twice!), and use the suggested R packages and let the power of spare matrixes come shinning in.

Obtain a Model and Implementation

These texts were initially tidy by changing to lowercase, removing punctuation, numbers, special and unicode characters, and profanity as selected. The cleaned text was then sampled and broken into n-grams of length 1, 2, 3 and 4 words using the “quanteda” package in R.

The app uses the stored n-grams of length 1, 2, 3 and 4 words, all four tables were combined. As the user enters text this table is searched, mutated and sorted based on the user inputted text. Finally, a function returns the best possible answers based on the supplied reference material.

A “simple” model of reviewing the last 1-3 words of entered by the user was used, to suggest a fourth (NextWord) word based on the preprocessed data set.

Next Phase in the MVP efforts

This fourth, or 'Next Word", is used as the returned predication table. If no match is found, the process continues, checking the second from last word of 3 word phrases, finally the third from the last word (first typed) is checked against the predictors. In all cases, matches stop the program until new or additional text is entered. Until text is entered, the program uses the phase “waiting for word cloud to” as a boot strap.

The learning curve certainly occurred during this process, first to ensure the learnings are truly understood, the code will be analyzed for opportunities refinement and enhanced style usage along with to use packages like SVM or other fuzzy logic packages to increase the speed and accuracy of the predictor. Secondly, other paths like using part of speech and user defined dictionary will be reviewed.