Capstone Project: CrystalBall

Scott Jacobs
12/31/16

Executive Summary

I've built a proof of concept text prediction engine based on a large data set from SwiftKey. I've deployed it in a web app to show you how easy and useful it can be to prototype data products.

Key Takeaways;

Generating routines for cleaning and processing text is relatively easy and reproducible
Large bodies of text can be broken down into smaller, portable objects called sparse matrices
Web Apps can quickly be deployed and subsequently shut down according to need

A routine for cleaning text data & Creating N Grams

Firstly, for any natural language processing project some amount of data preparation is needed. For this project a specific routine was created to meet the objectives of the project. It is reuseable, but also adaptable for future projects.
Secondly, using tidytext and tidyverse we can build robust pipelines for processing clean text into n gram frequency tables.

Generating Frequency Tables For NLP Prediction

Base R function xtabs generates a numeric table of frequency counts.
Using that large table object we can create a much smaller object called a sparse matrix.
These lightweight objects are very useful when deployed via web app in a storage and speed constrained environment.
Using Laplace smoothing, we ensure that there is very nearly always a word predicted by eliminating 0 probabilities in our matrix.
Predicting the next word in a phrase or sentence is as easy as looking up the last word or two words in the aforementioned sparse matrix, then identifying the highest probability word that follows.

Evaluating Performance

Using a holdout set for testing, random samples were taken and a prediction was generated.
Performance can be evaluated in terms of accuracy as well as speed and storage. While the predictive model does not posses great accuracy (~15%), it is fast and lightweight.
Most importantly, it met our objectives for this proof of concept.

Demonstration of Use

A word or phrase is entered into the box
A prediction is automatically generated and a plot showing other likely predicted words also appears. This can be helpful if your word was not predicted or if the word predicted did not occur with any great frequency, given the input phrase.
http://scolocoder.shinyapps.io

Conclusions

R gives us the tools to create prototypes quickly and efficiently
Web apps such as Shiny give us the opportunity to deliver a reactive experience with prototypes with minimal cost
If we plan on doing NLP going forward, we should dedicate computing resources to alleviate constraints.