Text Prediction App

Josh Morel
2016-10-02

Data Science Capstone Project

Anyone not living under a rock can tell you texting has become a key means of communication
Unfortunately, unless you're thumb-ninja digital native, communicating ideas through texts can be slow and error-prone
Just try thumbing out a thoughtful, fully-formed sentence when you've been typing with ten fingers all your life!

Natural Language Processing, or NLP for short, combined with modern computing can solve the slow thumb problem by lending a predictive helping hand
Consider: you want to share some love and send a “you're my best friend” message
You COULD thumb “you're my” and then… letter-by-letter… b e s t [space] f r i e n d
But with a predictive aid, only two pushes are required to complete those words calculated as very likely to follow the partially formed intention

Diagram

Use English text input data from blogs, twitter and news data source from the HC Corpora
Design language model program with the “Stupid Back-off” algorithm (truly smart) as described by Brant, et al (1)
Develop the model building pipeline with Python scripts chained through a Bash script (FYI - the original used Hadoop/MapReduce)
Deploy an R Shiny App hosted on shiny.io at https://josh-morel.shinyapps.io/text_prediction/ (repeated… just in case the previous link didn't work!)

[1]: T. Brants, et al. (2007). Large language models in machine translation. [Online] Available: http://www.aclweb.org/anthology/D07-1090.pdf

A means to evaluate relatative performance, from both a accuracy & computing, perspective, considering:

The model's maximum order (5-gram, 6-gram, 7-gram?)
Percent of source data to use (I used 40% but with the Stupid Backoff approach any amount is possible)
The vocabulary cut-off at which to map a word to the “unknown” token
Additional or alternate data sources
Handling of special characters and patterns, for example Twitter emojis