Data Science Capstone Project

Beecher Adams
January 28, 2018

The data science capstone project involves designing a predictive text model
A user enters one or more words and the model predicts possible next words
Natural Language Processing (NLP) techniques are used
The model is “trained” using a large set of sample text from news articles, twitter feeds, and blogs

Study and learn the basics of Natural Language Processing and Text Mining
Specifically learn how to use the Text Mining (tm) and ngram packages
Data Cleaning involved “bad word”, white space, modified punctuation, and number removal, leveraging tm
Random sampling (rbinom) of data sets used to keep file sizes and performance reasonable
Leveraged ngram to calculate the 1, 2, 3, 4, and 5 word ngrams
As ngram lacked way to get computed ngrams in useable dataframe, wrote custom parsing routine

The product is implemented as a Shiny app
After the user enters text, the product predicts up to 3 possible next words, ranked by their likelihood score
The product includes a prediction model with 1 thru 5 ngrams
For entries longer than 5 words it only considers the last 5 words
If no match is found for the largest word count, it backs off one word at a time until a match is found
The score for each predicted word is computed using Stupid Backoff smoothing with alpha value 0.4

Model performance was tested using benchmark.r as referenced in the course Discussion Forum
- Overall top-3 score: 9.55 %
- Overall top-1 precision: 7.22 %
- Overall top-3 precision: 11.51 %
- Avg runtime: 61.29 msec, Total memory: 115.19 MB
The results could be improved by using a larger sample size from the input text.
The limiting factor I encountered to using larger data sets was the custom routine to parse the ngram into a useable dataframe
It took well over a day of computation for the sizes I used