Data Science Capstone Project
Beecher Adams
January 28, 2018
Project Background
- The data science capstone project involves designing a predictive
text model
- A user enters one or more words and the model predicts possible next words
- Natural Language Processing (NLP) techniques are used
- The model is “trained” using a large set of sample text from news articles, twitter feeds, and blogs
Approach Taken
- Study and learn the basics of Natural Language Processing and Text Mining
- Specifically learn how to use the Text Mining
(tm) and ngram packages
- Data Cleaning involved “bad word”, white space, modified punctuation, and number removal, leveraging tm
- Random sampling (rbinom) of data sets used to keep file sizes and performance reasonable
- Leveraged ngram to calculate the 1, 2, 3, 4, and 5 word ngrams
- As ngram lacked way to get computed ngrams in useable dataframe, wrote custom parsing routine
Overview of the Product
- The product is implemented as a Shiny app
- After the user enters text, the product predicts up to 3 possible next words, ranked by their likelihood score
- The product includes a prediction model with 1 thru 5 ngrams
- For entries longer than 5 words it only considers the last 5 words
- If no match is found for the largest word count, it backs off one word at a time until a match is found
- The score for each predicted word is computed using Stupid Backoff smoothing with alpha value 0.4
Performance Test Results
- Model performance was tested using benchmark.r as referenced in the course Discussion Forum
- Overall top-3 score: 9.55 %
- Overall top-1 precision: 7.22 %
- Overall top-3 precision: 11.51 %
- Avg runtime: 61.29 msec, Total memory: 115.19 MB
- The results could be improved by using a larger sample size from the input text.
- The limiting factor I encountered to using larger data sets was the custom routine to parse the ngram into a useable dataframe
- It took well over a day of computation for the sizes I used