Data Science Capstone Project

Beecher Adams
January 28, 2018

Project Background

  • The data science capstone project involves designing a predictive text model
  • A user enters one or more words and the model predicts possible next words
  • Natural Language Processing (NLP) techniques are used
  • The model is “trained” using a large set of sample text from news articles, twitter feeds, and blogs

Approach Taken

  • Study and learn the basics of Natural Language Processing and Text Mining
  • Specifically learn how to use the Text Mining (tm) and ngram packages
  • Data Cleaning involved “bad word”, white space, modified punctuation, and number removal, leveraging tm
  • Random sampling (rbinom) of data sets used to keep file sizes and performance reasonable
  • Leveraged ngram to calculate the 1, 2, 3, 4, and 5 word ngrams
  • As ngram lacked way to get computed ngrams in useable dataframe, wrote custom parsing routine

Overview of the Product

  • The product is implemented as a Shiny app
  • After the user enters text, the product predicts up to 3 possible next words, ranked by their likelihood score
  • The product includes a prediction model with 1 thru 5 ngrams
  • For entries longer than 5 words it only considers the last 5 words
  • If no match is found for the largest word count, it backs off one word at a time until a match is found
  • The score for each predicted word is computed using Stupid Backoff smoothing with alpha value 0.4

Performance Test Results

  • Model performance was tested using benchmark.r as referenced in the course Discussion Forum
    • Overall top-3 score: 9.55 %
    • Overall top-1 precision: 7.22 %
    • Overall top-3 precision: 11.51 %
    • Avg runtime: 61.29 msec, Total memory: 115.19 MB
  • The results could be improved by using a larger sample size from the input text.
  • The limiting factor I encountered to using larger data sets was the custom routine to parse the ngram into a useable dataframe
  • It took well over a day of computation for the sizes I used