Data Science Capstone Natural Language Processing Application

Sheryl Thurston Rosenbaum
April 17, 2016

Data Product Development

As a final step in the Data Science Capstone this presentation describes the final product, a predictive text model that will predict the next word of a phrase. This product was developed through:

  • Downloading a corpus of text, blog and news data provided by the class.
  • Processing the data sets using the R text-mining package tm and PCorpus function to hold it in physical disk memory rather than RAM.
  • Using filehash, a package to update & maintain permanent corpus database
  • Preparing and cleaning database for statistical modeling through several transformations.
  • Creating n-grams (tau package) to characterize word combinations (uni-, bi-, and tri-grams)
  • Using modeling (Good-Turing and Katz backoff) when n-grams don't provide good enough accuracy

Data Cleaning and Storage Techniques

Data cleaning included removing whitespace, punctation, numbers and web URLS, retaining apostrophes, and changing ASCII characters to replacement words. Swear words were retained since they are a part of english communications.All elements of natural language that may be used in phraseology were retained.

The character of the different types of data sets (blogs, tweets and news) were retained so that the prediction model could be used for any application using similar language.

Loops were run to create n-grams by processing chunks at a time, and appending the results. The filehash package was used to store resulting data tables to minimize RAM usage and maximize scalability.

Data Modeling Techniques

Statistical Modeling

Modeling is based on the techniques we learned in the specialization course. We take the full corpus and break it up into a training set, a developmental test set, and a test set. The objective of the modeling phase is to balance accuracy, speed and scalability of the dataset. These factors are tweaked during testing to optimize the model.

Predictive modeling

The strategy selected was to build an N-Gram model with Good Turing Smoothing. N-Gram modeling uses Markovian techniques such that regardless of length of phrase, a prediction can be based on a trigram, bigram, or unigram.

The tau package in the R programming environment tokenizes and build N-Grams at the n=1, 2, and 3 levels. This kept the data set small.

N-Grams appearing five or more times were assumed accurate so no smoothing is needed. This Good-Turing method is advantageous because it also calculates a probability of an N-Gram that doesn't appear in the dataset. It re-estimates the probability to assign to N-grams with low counts by looking at the number of N-grams with higher counts. Katz back off helped find options by, for example, reducing 2-gram to a 1-gram.

The Data Product

The final product is fairly simple and accomplishes the goal of next-word prediction with good accuracy.

Just enter your phrase where the prompt “Enter phrase” appears on the left and the resulting next word will appear on the right.

Please have a try!

https://sherylperil.shinyapps.io/NLP_App/