Natural Language Capstone - Final Project

John McConnell
08/22/2017

The Capstone Project

Excutive Summary

The Capstone project of the data science specialiaztion provided three data files, one with news, one with blogs, and one with twitter tweets. This natural language data was analyzed and used to create a predictive model for predicting the next word in an input phrase.

The three files were quite large, approximately 550 MB in total size, with over 4 million lines and over 100 million words. Due to the size, a sample of 22,000 lines for each file was created for training.

Additionally, the input data was untidy and cleanup was necessary prior to processing. This cleanup included
removal of special characters, numbers, punctuation, whitespace, and conversion to lower case.

Once preprocessing was completed a model was created that would be used for prediction of the next word given an input string.

The Model

My model read in the data files and used the tm, NLP, & rWeka packages to create n-grams:

  • bigrams - two word phrases (i.e. “of the”)
  • trigrams - three word phrases (i.e. “one of the” )
  • quadgrams - four word phrases (i.e. “the end of the”)
  • fivegrams - five word phrases (i.e. “at the end of the”)
  • sixgrams - six word phrases (i.e. “at the end of the day”)

In computing the n-grams the package also output the frequency of the n-gram as well as the probabilities of each phrase. The n-grams (words and frequencies) were written out into RData files.

Shiny App

A Shiny app was developed which parsed an input string and predicted the next word based on a back-off method where the longest n-gram was tested and successively reduced (n-1) until a matching n-gram was found.

The response time was quick and the model was fairly accurate. The goal was to provide a predicted word response that was based on the largest n-gram. Click a link below to see a plot of the top 10 n-grams.

Below is a screenshot of the app

my image

To access the Shiny app please visit: