Data Science Capstone Milestone Report

Summary

This report describes the exploratory analysis of the data provided as part of the Coursera Data Science Capstone Project. The data files have been read and some characteristics of the data and plots are provided. The report concludes with a plan for a text prediction algorithm that will be hosted on www.shinyapps.io

Exploratory Analysis

The data consists of three files with samples of twitter, blogs, and news text as indicted in the following table

File	Size (Bytes)	Lines	Words
en_US.twitter.txt	167105338	2360148	30373832
en_US.blogs.txt	210160014	899288	37334441
en_US.news.txt	205811889	1010242	34372598

The files were read into R and basic text data cleaning was done to remove profanity, stop words, punctuation, and numbers. The Text Mining Package was used for most of the text processing. The number of lines and words were reduced as indicted in the following table:

Text	Lines	Words
twitter	2360113	20602310
blogs	898479	14935514
news	77223	1201591

It is interesting to note the relative effect of the data cleaning on the different types of text. The twitter text maintains a large number words while the news reduces down by a larger extent. The count of the single words left in the three texts are shown on the following graphs. Note that the distribution tails off with a very long tail. This kind of distribution is commom for word counts in texts.

Plot of word count in Blogs

Plot of word count in News

Plot of word count in Twitter

Plan

Now that the data is available the next step will be to develop the predication application in Shiny. The idea is to develop a model based on on a percentage of the text data or commonly called corpora for training and smaller amount saved for testing. The most common word prediction model uses the concept of N-Grams, a N-Gram is a set of words of the size N that are found within the training corpora. The 2-Gram or Bigram model uses the N-1 word in the set to predict the next word. Given a specific word there is a derived probability of what the next word should be. A 3-Gram or Trigram model uses the combination of the first and second words to predict the third word. A 4-Gram or Quadrigram model constrains the word selection even more resulting in a more accurate model. There are different techniques for dealing with situations where corpora does not have matching N-Grams. This situation is addressed with several methods. The one I will use is called the Backoff Method. When a N-Gram is not matched then we backoff to try to match a smaller N-1 N-Gram and so on until a word can be predicted.

Once the model is developed I will use the test data set to determine how well the model performs. A concept called Perplexity is used to compute the appropriateness of the model. Now there are the challenges in making the application small enough to fit into the hosted Shiny server. Using the host memory effectively will be necessary. A reasonable response time for the user is important given that this application is meant to simulate a word prediction application found on smartphone or tablet. So trade offs in the model will likely have to be made. I will use the perplexity measurement to determine how much of an impact my trade offs have on the model. One other idea that would be interesting would be to use the user’s input to add to the corpora and adjust the model on the fly. That will be the bonus once the application framework is in place.