Data Science Capstone Week 2 Milestone Report

Introduction

This report is to document the exploratory analysis I have completed related to the three en_US text files we have been given to eventually create a prediction algorithm and data product for.

I have following three text files stored on my laptop for ease of use and they were downloaded from Coursera Capstone Project Data.

en_US.twitter.txt en_US.blogs.txt en_US.news.txt

It is assumed that the data has been dowloaded from here and unzipped: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Once it is, I can go ahead and read in the text files from my local folder .

Summary Statistics

To get a sense of what the data looks like, I determine the number of lines, number of characters, and number of words for each of the 3 datasets (twitter, blogs, and news). I also calculate some basic statistics on the number of words per line (min, mean, and max).

##   Dataset   Lines     Chars    Words WPL_Min WPL_Mean WPL_Max
## 1   blogs  899288 206824382 37570839       0 41.75107    6726
## 2    news   77259  15639408  2651432       1 34.61779    1123
## 3 twitter 2360148 162096241 30451170       1 12.75065      47

We can see from the data above the files themselves are very large, so to improve processing time we will create samples of each file made of 5% of the entries. This should also allow the model and shiny application to run in a shorter amount of time.As we can see above, blogs tend to have the most words per line and tweets tend to have the least. This is what we would expect to see, given the character limit to tweets.

Clean and sample data

I first go ahead and remove all non-English characters and then go ahead and compile a sample dataset that is composed of 1% of each of the 3 original datasets.

Build corpus

Next I will use the functions within the tm package to build and clean my corpus that will be analyzed. After building the corpus, I convert everything to lower case, remove punctuation and numbers, strip white space, and then convert it to plain text.

Tokenize and Calculate Frequencies of N-Grams

I use the RWeka package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams.

Then I find the frequency of terms in each of these 3 matrices and construct dataframes of these frequencies. ###Calculate frequency of n-grams

##                    word frequency
## a couple of a couple of        77
## a lot of       a lot of       156
## all of the   all of the        72
## as well as   as well as        79
## at the end   at the end        53
## be able to   be able to        97

Make plots

Lastly, I write a function to plot the n-gram frequency and go ahead and plot the 20 most frequent Unigrams, Bigrams, and Trigrams.

Next steps

next steps are the predictive algorithm and deploy the Shiny app. Briefly, the plan is to add in the filters, which will be a file full of foul words, then compared the data. there is a second way that I want to try. I will fill the all spaces between words and the cut in similar short part to identify phrases as only one word. These algorithms will be based on frequency.

I will build a UI of the Shiny app and this will consist of a text input box that will allow a user to enter a word or phrase.