This report is to document the exploratory analysis I have completed related to the three en_US text files we have been given to eventually create a prediction algorithm and data product for.
I have following three text files stored on my laptop for ease of use and they were downloaded from Coursera Capstone Project Data.
en_US.twitter.txt en_US.blogs.txt en_US.news.txt
It is assumed that the data has been dowloaded from here and unzipped: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Once it is, I can go ahead and read in the text files from my local folder .
To get a sense of what the data looks like, I determine the number of lines, number of characters, and number of words for each of the 3 datasets (twitter, blogs, and news). I also calculate some basic statistics on the number of words per line (min, mean, and max).
## Dataset Lines Chars Words WPL_Min WPL_Mean WPL_Max
## 1 blogs 899288 206824382 37570839 0 41.75107 6726
## 2 news 77259 15639408 2651432 1 34.61779 1123
## 3 twitter 2360148 162096241 30451170 1 12.75065 47
We can see from the data above the files themselves are very large, so to improve processing time we will create samples of each file made of 5% of the entries. This should also allow the model and shiny application to run in a shorter amount of time.As we can see above, blogs tend to have the most words per line and tweets tend to have the least. This is what we would expect to see, given the character limit to tweets.
I first go ahead and remove all non-English characters and then go ahead and compile a sample dataset that is composed of 1% of each of the 3 original datasets.
Next I will use the functions within the tm package to build and clean my corpus that will be analyzed. After building the corpus, I convert everything to lower case, remove punctuation and numbers, strip white space, and then convert it to plain text.
I use the RWeka package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams.
Then I find the frequency of terms in each of these 3 matrices and construct dataframes of these frequencies. ###Calculate frequency of n-grams
## word frequency
## a couple of a couple of 77
## a lot of a lot of 156
## all of the all of the 72
## as well as as well as 79
## at the end at the end 53
## be able to be able to 97
Lastly, I write a function to plot the n-gram frequency and go ahead and plot the 20 most frequent Unigrams, Bigrams, and Trigrams.
next steps are the predictive algorithm and deploy the Shiny app. Briefly, the plan is to add in the filters, which will be a file full of foul words, then compared the data. there is a second way that I want to try. I will fill the all spaces between words and the cut in similar short part to identify phrases as only one word. These algorithms will be based on frequency.
I will build a UI of the Shiny app and this will consist of a text input box that will allow a user to enter a word or phrase.