Explanatory Analysis

An explanatory analysis of the three documents en_US.blogs.txt,en_US.news.txt and en_US.twitter.txt with basic summaries of Word counts, line counts and histograms that show basic properties of the data set. First we need to load basic libraries that are going to be used in the analysis.

Loading the libraries

library(ggplot2) ## Library used to make  plots
library(tm) ## Library used to make tokenization of data
library(xtable) ## Library that produces tables
library(SnowballC)

Loading the three data sets

The three data sets were provided by SwiftKey company thats helping develop the Capstone Project

connection <- file("en_US.blogs.txt", open = "r")
blog <- readLines(connection)
close(connection)

connection <- file("en_US.news.txt", open = "r")
news <- readLines(connection)
close(connection)

connection <- file("en_US.twitter.txt", open = "r")
twitter <- readLines(connection)
close(connection)

sizeblog <- object.size(blog)
sizetwitter <- object.size(news)
sizetwitter <- object.size(twitter)
name_of_file <- c("Blog", "News", "Twitter")
size_of_file_bytes <- c(sizeblog, sizetwitter, sizetwitter)
df <- cbind(name_of_file, size_of_file_bytes)
(df)
     name_of_file size_of_file_bytes
[1,] "Blog"       "260564320"       
[2,] "News"       "316037344"       
[3,] "Twitter"    "316037344"       

The table shows the biggest file is the en_US.twitter.txt file then the en_US.news.txt and the smaller file is en_US.blogs.txt.
Now we proceed to find the line count of the tree files.

Line Counts

blogLC <- length(blog)
newsLC <- length(news)
twitterLC <- length(twitter)

Name <- c("Blog", "News", "Twitter")
LineCount <- c(blogLC, newsLC, twitterLC)
df <- cbind(Name, LineCount)
(df)
     Name      LineCount
[1,] "Blog"    "899288" 
[2,] "News"    "1010242"
[3,] "Twitter" "2360148"

Since the data set is so big the algorithm will be build with three samples from the original data, this is how the three sample were build:

        set.seed(1234)
        blog<-sample(blog,5000,T)
        news<-sample(news,5000,T)
        twitter<-sample(twitter,5000,T)

We see that the en_US.twitter.txt file is the one with the highest line count and this has a lot of sense since it its the biggest file according to the byte table shown above.

Cleaning the data set

  1. Turn the whole document into lower case.
  2. Remove punctuation.
blog<-gsub('[[:punct:] ]+',' ',blog)
news<-gsub('[[:punct:] ]+',' ',news)
twitter<-gsub('[[:punct:] ]+',' ',twitter)
todo<-c(blog, news,twitter)
blog<-VCorpus(VectorSource(blog))
news<-VCorpus(VectorSource(news))
twitter<-VCorpus(VectorSource(twitter))
todo<-VCorpus(VectorSource(todo))
blog<-tm_map(blog, content_transformer(tolower))
news<-tm_map(news, content_transformer(tolower))
twitter<-tm_map(twitter,content_transformer(tolower))
todo<-tm_map(todo,content_transformer(tolower))
blog<-tm_map(blog, removePunctuation)
news<-tm_map(news, removePunctuation)
twitter<-tm_map(twitter,removePunctuation)
todo<-tm_map(todo,removePunctuation)



blog<-tm_map(blog, removeNumbers)
news<-tm_map(news, removeNumbers)
twittert<-tm_map(twitter,removeNumbers)
todor<-tm_map(todo,removeNumbers)

frequenciesblog = DocumentTermMatrix(blog)
frequenciesnews = DocumentTermMatrix(news)
frequenciestwitter = DocumentTermMatrix(twitter)
frequenciestodo = DocumentTermMatrix(todo)

The blog data set has a mean of a 8.26 per word and the 75% of the words occur less than 4 times while the maximum times a word is repeated in the data set is 10360 times

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    1.000     1.000     1.000     8.256     4.000 10360.000 

The news data has a mean of 7.133 per word and also 75% of the words occur less tha 4 times and the word that gets repetead the most gets repeted 9763 times.

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.000    1.000    1.000    7.133    4.000 9763.000 

The twitter data has a mean of 5.031 per word and also 75% of the words occur less tha 2 times and the word that gets repetead the most gets repeted 1953 times.

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.000    1.000    1.000    5.031    2.000 1953.000 

As observed all of the data is highly skewed lots of the words are only reapeated a bunch of time so if we put all data files togeteher this might help and make the data less skewed

The collection of data behaves as the data of the three files, and it is still highly skewed