An explanatory analysis of the three documents en_US.blogs.txt,en_US.news.txt and en_US.twitter.txt with basic summaries of Word counts, line counts and histograms that show basic properties of the data set. First we need to load basic libraries that are going to be used in the analysis.
library(ggplot2) ## Library used to make plots
library(tm) ## Library used to make tokenization of data
library(xtable) ## Library that produces tables
library(SnowballC)
The three data sets were provided by SwiftKey company thats helping develop the Capstone Project
connection <- file("en_US.blogs.txt", open = "r")
blog <- readLines(connection)
close(connection)
connection <- file("en_US.news.txt", open = "r")
news <- readLines(connection)
close(connection)
connection <- file("en_US.twitter.txt", open = "r")
twitter <- readLines(connection)
close(connection)
sizeblog <- object.size(blog)
sizetwitter <- object.size(news)
sizetwitter <- object.size(twitter)
name_of_file <- c("Blog", "News", "Twitter")
size_of_file_bytes <- c(sizeblog, sizetwitter, sizetwitter)
df <- cbind(name_of_file, size_of_file_bytes)
(df)
name_of_file size_of_file_bytes
[1,] "Blog" "260564320"
[2,] "News" "316037344"
[3,] "Twitter" "316037344"
The table shows the biggest file is the en_US.twitter.txt file then the en_US.news.txt and the smaller file is en_US.blogs.txt.
Now we proceed to find the line count of the tree files.
blogLC <- length(blog)
newsLC <- length(news)
twitterLC <- length(twitter)
Name <- c("Blog", "News", "Twitter")
LineCount <- c(blogLC, newsLC, twitterLC)
df <- cbind(Name, LineCount)
(df)
Name LineCount
[1,] "Blog" "899288"
[2,] "News" "1010242"
[3,] "Twitter" "2360148"
Since the data set is so big the algorithm will be build with three samples from the original data, this is how the three sample were build:
set.seed(1234)
blog<-sample(blog,5000,T)
news<-sample(news,5000,T)
twitter<-sample(twitter,5000,T)
We see that the en_US.twitter.txt file is the one with the highest line count and this has a lot of sense since it its the biggest file according to the byte table shown above.
blog<-gsub('[[:punct:] ]+',' ',blog)
news<-gsub('[[:punct:] ]+',' ',news)
twitter<-gsub('[[:punct:] ]+',' ',twitter)
todo<-c(blog, news,twitter)
blog<-VCorpus(VectorSource(blog))
news<-VCorpus(VectorSource(news))
twitter<-VCorpus(VectorSource(twitter))
todo<-VCorpus(VectorSource(todo))
blog<-tm_map(blog, content_transformer(tolower))
news<-tm_map(news, content_transformer(tolower))
twitter<-tm_map(twitter,content_transformer(tolower))
todo<-tm_map(todo,content_transformer(tolower))
blog<-tm_map(blog, removePunctuation)
news<-tm_map(news, removePunctuation)
twitter<-tm_map(twitter,removePunctuation)
todo<-tm_map(todo,removePunctuation)
blog<-tm_map(blog, removeNumbers)
news<-tm_map(news, removeNumbers)
twittert<-tm_map(twitter,removeNumbers)
todor<-tm_map(todo,removeNumbers)
frequenciesblog = DocumentTermMatrix(blog)
frequenciesnews = DocumentTermMatrix(news)
frequenciestwitter = DocumentTermMatrix(twitter)
frequenciestodo = DocumentTermMatrix(todo)
The blog data set has a mean of a 8.26 per word and the 75% of the words occur less than 4 times while the maximum times a word is repeated in the data set is 10360 times
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 8.256 4.000 10360.000
The news data has a mean of 7.133 per word and also 75% of the words occur less tha 4 times and the word that gets repetead the most gets repeted 9763 times.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 7.133 4.000 9763.000
The twitter data has a mean of 5.031 per word and also 75% of the words occur less tha 2 times and the word that gets repetead the most gets repeted 1953 times.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 5.031 2.000 1953.000
As observed all of the data is highly skewed lots of the words are only reapeated a bunch of time so if we put all data files togeteher this might help and make the data less skewed
The collection of data behaves as the data of the three files, and it is still highly skewed