Capstone Project

Loading the three data sets

The three data sets were provided by SwiftKey company thats helping develop the Capstone Project

connection <- file("en_US.blogs.txt", open = "r")
blog <- readLines(connection)
close(connection)

connection <- file("en_US.news.txt", open = "r")
news <- readLines(connection)
close(connection)

connection <- file("en_US.twitter.txt", open = "r")
twitter <- readLines(connection)
close(connection)

sizeblog <- object.size(blog)
sizetwitter <- object.size(news)
sizetwitter <- object.size(twitter)
name_of_file <- c("Blog", "News", "Twitter")
size_of_file_bytes <- c(sizeblog, sizetwitter, sizetwitter)
df <- cbind(name_of_file, size_of_file_bytes)
(df)

     name_of_file size_of_file_bytes
[1,] "Blog"       "260564320"       
[2,] "News"       "316037344"       
[3,] "Twitter"    "316037344"

The table shows the biggest file is the en_US.twitter.txt file then the en_US.news.txt and the smaller file is en_US.blogs.txt.
Now we proceed to find the line count of the tree files.

Line Counts

blogLC <- length(blog)
newsLC <- length(news)
twitterLC <- length(twitter)

Name <- c("Blog", "News", "Twitter")
LineCount <- c(blogLC, newsLC, twitterLC)
df <- cbind(Name, LineCount)
(df)

     Name      LineCount
[1,] "Blog"    "899288" 
[2,] "News"    "1010242"
[3,] "Twitter" "2360148"

Since the data set is so big the algorithm will be build with three samples from the original data, this is how the three sample were build:

        set.seed(1234)
        blog<-sample(blog,5000,T)
        news<-sample(news,5000,T)
        twitter<-sample(twitter,5000,T)

We see that the en_US.twitter.txt file is the one with the highest line count and this has a lot of sense since it its the biggest file according to the byte table shown above.

Cleaning the data set

Turn the whole document into lower case.
Remove punctuation.

blog<-gsub('[[:punct:] ]+',' ',blog)
news<-gsub('[[:punct:] ]+',' ',news)
twitter<-gsub('[[:punct:] ]+',' ',twitter)
todo<-c(blog, news,twitter)
blog<-VCorpus(VectorSource(blog))
news<-VCorpus(VectorSource(news))
twitter<-VCorpus(VectorSource(twitter))
todo<-VCorpus(VectorSource(todo))
blog<-tm_map(blog, content_transformer(tolower))
news<-tm_map(news, content_transformer(tolower))
twitter<-tm_map(twitter,content_transformer(tolower))
todo<-tm_map(todo,content_transformer(tolower))
blog<-tm_map(blog, removePunctuation)
news<-tm_map(news, removePunctuation)
twitter<-tm_map(twitter,removePunctuation)
todo<-tm_map(todo,removePunctuation)



blog<-tm_map(blog, removeNumbers)
news<-tm_map(news, removeNumbers)
twittert<-tm_map(twitter,removeNumbers)
todor<-tm_map(todo,removeNumbers)

frequenciesblog = DocumentTermMatrix(blog)
frequenciesnews = DocumentTermMatrix(news)
frequenciestwitter = DocumentTermMatrix(twitter)
frequenciestodo = DocumentTermMatrix(todo)

The blog data set has a mean of a 8.26 per word and the 75% of the words occur less than 4 times while the maximum times a word is repeated in the data set is 10360 times

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    1.000     1.000     1.000     8.256     4.000 10360.000

The news data has a mean of 7.133 per word and also 75% of the words occur less tha 4 times and the word that gets repetead the most gets repeted 9763 times.

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.000    1.000    1.000    7.133    4.000 9763.000

The twitter data has a mean of 5.031 per word and also 75% of the words occur less tha 2 times and the word that gets repetead the most gets repeted 1953 times.

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
   1.000    1.000    1.000    5.031    2.000 1953.000

As observed all of the data is highly skewed lots of the words are only reapeated a bunch of time so if we put all data files togeteher this might help and make the data less skewed

The collection of data behaves as the data of the three files, and it is still highly skewed

Capstone Project

Nicolás Huertas

7/26/2017

Explanatory Analysis

Loading the libraries

Loading the three data sets

Line Counts

Cleaning the data set