Capstone Week 2 - Project exploratory data analysis

Sinopsis

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. This document explains only the major features of the data identified and the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.

Content

Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report of any interesting findings that amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.

Library Upload

library(ggplot2)
library(knitr)
library(NLP)
library(stringi)
library(tm)
library(rJava)
library(RWeka)
set.seed(1234)

Data upload

We have to include all the downloaded files in our Environment.

blogs.file <- readLines("data/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news.file <- readLines("data/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter.file <- readLines("data/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Data Summary

Files.Summary <- data.frame(
  File_Name=c("blogs.file","news.file","twitter.file"),
  File_Size = sapply(list(blogs.file, news.file, twitter.file), function(x){format(object.size(x),"MB")}),
  t(rbind(sapply(list(blogs.file,news.file,twitter.file),stri_stats_general),
          WordCount=sapply(list(blogs.file,news.file,twitter.file),stri_stats_latex)[4,])
    ),
  MaxLineLength = sapply(list(blogs.file, news.file, twitter.file), function(x){max(unlist(lapply(x, function(y) nchar(y))))})

)

We could see below the statistics.

kable(Files.Summary)

File_Name	File_Size	Lines	LinesNEmpty	Chars	CharsNWhite	WordCount	MaxLineLength
blogs.file	255.4 Mb	899288	899288	206824382	170389539	37570839	40833
news.file	19.8 Mb	77259	77259	15639408	13072698	2651432	5760
twitter.file	319 Mb	2360148	2360148	162096241	134082806	30451170	140

Reaching this point and to make a deeper exploration it is needed to get samples of our files and reduce de input data to make the data review quicker and cleaner. In this case the 3% of each file.

For the blogs file:

blogs.sample <- sample(blogs.file, length(blogs.file) * 0.03)

For the news file:

news.sample <- sample(news.file, length(news.file) * 0.03)

For the twitter file:

twitter.sample <- sample(twitter.file, length(twitter.file) * 0.03)

Giving us new results or the statistics.

Files.Summary <- data.frame(
  File_Name=c("blogs.sample","news.sample","twitter.sample"),
  File_Size = sapply(list(blogs.sample, news.sample, twitter.sample), function(x){format(object.size(x),"MB")}),
  t(rbind(sapply(list(blogs.sample,news.sample,twitter.sample),stri_stats_general),
          WordCount=sapply(list(blogs.sample,news.sample,twitter.sample),stri_stats_latex)[4,])
    ),
  MaxLineLength = sapply(list(blogs.sample, news.sample, twitter.sample), function(x){max(unlist(lapply(x, function(y) nchar(y))))})

)

kable(Files.Summary)

File_Name	File_Size	Lines	LinesNEmpty	Chars	CharsNWhite	WordCount	MaxLineLength
blogs.sample	7.7 Mb	26978	26978	6227319	5129497	1132931	6886
news.sample	0.6 Mb	2317	2317	465647	389156	78854	1150
twitter.sample	9.7 Mb	70804	70804	4857378	4018238	912327	140

Data Cleaning

Now it is time to reduce the non desired contente like invalid characters, empty lines or puntuation signs to have normalied words in the corpus.

first of all let’s build a metafile with all the samples.

data.sample <- c(blogs.sample, news.sample, twitter.sample)

For this preparation we are going to used the functions which belongs to the TM package.Harmonizing:

Lower case letters.
Removing puntiation.
Delete numbers.
Triming white spaces.
Transforming to plain text.

corpus <- VCorpus(VectorSource(data.sample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)

Distributions and frequencies

Now it is time to perform the distributions for n-gram models, for that purpose it is necesari to tokenize the sample data, in 1, 2 and 3 groups.

unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

matrix_1 <- TermDocumentMatrix(corpus, control = list(tokenize = unigram))
matrix_2 <- TermDocumentMatrix(corpus, control = list(tokenize = bigram))
matrix_3 <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))

So now the terms matrixes are built, the job to be done it count the numer of appearences of each word for the 3 models.

unifreq <- findFreqTerms(matrix_1,lowfreq = 50)
bifreq <- findFreqTerms(matrix_2,lowfreq=50)
trifreq <- findFreqTerms(matrix_3,lowfreq=50)

For the unigram model:

unidistribution <- rowSums(as.matrix(matrix_1[unifreq,]))
unidistribution <- data.frame(word=names(unidistribution), frequency=unidistribution)
unidistribution <- unidistribution[order(-unidistribution$frequency),][1:30,]

ggplot(data=unidistribution, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "orange", colour = "black") + ggtitle("Words Unigram distribution") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

For the bigram model:

bidistribution <- rowSums(as.matrix(matrix_2[bifreq,]))
bidistribution <- data.frame(word=names(bidistribution), frequency=bidistribution)

bidistribution <- bidistribution[order(-bidistribution$frequency),][1:30,] 

ggplot(data=bidistribution, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "red", colour = "black") + ggtitle("Words Bigram distribution") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

For the trigram model:

tridistribution <- rowSums(as.matrix(matrix_3[trifreq,]))
tridistribution <- data.frame(word=names(tridistribution), frequency=tridistribution)

tridistribution <- tridistribution[order(-tridistribution$frequency),][1:30,] 

ggplot(data=tridistribution, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "green", colour = "black") + ggtitle("Words Trigram distribution") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

Next steps

After analyzing the source data thanks to the exploratory tasks. It is shown the n-gram approach for words groupings and its logical proximity.

Now it is time to design the algorithm and provide an apprication to support it.