Sinopsis
The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. This document explains only the major features of the data identified and the plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager.
Content
Demonstrate that you’ve downloaded the data and have successfully loaded it in.
Create a basic report of summary statistics about the data sets.
Report of any interesting findings that amassed so far.
Get feedback on your plans for creating a prediction algorithm and Shiny app.
Library Upload
library(ggplot2)
library(knitr)
library(NLP)
library(stringi)
library(tm)
library(rJava)
library(RWeka)
set.seed(1234)
Data upload
We have to include all the downloaded files in our Environment.
blogs.file <- readLines("data/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news.file <- readLines("data/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter.file <- readLines("data/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
Data Summary
Files.Summary <- data.frame(
File_Name=c("blogs.file","news.file","twitter.file"),
File_Size = sapply(list(blogs.file, news.file, twitter.file), function(x){format(object.size(x),"MB")}),
t(rbind(sapply(list(blogs.file,news.file,twitter.file),stri_stats_general),
WordCount=sapply(list(blogs.file,news.file,twitter.file),stri_stats_latex)[4,])
),
MaxLineLength = sapply(list(blogs.file, news.file, twitter.file), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
We could see below the statistics.
kable(Files.Summary)
| blogs.file |
255.4 Mb |
899288 |
899288 |
206824382 |
170389539 |
37570839 |
40833 |
| news.file |
19.8 Mb |
77259 |
77259 |
15639408 |
13072698 |
2651432 |
5760 |
| twitter.file |
319 Mb |
2360148 |
2360148 |
162096241 |
134082806 |
30451170 |
140 |
Reaching this point and to make a deeper exploration it is needed to get samples of our files and reduce de input data to make the data review quicker and cleaner. In this case the 3% of each file.
For the blogs file:
blogs.sample <- sample(blogs.file, length(blogs.file) * 0.03)
For the news file:
news.sample <- sample(news.file, length(news.file) * 0.03)
For the twitter file:
twitter.sample <- sample(twitter.file, length(twitter.file) * 0.03)
Giving us new results or the statistics.
Files.Summary <- data.frame(
File_Name=c("blogs.sample","news.sample","twitter.sample"),
File_Size = sapply(list(blogs.sample, news.sample, twitter.sample), function(x){format(object.size(x),"MB")}),
t(rbind(sapply(list(blogs.sample,news.sample,twitter.sample),stri_stats_general),
WordCount=sapply(list(blogs.sample,news.sample,twitter.sample),stri_stats_latex)[4,])
),
MaxLineLength = sapply(list(blogs.sample, news.sample, twitter.sample), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
)
kable(Files.Summary)
| blogs.sample |
7.7 Mb |
26978 |
26978 |
6227319 |
5129497 |
1132931 |
6886 |
| news.sample |
0.6 Mb |
2317 |
2317 |
465647 |
389156 |
78854 |
1150 |
| twitter.sample |
9.7 Mb |
70804 |
70804 |
4857378 |
4018238 |
912327 |
140 |
Data Cleaning
Now it is time to reduce the non desired contente like invalid characters, empty lines or puntuation signs to have normalied words in the corpus.
first of all let’s build a metafile with all the samples.
data.sample <- c(blogs.sample, news.sample, twitter.sample)
For this preparation we are going to used the functions which belongs to the TM package.Harmonizing:
corpus <- VCorpus(VectorSource(data.sample))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
Distributions and frequencies
Now it is time to perform the distributions for n-gram models, for that purpose it is necesari to tokenize the sample data, in 1, 2 and 3 groups.
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
matrix_1 <- TermDocumentMatrix(corpus, control = list(tokenize = unigram))
matrix_2 <- TermDocumentMatrix(corpus, control = list(tokenize = bigram))
matrix_3 <- TermDocumentMatrix(corpus, control = list(tokenize = trigram))
So now the terms matrixes are built, the job to be done it count the numer of appearences of each word for the 3 models.
unifreq <- findFreqTerms(matrix_1,lowfreq = 50)
bifreq <- findFreqTerms(matrix_2,lowfreq=50)
trifreq <- findFreqTerms(matrix_3,lowfreq=50)
For the unigram model:
unidistribution <- rowSums(as.matrix(matrix_1[unifreq,]))
unidistribution <- data.frame(word=names(unidistribution), frequency=unidistribution)
unidistribution <- unidistribution[order(-unidistribution$frequency),][1:30,]
ggplot(data=unidistribution, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "orange", colour = "black") + ggtitle("Words Unigram distribution") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

For the bigram model:
bidistribution <- rowSums(as.matrix(matrix_2[bifreq,]))
bidistribution <- data.frame(word=names(bidistribution), frequency=bidistribution)
bidistribution <- bidistribution[order(-bidistribution$frequency),][1:30,]
ggplot(data=bidistribution, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "red", colour = "black") + ggtitle("Words Bigram distribution") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

For the trigram model:
tridistribution <- rowSums(as.matrix(matrix_3[trifreq,]))
tridistribution <- data.frame(word=names(tridistribution), frequency=tridistribution)
tridistribution <- tridistribution[order(-tridistribution$frequency),][1:30,]
ggplot(data=tridistribution, aes(x = reorder(word, -frequency), y = frequency)) +
geom_bar(stat = "identity", fill = "green", colour = "black") + ggtitle("Words Trigram distribution") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

Next steps
After analyzing the source data thanks to the exploratory tasks. It is shown the n-gram approach for words groupings and its logical proximity.
Now it is time to design the algorithm and provide an apprication to support it.