x— title: “Capstone - Assignment1” author: “cy ting” date: “April 25, 2016” output: html_document —
This assignment aims to provide an exploratory overview of the three datasets, namely, en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt. In this assignment, basic statistical information and visualization through graph will be provided. A word cloud is also given to provide a high-level understanding of the datasets.
There are four main R libraries to be loaded before exploratory analysis could be performed on the three datasets.
library(tm)
library(ggplot2)
library(wordcloud)
library(SnowballC)
library(RWeka)
blog<-readLines("./dataset/en_US/en_US.blogs.txt",encoding="UTF-8", skipNul = TRUE)
twitter<-readLines("./dataset/en_US/en_US.twitter.txt",encoding="UTF-8", skipNul = TRUE)
news<-readLines("./dataset/en_US/en_US.news.txt",encoding="UTF-8", skipNul = TRUE)
dt.blogs<-iconv(blog,from='latin1',to ='ASCII',sub="")
dt.twitter<-iconv(twitter,from='latin1',to ='ASCII', sub="")
dt.news<-iconv(news,from='latin1',to ='ASCII',sub="")
The length of datasets
paste("Twitter =" , length(dt.twitter))
paste("Blogs =" , length(dt.blogs))
paste("News =" , length(dt.news))
## [1] "Twitter = 2360148"
## [1] "Blogs = 899288"
## [1] "News = 77259"
The file size in MB
paste("Twitter dataset =", round(file.info("./dataset/en_US/en_US.twitter.txt")$size /1024 /1000, digits = 0), "MB")
paste("Blogs dataset =", round(file.info("./dataset/en_US/en_US.blogs.txt")$size /1024 /1000, digits = 0), "MB")
paste("News dataset =", round(file.info("./dataset/en_US/en_US.news.txt")$size /1024 /1000, digits = 0), "MB")
## [1] "Twitter dataset = 163 MB"
## [1] "Blogs dataset = 205 MB"
## [1] "News dataset = 201 MB"
Word count before any preprocessing
dt.twitter.WordCount <- sum(sapply(gregexpr("\\S+", dt.twitter), length))
paste("Twitter dataset = ", dt.twitter.WordCount)
dt.blogs.WordCount <- sum(sapply(gregexpr("\\S+", dt.blogs), length))
paste("Twitter dataset = ", dt.blogs.WordCount)
dt.news.WordCount <- sum(sapply(gregexpr("\\S+", dt.news), length))
paste("Twitter dataset = ", dt.news.WordCount)
## [1] "Twitter dataset = 30341030"
## [1] "Twitter dataset = 37272703"
## [1] "Twitter dataset = 2639044"
In this assignment, random sampling of 10000 lines will be extracted from each dataset.The extracted sample will then be combined to allow corpus creation using the tm package.
dt.twitter.sample <- dt.twitter[sample(1:length(dt.twitter),10000, replace = FALSE)]
dt.blogs.sample <- dt.twitter[sample(1:length(dt.blogs),10000, replace = FALSE)]
dt.news.sample <- dt.twitter[sample(1:length(dt.news),10000, replace = FALSE)]
combined.sample<-rbind(dt.twitter.sample,dt.blogs.sample,dt.news.sample)
In this assignment, no stemming and stopword removable will be performed.
myCorpus<-VCorpus(VectorSource(combined.sample))
myCorpus<-tm_map(myCorpus,content_transformer(tolower))
myCorpus<-tm_map(myCorpus,removePunctuation)
myCorpus<-tm_map(myCorpus,removeNumbers)
myCorpus<-tm_map(myCorpus,stripWhitespace)
High-level overview of words in the combined sampled dataset.
wordcloud(myCorpus, max.words = 200, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))
Below are codes for tokenization of n-gram and subsequently with construction of data frame for one-gram, two-gram, and three-gram. The data frames are then order decreasingly so that the top 10 ngram can be extracted.
df<-data.frame(text=unlist(sapply(myCorpus,`[`,"content")), stringsAsFactors=F)
one.gram<-NGramTokenizer(df,Weka_control(min=1,max=1))
df.one.gram<-data.frame(table(one.gram))
df.one.gram<-df.one.gram[order(df.one.gram$Freq, decreasing=TRUE), ]
two.gram<-NGramTokenizer(df,Weka_control(min=2,max=2))
df.two.gram<-data.frame(table(two.gram))
df.two.gram<-df.two.gram[order(df.two.gram$Freq, decreasing=TRUE), ]
three.gram<-NGramTokenizer(df,Weka_control(min=3,max=3))
df.three.gram<-data.frame(table(three.gram))
df.three.gram<-df.three.gram[order(df.three.gram$Freq, decreasing=TRUE), ]
Graph for top 10 one-gram
df.one.gram.top10 <- df.one.gram[1:10,]
ggplot(df.one.gram.top10, aes(one.gram, Freq, fill=one.gram)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=0, hjust=1),
legend.position="none")
Graph for top 10 two-gram
df.two.gram.top10 <- df.two.gram[1:10,]
ggplot(df.two.gram.top10, aes(two.gram, Freq, fill=two.gram)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=90, hjust=1),
legend.position="none")
Graph for top 10 three-gram
df.three.gram.top10 <- df.three.gram[1:10,]
ggplot(df.three.gram.top10, aes(three.gram, Freq, fill=three.gram)) +
geom_bar(stat="identity")+
theme(axis.text.x=element_text(angle=90, hjust=1),
legend.position="none")
Below are the findings from exploratory analysis of the dataset.
From word cloud, stopwords appeared most in top 10 list, such as “and”, “for”, “the”
For one-gram analysis, “i”, “for”, “the” appeared to have higher frequency as compared to other words.
For two-gram analysis, “for the”, “in the”, “of the” appeared to have higher frequency as compared to other words.
4, For three-gram analysis, “thanks for the” appeared most.
The next task is to develop a predictive model and also a shiny application.