x— title: “Capstone - Assignment1” author: “cy ting” date: “April 25, 2016” output: html_document —

Introduction

This assignment aims to provide an exploratory overview of the three datasets, namely, en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt. In this assignment, basic statistical information and visualization through graph will be provided. A word cloud is also given to provide a high-level understanding of the datasets.

Loading R Libraries

There are four main R libraries to be loaded before exploratory analysis could be performed on the three datasets.

library(tm)
library(ggplot2)
library(wordcloud)
library(SnowballC)
library(RWeka)

Loading and Cleaning the Datasets

blog<-readLines("./dataset/en_US/en_US.blogs.txt",encoding="UTF-8", skipNul = TRUE)
twitter<-readLines("./dataset/en_US/en_US.twitter.txt",encoding="UTF-8", skipNul = TRUE)
news<-readLines("./dataset/en_US/en_US.news.txt",encoding="UTF-8", skipNul = TRUE)

dt.blogs<-iconv(blog,from='latin1',to ='ASCII',sub="")
dt.twitter<-iconv(twitter,from='latin1',to ='ASCII', sub="")
dt.news<-iconv(news,from='latin1',to ='ASCII',sub="")

Basic Statistics of the Datasets

The length of datasets

paste("Twitter =" , length(dt.twitter))
paste("Blogs =" , length(dt.blogs))
paste("News =" , length(dt.news))
## [1] "Twitter = 2360148"
## [1] "Blogs = 899288"
## [1] "News = 77259"

The file size in MB

paste("Twitter dataset =", round(file.info("./dataset/en_US/en_US.twitter.txt")$size /1024 /1000, digits = 0), "MB")
paste("Blogs dataset =", round(file.info("./dataset/en_US/en_US.blogs.txt")$size /1024 /1000, digits = 0), "MB")
paste("News dataset =", round(file.info("./dataset/en_US/en_US.news.txt")$size /1024 /1000, digits = 0), "MB")
## [1] "Twitter dataset = 163 MB"
## [1] "Blogs dataset = 205 MB"
## [1] "News dataset = 201 MB"

Word count before any preprocessing

dt.twitter.WordCount <- sum(sapply(gregexpr("\\S+", dt.twitter), length))
paste("Twitter dataset = ", dt.twitter.WordCount)
dt.blogs.WordCount <- sum(sapply(gregexpr("\\S+", dt.blogs), length))
paste("Twitter dataset = ", dt.blogs.WordCount)
dt.news.WordCount <- sum(sapply(gregexpr("\\S+", dt.news), length))
paste("Twitter dataset = ", dt.news.WordCount)
## [1] "Twitter dataset =  30341030"
## [1] "Twitter dataset =  37272703"
## [1] "Twitter dataset =  2639044"

Preparing the sample dataset

In this assignment, random sampling of 10000 lines will be extracted from each dataset.The extracted sample will then be combined to allow corpus creation using the tm package.

dt.twitter.sample <-  dt.twitter[sample(1:length(dt.twitter),10000, replace = FALSE)]
dt.blogs.sample <-  dt.twitter[sample(1:length(dt.blogs),10000, replace = FALSE)]
dt.news.sample <-  dt.twitter[sample(1:length(dt.news),10000, replace = FALSE)]

combined.sample<-rbind(dt.twitter.sample,dt.blogs.sample,dt.news.sample)

Create the corpus, clean the data

In this assignment, no stemming and stopword removable will be performed.

myCorpus<-VCorpus(VectorSource(combined.sample))
myCorpus<-tm_map(myCorpus,content_transformer(tolower))
myCorpus<-tm_map(myCorpus,removePunctuation)
myCorpus<-tm_map(myCorpus,removeNumbers)
myCorpus<-tm_map(myCorpus,stripWhitespace)

Word Cloud for one word

High-level overview of words in the combined sampled dataset.

wordcloud(myCorpus, max.words = 200, random.order = FALSE,rot.per=0.35, use.r.layout=FALSE,colors=brewer.pal(8, "Dark2"))

Extracting one-gram, two-gram, and three-gram

Below are codes for tokenization of n-gram and subsequently with construction of data frame for one-gram, two-gram, and three-gram. The data frames are then order decreasingly so that the top 10 ngram can be extracted.

df<-data.frame(text=unlist(sapply(myCorpus,`[`,"content")), stringsAsFactors=F)

one.gram<-NGramTokenizer(df,Weka_control(min=1,max=1))
df.one.gram<-data.frame(table(one.gram))
df.one.gram<-df.one.gram[order(df.one.gram$Freq, decreasing=TRUE), ]

two.gram<-NGramTokenizer(df,Weka_control(min=2,max=2))
df.two.gram<-data.frame(table(two.gram))
df.two.gram<-df.two.gram[order(df.two.gram$Freq, decreasing=TRUE), ]

three.gram<-NGramTokenizer(df,Weka_control(min=3,max=3))
df.three.gram<-data.frame(table(three.gram))
df.three.gram<-df.three.gram[order(df.three.gram$Freq, decreasing=TRUE), ]

Graphs for one-gram, two-gram, and three-gram

Graph for top 10 one-gram

df.one.gram.top10 <- df.one.gram[1:10,]

ggplot(df.one.gram.top10, aes(one.gram, Freq, fill=one.gram)) + 
     geom_bar(stat="identity")+ 
     theme(axis.text.x=element_text(angle=0, hjust=1), 
           legend.position="none") 

Graph for top 10 two-gram

df.two.gram.top10 <- df.two.gram[1:10,]

ggplot(df.two.gram.top10, aes(two.gram, Freq, fill=two.gram)) + 
     geom_bar(stat="identity")+ 
     theme(axis.text.x=element_text(angle=90, hjust=1), 
           legend.position="none") 

Graph for top 10 three-gram

df.three.gram.top10 <- df.three.gram[1:10,]

ggplot(df.three.gram.top10, aes(three.gram, Freq, fill=three.gram)) + 
     geom_bar(stat="identity")+ 
     theme(axis.text.x=element_text(angle=90, hjust=1), 
           legend.position="none") 

Findings

Below are the findings from exploratory analysis of the dataset.

  1. From word cloud, stopwords appeared most in top 10 list, such as “and”, “for”, “the”

  2. For one-gram analysis, “i”, “for”, “the” appeared to have higher frequency as compared to other words.

  3. For two-gram analysis, “for the”, “in the”, “of the” appeared to have higher frequency as compared to other words.

4, For three-gram analysis, “thanks for the” appeared most.

Predictive Model and Shiny App development

The next task is to develop a predictive model and also a shiny application.