This report gives a brief summary of the text data studyed so far, which is crawled from website. There are 3 sections of the data in use: US blogs, US news, US twitter. We will explain the exploratory analysis done to the 3 text data and provide some interesting insights.
Three files: “en_US.blogs.txt”, “en_US.news.txt”, “en_US.twitter.txt” are used in this project.
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
## Item Line_count
## 1 en_US.blogs.txt 899288
## 2 en_US.blogs.txt 1010242
## 3 en_US.blogs.txt 2360148
According to the table above, the data set is huge. I will randomly resampling the data sets to speed up the summary.
To resampling data set, we generate random binomial sequence to choose part of the lines of plain text.
# function to do the resampling
resample<-function(File, indicator){
j <- 1
box <- rep("",sum(indicator))
con <- file(File,"r")
for (i in 1:length(indicator)) {
if (indicator[i] == 1) { box[j] <- readLines(con,1)
j <- j+1 }
else { temp <- readLines(con,1)
rm(temp) }
}
close(con)
return(box)
}
# flip a coin to decide whether read in this line
set.seed(777)
cBlogs <- rbinom(Line_count[1],1,0.01)
cNews <- rbinom(Line_count[2],1,0.01)
cTwi <- rbinom(Line_count[3],1,0.01)
# resample the 3 data sets
blogs <- resample("en_US.blogs.txt",cBlogs)
news <- resample("en_US.news.txt",cNews)
twitter <- resample("en_US.twitter.txt",cTwi)
## Warning in readLines(con, 1): line 1 appears to contain an embedded nul
## Warning in readLines(con, 1): line 1 appears to contain an embedded nul
## Warning in readLines(con, 1): line 1 appears to contain an embedded nul
## Warning in readLines(con, 1): line 1 appears to contain an embedded nul
To carry out analysis to the data set, we need to convert the object to corpus object using “tm” package, designed for natural language processing. After that, do some cleaning of the data. We will consider all the words as lower case, remove english stop words, remove punctuation, remove stripWhitespace and then convert to a DocumentTermMatrix object for further analyze.
library(tm)
## Loading required package: NLP
# function to clean data
clean <- function(temp){
temp <- tm_map(temp, content_transformer(tolower))
temp <- tm_map(temp, removeWords, stopwords("english"))
temp <- tm_map(temp, removePunctuation)
temp <- tm_map(temp, removeNumbers)
temp <- tm_map(temp, stripWhitespace)
dtm <- DocumentTermMatrix(temp)
rm(temp)
return(dtm)
}
blogs1 <- Corpus(VectorSource(blogs))
Blogs <- clean(blogs1)
news1 <- Corpus(VectorSource(news))
News <- clean(news1)
twitter1 <- Corpus(VectorSource(twitter))
Twitter <- clean(twitter1) ; rm(twitter,twitter1)
We look into the 3 data sets seperately, see if there’s any difference pattern within type. Twitter, blogs and news will speak and write in different tones. We may want to build different word prediction modern in different usage. The exploratory analysis will give us a basic idea of segmenting the users.
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
freq <- sort(colSums(as.matrix(Blogs)),decreasing=TRUE)
wordFre <- data.frame(word=names(freq),freq=freq)
top20 <- wordFre[1:20,]
qplot(x=word,y=freq,data=top20)
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(555)
wordcloud(names(freq),freq,min.freq=top20$freq[20],colors=brewer.pal(6,"Dark2"))
rm(freq,wordFre,top20,Blogs)
freq <- sort(colSums(as.matrix(News)),decreasing=TRUE)
wordFre <- data.frame(word=names(freq),freq=freq)
top20 <- wordFre[1:20,]
qplot(x=word,y=freq,data=top20)
library(wordcloud)
set.seed(555)
wordcloud(names(freq),freq,min.freq=top20$freq[20],colors=brewer.pal(6,"Dark2"))
rm(freq,wordFre,top20,News)
freq <- sort(colSums(as.matrix(Twitter)),decreasing=TRUE)
wordFre <- data.frame(word=names(freq),freq=freq)
top20 <- wordFre[1:20,]
qplot(x=word,y=freq,data=top20)
library(wordcloud)
set.seed(555)
wordcloud(names(freq),freq,min.freq=top20$freq[20],colors=brewer.pal(6,"Dark2"))
rm(freq,wordFre,top20,Twitter)