The aims of this milestone report are: 1. The data is downloaded and loaded
2. Create a basic report of summary statistics about the data sets
3. Report any interesting findings
4. Get feedbacks
Running Linux commands (wc and du) from within R, we an see that:
1. “en_US.blogs.txt” has 899,288 lines, 210,160,014 characters and is 200Mb in size
2. “en_US.news.txt” has 1,010,242 lines, 205,811,889 characters and is 197Mb in size
3. “en_US.twitter.txt” has 2,360,148 lines, 167,105,338 characters and is 159Mb in size For simplicity, only codes for blogs are shown.
setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system('wc -lc en_US.blogs.txt')
system('du -ch en_US.blogs.txt')
For exploratory purpose, I need to sample a subset of the data, about 20%, running Perl scripts from within R.
setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system("perl -ne 'print if (rand() < .20)' en_US.blogs.txt > en_US.blogs.subset.txt")
system("perl -ne 'print if (rand() < .20)' en_US.news.txt > en_US.news.subset.txt")
system("perl -ne 'print if (rand() < .20)' en_US.twitter.txt > en_US.twitter.subset.txt")
Wordcloud is a good way to explore the most frequent words in each file (blog, news and twitter). To do this, we first need to create corpus and term-document matrix.
Create corpus and genrate term-document matrix
setwd("~/JHU_DataScience/capstone_project/final/en_US/")
library(tm)
if(!file.exists("subset")) {dir.create("subset")}
system("mv *.subset.txt ./subset/")
filepath<-"~/JHU_DataScience/capstone_project/final/en_US/subset/"
#Create corpus
en_US<-VCorpus(DirSource(filepath,encoding="UTF-8"),
readerControl=list(language="en_US"))
#Remove numbers, punctuation, chagne to lower case, remove bad words and strip whitespace
en_US<-tm_map(en_US,removeNumbers)
en_US<-tm_map(en_US,removePunctuation)
en_US<-tm_map(en_US,content_transformer(tolower))
en_US<-tm_map(en_US,removeWords,stopwords(kind = "en"))
profanity<-read.table("./profanity.txt",header=F,sep="\n")
profanity<-profanity$V1
en_US<-tm_map(en_US,removeWords,profanity)
en_US<-tm_map(en_US,stripWhitespace)
#Generate term-document matrix
termDocMat<-TermDocumentMatrix(en_US)
termDocMat<-as.matrix(termDocMat)
First wordcloud is for blogs. The most frequent words are “one”, “wil”, “just”, “like”, “can”, “time”.
931 most frequnt words cover 50% of all word instances.
15435 most frequent words cover 90% of all word instances.
termFreq<-sort(termDocMat[,1],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
cumu_freq<-cumsum(termFreqDF$freq)
table(cumu_freq/tail(cumu_freq,n=1)>0.5)
##
## FALSE TRUE
## 934 328971
table(cumu_freq/tail(cumu_freq,n=1)>0.9)
##
## FALSE TRUE
## 15442 314463
Second word cloud is for news. The most common word is “said”. This makes sense, as it is typical to quote someone as news source using sentences like “John Doe said something”.
1170 most frequnt words cover 50% of all word instances.
15944 most frequent words cover 90% of all word instances.
termFreq<-sort(termDocMat[,2],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
cumu_freq<-cumsum(termFreqDF$freq)
table(cumu_freq/tail(cumu_freq,n=1)>0.5)
##
## FALSE TRUE
## 1174 328731
table(cumu_freq/tail(cumu_freq,n=1)>0.9)
##
## FALSE TRUE
## 16033 313872
Third word cloud is for twitter. The most frequent words are “just”, “like”, “get”, “good”, “love”, “dont”, “great”. The word “LOL” is also quite common.
520 most frequnt words cover 50% of all word instances.
12619 most frequent words cover 90% of all word instances.
This suggest that Twitter texts may have simpler structure compared to news or blogs, since a much smaller set of the most frequent words cover 50% and 90% of all word frequencies.
termFreq<-sort(termDocMat[,3],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
cumu_freq<-cumsum(termFreqDF$freq)
table(cumu_freq/tail(cumu_freq,n=1)>0.5)
##
## FALSE TRUE
## 519 329386
table(cumu_freq/tail(cumu_freq,n=1)>0.9)
##
## FALSE TRUE
## 12623 317282
Creating n-grams is a first step to build predictive models. It gives a first-look into what words are frequently used together. Wordcloud analysis suggests that blog, news and twitter seem to have different most frequent works (unigrams), so bi- and tri-grams are also separately generated for them.
options(mc.cores=1)
require(RWeka)
#Create bi- and tri-grams.
UnigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TrigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tdm1 <- TermDocumentMatrix(en_US,control = list(tokenize=UnigramTokenizer))
tdm2 <- TermDocumentMatrix(en_US,control = list(tokenize=BigramTokenizer))
tdm3 <- TermDocumentMatrix(en_US,control = list(tokenize=TrigramTokenizer))
Most frequent bi- and tri-grams in blogs
gram2Freq<-as.matrix(tdm2)[,1]
gram2FreqDF<-data.frame(bigrams = names(gram2Freq), freq=gram2Freq)
gram2FreqDF<-gram2FreqDF[order(-gram2FreqDF$freq),]
gram3Freq<-as.matrix(tdm3)[,1]
gram3FreqDF<-data.frame(bigrams = names(gram3Freq), freq=gram3Freq)
gram3FreqDF<-gram3FreqDF[order(-gram3FreqDF$freq),]
library(ggplot2)
source("~/bin/R_scripts/multiplot.R")
gram2FreqDF$bigrams<-factor(gram2FreqDF$bigrams,levels=gram2FreqDF$bigrams[order(-gram2FreqDF$freq)],ordered=T)
p1 <- ggplot(gram2FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="red") + xlab("Bigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))
gram3FreqDF$bigrams<-factor(gram3FreqDF$bigrams,levels=gram3FreqDF$bigrams[order(-gram3FreqDF$freq)],ordered=T)
p2 <- ggplot(gram3FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="orange") + xlab("Trigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))
multiplot(p1,p2)
Most frequent bi- and tri-grams in news Same codes as in analyzing blogs data.
Most frequent bi- and tri-grams in twitter Same codes as in analyzing blogs data.
The next steps will be to generate a prediction model and build a Shiny app for it. To achieve this, I will carefully consider these issues: 1. Pre-compute and save/load a term-document matrix, possibly outside of R, using Bash script or Python. Exploratory analysis using 20% of all data is already very time consuming.
2. Balance between predictive performance and computing resources, especially the limiation of Shiny server.