The aims of this milestone report are: 1. The data is downloaded and loaded
2. Create a basic report of summary statistics about the data sets
3. Report any interesting findings
4. Get feedbacks

Basic summary of the files

Running Linux commands (wc and du) from within R, we an see that:
1. “en_US.blogs.txt” has 899,288 lines, 210,160,014 characters and is 200Mb in size
2. “en_US.news.txt” has 1,010,242 lines, 205,811,889 characters and is 197Mb in size
3. “en_US.twitter.txt” has 2,360,148 lines, 167,105,338 characters and is 159Mb in size For simplicity, only codes for blogs are shown.

setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system('wc -lc en_US.blogs.txt')
system('du -ch en_US.blogs.txt')

For exploratory purpose, I need to sample a subset of the data, about 20%, running Perl scripts from within R.

setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system("perl -ne 'print if (rand() < .20)' en_US.blogs.txt > en_US.blogs.subset.txt") 
system("perl -ne 'print if (rand() < .20)' en_US.news.txt > en_US.news.subset.txt")
system("perl -ne 'print if (rand() < .20)' en_US.twitter.txt > en_US.twitter.subset.txt")  

Create corpus and wordcloud

Wordcloud is a good way to explore the most frequent words in each file (blog, news and twitter). To do this, we first need to create corpus and term-document matrix.

Create corpus and genrate term-document matrix

setwd("~/JHU_DataScience/capstone_project/final/en_US/")
library(tm)
if(!file.exists("subset")) {dir.create("subset")}
system("mv *.subset.txt ./subset/")
filepath<-"~/JHU_DataScience/capstone_project/final/en_US/subset/"
#Create corpus
en_US<-VCorpus(DirSource(filepath,encoding="UTF-8"),
               readerControl=list(language="en_US"))
#Remove numbers, punctuation, chagne to lower case, remove bad words and strip whitespace
en_US<-tm_map(en_US,removeNumbers)
en_US<-tm_map(en_US,removePunctuation)
en_US<-tm_map(en_US,content_transformer(tolower))
en_US<-tm_map(en_US,removeWords,stopwords(kind = "en"))
profanity<-read.table("./profanity.txt",header=F,sep="\n")
profanity<-profanity$V1
en_US<-tm_map(en_US,removeWords,profanity)
en_US<-tm_map(en_US,stripWhitespace)

#Generate term-document matrix 
termDocMat<-TermDocumentMatrix(en_US)
termDocMat<-as.matrix(termDocMat)

First wordcloud is for blogs. The most frequent words are “one”, “wil”, “just”, “like”, “can”, “time”.
931 most frequnt words cover 50% of all word instances.
15435 most frequent words cover 90% of all word instances.

termFreq<-sort(termDocMat[,1],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))

cumu_freq<-cumsum(termFreqDF$freq)
table(cumu_freq/tail(cumu_freq,n=1)>0.5)
## 
##  FALSE   TRUE 
##    934 328971
table(cumu_freq/tail(cumu_freq,n=1)>0.9)
## 
##  FALSE   TRUE 
##  15442 314463

Second word cloud is for news. The most common word is “said”. This makes sense, as it is typical to quote someone as news source using sentences like “John Doe said something”.
1170 most frequnt words cover 50% of all word instances.
15944 most frequent words cover 90% of all word instances.

termFreq<-sort(termDocMat[,2],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))

cumu_freq<-cumsum(termFreqDF$freq)
table(cumu_freq/tail(cumu_freq,n=1)>0.5)
## 
##  FALSE   TRUE 
##   1174 328731
table(cumu_freq/tail(cumu_freq,n=1)>0.9)
## 
##  FALSE   TRUE 
##  16033 313872

Third word cloud is for twitter. The most frequent words are “just”, “like”, “get”, “good”, “love”, “dont”, “great”. The word “LOL” is also quite common.
520 most frequnt words cover 50% of all word instances.
12619 most frequent words cover 90% of all word instances.
This suggest that Twitter texts may have simpler structure compared to news or blogs, since a much smaller set of the most frequent words cover 50% and 90% of all word frequencies.

termFreq<-sort(termDocMat[,3],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))

cumu_freq<-cumsum(termFreqDF$freq)
table(cumu_freq/tail(cumu_freq,n=1)>0.5)
## 
##  FALSE   TRUE 
##    519 329386
table(cumu_freq/tail(cumu_freq,n=1)>0.9)
## 
##  FALSE   TRUE 
##  12623 317282

Tokenization and bi- and tri-grams

Creating n-grams is a first step to build predictive models. It gives a first-look into what words are frequently used together. Wordcloud analysis suggests that blog, news and twitter seem to have different most frequent works (unigrams), so bi- and tri-grams are also separately generated for them.

options(mc.cores=1)
require(RWeka)
#Create bi- and tri-grams.
UnigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TrigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tdm1 <- TermDocumentMatrix(en_US,control = list(tokenize=UnigramTokenizer))
tdm2 <- TermDocumentMatrix(en_US,control = list(tokenize=BigramTokenizer))
tdm3 <- TermDocumentMatrix(en_US,control = list(tokenize=TrigramTokenizer))

Most frequent bi- and tri-grams in blogs

gram2Freq<-as.matrix(tdm2)[,1]
gram2FreqDF<-data.frame(bigrams = names(gram2Freq), freq=gram2Freq)
gram2FreqDF<-gram2FreqDF[order(-gram2FreqDF$freq),]
gram3Freq<-as.matrix(tdm3)[,1]
gram3FreqDF<-data.frame(bigrams = names(gram3Freq), freq=gram3Freq)
gram3FreqDF<-gram3FreqDF[order(-gram3FreqDF$freq),]

library(ggplot2)
source("~/bin/R_scripts/multiplot.R")
gram2FreqDF$bigrams<-factor(gram2FreqDF$bigrams,levels=gram2FreqDF$bigrams[order(-gram2FreqDF$freq)],ordered=T)
p1 <- ggplot(gram2FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="red") + xlab("Bigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))

gram3FreqDF$bigrams<-factor(gram3FreqDF$bigrams,levels=gram3FreqDF$bigrams[order(-gram3FreqDF$freq)],ordered=T)
p2 <- ggplot(gram3FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="orange") + xlab("Trigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))

multiplot(p1,p2)

Most frequent bi- and tri-grams in news Same codes as in analyzing blogs data.

Most frequent bi- and tri-grams in twitter Same codes as in analyzing blogs data.

Next steps

The next steps will be to generate a prediction model and build a Shiny app for it. To achieve this, I will carefully consider these issues: 1. Pre-compute and save/load a term-document matrix, possibly outside of R, using Bash script or Python. Exploratory analysis using 20% of all data is already very time consuming.
2. Balance between predictive performance and computing resources, especially the limiation of Shiny server.