Data Science Specialization Capstone

The aims of this milestone report are: 1. The data is downloaded and loaded 2. Create a basic report of summary statistics about the data sets 3. Report any interesting findings 4. Get feedbacks

Basic summary of the files

Running Linux commands (wc and du) from within R, we an see that:
1. “en_US.blogs.txt” has 899,288 lines, 210,160,014 characters and is 200Mb in size
2. “en_US.news.txt” has 1,010,242 lines, 205,811,889 characters and is 197Mb in size
3. “en_US.twitter.txt” has 2,360,148 lines, 167,105,338 characters and is 159Mb in size For simplicity, only codes for blogs are shown.

setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system('wc -lc en_US.blogs.txt')
system('du -ch en_US.blogs.txt')

For exploratory purpose, I need to sample a subset of the data, about 20%, running Perl scripts from within R.

setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system("perl -ne 'print if (rand() < .20)' en_US.blogs.txt > en_US.blogs.subset.txt") 
system("perl -ne 'print if (rand() < .20)' en_US.news.txt > en_US.news.subset.txt")
system("perl -ne 'print if (rand() < .20)' en_US.twitter.txt > en_US.twitter.subset.txt")

Create corpus and wordcloud

Wordcloud is a good way to explore the most frequent words in each file (blog, news and twitter). To do this, we first need to create corpus and term-document matrix.

Create corpus and genrate term-document matrix

setwd("~/JHU_DataScience/capstone_project/final/en_US/")
library(tm)
if(!file.exists("subset")) {dir.create("subset")}
system("mv *.subset.txt ./subset/")
filepath<-"~/JHU_DataScience/capstone_project/final/en_US/subset/"
#Create corpus
en_US<-VCorpus(DirSource(filepath,encoding="UTF-8"),
               readerControl=list(language="en_US"))
#Remove numbers, punctuation, chagne to lower case, remove bad words and strip whitespace
en_US<-tm_map(en_US,removeNumbers)
en_US<-tm_map(en_US,removePunctuation)
en_US<-tm_map(en_US,content_transformer(tolower))
en_US<-tm_map(en_US,removeWords,stopwords(kind = "en"))
profanity<-read.table("./profanity.txt",header=F,sep="\n")
profanity<-profanity$V1
en_US<-tm_map(en_US,removeWords,profanity)
en_US<-tm_map(en_US,stripWhitespace)

#Generate term-document matrix 
termDocMat<-TermDocumentMatrix(en_US)
termDocMat<-as.matrix(termDocMat)

First wordcloud is for blogs. The most frequent words are “one”, “wil”, “just”, “like”, “can”, “time”.

termFreq<-sort(termDocMat[,1],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))

Second word cloud is for news. The most common word is “said”. This makes sense, as it is typical to quote someone as news source using sentences like “John Doe said something”.

termFreq<-sort(termDocMat[,2],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))

Third word cloud is for twitter. The most frequent words are “just”, “like”, “get”, “good”, “love”, “dont”, “great”. The word “LOL” is also quite common.

termFreq<-sort(termDocMat[,3],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))

Tokenization and bi- and tri-grams

Creating n-grams is a first step to build predictive models. It gives a first-look into what words are frequently used together. Wordcloud analysis suggests that blog, news and twitter seem to have different most frequent works (unigrams), so bi- and tri-grams are also separately generated for them.

options(mc.cores=1)
require(RWeka)
#Create bi- and tri-grams.
UnigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TrigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tdm1 <- TermDocumentMatrix(en_US,control = list(tokenize=UnigramTokenizer))
tdm2 <- TermDocumentMatrix(en_US,control = list(tokenize=BigramTokenizer))
tdm3 <- TermDocumentMatrix(en_US,control = list(tokenize=TrigramTokenizer))

Most frequent bi- and tri-grams in blogs

gram2Freq<-as.matrix(tdm2)[,1]
gram2FreqDF<-data.frame(bigrams = names(gram2Freq), freq=gram2Freq)
gram2FreqDF<-gram2FreqDF[order(-gram2FreqDF$freq),]
gram3Freq<-as.matrix(tdm3)[,1]
gram3FreqDF<-data.frame(bigrams = names(gram3Freq), freq=gram3Freq)
gram3FreqDF<-gram3FreqDF[order(-gram3FreqDF$freq),]

library(ggplot2)
source("~/bin/R_scripts/multiplot.R")
gram2FreqDF$bigrams<-factor(gram2FreqDF$bigrams,levels=gram2FreqDF$bigrams[order(-gram2FreqDF$freq)],ordered=T)
p1 <- ggplot(gram2FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="red") + xlab("Bigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))

gram3FreqDF$bigrams<-factor(gram3FreqDF$bigrams,levels=gram3FreqDF$bigrams[order(-gram3FreqDF$freq)],ordered=T)
p2 <- ggplot(gram3FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="orange") + xlab("Trigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))

multiplot(p1,p2)

Most frequent bi- and tri-grams in news Same codes as in analyzing blogs data.

Most frequent bi- and tri-grams in twitter Same codes as in analyzing blogs data.

Next steps

The next steps will be to generate a prediction model and build a Shiny app for it. To achieve this, I will carefully consider these issues: 1. Pre-compute and save/load a term-document matrix, possibly outside of R, using Bash script or Python. Exploratory analysis using 20% of all data is already very time consuming.
2. Balance between predictive performance and computing resources, especially the limiation of Shiny server.

Data Science Specialization Capstone - Milestone Report

Yuliang Wang

July 25, 2015

Basic summary of the files

Create corpus and wordcloud

Tokenization and bi- and tri-grams

Next steps