The aims of this milestone report are: 1. The data is downloaded and loaded 2. Create a basic report of summary statistics about the data sets 3. Report any interesting findings 4. Get feedbacks
Running Linux commands (wc and du) from within R, we an see that:
1. “en_US.blogs.txt” has 899,288 lines, 210,160,014 characters and is 200Mb in size
2. “en_US.news.txt” has 1,010,242 lines, 205,811,889 characters and is 197Mb in size
3. “en_US.twitter.txt” has 2,360,148 lines, 167,105,338 characters and is 159Mb in size For simplicity, only codes for blogs are shown.
setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system('wc -lc en_US.blogs.txt')
system('du -ch en_US.blogs.txt')
For exploratory purpose, I need to sample a subset of the data, about 20%, running Perl scripts from within R.
setwd("~/JHU_DataScience/capstone_project/final/en_US/")
system("perl -ne 'print if (rand() < .20)' en_US.blogs.txt > en_US.blogs.subset.txt")
system("perl -ne 'print if (rand() < .20)' en_US.news.txt > en_US.news.subset.txt")
system("perl -ne 'print if (rand() < .20)' en_US.twitter.txt > en_US.twitter.subset.txt")
Wordcloud is a good way to explore the most frequent words in each file (blog, news and twitter). To do this, we first need to create corpus and term-document matrix.
Create corpus and genrate term-document matrix
setwd("~/JHU_DataScience/capstone_project/final/en_US/")
library(tm)
if(!file.exists("subset")) {dir.create("subset")}
system("mv *.subset.txt ./subset/")
filepath<-"~/JHU_DataScience/capstone_project/final/en_US/subset/"
#Create corpus
en_US<-VCorpus(DirSource(filepath,encoding="UTF-8"),
readerControl=list(language="en_US"))
#Remove numbers, punctuation, chagne to lower case, remove bad words and strip whitespace
en_US<-tm_map(en_US,removeNumbers)
en_US<-tm_map(en_US,removePunctuation)
en_US<-tm_map(en_US,content_transformer(tolower))
en_US<-tm_map(en_US,removeWords,stopwords(kind = "en"))
profanity<-read.table("./profanity.txt",header=F,sep="\n")
profanity<-profanity$V1
en_US<-tm_map(en_US,removeWords,profanity)
en_US<-tm_map(en_US,stripWhitespace)
#Generate term-document matrix
termDocMat<-TermDocumentMatrix(en_US)
termDocMat<-as.matrix(termDocMat)
First wordcloud is for blogs. The most frequent words are “one”, “wil”, “just”, “like”, “can”, “time”.
termFreq<-sort(termDocMat[,1],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
Second word cloud is for news. The most common word is “said”. This makes sense, as it is typical to quote someone as news source using sentences like “John Doe said something”.
termFreq<-sort(termDocMat[,2],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
Third word cloud is for twitter. The most frequent words are “just”, “like”, “get”, “good”, “love”, “dont”, “great”. The word “LOL” is also quite common.
termFreq<-sort(termDocMat[,3],decreasing=T)
termFreqDF<-data.frame(terms=names(termFreq),freq=termFreq)
wordcloud(termFreqDF$terms,termFreqDF$freq,max.words=100,scale=c(2.5,0.5),random.order = F,rot.per=0.45, use.r.layout=FALSE, colors=brewer.pal(8,"Dark2"))
Creating n-grams is a first step to build predictive models. It gives a first-look into what words are frequently used together. Wordcloud analysis suggests that blog, news and twitter seem to have different most frequent works (unigrams), so bi- and tri-grams are also separately generated for them.
options(mc.cores=1)
require(RWeka)
#Create bi- and tri-grams.
UnigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))}
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}
TrigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))}
tdm1 <- TermDocumentMatrix(en_US,control = list(tokenize=UnigramTokenizer))
tdm2 <- TermDocumentMatrix(en_US,control = list(tokenize=BigramTokenizer))
tdm3 <- TermDocumentMatrix(en_US,control = list(tokenize=TrigramTokenizer))
Most frequent bi- and tri-grams in blogs
gram2Freq<-as.matrix(tdm2)[,1]
gram2FreqDF<-data.frame(bigrams = names(gram2Freq), freq=gram2Freq)
gram2FreqDF<-gram2FreqDF[order(-gram2FreqDF$freq),]
gram3Freq<-as.matrix(tdm3)[,1]
gram3FreqDF<-data.frame(bigrams = names(gram3Freq), freq=gram3Freq)
gram3FreqDF<-gram3FreqDF[order(-gram3FreqDF$freq),]
library(ggplot2)
source("~/bin/R_scripts/multiplot.R")
gram2FreqDF$bigrams<-factor(gram2FreqDF$bigrams,levels=gram2FreqDF$bigrams[order(-gram2FreqDF$freq)],ordered=T)
p1 <- ggplot(gram2FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="red") + xlab("Bigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))
gram3FreqDF$bigrams<-factor(gram3FreqDF$bigrams,levels=gram3FreqDF$bigrams[order(-gram3FreqDF$freq)],ordered=T)
p2 <- ggplot(gram3FreqDF[1:20,],aes(x=bigrams,y=freq)) + geom_bar(stat="identity",fill="orange") + xlab("Trigrams") + ylab("Frequency") + theme(axis.text.x = element_text(angle = 90, hjust = 1,size=10))
multiplot(p1,p2)
Most frequent bi- and tri-grams in news Same codes as in analyzing blogs data.
Most frequent bi- and tri-grams in twitter Same codes as in analyzing blogs data.
The next steps will be to generate a prediction model and build a Shiny app for it. To achieve this, I will carefully consider these issues: 1. Pre-compute and save/load a term-document matrix, possibly outside of R, using Bash script or Python. Exploratory analysis using 20% of all data is already very time consuming.
2. Balance between predictive performance and computing resources, especially the limiation of Shiny server.