This is the milestone report for Data Science Capstone project from Coursera Data Science Specialization. The objectives of this report is to load the 3 given data sets, summarize the data, and explore the data to understand the frequency distribution of words and 2-gram, 3-gram words.
Before loading data, let’s check the data size and word counts within bash shell.
file sizes
167105338 KB en_US.twitter.txt
205811889 KB en_US.news.txt
210160014 KB en_US.blogs.txt
line counts
899288 en_US.blogs.txt
1010242 en_US.news.txt
2360148 en_US.twitter.txt
4269678 total
setwd("D:/Capstone/final/en_US")
blog <- readLines("en_US.blogs.txt",skipNul = TRUE, warn = TRUE)
news <- readLines("en_US.news.txt",skipNul = TRUE, warn = TRUE)
## Warning in readLines("en_US.news.txt", skipNul = TRUE, warn = TRUE):
## incomplete final line found on 'en_US.news.txt'
twitter <- readLines("en_US.twitter.txt",skipNul = TRUE, warn = TRUE)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
library(NLP)
## Warning: package 'NLP' was built under R version 3.5.2
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(tm)
## Warning: package 'tm' was built under R version 3.5.3
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.5.3
Because these data are huge. We need to make a sample subset to do this project. In each file, I select random 1000 entries as my data source and then delete the original data to release memory space.
set.seed(100)
sample_size = 1000
sample_blog <- blog[sample(1:length(blog),sample_size)]
sample_news <- news[sample(1:length(news),sample_size)]
sample_twitter <- twitter[sample(1:length(twitter),sample_size)]
Examing the first few lines of each data set:
head(sample_blog)
## [1] "So. Jeff has a talk with the monkey and tries to explain to him that he needed to have courage to eat the zucchini. He needed to look at it like Super Man would look at kryptonite and ATTACK the zucchini! The boy took a couple of quick breaths and ran back into the kitchen, determined to beat the dreaded green yuck. A few minutes later he came out, triumphant! Good job, monkey! You did it!"
## [2] "or feel free to email me at"
## [3] "Part 1: My Very First Competition"
## [4] "4.Tribute My Ass"
## [5] "Something Shiny Syndrome."
## [6] "Stop mumbling. (any suggestions from my speech-pathologist pals?) Yeah, didn't do so well with this one. Maybe I need some therapy or shock treatments."
head(sample_twitter)
## [1] "because Scott Walker is a lying ASS"
## [2] "Banana Republic didn't have what I wanted, so I tried God-Forsaken Hellhole."
## [3] "LGBT Civil Rights March/Rally DC. Check facebook messages. Thx Wanda! Woot!"
## [4] "hey where's you get your face? The toilet store?"
## [5] "Regardless of the final score, this team has proven their worth. I'm crying. What a game!!!"
## [6] "Dylan, Nathan, ean, & Anthony comin over. :)"
head(sample_news)
## [1] "(916) 985-2675"
## [2] "Anyone who stands in line for Social Security disability benefits learns certain truths. The system is slow. It's wasteful.And it's often cruel."
## [3] "The subpoena comes ahead of a hearing next week in which Bernanke is scheduled to testify."
## [4] "Â Jesse Reese, 147-yard seventh hole at Morgan Creek, 3-hybrid"
## [5] "\"Make sure that she stays hydrated,\" I texted from the corner of our New York newsroom. \"Maybe some ginger ale. Is it bad diary?\""
## [6] "\"Obviously Iâ\200\231m glad to hear theyâ\200\231re not pursuing this,\" he said."
Then combine all 3 data and remove originals:
sample_data<-rbind(sample_blog,sample_news,sample_twitter)
rm(blog,news,twitter)
I clean the data with following rules: 1. remove punctuation 2. remove whitespace 3. discard numbers since they are irrelavant in our analysis 4. convert to all lowercases
Clean the data using tm_map:
mycorpus<-VCorpus(VectorSource(sample_data))
mycorpus <- tm_map(mycorpus, content_transformer(tolower)) # convert to lowercase
mycorpus <- tm_map(mycorpus, removePunctuation) # remove punctuation
mycorpus <- tm_map(mycorpus, removeNumbers) # remove numbers
mycorpus <- tm_map(mycorpus, stripWhitespace) # remove multiple whitespace
changetospace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
mycorpus <- tm_map(mycorpus, changetospace, "/|@|\\|")
We use NGramTokenizer in RWeka package for this task. In this project, we analyze 1gram, 2gram, and 3gram, which I will call “oneGM”, “twoGM”, and “threeGM”, respectively for the n-gram matrices.
uniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
OneT <- NGramTokenizer(mycorpus, Weka_control(min = 1, max = 1))
oneGM <- TermDocumentMatrix(mycorpus, control = list(tokenize = uniGramTokenizer))
twoGM <- TermDocumentMatrix(mycorpus, control = list(tokenize = biGramTokenizer))
threeGM <- TermDocumentMatrix(mycorpus, control = list(tokenize = triGramTokenizer))
Unigram frequency
freqTerms <- findFreqTerms(oneGM, lowfreq = 200)
termFreq <- rowSums(as.matrix(oneGM[freqTerms,]))
termFreq <- data.frame(unigram=names(termFreq), frequency=termFreq)
g1 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Unigram") + ylab("Frequency") +
labs(title = "Top unigrams by frequency")
print(g1)
Bigram frequency
freqTerms <- findFreqTerms(twoGM, lowfreq = 70)
termFreq <- rowSums(as.matrix(twoGM[freqTerms,]))
termFreq <- data.frame(bigram=names(termFreq), frequency=termFreq)
g2 <- ggplot(termFreq, aes(x=reorder(bigram, frequency), y=frequency)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Bigram") + ylab("Frequency") +
labs(title = "Top bigrams by frequency")
print(g2)
Trigram frequency
freqTerms <- findFreqTerms(threeGM, lowfreq = 10)
termFreq <- rowSums(as.matrix(threeGM[freqTerms,]))
termFreq <- data.frame(trigram=names(termFreq), frequency=termFreq)
g3 <- ggplot(termFreq, aes(x=reorder(trigram, frequency), y=frequency)) +
geom_bar(stat = "identity") + coord_flip() +
theme(legend.title=element_blank()) +
xlab("Trigram") + ylab("Frequency") +
labs(title = "Top trigrams by frequency")
print(g3)