This is the milestone report for Data Science Capstone project from Coursera Data Science Specialization. The objectives of this report is to load the 3 given data sets, summarize the data, and explore the data to understand the frequency distribution of words and 2-gram, 3-gram words.

0. Examine Data

Before loading data, let’s check the data size and word counts within bash shell.

file sizes
167105338 KB en_US.twitter.txt
205811889 KB en_US.news.txt
210160014 KB en_US.blogs.txt

line counts
899288 en_US.blogs.txt
1010242 en_US.news.txt
2360148 en_US.twitter.txt
4269678 total

1. Load Data and necessary packages

setwd("D:/Capstone/final/en_US")
blog <- readLines("en_US.blogs.txt",skipNul = TRUE, warn = TRUE)
news <- readLines("en_US.news.txt",skipNul = TRUE, warn = TRUE)
## Warning in readLines("en_US.news.txt", skipNul = TRUE, warn = TRUE):
## incomplete final line found on 'en_US.news.txt'
twitter <- readLines("en_US.twitter.txt",skipNul = TRUE, warn = TRUE)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
library(NLP)
## Warning: package 'NLP' was built under R version 3.5.2
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(tm)
## Warning: package 'tm' was built under R version 3.5.3
library(RWeka)
## Warning: package 'RWeka' was built under R version 3.5.3

2. Data Sampling

Because these data are huge. We need to make a sample subset to do this project. In each file, I select random 1000 entries as my data source and then delete the original data to release memory space.

set.seed(100)
sample_size = 1000

sample_blog <- blog[sample(1:length(blog),sample_size)]
sample_news <- news[sample(1:length(news),sample_size)]
sample_twitter <- twitter[sample(1:length(twitter),sample_size)]

Examing the first few lines of each data set:

head(sample_blog)
## [1] "So. Jeff has a talk with the monkey and tries to explain to him that he needed to have courage to eat the zucchini. He needed to look at it like Super Man would look at kryptonite and ATTACK the zucchini! The boy took a couple of quick breaths and ran back into the kitchen, determined to beat the dreaded green yuck. A few minutes later he came out, triumphant! Good job, monkey! You did it!"
## [2] "or feel free to email me at"                                                                                                                                                                                                                                                                                                                                                                             
## [3] "Part 1: My Very First Competition"                                                                                                                                                                                                                                                                                                                                                                       
## [4] "4.Tribute My Ass"                                                                                                                                                                                                                                                                                                                                                                                        
## [5] "Something Shiny Syndrome."                                                                                                                                                                                                                                                                                                                                                                               
## [6] "Stop mumbling. (any suggestions from my speech-pathologist pals?) Yeah, didn't do so well with this one. Maybe I need some therapy or shock treatments."
head(sample_twitter)
## [1] "because Scott Walker is a lying ASS"                                                        
## [2] "Banana Republic didn't have what I wanted, so I tried God-Forsaken Hellhole."               
## [3] "LGBT Civil Rights March/Rally DC. Check facebook messages. Thx Wanda! Woot!"                
## [4] "hey where's you get your face? The toilet store?"                                           
## [5] "Regardless of the final score, this team has proven their worth. I'm crying. What a game!!!"
## [6] "Dylan, Nathan, ean, & Anthony comin over. :)"
head(sample_news)
## [1] "(916) 985-2675"                                                                                                                                  
## [2] "Anyone who stands in line for Social Security disability benefits learns certain truths. The system is slow. It's wasteful.And it's often cruel."
## [3] "The subpoena comes ahead of a hearing next week in which Bernanke is scheduled to testify."                                                      
## [4] "• Jesse Reese, 147-yard seventh hole at Morgan Creek, 3-hybrid"                                                                                 
## [5] "\"Make sure that she stays hydrated,\" I texted from the corner of our New York newsroom. \"Maybe some ginger ale. Is it bad diary?\""           
## [6] "\"Obviously Iâ\200\231m glad to hear theyâ\200\231re not pursuing this,\" he said."

Then combine all 3 data and remove originals:

sample_data<-rbind(sample_blog,sample_news,sample_twitter)
rm(blog,news,twitter)

3. Clean Data

I clean the data with following rules: 1. remove punctuation 2. remove whitespace 3. discard numbers since they are irrelavant in our analysis 4. convert to all lowercases

Clean the data using tm_map:

mycorpus<-VCorpus(VectorSource(sample_data))
mycorpus <- tm_map(mycorpus, content_transformer(tolower)) # convert to lowercase
mycorpus <- tm_map(mycorpus, removePunctuation) # remove punctuation
mycorpus <- tm_map(mycorpus, removeNumbers) # remove numbers
mycorpus <- tm_map(mycorpus, stripWhitespace) # remove multiple whitespace
changetospace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
mycorpus <- tm_map(mycorpus, changetospace, "/|@|\\|")

4. Tokenize the sentences

We use NGramTokenizer in RWeka package for this task. In this project, we analyze 1gram, 2gram, and 3gram, which I will call “oneGM”, “twoGM”, and “threeGM”, respectively for the n-gram matrices.

uniGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
OneT <- NGramTokenizer(mycorpus, Weka_control(min = 1, max = 1))
oneGM <- TermDocumentMatrix(mycorpus, control = list(tokenize = uniGramTokenizer))
twoGM <- TermDocumentMatrix(mycorpus, control = list(tokenize = biGramTokenizer))
threeGM <- TermDocumentMatrix(mycorpus, control = list(tokenize = triGramTokenizer))

5. Generate n-gram histograms

Unigram frequency

freqTerms <- findFreqTerms(oneGM, lowfreq = 200)
termFreq <- rowSums(as.matrix(oneGM[freqTerms,]))
termFreq <- data.frame(unigram=names(termFreq), frequency=termFreq)

g1 <- ggplot(termFreq, aes(x=reorder(unigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Unigram") + ylab("Frequency") +
    labs(title = "Top unigrams by frequency")
print(g1)

Bigram frequency

freqTerms <- findFreqTerms(twoGM, lowfreq = 70)
termFreq <- rowSums(as.matrix(twoGM[freqTerms,]))
termFreq <- data.frame(bigram=names(termFreq), frequency=termFreq)

g2 <- ggplot(termFreq, aes(x=reorder(bigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Bigram") + ylab("Frequency") +
    labs(title = "Top bigrams by frequency")
print(g2)

Trigram frequency

freqTerms <- findFreqTerms(threeGM, lowfreq = 10)
termFreq <- rowSums(as.matrix(threeGM[freqTerms,]))
termFreq <- data.frame(trigram=names(termFreq), frequency=termFreq)

g3 <- ggplot(termFreq, aes(x=reorder(trigram, frequency), y=frequency)) +
    geom_bar(stat = "identity") +  coord_flip() +
    theme(legend.title=element_blank()) +
    xlab("Trigram") + ylab("Frequency") +
    labs(title = "Top trigrams by frequency")
print(g3)