Coursera Data Science Capstone (Week 2)

The goal of this capstone is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a data product and presentation to Swiftkey.

For Week 2, the main objective is to build the sample corpus, find the 2-gram and 3-gram term document matrix and perform exploratory analysis on the words. The data is available to be downloaded from

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The files are extracted from the zip file with three working files:

  1. “en_US.blogs.txt”
  2. “en_US.news.txt”
  3. “en_US.twitter.txt”

Data Preparation

Several library chosen to begin with are as below:

library(NLP)
library(tm)
library(stringi)
library(RWeka)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
options(mc.cores=1)

As the original data files (Blogs, News and Twitter) are extremely large, a small sample will be generated to study the data. A 10% of the contents of each of the data (Blogs, News and Twitter) will be sampled to create the sample corpus.

The corpus will then be generated by using the sample created.

#Create the Corpus from the sample data
corpus.folder<-"C:/Users/ShockShockWest/Documents/My Project/Capstone/Sample"
corpus<-VCorpus(DirSource(corpus.folder,encoding="UTF-8"))
profanity<-readLines("C:/Users/ShockShockWest/Documents/My Project/Capstone/profanity.csv")

The summary of the sample corpus created is as per below

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 21797232
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1574699
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 18657729

Corpus Transformation

As observed, there are numerous characters, words, numbers and punctuations that are not relevant to the prediction exercise. Therefore, few functions will be created to transform/clean the corpus before the actual analysis can be performed. The transformation is performed using tm_map and it includes cleaning the below:

  1. Web address URL
  2. Metadata
  3. Non-ASCII words
  4. Repeating alphabets
  5. Upper case characters
  6. Numbers
  7. Profanity words
  8. Common english words
#Create function to transform the data
removeURL<-function(x) gsub("http[[:alnum:]]*","",x)
removeSign<-function(x) gsub("[[:punct:]]","",x)
removeNum<-function(x) gsub("[[:digit:]]","",x)
removeapo<-function(x) gsub("'","",x)
removeNonASCII<-function(x) iconv(x, "latin1", "ASCII", sub="")
removerepeat<- function(x) gsub("([[:alpha:]])\\1{2,}", "\\1\\1", x)
toLowerCase <- function(x) sapply(x,tolower)
removeSpace<-function(x) gsub("\\s+"," ",x)
removeTh<-function(x) gsub(" th", "",x)

#Transform the corpus
corpus<-tm_map(corpus,content_transformer(removeapo))#remove apostrophe
corpus<-tm_map(corpus,content_transformer(removeNum))#remove numbers
corpus<-tm_map(corpus,content_transformer(removeURL)) #remove web url
corpus<-tm_map(corpus,content_transformer(removeSign)) #remove number and punctuation except apostrophe
corpus<-tm_map(corpus,content_transformer(removeNonASCII)) #remove non-ASCII
corpus<-tm_map(corpus,content_transformer(toLowerCase))# convert uppercase to lowercase
corpus<-tm_map(corpus,content_transformer(removerepeat))# remove repeated alphabets in a words
corpus<-tm_map(corpus,content_transformer(removeSpace)) #remove multiple space
corpus<-tm_map(corpus,removeWords,stopwords("english")) #remove common english words
corpus<-tm_map(corpus,removeWords,profanity) #remove profanity words
corpus<-tm_map(corpus,content_transformer(removeTh)) #remove th from words

The summary of the sample corpus after transformation/cleaning is as per below

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 16581419
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1265733
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 14970049

Build Term Document Matrix

Now, the sample corpus is ready, it will then be tokenized using NGramTokenizer to three different categories: the unigram,bigram and trigram to further analyze the frequency of the words.

1-Gram

1-gram is a contiguous sequence of single word from the corpus.

dtm<- TermDocumentMatrix(corpus)
wordMatrix = as.data.frame((as.matrix(  dtm )) ) 
v <- sort(rowSums(wordMatrix),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
plotd<-d[1:20,]

Top 10 words of the unigram

##      word  freq
## just just 24237
## will will 22683
## like like 21703
## one   one 20525
## get   get 19489
## can   can 17778
## time time 16048
## now   now 15958
## love love 14789
## day   day 14788

2-Gram

2-gram is a contiguous sequence of two words from the corpus.

bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2<-TermDocumentMatrix(corpus,control = list(tokenize = bigram))
wordMatrix2 = as.data.frame((as.matrix(  dtm2 )) ) 
v2 <- sort(rowSums(wordMatrix2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)
plotd2<-d2[1:20,]

Top 10 words of the bigram

##                            word freq
## im sure                 im sure 1822
## right now             right now 1811
## last night           last night 1804
## cant wait             cant wait 1631
## looking forward looking forward 1615
## dont know             dont know 1302
## feel like             feel like 1261
## next week             next week 1197
## mister rogers     mister rogers 1170
## im going               im going 1154

3-Gram

3-gram is a contiguous sequence of three words from the corpus.

trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3<-TermDocumentMatrix(corpus,control = list(tokenize = trigram))
wordMatrix3 = as.data.frame((as.matrix(  dtm3 )) ) 
v3 <- sort(rowSums(wordMatrix3),decreasing=TRUE)
d3 <- data.frame(word = names(v3),freq=v3)
plotd3<-d3[1:20,]

Top 10 words of the trigram

##                          word freq
## boy big sword   boy big sword  468
## little boy big little boy big  468
## new york city   new york city  373
## let us know       let us know  348
## im pretty sure im pretty sure  333
## cant wait see   cant wait see  293
## im sure will     im sure will  274
## id love tell     id love tell  246
## u know clap       u know clap  246
## go night night go night night  242

Generate Word Cloud and GGplot

Word Cloud and GGplot are generated to better illustrate the relationship of the words in each ngram categories. The top 100 words, 2-gram words and 3-gram words are shown on the word clouds and plot.

1-gram Word Cloud and Plot

2-gram Word Cloud and Plot

3-gram Word Cloud and Plot

What’s Next?

  1. Create a larger corpus size from the original blogs, news and twitter and tokenize it (2-gram and 3-gram). Possible to split the corpus onto multiple parts and create the 2-gram and 3-gram matrix independently then to recombine them as one.
  2. Create prediction algorithm by comparing the input to the 2-gram and 3-gram matrix.
  3. Optimize the codes to allow faster processing.