Coursera Data Science Capstone

The goal of this capstone is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a data product and presentation to Swiftkey.

The main objective is to understand the problem, acquire the data, and understand the type of data we dealing with. The data is available to be downloaded from

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The files are extracted from the zip file with three working files:

  1. “en_US.blogs.txt”
  2. “en_US.news.txt”
  3. “en_US.twitter.txt”

Data preparation 1

Several library chosen to begin with are as below:

library(magrittr)
library(NLP)
library(tm)
library(stringi)
library(RWeka)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
options(mc.cores=1)

Data is being read and stored:

blogfile<- "C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\en_US.blogs.txt"
newsfile<- "C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\en_US.news.txt"
twitterfile<- "C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\en_US.twitter.txt"

blog.line<-readLines(blogfile,encoding="UTF-8", skipNul = TRUE)
news.line<-readLines(newsfile,encoding="UTF-8", skipNul = TRUE)
twitter.line<-readLines(twitterfile,encoding="UTF-8", skipNul = TRUE)

Understanding the data (preliminary)

Count the words on each lines in the data

blog.word.count<-stri_count_words(blog.line)
news.word.count<-stri_count_words(news.line)
twitter.word.count<-stri_count_words(twitter.line)

Blog

Produce a summary of the preliminary understanding of the data for blog

Number of lines:

## [1] 899288

Number of words:

## [1] 37546246

Summary of words count:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

News

Produce a summary of the preliminary understanding of the data for news

Number of lines:

## [1] 77259

Number of words:

## [1] 2674536

Summary of words count:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00

Twitter

Produce a summary of the preliminary understanding of the data for twitter

Number of lines:

## [1] 2360148

Number of words:

## [1] 30093410

Summary of words count:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

Data Processing

Split lines to words

blog.word<-unlist(strsplit(blog.line," "))
news.word<-unlist(strsplit(news.line," "))
twitter.word<-unlist(strsplit(twitter.line," "))

Finding the punctuations, spaces, non-ASCII and numbers

blog.blankspace<-sum(stri_count(blog.line,regex="\\p{Space}"))
news.blankspace<-sum(stri_count(news.line,regex="\\p{Space}"))
twitter.blankspace<-sum(stri_count(twitter.line,regex="\\p{Space}"))
blog.punc<-sum(stri_count(blog.line,regex="\\p{Punct}"))
news.punc<-sum(stri_count(news.line,regex="\\p{Punct}"))
twitter.punc<-sum(stri_count(twitter.line,regex="\\p{Punct}"))
blog.nonEnglish <- length(blog.word[stri_enc_isascii(unlist(blog.word))==FALSE])
news.nonEnglish <- length(news.word[stri_enc_isascii(unlist(news.word))==FALSE])
twitter.nonEnglish <- length(twitter.word[stri_enc_isascii(unlist(twitter.word))==FALSE])
blog.number<-length(blog.word[stri_detect_regex(blog.word,"[:digit:]")==TRUE])
news.number<-length(news.word[stri_detect_regex(news.word,"[:digit:]")==TRUE])
twitter.number<-length(twitter.word[stri_detect_regex(twitter.word,"[:digit:]")==TRUE])

Understanding of the Data

Blog

Analysis of information for blog

From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:

## 25% 50% 95% 
##   9  28 126

Number of lines:

## [1] 899288

Number of words:

## [1] 37334131

Top 10 words:

## 
##     the      to     and      of       a       I      in    that      is 
## 1659151 1043878 1015714  862906  857102  738534  540436  421628  412438 
##     for 
##  337156

Note that the occurence of the words observed to be common english stop words

Number of blankspaces:

## [1] 36434843

Number of punctuations:

## [1] 6536746

Number of non-ASCII words:

## [1] 716174

Number of digits:

## [1] 411373

News

Analysis of information for news

From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:

## 25% 50% 95% 
##  19  32  74

Number of lines:

## [1] 77259

Number of words:

## [1] 2643969

Top 10 words:

## 
##    the     to    and      a     of     in    for   that     is     on 
## 131810  68417  65167  63401  58675  47526  25498  23916  21232  19198

Note that the occurence of the words observed to be common english stop words

Number of blankspaces:

## [1] 2566710

Number of punctuations:

## [1] 533196

Number of non-ASCII words:

## [1] 22587

Number of digits:

## [1] 64181

Twitter

Analysis of information for Twitter

From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:

## 25% 50% 95% 
##   7  12  25

Number of lines:

## [1] 2360148

Number of words:

## [1] 30373583

Top 10 words:

## 
##    the     to      I      a    you    and    for     of     in     is 
## 837023 761902 604531 572691 416377 397642 368422 349367 348815 329396

Note that the occurence of the words observed to be common english stop words

Number of blankspaces:

## [1] 28013435

Number of punctuations:

## [1] 7877048

Number of non-ASCII words:

## [1] 114774

Number of digits:

## [1] 505709

Data preparation 2

As the original data files (Blogs, News and Twitter) are extremely large, a small sample will be generated to study the data. A 10% of the contents of each of the data (Blogs, News and Twitter) will be sampled to create the sample corpus.

The corpus will then be generated by using the sample created.

#Create the Corpus from the sample data
corpus.folder<-"Sample"
corpus<-VCorpus(DirSource(corpus.folder,encoding="UTF-8"))
profanity<-readLines("C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\profanity.csv")

The summary of the sample corpus created is as per below

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 22046034
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1582247
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 18734352

Corpus Transformation

As observed, there are numerous characters, words, numbers and punctuations that are not relevant to the prediction exercise. Therefore, few functions will be created to transform/clean the corpus before the actual analysis can be performed. The transformation is performed using tm_map and it includes cleaning the below:

  1. Web address URL
  2. Metadata
  3. Non-ASCII words
  4. Repeating alphabets
  5. Upper case characters
  6. Numbers
  7. Profanity words
  8. Common english words
#Create function to transform the data
removeURL<-function(x) gsub("http[[:alnum:]]*","",x)
removeSign<-function(x) gsub("[[:punct:]]","",x)
removeNum<-function(x) gsub("[[:digit:]]","",x)
removeapo<-function(x) gsub("'","",x)
removeNonASCII<-function(x) iconv(x, "latin1", "ASCII", sub="")
removerepeat<- function(x) gsub("([[:alpha:]])\\1{2,}", "\\1\\1", x)
toLowerCase <- function(x) sapply(x,tolower)
removeSpace<-function(x) gsub("\\s+"," ",x)
removeTh<-function(x) gsub(" th ", "",x)

#Transform the corpus
corpus<-tm_map(corpus,content_transformer(removeapo))#remove apostrophe
corpus<-tm_map(corpus,content_transformer(removeNum))#remove numbers
corpus<-tm_map(corpus,content_transformer(removeURL)) #remove web url
corpus<-tm_map(corpus,content_transformer(removeSign)) #remove number and punctuation except apostrophe
corpus<-tm_map(corpus,content_transformer(removeNonASCII)) #remove non-ASCII
corpus<-tm_map(corpus,content_transformer(toLowerCase))# convert uppercase to lowercase
corpus<-tm_map(corpus,content_transformer(removerepeat))# remove repeated alphabets in a words
corpus<-tm_map(corpus,content_transformer(removeSpace)) #remove multiple space
corpus<-tm_map(corpus,removeWords,stopwords("english")) #remove common english words
corpus<-tm_map(corpus,removeWords,profanity) #remove profanity words
corpus<-tm_map(corpus,content_transformer(removeTh)) #remove th from words

The summary of the sample corpus after transformation/cleaning is as per below

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15060033
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1129424
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 11813642

Build Term Document Matrix

Now, the sample corpus is ready, it will then be tokenized using NGramTokenizer to three different categories: the unigram,bigram and trigram to further analyze the frequency of the words.

2-Gram

2-gram is a contiguous sequence of two words from the corpus.

bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2<-TermDocumentMatrix(corpus,control = list(tokenize = bigram))
wordMatrix2 <- as.data.frame((as.matrix(dtm2))) 
v2 <- sort(rowSums(wordMatrix2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)
plotd2<-d2[1:20,]

Top 10 words of the bigram

##                            word freq
## im sure                 im sure 1833
## right now             right now 1821
## last night           last night 1804
## cant wait             cant wait 1631
## looking forward looking forward 1615
## feel like             feel like 1373
## dont know             dont know 1302
## dont think           dont think 1251
## next week             next week 1210
## mister rogers     mister rogers 1170

3-Gram

3-gram is a contiguous sequence of three words from the corpus.

trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3<-TermDocumentMatrix(corpus,control = list(tokenize = trigram))
wordMatrix3 <- as.data.frame((as.matrix(dtm3))) 
v3 <- sort(rowSums(wordMatrix3),decreasing=TRUE)
d3 <- data.frame(word = names(v3),freq=v3)
plotd3<-d3[1:20,]

Top 10 words of the trigram

##                          word freq
## boy big sword   boy big sword  468
## little boy big little boy big  468
## new york city   new york city  373
## let us know       let us know  348
## im pretty sure im pretty sure  333
## cant wait see   cant wait see  293
## im sure will     im sure will  274
## id love tell     id love tell  246
## u know clap       u know clap  246
## go night night go night night  242

Understanding the data (Advanced)

Word Cloud and GGplot are generated to better illustrate the relationship of the words in each ngram categories. The top 100 words, 2-gram words and 3-gram words are shown on the word clouds and plot.

2-gram Word Cloud and Plot

3-gram Word Cloud and Plot

What’s Next?

  1. Create a larger corpus size from the original blogs, news and twitter and tokenize it (2-gram and 3-gram). Possible to split the corpus onto multiple parts and create the 2-gram and 3-gram matrix independently then to recombine them as one.
  2. Create prediction algorithm by comparing the input to the 2-gram and 3-gram matrix.
  3. Optimize the codes to allow faster processing.