The goal of this capstone is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a data product and presentation to Swiftkey.
The main objective is to understand the problem, acquire the data, and understand the type of data we dealing with. The data is available to be downloaded from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The files are extracted from the zip file with three working files:
Several library chosen to begin with are as below:
library(magrittr)
library(NLP)
library(tm)
library(stringi)
library(RWeka)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
options(mc.cores=1)
Data is being read and stored:
blogfile<- "C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\en_US.blogs.txt"
newsfile<- "C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\en_US.news.txt"
twitterfile<- "C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\en_US.twitter.txt"
blog.line<-readLines(blogfile,encoding="UTF-8", skipNul = TRUE)
news.line<-readLines(newsfile,encoding="UTF-8", skipNul = TRUE)
twitter.line<-readLines(twitterfile,encoding="UTF-8", skipNul = TRUE)
Count the words on each lines in the data
blog.word.count<-stri_count_words(blog.line)
news.word.count<-stri_count_words(news.line)
twitter.word.count<-stri_count_words(twitter.line)
Produce a summary of the preliminary understanding of the data for blog
Number of lines:
## [1] 899288
Number of words:
## [1] 37546246
Summary of words count:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
Produce a summary of the preliminary understanding of the data for news
Number of lines:
## [1] 77259
Number of words:
## [1] 2674536
Summary of words count:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
Produce a summary of the preliminary understanding of the data for twitter
Number of lines:
## [1] 2360148
Number of words:
## [1] 30093410
Summary of words count:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
Split lines to words
blog.word<-unlist(strsplit(blog.line," "))
news.word<-unlist(strsplit(news.line," "))
twitter.word<-unlist(strsplit(twitter.line," "))
Finding the punctuations, spaces, non-ASCII and numbers
blog.blankspace<-sum(stri_count(blog.line,regex="\\p{Space}"))
news.blankspace<-sum(stri_count(news.line,regex="\\p{Space}"))
twitter.blankspace<-sum(stri_count(twitter.line,regex="\\p{Space}"))
blog.punc<-sum(stri_count(blog.line,regex="\\p{Punct}"))
news.punc<-sum(stri_count(news.line,regex="\\p{Punct}"))
twitter.punc<-sum(stri_count(twitter.line,regex="\\p{Punct}"))
blog.nonEnglish <- length(blog.word[stri_enc_isascii(unlist(blog.word))==FALSE])
news.nonEnglish <- length(news.word[stri_enc_isascii(unlist(news.word))==FALSE])
twitter.nonEnglish <- length(twitter.word[stri_enc_isascii(unlist(twitter.word))==FALSE])
blog.number<-length(blog.word[stri_detect_regex(blog.word,"[:digit:]")==TRUE])
news.number<-length(news.word[stri_detect_regex(news.word,"[:digit:]")==TRUE])
twitter.number<-length(twitter.word[stri_detect_regex(twitter.word,"[:digit:]")==TRUE])
Analysis of information for blog
From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:
## 25% 50% 95%
## 9 28 126
Number of lines:
## [1] 899288
Number of words:
## [1] 37334131
Top 10 words:
##
## the to and of a I in that is
## 1659151 1043878 1015714 862906 857102 738534 540436 421628 412438
## for
## 337156
Note that the occurence of the words observed to be common english stop words
Number of blankspaces:
## [1] 36434843
Number of punctuations:
## [1] 6536746
Number of non-ASCII words:
## [1] 716174
Number of digits:
## [1] 411373
Analysis of information for news
From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:
## 25% 50% 95%
## 19 32 74
Number of lines:
## [1] 77259
Number of words:
## [1] 2643969
Top 10 words:
##
## the to and a of in for that is on
## 131810 68417 65167 63401 58675 47526 25498 23916 21232 19198
Note that the occurence of the words observed to be common english stop words
Number of blankspaces:
## [1] 2566710
Number of punctuations:
## [1] 533196
Number of non-ASCII words:
## [1] 22587
Number of digits:
## [1] 64181
Analysis of information for Twitter
From the plot above, it can be seen that the 25%, 50% and 95% percentiles are as below:
## 25% 50% 95%
## 7 12 25
Number of lines:
## [1] 2360148
Number of words:
## [1] 30373583
Top 10 words:
##
## the to I a you and for of in is
## 837023 761902 604531 572691 416377 397642 368422 349367 348815 329396
Note that the occurence of the words observed to be common english stop words
Number of blankspaces:
## [1] 28013435
Number of punctuations:
## [1] 7877048
Number of non-ASCII words:
## [1] 114774
Number of digits:
## [1] 505709
As the original data files (Blogs, News and Twitter) are extremely large, a small sample will be generated to study the data. A 10% of the contents of each of the data (Blogs, News and Twitter) will be sampled to create the sample corpus.
The corpus will then be generated by using the sample created.
#Create the Corpus from the sample data
corpus.folder<-"Sample"
corpus<-VCorpus(DirSource(corpus.folder,encoding="UTF-8"))
profanity<-readLines("C:\\Users\\cpangb\\Documents\\Capstone\\Corpus\\profanity.csv")
The summary of the sample corpus created is as per below
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 22046034
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1582247
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 18734352
As observed, there are numerous characters, words, numbers and punctuations that are not relevant to the prediction exercise. Therefore, few functions will be created to transform/clean the corpus before the actual analysis can be performed. The transformation is performed using tm_map and it includes cleaning the below:
#Create function to transform the data
removeURL<-function(x) gsub("http[[:alnum:]]*","",x)
removeSign<-function(x) gsub("[[:punct:]]","",x)
removeNum<-function(x) gsub("[[:digit:]]","",x)
removeapo<-function(x) gsub("'","",x)
removeNonASCII<-function(x) iconv(x, "latin1", "ASCII", sub="")
removerepeat<- function(x) gsub("([[:alpha:]])\\1{2,}", "\\1\\1", x)
toLowerCase <- function(x) sapply(x,tolower)
removeSpace<-function(x) gsub("\\s+"," ",x)
removeTh<-function(x) gsub(" th ", "",x)
#Transform the corpus
corpus<-tm_map(corpus,content_transformer(removeapo))#remove apostrophe
corpus<-tm_map(corpus,content_transformer(removeNum))#remove numbers
corpus<-tm_map(corpus,content_transformer(removeURL)) #remove web url
corpus<-tm_map(corpus,content_transformer(removeSign)) #remove number and punctuation except apostrophe
corpus<-tm_map(corpus,content_transformer(removeNonASCII)) #remove non-ASCII
corpus<-tm_map(corpus,content_transformer(toLowerCase))# convert uppercase to lowercase
corpus<-tm_map(corpus,content_transformer(removerepeat))# remove repeated alphabets in a words
corpus<-tm_map(corpus,content_transformer(removeSpace)) #remove multiple space
corpus<-tm_map(corpus,removeWords,stopwords("english")) #remove common english words
corpus<-tm_map(corpus,removeWords,profanity) #remove profanity words
corpus<-tm_map(corpus,content_transformer(removeTh)) #remove th from words
The summary of the sample corpus after transformation/cleaning is as per below
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15060033
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1129424
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 11813642
Now, the sample corpus is ready, it will then be tokenized using NGramTokenizer to three different categories: the unigram,bigram and trigram to further analyze the frequency of the words.
2-gram is a contiguous sequence of two words from the corpus.
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2<-TermDocumentMatrix(corpus,control = list(tokenize = bigram))
wordMatrix2 <- as.data.frame((as.matrix(dtm2)))
v2 <- sort(rowSums(wordMatrix2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)
plotd2<-d2[1:20,]
Top 10 words of the bigram
## word freq
## im sure im sure 1833
## right now right now 1821
## last night last night 1804
## cant wait cant wait 1631
## looking forward looking forward 1615
## feel like feel like 1373
## dont know dont know 1302
## dont think dont think 1251
## next week next week 1210
## mister rogers mister rogers 1170
3-gram is a contiguous sequence of three words from the corpus.
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3<-TermDocumentMatrix(corpus,control = list(tokenize = trigram))
wordMatrix3 <- as.data.frame((as.matrix(dtm3)))
v3 <- sort(rowSums(wordMatrix3),decreasing=TRUE)
d3 <- data.frame(word = names(v3),freq=v3)
plotd3<-d3[1:20,]
Top 10 words of the trigram
## word freq
## boy big sword boy big sword 468
## little boy big little boy big 468
## new york city new york city 373
## let us know let us know 348
## im pretty sure im pretty sure 333
## cant wait see cant wait see 293
## im sure will im sure will 274
## id love tell id love tell 246
## u know clap u know clap 246
## go night night go night night 242
Word Cloud and GGplot are generated to better illustrate the relationship of the words in each ngram categories. The top 100 words, 2-gram words and 3-gram words are shown on the word clouds and plot.