The goal of this capstone is to mimic the experience of being a data scientist by using data science techniques learned from all 9 specialization courses to create a data product and presentation to Swiftkey.
For Week 2, the main objective is to build the sample corpus, find the 2-gram and 3-gram term document matrix and perform exploratory analysis on the words. The data is available to be downloaded from
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The files are extracted from the zip file with three working files:
Several library chosen to begin with are as below:
library(NLP)
library(tm)
library(stringi)
library(RWeka)
library(ggplot2)
library(wordcloud)
library(RColorBrewer)
options(mc.cores=1)
As the original data files (Blogs, News and Twitter) are extremely large, a small sample will be generated to study the data. A 10% of the contents of each of the data (Blogs, News and Twitter) will be sampled to create the sample corpus.
The corpus will then be generated by using the sample created.
#Create the Corpus from the sample data
corpus.folder<-"C:/Users/ShockShockWest/Documents/My Project/Capstone/Sample"
corpus<-VCorpus(DirSource(corpus.folder,encoding="UTF-8"))
profanity<-readLines("C:/Users/ShockShockWest/Documents/My Project/Capstone/profanity.csv")
The summary of the sample corpus created is as per below
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 21797232
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1574699
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 18657729
As observed, there are numerous characters, words, numbers and punctuations that are not relevant to the prediction exercise. Therefore, few functions will be created to transform/clean the corpus before the actual analysis can be performed. The transformation is performed using tm_map and it includes cleaning the below:
#Create function to transform the data
removeURL<-function(x) gsub("http[[:alnum:]]*","",x)
removeSign<-function(x) gsub("[[:punct:]]","",x)
removeNum<-function(x) gsub("[[:digit:]]","",x)
removeapo<-function(x) gsub("'","",x)
removeNonASCII<-function(x) iconv(x, "latin1", "ASCII", sub="")
removerepeat<- function(x) gsub("([[:alpha:]])\\1{2,}", "\\1\\1", x)
toLowerCase <- function(x) sapply(x,tolower)
removeSpace<-function(x) gsub("\\s+"," ",x)
removeTh<-function(x) gsub(" th", "",x)
#Transform the corpus
corpus<-tm_map(corpus,content_transformer(removeapo))#remove apostrophe
corpus<-tm_map(corpus,content_transformer(removeNum))#remove numbers
corpus<-tm_map(corpus,content_transformer(removeURL)) #remove web url
corpus<-tm_map(corpus,content_transformer(removeSign)) #remove number and punctuation except apostrophe
corpus<-tm_map(corpus,content_transformer(removeNonASCII)) #remove non-ASCII
corpus<-tm_map(corpus,content_transformer(toLowerCase))# convert uppercase to lowercase
corpus<-tm_map(corpus,content_transformer(removerepeat))# remove repeated alphabets in a words
corpus<-tm_map(corpus,content_transformer(removeSpace)) #remove multiple space
corpus<-tm_map(corpus,removeWords,stopwords("english")) #remove common english words
corpus<-tm_map(corpus,removeWords,profanity) #remove profanity words
corpus<-tm_map(corpus,content_transformer(removeTh)) #remove th from words
The summary of the sample corpus after transformation/cleaning is as per below
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 16581419
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1265733
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 14970049
Now, the sample corpus is ready, it will then be tokenized using NGramTokenizer to three different categories: the unigram,bigram and trigram to further analyze the frequency of the words.
1-gram is a contiguous sequence of single word from the corpus.
dtm<- TermDocumentMatrix(corpus)
wordMatrix = as.data.frame((as.matrix( dtm )) )
v <- sort(rowSums(wordMatrix),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
plotd<-d[1:20,]
Top 10 words of the unigram
## word freq
## just just 24237
## will will 22683
## like like 21703
## one one 20525
## get get 19489
## can can 17778
## time time 16048
## now now 15958
## love love 14789
## day day 14788
2-gram is a contiguous sequence of two words from the corpus.
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm2<-TermDocumentMatrix(corpus,control = list(tokenize = bigram))
wordMatrix2 = as.data.frame((as.matrix( dtm2 )) )
v2 <- sort(rowSums(wordMatrix2),decreasing=TRUE)
d2 <- data.frame(word = names(v2),freq=v2)
plotd2<-d2[1:20,]
Top 10 words of the bigram
## word freq
## im sure im sure 1822
## right now right now 1811
## last night last night 1804
## cant wait cant wait 1631
## looking forward looking forward 1615
## dont know dont know 1302
## feel like feel like 1261
## next week next week 1197
## mister rogers mister rogers 1170
## im going im going 1154
3-gram is a contiguous sequence of three words from the corpus.
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm3<-TermDocumentMatrix(corpus,control = list(tokenize = trigram))
wordMatrix3 = as.data.frame((as.matrix( dtm3 )) )
v3 <- sort(rowSums(wordMatrix3),decreasing=TRUE)
d3 <- data.frame(word = names(v3),freq=v3)
plotd3<-d3[1:20,]
Top 10 words of the trigram
## word freq
## boy big sword boy big sword 468
## little boy big little boy big 468
## new york city new york city 373
## let us know let us know 348
## im pretty sure im pretty sure 333
## cant wait see cant wait see 293
## im sure will im sure will 274
## id love tell id love tell 246
## u know clap u know clap 246
## go night night go night night 242
Word Cloud and GGplot are generated to better illustrate the relationship of the words in each ngram categories. The top 100 words, 2-gram words and 3-gram words are shown on the word clouds and plot.