Data Science Capstone Project: Milestone Report

Introduction

Objective

The Milestone report is a part of the Data Science Specialization Capstone by Johns Hopkins University on Coursera. To make this project available, the creators of the course have partnered with Swiftkey, a company that builds a smart keyboard that makes it easier for people to type on their mobile devices. The final goal of the project is to develop a similar smart keyboard with the help of various predictive models. Thus, to realise this task, we will extensively explore the area of natural language processing. The aim of this report is to demonstrate the first and essential skills required to tackle this objective: to load and process the large text data sets and to carry out exploratory analysis of its main features. Finally, we will outline next steps required for creation of the prediction algorithms deployed within a Shiny user-friendly web-application.

The document consists of the following parts:

Loading the data
Exploratory data analysis
Processing the data and creating corpora
Tokenization
Visualisation
Key findings
Next steps

Loading the data

The data set for this project is available by the link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. To proceed with all the tasks in scope of this project, let’s first download the data and unzip it to a separate directory.

fileURL <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileURL, destfile = "Coursera-SwiftKey.zip", method = "curl")
unlink(fileURL)
unzip("Coursera-SwiftKey.zip")

This project requires using several R packages. Having installed them, let’s load them all preliminarily.

library(ggplot2)        # data sets visualisation
library(stringi)        # string manipulation
library(tm)             # main package used in text mining
library(wordcloud)      # for creation of word clouds

The source data consists of four folders (relative to the language of a data set). For the purpose of our analysis we will use en_US data sets. In order to load full data sets into R, we’ll use readLines function:

twitter_lines <- readLines("final/EN_US/en_US.twitter.txt")
blogs_lines<-readLines("final/EN_US/en_US.blogs.txt")
news_lines <- readLines("final/EN_US/en_US.news.txt")

all_lines <- c(twitter_lines,blogs_lines,news_lines)

Exploratory data analysis

In this section, we’ll perform basic exploratory analysis of all data sets, including calculations of the length of lines in number of charaters, number of lines and, finally, number of words in all our data sets.

The following chunk of code can be used to obtain main statistics for the three data set (has to be run separately for each of them) on the length of lines in number of characters.

shortest <- 1000; longest <- 0; total <- 0; cnt <- 0
for (i in 1:length(twitter_lines)) {
    length <- nchar(twitter_lines[i])
    if (length > longest) longest <- length
    if (length != 0 && length < shortest ) shortest <- length
    if (length > 0) cnt <- cnt + 1
    total <- total + length
}
shortest; round(total / cnt); longest
## [1] 2
## [1] 69
## [1] 213

Having run the code for all other data sets we get the following resulting stats regarding length of lines in number of characters. Here we can notice that the shortest line has 1 characher only, and the longest is almost about of 41K.

Length of line	Twitter	Blogs	News
Shortest line	2	1	2
Average line	69	232	203
Longest line	213	40835	5760

The following chunk of code can be used to calculate number of words in the data sets and summary statistics: number of words, number of lines and so on.

total_raw<-lapply(list(twitter_lines,blogs_lines,news_lines),function(x) stri_count_words(x))
stats<-data.frame(
            data_set =c("Twitter","Blogs","News"), 
            t(rbind(sapply(list(twitter_lines,blogs_lines,news_lines),stri_stats_general),
                    total_words =sapply(list(twitter_lines,blogs_lines,news_lines),stri_stats_latex)[4,])),
            all_data =rbind(summary(total_raw[[1]]),summary(total_raw[[2]]),summary(total_raw[[3]]))
            )
print(stats)

##   data_set   Lines LinesNEmpty     Chars CharsNWhite total_words
## 1  Twitter 2360148     2360148 162384825   134370864    30556524
## 2    Blogs  899288      899288 208361438   171926076    37746231
## 3     News   77259       77259  15683765    13117038     2661443
##   all_data.Min. all_data.1st.Qu. all_data.Median all_data.Mean
## 1             1                7              12      12.79289
## 2             0                9              29      42.29050
## 3             1               19              32      34.81225
##   all_data.3rd.Qu. all_data.Max.
## 1               18            47
## 2               60          6725
## 3               46          1123

Processing the data and creating corpora

To start with, as we deal with pretty large amount of data (approx.556 MB for the three data sets in English), the tasks of the project can be resource-intensive. Let’s assume that it would be enough to restrict the data with few randomly selected rows: this might be an accurate approximation to results that could have been obtained using all the data. We’ll start data processing by loading only 1000 lines of each of three data sets.

con <- file("final/EN_US/en_US.twitter.txt", 'r')
twitter_lines <- readLines(con, n = 1000)
close(con)
con <- file("final/EN_US/en_US.blogs.txt", 'r')
blogs_lines <- readLines(con, n = 1000)
close(con)
con <- file("final/EN_US/en_US.news.txt", 'r')
news_lines<- readLines(con, n = 1000)
close(con)

Next, we’ll create data corpora.

en_twitter <- VCorpus(VectorSource(twitter_lines))
en_blogs <- VCorpus(VectorSource(blogs_lines))
en_news <- VCorpus(VectorSource(news_lines))

Having created the corpora, we can proceed with data cleansing and processing. It will be carried out in the following order:

Removal of punctuation
Removal of numbers
Convert words to lower-case
Removal of whitespaces

en_twitter<- tm_map(en_twitter, removePunctuation)
en_blogs <- tm_map(en_blogs, removePunctuation)
en_news <- tm_map(en_news, removePunctuation)

en_twitter <- tm_map(en_twitter, removeNumbers) 
en_blogs <- tm_map(en_blogs, removeNumbers)
en_news <- tm_map(en_news, removeNumbers) 
 
en_twitter <- tm_map(en_twitter, tolower) 
en_blogs <- tm_map(en_blogs, tolower)
en_news <- tm_map(en_news, tolower) 

en_twitter <- tm_map(en_twitter, stripWhitespace) 
en_blogs <- tm_map(en_blogs, stripWhitespace)
en_news <- tm_map(en_news, stripWhitespace)

en_twitter <- tm_map(en_twitter, PlainTextDocument)  
en_blogs <- tm_map(en_blogs, PlainTextDocument) 
en_news <- tm_map(en_news, PlainTextDocument)

Tokenization

As a first step, we’ll perform basic preparation for tokenization - generation of the document-term matrices. A DTM, by a widely accepted definition, is basically a matrix, with documents designated by rows and words by columns, that the elements are the counts or the weights. Additionally, we do removal of sparse terms. These operations will be carried out with each data set.

twitter_dtm <- DocumentTermMatrix(en_twitter)   
twitter_dtm <- removeSparseTerms(twitter_dtm, 0.999)

blogs_dtm <- DocumentTermMatrix(en_blogs)   
blogs_dtm <- removeSparseTerms(blogs_dtm, 0.999)

news_dtm <- DocumentTermMatrix(en_news)   
news_dtm <- removeSparseTerms(news_dtm, 0.999)

In the next chunk of code, we’ll define the most frequent terms.

twitter_freq <- colSums(as.matrix(twitter_dtm))   
twitter_ordered <- order(twitter_freq)

blogs_freq <- colSums(as.matrix(blogs_dtm))   
blogs_ordered <- order(blogs_freq)

news_freq <- colSums(as.matrix(news_dtm))   
news_ordered <- order(news_freq)

twitter_freq[tail(twitter_ordered,20)]

##  but  can  one with  was what like  not  all have  its  are just this your 
##   50   50   50   52   55   56   57   57   58   58   58   62   62   69   75 
## that  for  and  you  the 
##  103  144  180  215  417

blogs_freq[tail(blogs_ordered,20)]

##  just  what  like  they about   one  from   all   are   not   but  have 
##   128   128   137   137   138   141   149   156   165   173   222   244 
##  this   you   was  with   for  that   and   the 
##   266   290   317   361   396   540  1229  2046

news_freq[tail(news_ordered,20)]

## more  you  who will  has they  its  not  are from have  his  but  was with 
##  101  104  105  105  108  112  129  133  139  139  139  145  173  204  251 
## said that  for  and  the 
##  259  332  386  795 1874

In Natural Language Processing (NLP), tokenization refers to breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. Next, we’ll define a function that will be used in extraction of 2-gram and 3-gram word structures from the cleaned text corpus and use it in the further analysis.

two_gram_tokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

three_gram_tokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

In the following task, we again generate DTMs, remove sparse terms and organize the data by frequency but for 2-grams and 3-grams respectively.

#Document term matrices for 2-grams for each data set
dtm_twitter_two <- DocumentTermMatrix(en_twitter, control = list(tokenize = two_gram_tokenizer))
dtm_twitter_two_removed <- removeSparseTerms(dtm_twitter_two, 0.999)
freq_twitter_two <- colSums(as.matrix(dtm_twitter_two_removed)); order_two <- order(freq_twitter_two)

dtm_blogs_two <- DocumentTermMatrix(en_blogs, control = list(tokenize = two_gram_tokenizer))
dtm_blogs_two_removed <- removeSparseTerms(dtm_blogs_two, 0.999)
freq_blogs_two <- colSums(as.matrix(dtm_blogs_two_removed)); order_two <- order(freq_blogs_two)

dtm_news_two <- DocumentTermMatrix(en_news, control = list(tokenize = two_gram_tokenizer))
dtm_news_two_removed <- removeSparseTerms(dtm_news_two, 0.999)
freq_news_two <- colSums(as.matrix(dtm_news_two_removed)); order_two <- order(freq_news_two)

#Document term matrices for 3-grams for each data set
dtm_twitter_three <- DocumentTermMatrix(en_twitter, control = list(tokenize = three_gram_tokenizer))
dtm_twitter_three_removed <- removeSparseTerms(dtm_twitter_three, 0.999)
freq_twitter_three <- colSums(as.matrix(dtm_twitter_three_removed)); order_twitter_three <- order(freq_twitter_three)

dtm_blogs_three <- DocumentTermMatrix(en_blogs, control = list(tokenize = three_gram_tokenizer))
dtm_blogs_three_removed <- removeSparseTerms(dtm_blogs_three, 0.999)
freq_blogs_three <- colSums(as.matrix(dtm_blogs_three_removed)); order_blogs_three <- order(freq_blogs_three)

dtm_news_three <- DocumentTermMatrix(en_news, control = list(tokenize = three_gram_tokenizer))
dtm_news_three_removed <- removeSparseTerms(dtm_news_three, 0.999)
freq_news_three <- colSums(as.matrix(dtm_news_three_removed)); order_news_three <- order(freq_news_three)

Data Visualisation

Plots

In the following plots we’ll demonstrate the 2-grams that appear more frequently.

ggplot(subset(data.frame(word=names(freq_twitter_two), freq=freq_twitter_two), freq_twitter_two>12), aes(word, freq)) + 
        geom_bar(stat="identity", color = "blue", fill = "darkblue")  +
        ggtitle("Twitter data set")

ggplot(subset(data.frame(word=names(freq_blogs_two), freq=freq_blogs_two) , freq_blogs_two>50), aes(word, freq))  + 
        geom_bar(stat="identity", color = "magenta", fill = "red")   +
        ggtitle("Blogs data set")

ggplot(subset(data.frame(word=names(freq_news_two), freq=freq_news_two), freq_news_two>30), aes(word, freq)) + 
        geom_bar(stat="identity", color = "black", fill = "green") + 
        ggtitle("News data set")

Wordclouds

A word cloud can be also a very useful tool when you need to highlight the most commonly cited words in a text using a quick visualization. In the following section, we’ll create wordclouds of the top 40 words ib each of the three data sets.

wordcloud(names(freq_twitter_three), freq_twitter_three, max.words=40,  scale=c(3, .5), colors=brewer.pal(6, "Accent"))

wordcloud(names(freq_blogs_three), freq_blogs_three, max.words=40,  scale=c(3,.3), colors=brewer.pal(7, "Set2"))

wordcloud(names(freq_news_three), freq_news_three, max.words=40,  scale=c(2,.1), colors=brewer.pal(5, "Paired"))

Key findings

It took me less than a minute to load and process all the scripts in the exercise (taking into account that the task was performed on a resouceful laptop). We significantly restricted the size of our data in the exercise by loading chunks to tackle the resources problem. In the future, to make dealing with large data sets more efficient without loosing the quality of prediction models, it might be reasonable to apply parallel computing.
Additionally, I would suggest to use some local data base to store the large amounts of data, such as Access, SQLite, MS SQL Server etc.
As was noted before, we deal with very mixed data: starting from one character per line and up to 41K, the size of every data set is also very different.
Stopwords and swear words were not removed. In some cases, might be also suggested to have profanity removed (depending on the type, purpose or who the finel user is).
Among the most frequent words we can find articles, pronouns, particles, auxiliary verb, prepositions.

Next steps

In the next steps of the capstone project, we’ll dig more into the predictive modelling. We might base it on n-gram tokenization, explore such text mining techniques as “Bag of Words”, apply sentiment analysis to improve predictions etc. Having defined the prediction model of our interest, we will continue the project with designing and developing a Shiny Application.

Appendix

The session information attached:

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Russian_Russia.1251  LC_CTYPE=Russian_Russia.1251   
## [3] LC_MONETARY=Russian_Russia.1251 LC_NUMERIC=C                   
## [5] LC_TIME=Russian_Russia.1251    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] wordcloud_2.6      RColorBrewer_1.1-2 tm_0.7-5          
## [4] NLP_0.2-0          stringi_1.1.7      ggplot2_3.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17     pillar_1.2.3     compiler_3.5.0   plyr_1.8.4      
##  [5] bindr_0.1.1      tools_3.5.0      digest_0.6.15    evaluate_0.10.1 
##  [9] tibble_1.4.2     gtable_0.2.0     pkgconfig_2.0.1  rlang_0.2.1     
## [13] yaml_2.1.19      parallel_3.5.0   bindrcpp_0.2.2   withr_2.1.2     
## [17] dplyr_0.7.5      stringr_1.3.1    knitr_1.20       xml2_1.2.0      
## [21] rprojroot_1.3-2  grid_3.5.0       tidyselect_0.2.4 glue_1.2.0      
## [25] R6_2.2.2         rmarkdown_1.10   purrr_0.2.5      magrittr_1.5    
## [29] backports_1.1.2  scales_1.0.0     htmltools_0.3.6  assertthat_0.2.0
## [33] colorspace_1.3-2 labeling_0.3     lazyeval_0.2.1   munsell_0.5.0   
## [37] slam_0.1-43