1 GETTING AND CLEANING THE DATA

INTRODUCTION

This report is part of the John’s Hopkins Data Science Course Spacialization. The assignment requires to analyse several files with common txets in order to model the usual pattern of texts and develop an algorithm in order to suggest the following word to the user as Swiftjey’s software performs.

Source of the data: Content archived from heliohost.org on September 30, 2016 and retrieved via Wayback Machine on April 24, 2017.

The data is collected by a web archiver that allows us to have the text published. This zip contains several languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian but I’m going to process only the English language. Anyway, all that is deveolpped in this assigment can be applied to other languages.

Reading of the english files:

# download and unzip of the files
fileName <- "Coursera-SwiftKey.zip"
if (file.exists(fileName)) file.remove(fileName)

## [1] TRUE

unlink("Final", recursive = TRUE)
url.file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url.file, destfile="Coursera-SwiftKey.zip")
unzip(fileName)
# Using read_lines because readr is the most efficient r package in reading files
txtdatatmp <- tbl_df(read_lines(file="./final/en_US/en_US.blogs.txt"))
txtdatatmp2 <- tbl_df(read_lines(file="./final/en_US/en_US.news.txt"))
txtdatatmp3 <- tbl_df(read_lines(file="./final/en_US/en_US.twitter.txt"))
names(txtdatatmp) <- c("phrases")
names(txtdatatmp2) <- c("phrases")
names(txtdatatmp3) <- c("phrases")
# Union of the three files using the most efficient version of rbind
txtdata.all <- bind_rows(txtdatatmp, txtdatatmp2, txtdatatmp3)
names(txtdata.all) <- c("phrases")

# I'm selecting a sample of the phrases available
# txtdata <- txtdata.all
txtdata <- sample_n(txtdata.all,100000)
names(txtdata) <- c("phrases")
attach(txtdata)

Summary of the data sets:

blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
resume <- data.table(
  source=c("blogs", "news", "twitter"),
  size.MB = as.character(c(round(blogs.size,digits=1), round(news.size,digits=1), round(twitter.size,digits=1))),
  num.lines = c(length(txtdatatmp$phrases), length(txtdatatmp2$phrases), length(txtdatatmp3$phrases)))
kable(resume)

source	size.MB	num.lines
blogs	200.4	899288
news	196.3	1010242
twitter	159.4	2360148

# Freeing some space
rm(txtdatatmp, txtdatatmp2, txtdatatmp3)

I’m going to use the tm package. Cleaning the data from stopwords, stemming, removing whitespaces

corpora <- VCorpus(VectorSource(paste(unlist(txtdata), collapse =" ")))
removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x) 
corpora <- tm_map(corpora, content_transformer(removeURL))
corpora <- tm_map(corpora, removeNumbers)
corpora <- tm_map(corpora, removePunctuation)
corpora <- tm_map(corpora, content_transformer(tolower))
corpora <- tm_map(corpora, removeWords, stopwords("en"))
corpora <- tm_map(corpora, stripWhitespace)
corpora <- tm_map(corpora, stemDocument, language="en")

Creating the term-document matrix and inspecting it

dtm <- DocumentTermMatrix(corpora) 
dtm.data <- as.matrix(dtm)
inspect(dtm)

## <<DocumentTermMatrix (documents: 1, terms: 4668)>>
## Non-/sparse entries: 4668/0
## Sparsity           : 0%
## Maximal term length: 31
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs can day get just like make one said time will
##    1  52  51  80   84   72   55  71   81   61   83

2 EXPLORATORY DATA ANALYSIS

Zipf’s law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. We can see from the plot that the frequency is similar to expected.

Zipf_plot(dtm)

## (Intercept)           x 
##   6.3840979  -0.7842895

Most frequent words

FreqMat <- data.frame(ST = rownames(t(dtm.data)), 
                       Freq = rowSums(t(dtm.data)), 
                       row.names = NULL)
FreqMat <- FreqMat[order(FreqMat$Freq, decreasing=TRUE),]
ranking<- FreqMat[1:20,]
ranking$ST <- factor(ranking$ST, levels=ranking$ST)
fig <- ggplot(ranking, aes(x=as.vector(ranking$ST), y=ranking$Freq)) + geom_bar(stat="identity")
fig <- fig + aes(ranking$ST)
fig <- fig + xlab("Word in Corpus")
fig <- fig + ylab("Word Count")
fig <- fig + theme_economist()
fig <- fig + coord_flip()
print(fig)

Word cloud

wordcloud(words = FreqMat$ST, freq = FreqMat$Freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Topics model

lda <- LDA(dtm, k = 8)
term <- terms(lda,6)
kable(term)

Topic 1	Topic 2	Topic 3	Topic 4	Topic 5	Topic 6	Topic 7	Topic 8
said	just	said	said	get	will	get	like
like	one	like	one	just	get	time	year
just	get	just	just	like	can	one	will
make	will	time	time	say	said	love	time
can	think	get	will	will	one	make	say
day	said	know	make	know	love	just	said

3 MODELING

Identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

getFreq <- function(tdm) {
     freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
     return(data.frame(word = names(freq), freq = freq))
 }
 bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
 trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
 quadgram <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
 makePlot <- function(data) {
     ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
         labs(x="", y = "Frequency") +
         theme_economist() +
         coord_flip()+
         theme(axis.text.x = element_text(size = 6, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))
 }

Bi-gram

freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpora, control = list(tokenize = bigram)), 0.9999))
makePlot(freq2)

wordcloud(words = freq2$word, freq = freq2$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Tri-gram

freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpora, control = list(tokenize = trigram)), 0.9999))
makePlot(freq3)

wordcloud(words = freq3$word, freq = freq3$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Quad-gram

freq4 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpora, control = list(tokenize = quadgram)), 0.9999))
makePlot(freq4)

wordcloud(words = freq4$word, freq = freq4$freq, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

4 NEXT STEPS

This is the initial exploratory data analysis and modeling of the capstone project. The processing time is important because the data -although is not very big- the combinations for bigrams and trigrams make the CPU heavy. Is is important to select the best model and efficient functions for the final model of the capstone. The initial idea for the model is to use the bigram and trigram in order to predict the following word.

5 APPENDIX. ENVIRONMENT USED

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.10             lda_1.4.2           topicmodels_0.2-9  
##  [4] wordcloud_2.6       RColorBrewer_1.1-2  RWeka_0.4-41       
##  [7] SnowballC_0.6.0     tm_0.7-7            NLP_0.2-0          
## [10] GGally_1.4.0        factoextra_1.0.6    plotly_4.9.1       
## [13] knitr_1.26          randomForest_4.6-14 rattle_5.3.0       
## [16] rpart.plot_3.0.8    rpart_4.1-15        caret_6.0-84       
## [19] lattice_0.20-38     rlist_0.4.6.1       ggthemes_4.2.0     
## [22] R.utils_2.9.2       R.oo_1.23.0         R.methodsS3_1.7.1  
## [25] data.table_1.12.6   reshape2_1.4.3      reshape_0.8.8      
## [28] ggcorrplot_0.1.3    ggplot2_3.2.1       dplyr_0.8.3        
## [31] stringr_1.4.0       readr_1.3.1        
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-142       lubridate_1.7.4    httr_1.4.1         tools_3.6.1       
##  [5] backports_1.1.5    R6_2.4.1           lazyeval_0.2.2     colorspace_1.4-1  
##  [9] nnet_7.3-12        withr_2.1.2        tidyselect_0.2.5   compiler_3.6.1    
## [13] xml2_1.2.2         labeling_0.3       slam_0.1-47        scales_1.1.0      
## [17] digest_0.6.23      rmarkdown_1.18     pkgconfig_2.0.3    htmltools_0.4.0   
## [21] highr_0.8          RWekajars_3.9.3-2  htmlwidgets_1.5.1  rlang_0.4.2       
## [25] prettydoc_0.3.1    farver_2.0.1       generics_0.0.2     jsonlite_1.6      
## [29] ModelMetrics_1.2.2 magrittr_1.5       modeltools_0.2-22  Matrix_1.2-18     
## [33] Rcpp_1.0.3         munsell_0.5.0      lifecycle_0.1.0    stringi_1.4.3     
## [37] yaml_2.2.0         MASS_7.3-51.4      plyr_1.8.4         recipes_0.1.7     
## [41] grid_3.6.1         parallel_3.6.1     ggrepel_0.8.1      crayon_1.3.4      
## [45] splines_3.6.1      hms_0.5.2          zeallot_0.1.0      pillar_1.4.2      
## [49] codetools_0.2-16   stats4_3.6.1       glue_1.3.1         evaluate_0.14     
## [53] vctrs_0.2.0        foreach_1.4.7      gtable_0.3.0       purrr_0.3.3       
## [57] tidyr_1.0.0        assertthat_0.2.1   xfun_0.11          gower_0.2.1       
## [61] prodlim_2019.11.13 class_7.3-15       survival_3.1-8     viridisLite_0.3.0 
## [65] timeDate_3043.102  tibble_2.1.3       rJava_0.9-11       iterators_1.0.12  
## [69] lava_1.6.6         ipred_0.9-9

Data Science Capstone. Milestone Report.

Alessandro Galletto

7/1/2020

1 GETTING AND CLEANING THE DATA

2 EXPLORATORY DATA ANALYSIS

3 MODELING

4 NEXT STEPS

5 APPENDIX. ENVIRONMENT USED