1 GETTING AND CLEANING THE DATA
INTRODUCTION
This report is part of the John’s Hopkins Data Science Course Spacialization. The assignment requires to analyse several files with common txets in order to model the usual pattern of texts and develop an algorithm in order to suggest the following word to the user as Swiftjey’s software performs.
Source of the data: Content archived from heliohost.org on September 30, 2016 and retrieved via Wayback Machine on April 24, 2017.
The data is collected by a web archiver that allows us to have the text published. This zip contains several languages: 1) German, 2) English - United States, 3) Finnish and 4) Russian but I’m going to process only the English language. Anyway, all that is deveolpped in this assigment can be applied to other languages.
Reading of the english files:
# download and unzip of the files
fileName <- "Coursera-SwiftKey.zip"
if (file.exists(fileName)) file.remove(fileName)## [1] TRUE
unlink("Final", recursive = TRUE)
url.file <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(url.file, destfile="Coursera-SwiftKey.zip")
unzip(fileName)
# Using read_lines because readr is the most efficient r package in reading files
txtdatatmp <- tbl_df(read_lines(file="./final/en_US/en_US.blogs.txt"))
txtdatatmp2 <- tbl_df(read_lines(file="./final/en_US/en_US.news.txt"))
txtdatatmp3 <- tbl_df(read_lines(file="./final/en_US/en_US.twitter.txt"))
names(txtdatatmp) <- c("phrases")
names(txtdatatmp2) <- c("phrases")
names(txtdatatmp3) <- c("phrases")
# Union of the three files using the most efficient version of rbind
txtdata.all <- bind_rows(txtdatatmp, txtdatatmp2, txtdatatmp3)
names(txtdata.all) <- c("phrases")
# I'm selecting a sample of the phrases available
# txtdata <- txtdata.all
txtdata <- sample_n(txtdata.all,100000)
names(txtdata) <- c("phrases")
attach(txtdata)Summary of the data sets:
blogs.size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("final/en_US/en_US.news.txt")$size / 1024 ^ 2
twitter.size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 ^ 2
resume <- data.table(
source=c("blogs", "news", "twitter"),
size.MB = as.character(c(round(blogs.size,digits=1), round(news.size,digits=1), round(twitter.size,digits=1))),
num.lines = c(length(txtdatatmp$phrases), length(txtdatatmp2$phrases), length(txtdatatmp3$phrases)))
kable(resume)| source | size.MB | num.lines |
|---|---|---|
| blogs | 200.4 | 899288 |
| news | 196.3 | 1010242 |
| 159.4 | 2360148 |
I’m going to use the tm package. Cleaning the data from stopwords, stemming, removing whitespaces
corpora <- VCorpus(VectorSource(paste(unlist(txtdata), collapse =" ")))
removeURL <- function(x) gsub("http[[:alnum:][:punct:]]*", "", x)
corpora <- tm_map(corpora, content_transformer(removeURL))
corpora <- tm_map(corpora, removeNumbers)
corpora <- tm_map(corpora, removePunctuation)
corpora <- tm_map(corpora, content_transformer(tolower))
corpora <- tm_map(corpora, removeWords, stopwords("en"))
corpora <- tm_map(corpora, stripWhitespace)
corpora <- tm_map(corpora, stemDocument, language="en")Creating the term-document matrix and inspecting it
## <<DocumentTermMatrix (documents: 1, terms: 4668)>>
## Non-/sparse entries: 4668/0
## Sparsity : 0%
## Maximal term length: 31
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can day get just like make one said time will
## 1 52 51 80 84 72 55 71 81 61 83
2 EXPLORATORY DATA ANALYSIS
Zipf’s law is an empirical law formulated using mathematical statistics that refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions. We can see from the plot that the frequency is similar to expected.
## (Intercept) x
## 6.3840979 -0.7842895
Most frequent words
FreqMat <- data.frame(ST = rownames(t(dtm.data)),
Freq = rowSums(t(dtm.data)),
row.names = NULL)
FreqMat <- FreqMat[order(FreqMat$Freq, decreasing=TRUE),]
ranking<- FreqMat[1:20,]
ranking$ST <- factor(ranking$ST, levels=ranking$ST)
fig <- ggplot(ranking, aes(x=as.vector(ranking$ST), y=ranking$Freq)) + geom_bar(stat="identity")
fig <- fig + aes(ranking$ST)
fig <- fig + xlab("Word in Corpus")
fig <- fig + ylab("Word Count")
fig <- fig + theme_economist()
fig <- fig + coord_flip()
print(fig)Word cloud
wordcloud(words = FreqMat$ST, freq = FreqMat$Freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))Topics model
| Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | Topic 6 | Topic 7 | Topic 8 |
|---|---|---|---|---|---|---|---|
| said | just | said | said | get | will | get | like |
| like | one | like | one | just | get | time | year |
| just | get | just | just | like | can | one | will |
| make | will | time | time | say | said | love | time |
| can | think | get | will | will | one | make | say |
| day | said | know | make | know | love | just | said |
3 MODELING
Identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
quadgram <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
makePlot <- function(data) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x="", y = "Frequency") +
theme_economist() +
coord_flip()+
theme(axis.text.x = element_text(size = 6, hjust = 1)) +
geom_bar(stat = "identity", fill = I("grey50"))
}Bi-gram
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpora, control = list(tokenize = bigram)), 0.9999))
makePlot(freq2)wordcloud(words = freq2$word, freq = freq2$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))Tri-gram
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpora, control = list(tokenize = trigram)), 0.9999))
makePlot(freq3) wordcloud(words = freq3$word, freq = freq3$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))Quad-gram
freq4 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpora, control = list(tokenize = quadgram)), 0.9999))
makePlot(freq4) wordcloud(words = freq4$word, freq = freq4$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))4 NEXT STEPS
This is the initial exploratory data analysis and modeling of the capstone project. The processing time is important because the data -although is not very big- the combinations for bigrams and trigrams make the CPU heavy. Is is important to select the best model and efficient functions for the final model of the capstone. The initial idea for the model is to use the bigram and trigram in order to predict the following word.
5 APPENDIX. ENVIRONMENT USED
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] es_ES.UTF-8/es_ES.UTF-8/es_ES.UTF-8/C/es_ES.UTF-8/es_ES.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.10 lda_1.4.2 topicmodels_0.2-9
## [4] wordcloud_2.6 RColorBrewer_1.1-2 RWeka_0.4-41
## [7] SnowballC_0.6.0 tm_0.7-7 NLP_0.2-0
## [10] GGally_1.4.0 factoextra_1.0.6 plotly_4.9.1
## [13] knitr_1.26 randomForest_4.6-14 rattle_5.3.0
## [16] rpart.plot_3.0.8 rpart_4.1-15 caret_6.0-84
## [19] lattice_0.20-38 rlist_0.4.6.1 ggthemes_4.2.0
## [22] R.utils_2.9.2 R.oo_1.23.0 R.methodsS3_1.7.1
## [25] data.table_1.12.6 reshape2_1.4.3 reshape_0.8.8
## [28] ggcorrplot_0.1.3 ggplot2_3.2.1 dplyr_0.8.3
## [31] stringr_1.4.0 readr_1.3.1
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-142 lubridate_1.7.4 httr_1.4.1 tools_3.6.1
## [5] backports_1.1.5 R6_2.4.1 lazyeval_0.2.2 colorspace_1.4-1
## [9] nnet_7.3-12 withr_2.1.2 tidyselect_0.2.5 compiler_3.6.1
## [13] xml2_1.2.2 labeling_0.3 slam_0.1-47 scales_1.1.0
## [17] digest_0.6.23 rmarkdown_1.18 pkgconfig_2.0.3 htmltools_0.4.0
## [21] highr_0.8 RWekajars_3.9.3-2 htmlwidgets_1.5.1 rlang_0.4.2
## [25] prettydoc_0.3.1 farver_2.0.1 generics_0.0.2 jsonlite_1.6
## [29] ModelMetrics_1.2.2 magrittr_1.5 modeltools_0.2-22 Matrix_1.2-18
## [33] Rcpp_1.0.3 munsell_0.5.0 lifecycle_0.1.0 stringi_1.4.3
## [37] yaml_2.2.0 MASS_7.3-51.4 plyr_1.8.4 recipes_0.1.7
## [41] grid_3.6.1 parallel_3.6.1 ggrepel_0.8.1 crayon_1.3.4
## [45] splines_3.6.1 hms_0.5.2 zeallot_0.1.0 pillar_1.4.2
## [49] codetools_0.2-16 stats4_3.6.1 glue_1.3.1 evaluate_0.14
## [53] vctrs_0.2.0 foreach_1.4.7 gtable_0.3.0 purrr_0.3.3
## [57] tidyr_1.0.0 assertthat_0.2.1 xfun_0.11 gower_0.2.1
## [61] prodlim_2019.11.13 class_7.3-15 survival_3.1-8 viridisLite_0.3.0
## [65] timeDate_3043.102 tibble_2.1.3 rJava_0.9-11 iterators_1.0.12
## [69] lava_1.6.6 ipred_0.9-9