This is the first part of Capstone project for Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.
The final goal of the project is to develop application predicting the next word after being given some text input. Provided in the Report Exploratory Data Analysis serves as a preparation for building a predictive model and data product based on a predictive algorithm.
Training data set containing a sample corpus of text in different languages can be downloaded here. This project focuses on English texts in the data set come from blogs, news and twitter.
Code buttonlibrary(R.utils); library(readr); library(data.table)
library(knitr); library(ngram); library(dplyr)
library(quanteda); library(stringi); library(tidytext)
library(ggplot2); library(plotly); library(tidyr)
library(wordcloud2); library(tidyverse); library(wordcloud)if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA")) dir.create("./data/1225_DS-CS-w2_WordsEDA")
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest <- "./data/1225_DS-CS-w2_WordsEDA/Coursera-SwiftKey.zip"
if(!file.exists(dest)){download.file(
url = url, destfile = dest, method = "curl")}
if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/final")) {unzip(dest, exdir = "./data/1225_DS-CS-w2_WordsEDA")}Within each file (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt), every line is an extract from a single post/ article/ tweet. Here is some key summaries of the three files: size, line count, word count, average number of words as well as characters per line, and length of the longest line in characters:
dir <- "./data/1225_DS-CS-w2_WordsEDA/final/en_US"
svar <- function(name) {
path<- paste0(dir,"/en_US.",name,".txt")
svar<- read_lines(path, skip_empty_rows = TRUE)
size.MB <- file.info(path)$size/2^20
list(svar, size.MB)
}
summ <- function(svar) {
nlines <- length(svar)
nwords <- wordcount(svar)
nchars <- sum(nchar(svar))
wordpl <- nwords/nlines
charpl <- nchars/nlines
maxline <- max(sapply(svar,nchar))
dt <- data.table(nlines, nwords, wordpl, charpl, maxline, nchars)
dt
}
blogs<-svar("blogs")
twitter<-svar("twitter")
news<-svar("news")
siblogs<-blogs[[2]]; blogs<-blogs[[1]]
sitwitter <- twitter[[2]]; twitter <- twitter[[1]]
sinews <- news[[2]]; news <- news[[1]]
sblogs <- cbind(name="en_US.blogs.txt", size= siblogs, summ(blogs))
stwitter <- cbind(name="en_US.twitter.txt", size= sitwitter, summ(twitter))
snews <- cbind(name="en_US.news.txt", size= sinews, summ(news))
sdata <- data.table(name="aggregated data", size= siblogs+sitwitter+sinews,
nlines=sblogs$nlines+stwitter$nlines+snews$nlines,
nwords=sblogs$nwords+stwitter$nwords+snews$nwords,
wordpl = (sblogs$nwords+stwitter$nwords+snews$nwords)/(
sblogs$nlines+stwitter$nlines + snews$nlines),
charpl = (sblogs$nchars+stwitter$nchars+snews$nchars)/(
sblogs$nlines+stwitter$nlines + snews$nlines),
maxline = max(sblogs$maxline, stwitter$maxline, snews$maxline))
sfull<- rbind(sblogs[,1:7],stwitter[,1:7],snews[,1:7], sdata)
kable(sfull, digits = 4, caption = "Table 1: Summary of the original data",
col.names = c("source","file size (MB)","line count","word count",
"avg words/line","avg chars/line", "longest line (chars)"))| source | file size (MB) | line count | word count | avg words/line | avg chars/line | longest line (chars) |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200.4242 | 899288 | 37334131 | 41.5152 | 229.9870 | 40833 |
| en_US.twitter.txt | 159.3641 | 2360148 | 30373543 | 12.8693 | 68.6805 | 140 |
| en_US.news.txt | 196.2775 | 1010242 | 34372530 | 34.0241 | 201.1628 | 11384 |
| aggregated data | 556.0658 | 4269678 | 102080204 | 23.9082 | 134.0016 | 40833 |
The original files are quite large: \(0.899\) mln blog records, \(1.01\) mln news, \(2.36\) mln tweets. Amount of lines differ, but in total each has roughly \(30\)-\(37\) mln words. The blogs and news data seem similar in average number of words and characters (though blogs are slightly larger); twitter, as expected, is much shorter.
en_US.blogs.txt file, contains 40833 charactersen_US.twitter.txt: contains expected 140 characters (сontent archived from heliohost.org on September 30, 2016)en_US.twitter.txtThe next task is to perform exploratory analysis of the data to get familiar with them and understand the underlying features and relationships. However, processing original files of that huge size pushed up against R’s memory limits and ran slowly. To speed up analysis, \(15\%\) of the lines from each file are sampled for the purpose of this report.
15%-sample form each source
blogs <- tibble(text = blogs)
news <- tibble(text = news)
twitter <- tibble(text = twitter)
set.seed(123)
rate <- 0.15
blogs.sample <- blogs %>% slice_sample(., n=nrow(blogs)*rate)
twitter.sample <- twitter %>% slice_sample(., n=nrow(twitter)*rate)
news.sample <- news %>% slice_sample(., n=nrow(news)*rate)
data.sample <- bind_rows(mutate(blogs.sample, source="blogs"),
mutate(twitter.sample, source= "twitter"),
mutate(news.sample, source="news"))
data.sample$source <- as.factor(data.sample$source)Sampled data summary (similar to that for original data):
sblogs <- cbind(name="blogs", summ(blogs.sample$text))
stwitter <- cbind(name="twitter", summ(twitter.sample$text))
snews <- cbind(name="news", summ(news.sample$text))
sdata <- cbind(name="aggregated sample", summ(data.sample$text))
sfull<- rbind(sblogs[,1:6],stwitter[,1:6],snews[,1:6], sdata[,1:6])
kable(sfull, digits = 4, caption = "Table 2: Summary of the sampled data",
col.names = c("source", "line count","word count",
"avg words/line","avg chars/line", "longest line (chars)"))| source | line count | word count | avg words/line | avg chars/line | longest line (chars) |
|---|---|---|---|---|---|
| blogs | 134893 | 5592164 | 41.4563 | 229.6845 | 12409 |
| 354022 | 4559255 | 12.8785 | 68.7258 | 140 | |
| news | 151536 | 5142591 | 33.9364 | 200.6525 | 2042 |
| aggregated sample | 640451 | 15294010 | 23.8801 | 133.8422 | 12409 |
rm(blogs, twitter, news, siblogs, sitwitter, sinews, sblogs, stwitter, snews,
sfull, dir, rate)The sample used for further analysis consists of 640451 lines and 15294010 words. Other characteristics are similar to the original data ones (except for longest line, which is not surprising).
Cleaning is separated from tokenizing so that unnest_tokens function can be then used for both words (unigrams), and n-grams.
First, text need to be splitted into sentences as the end of a sentence should probably not be a predictor for the next one (like finish sentence_start sentence). Then, pre-process each sentence:
://)#/ @)im/ i'm, dont/ don't as distinct words \(=>\) don’t remove apostrophessent.tidy <- chartr("’‘`´", "''''", data.sample$text) # unif apostrophes
sent.tidy <- gsub("[[:blank:]]#[^[:blank:]]*", " ", sent.tidy, perl = T) #hashs
sent.tidy <- gsub("[[:blank:]]@[^[:blank:]]*", " ", sent.tidy, perl = T) #at signs (@)
sent.tidy <- gsub("(https?)?://[^[:blank:]]*", " ", sent.tidy, perl = T) #urls
sent.tidy <- gsub("[^[:alnum:]']", " ", sent.tidy, perl = T) # non-alpha/numerical
sent.tidy <- gsub("[[:blank:]]+'([[:alnum:][:blank:]]+)'[[:blank:]]+", " \\1 ",
sent.tidy, perl = T) # surr apostrophes
sent.tidy <- stri_trim_both(sent.tidy) # surr blanks
sent.tidy <-gsub("[[:blank:][:digit:]+[:blank:]]", " ", sent.tidy, perl = T) #numbers w\o text
sent.tidy<- gsub("[[:blank:]]{2,}", " ", sent.tidy, perl = T) # condense blanks
sent.tidy <- trimws(sent.tidy) # leading/ trailing blanks
sent.tidy <- tolower(sent.tidy)
data.sample$text <- sent.tidy
if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/sample"))
{write_csv(data.sample, "./data/1225_DS-CS-w2_WordsEDA/sample.csv")}NOTE: this cleanup is not the most reliable one, since maybe not all URLs have been stripped, or something is falsely matched by the URL pattern, the texts can still contain typos, repeated letters etc. Errors left should be marginal though, and not hurt a future predictive model too much.
There are pros and cons for each option.
Profanity removal
Stop-words removal
Stemming
In the report is performed variant without all of these options (stemming, profanity/ stop-words removal), while are made functions to do it (Appendix: rmoptions).
Also, in Appendix: rmstops is performed and shortly discussed influence of stop-words removal.
Next step is to tokenize the data, that is to separate it into smaller units like words or phrases - n-grams (contiguous sequence of n items). N-gram of size 1 is referred to as a unigram (just a single word), size 2 is a bigram, size 3 is a trigram.
# data.sample <- read_csv("./data/1225_DS-CS-w2_WordsEDA/sample.csv")
# words.tidy<- data.sample%>% unnest_tokens(word,text) - just example: all words (w\o numbers(?) & apostrophes), count: nrow(words.tidy)
data.corpus<- corpus(data.sample)
toks1<- tokens(data.corpus) # to look at a whole sentence: data.corpus[["text640451"]]/ toks1[["text640451"]], count: sum(ntoken(data.corpus))/ sum(ntoken(toks1))
# toks1<- toksrp(toks1) # remove profanities
# toks1 <- toksrs(toks1) # remove stop-words
toks2 <- tokens_ngrams(toks1, 2)
toks3 <- tokens_ngrams(toks1, 3)
astr<- dfm(toks1, groups = docvars(toks1, "source")) # document feature matrix (for plots)
astr2<- dfm(toks2, groups = docvars(toks1, "source"))
astr3<- dfm(toks3, groups = docvars(toks1, "source"))
# astr<- stemm(astr); astr2<- stemm(astr2); astr3<- stemm(astr3) # stemmingJust take a look at what the cleaned data looks like:
data.sample# A tibble: 640,451 x 2
text source
<chr> <fct>
1 the bruschetta however missed the mark instead of manageable two bite… blogs
2 walden pond mt rainier big sur everglades and so forth blogs
3 despite laws banning cell phones while driving and increased awarenes… blogs
4 ghosts and goblins blogs
5 now i can write in specific post information for each day of the week… blogs
6 but trying to pin photos to muslin walls would be a bit too tricky blogs
7 she and rosso had been fruiting around because they are bored and pen… blogs
8 lastly has anyone seen the new harry potter movie if you're planning … blogs
9 while i generally enjoyed this movie there were a few things that did… blogs
10 accessories martha stewart floral border punch marvy notched corner p… blogs
# … with 640,441 more rows
Each sentence is on a separate line, and no uppercase letters, extra characters, numbers, punctuation, just apostrophes.
Coverage of Corpus by unigrams
There are 201550 unique words in the cleaned corpus. Count their frequency, and then see how many of them are required to cover \(50\%-80\%-90\%\) of the whole sample.
tstat<- function(dfm) {
tstat<-textstat_frequency(dfm)
tstat <- tibble(tstat) %>%
transmute(ngram=feature, frequency, nwords=1:nrow(tstat),
coverage = cumsum(frequency)/sum(frequency))
cover50 <- min(which(tstat$coverage>=0.5))
cover80 <- min(which(tstat$coverage>=0.8))
cover90 <- min(which(tstat$coverage>=0.9))
list(tstat, c(cover50, cover80, cover90))
}
tstat1 <- tstat(astr); cover1 <- tstat1[[2]]; tstat1<-tstat1[[1]]
ggplot(tstat1, aes(x=nwords, y=coverage)) +
geom_line(colour="cornflowerblue", size=1.3) +
geom_vline(aes(xintercept=cover1[1],colour="50%"),
linetype="longdash", size=1.1) +
geom_vline(aes(xintercept=cover1[2],colour="80%"),
linetype="longdash", size=1.1) +
geom_vline(aes(xintercept=cover1[3],colour="90%"),
linetype="longdash", size=1.1) +
scale_color_manual(name=NULL, values=c(`50%`="brown", `80%`="purple",
`90%`="violet"))+
scale_x_continuous(limits =c(NA, 10000))+
scale_y_continuous(labels = scales::percent) +
theme(axis.line = element_line(size = 3, colour = "grey80")) +
labs(x = "words count", y = "cumulative %" ) +
ggtitle("Corpus coverage by words (unigrams)")Only 142 words (0.07\(\%\)) are required to fill 50% of the sample corpus. It takes 2304 (1.14\(\%\)) words for 80% coverage, and 7094 (3.52\(\%\)) - for 90% coverage.
Frequency
The higher the frequency of words/ word combinations in the corpus, the higher the probability user will enter them in a future application. So, visualize (interactively: move your mouse) the most common words in the data sample (word size/color represents its frequency):
top1.100<- textstat_frequency(astr, n=100)
wordcloud2(top1.100, size = 1.5, backgroundColor = "lightsteelblue")The most frequent words are ‘the’, ‘to’, ‘and’, ‘a’, ‘of’, which is not surprising since stop-words haven’t been removed.
Look now at the frequency of the 25 most common words in the context of groups (blogs, news, twitter):
gfreq<- function(dfm) {
top.25<- textstat_frequency(dfm, n=25, groups = docnames(dfm))
colnames(top.25)[5]<- "source"
ggplot(top.25, aes(reorder(feature, frequency), frequency,
fill=source)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(x = "most frequent n-grams", y = "n-gram frequency")+
facet_wrap(~ source) +
theme(axis.line = element_line(size = 3, colour = "grey80"),
axis.text.x=element_text(angle=20, vjust=1, hjust=0))
}
gfreq(astr)news and blogsCoverage
tstat2 <- tstat(astr2); cover2 <- tstat2[[2]]; tstat2<-tstat2[[1]]
tstat3 <- tstat(astr3); cover3 <- tstat3[[2]]; tstat3<-tstat3[[1]]
ngrams <- c("unigrams", "bigrams", "trigrams")
total <- c(sum(astr), sum(astr2), sum(astr3))
uni <- c(ncol(astr), ncol(astr2), ncol(astr3))
cover50 <- round(c(100*cover1[1]/ncol(astr), 100*cover2[1]/ncol(astr2),
100*cover3[1]/ncol(astr3)),2)
cover80 <- round(c(100*cover1[2]/ncol(astr), 100*cover2[2]/ncol(astr2),
100*cover3[2]/ncol(astr3)),2)
cover90 <- round(c(100*cover1[3]/ncol(astr), 100*cover2[3]/ncol(astr2),
100*cover3[3]/ncol(astr3)),2)
tab<- cbind(ngrams, total, uni, cover50, cover80, cover90)
kable(tab, caption = "Table 3: n-grams comparison",
col.names = c("n-grams", "total","unique", "cover50, %", "cover80, %",
"cover90, %"))| n-grams | total | unique | cover50, % | cover80, % | cover90, % |
|---|---|---|---|---|---|
| unigrams | 15104877 | 201550 | 0.07 | 1.14 | 3.52 |
| bigrams | 14464677 | 3491872 | 1.1 | 22.07 | 58.58 |
| trigrams | 13828228 | 8880303 | 22.14 | 68.86 | 84.43 |
Amount of total n-grams decreases from unigrams to trigrams, while number of unique n-grams increases. Percent coverage by n-grams increases significantly: if to cover 90% of total unigrams are required only 3.52\(\%\) unique unigrams, for trigrams are requared almost all 84.43\(\%\) of them.
Explore the frequency of bigrams/ trigrams in total, and in the context of groups (blogs, news, twitter).
Bigrams frequency
top2.100<- textstat_frequency(astr2, n=150)
wordcloud(top2.100$feature, top2.100$frequency, scale=c(4,.6),
colors=brewer.pal(8, "Dark2"))gfreq(astr2)Trigrams frequency
top3.100<- textstat_frequency(astr3, n=70)
wordcloud(top3.100$feature, top3.100$frequency, scale=c(3,.4),
colors=brewer.pal(8, "Dark2"))gfreq(astr3)As n increases, so do both diversity of n-grams, and differences between sources (blogs, news, twitter).
For building an accurate and fast predictive model, it seems to make sense (but not limited to):
rmoptionsWays for filtering profanities/ stop-words, stemming
# load profanity file
loader <- function() {
if(!dir.exists("./data")) dir.create("./data")
if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA")) dir.create("./data/1225_DS-CS-w2_WordsEDA")
if(!dir.exists("./data/1225_DS-CS-w2_WordsEDA/ignore"))
dir.create("./data/1225_DS-CS-w2_WordsEDA/ignore")
url <- "https://www.freewebheaders.com/download/files/full-list-of-bad-words_csv-file_2021_01_18.zip"
dest <- "./data/1225_DS-CS-w2_WordsEDA/full-list-of-bad-words_csv-file_2021_01_18.zip"
if(!file.exists(dest)){download.file(
url = url, destfile = dest, method = "curl")}
if(!file.exists("./data/1225_DS-CS-w2_WordsEDA/ignore/full-list-of-bad-words_csv-file_2021_01_18.csv")) {
unzip(dest, exdir = "./data/1225_DS-CS-w2_WordsEDA/ignore")}
####### another option #######: a publicly kept profanity list from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words
# url <- "https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/5faf2ba42d7b1c0977169ec3611df25a3c08eb13/en"
# dest <- "./data/1225_DS-CS-w2_WordsEDA/ignore"
# if(!file.exists(dest)){download.file(url = url, destfile = dest, method = "curl")}
#######
}
# remove profanities
toksrp <- function(toks){
library(readr); library(quanteda)
loader()
ignore <- read_csv(
"./data/1225_DS-CS-w2_WordsEDA/ignore/full-list-of-bad-words_csv-file_2021_01_18.csv",
col_names = FALSE)
####### with another option #######:
# ignore <- read_tsv("./data/1225_DS-CS-w2_WordsEDA/ignore", col_names = FALSE)
#######
ignore <- ignore$X1
toksrp <- tokens_remove(toks, ignore)
toksrp
}
# remove stop-words
toksrs <- function(toks) {
library(quanteda)
toksrs <- tokens_remove(toks, stopwords("english"))
toksrs
}
# stemming
stemm <- function(dfm) {
library(quanteda)
stemm <- dfm(dfm, stem=TRUE)
stemm
}rmstopsStop-words filter in a shorthand
The most common words except for stop-words:
toks01 <- toksrs(toks1)
toks02 <- tokens_ngrams(toks01, 2)
astr0<- dfm(toks01, groups = docvars(toks01, "source"))
astr02<- dfm(toks02, groups = docvars(toks01, "source"))
top01.100<- textstat_frequency(astr0, n=120)
wordcloud(top01.100$feature, top01.100$frequency, scale=c(3,.6),
colors=brewer.pal(8, "Dark2"))Frequency of the 25 most common bigrams (except for stop-words ) in the context of groups:
gfreq(astr02)Comparing with stop-words non-removal option:
sessionInfo()R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] wordcloud_2.6 RColorBrewer_1.1-2 forcats_0.5.0 stringr_1.4.0
[5] purrr_0.3.4 tibble_3.0.4 tidyverse_1.3.0 wordcloud2_0.2.1
[9] tidyr_1.1.2 plotly_4.9.2.1 ggplot2_3.3.2 tidytext_0.2.6
[13] stringi_1.5.3 quanteda_2.1.2 dplyr_1.0.2 ngram_3.0.4
[17] knitr_1.30 data.table_1.13.4 readr_1.4.0 R.utils_2.10.1
[21] R.oo_1.24.0 R.methodsS3_1.8.1
loaded via a namespace (and not attached):
[1] httr_1.4.2 jsonlite_1.7.2 viridisLite_0.3.0
[4] modelr_0.1.8 RcppParallel_5.0.2 assertthat_0.2.1
[7] highr_0.8 cellranger_1.1.0 yaml_2.2.1
[10] pillar_1.4.7 backports_1.2.1 lattice_0.20-41
[13] glue_1.4.2 digest_0.6.27 rvest_0.3.6
[16] colorspace_2.0-0 htmltools_0.5.1.1 Matrix_1.2-18
[19] pkgconfig_2.0.3 ISOcodes_2020.12.04 broom_0.7.2
[22] haven_2.3.1 scales_1.1.1 farver_2.0.3
[25] generics_0.1.0 usethis_2.0.0 ellipsis_0.3.1
[28] withr_2.3.0 lazyeval_0.2.2 cli_2.2.0
[31] magrittr_2.0.1 crayon_1.3.4 readxl_1.3.1
[34] evaluate_0.14 stopwords_2.1 tokenizers_0.2.1
[37] janeaustenr_0.1.5 fs_1.5.0 fansi_0.4.1
[40] SnowballC_0.7.0 xml2_1.3.2 tools_4.0.3
[43] hms_0.5.3 lifecycle_0.2.0 munsell_0.5.0
[46] reprex_0.3.0 compiler_4.0.3 rlang_0.4.9
[49] grid_4.0.3 rstudioapi_0.13 htmlwidgets_1.5.2
[52] labeling_0.4.2 rmarkdown_2.5 gtable_0.3.0
[55] DBI_1.1.0 R6_2.5.0 lubridate_1.7.9.2
[58] utf8_1.1.4 fastmatch_1.1-0 Rcpp_1.0.5
[61] vctrs_0.3.5 dbplyr_2.0.0 tidyselect_1.1.0
[64] xfun_0.19