People are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices.
This course will start with analyzing a large corpus of text documents to discover the structure in the data and how words are put together, cleaning and analyzing text data, building and sampling from a predictive text model. Finally, you will use the knowledge you gained in data products to build a predictive text product
In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.
This document
+ Input the data, blogs, news, twitter. + Input profanity reference
+ Sample the data to work with a smaller data set
+ Process the data to clean the text, blogs, news, twitter
+ Obtain text frequencies + Plot the top frequencies + Summary table of union and intersection of the data sets
list.of.packages <- c('ggplot2','ngram','NLP','openNLP','RWeka','tm',
'formatR','knitr','kableExtra','VennDiagram')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages,repos = "http://cran.us.r-project.org")
suppressWarnings( suppressMessages( library(ggplot2) ))
suppressWarnings( suppressMessages( library(formatR) ))
suppressWarnings( suppressMessages( library(knitr) ))
suppressWarnings( suppressMessages( library(kableExtra) ))
suppressWarnings( suppressMessages( library(VennDiagram) ))
suppressWarnings( suppressMessages( library(tm) ))
suppressWarnings( suppressMessages( library(ngram) ))
suppressWarnings( suppressMessages( library(RWeka) ))
suppressWarnings( suppressMessages( library(openNLP) ))
# suppressWarnings( suppressMessages( library(NLP) )) # loaded by tm
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
lzip <- "Coursera-SwiftKey.zip"
if (!file.exists(lzip)) {
download.file(url, destfile = lzip)
unzip(lzip)
}
en_dir <- "./final/en_US"
list.files(en_dir, "*.txt")
> [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
~15 seconds
url <- "./final/en_US/en_US.blogs.txt"
system.time(blogs <- readLines(url, skipNul = T))
print(paste("File size", url, file.info(url)$size, sep = ": "))
print(paste("Number of lines", length(blogs), sep = ": "))
> user system elapsed
> 13.98 0.50 15.68
> [1] "File size: ./final/en_US/en_US.blogs.txt: 210160014"
> [1] "Number of lines: 899288"
url <- "./final/en_US/en_US.news.txt"
system.time(news <- readLines(url, skipNul = T))
print(paste("File size", url, file.info(url)$size, sep = ": "))
print(paste("Number of lines", length(news), sep = ": "))
> user system elapsed
> 1.29 0.01 1.33
> [1] "File size: ./final/en_US/en_US.news.txt: 205811889"
> [1] "Number of lines: 77259"
~15 seconds
url <- "./final/en_US/en_US.twitter.txt"
system.time(twtr <- readLines(url, skipNul = T))
print(paste("File size", url, file.info(url)$size, sep = ": "))
print(paste("Number of lines", length(twtr), sep = ": "))
> user system elapsed
> 12.97 0.33 13.43
> [1] "File size: ./final/en_US/en_US.twitter.txt: 167105338"
> [1] "Number of lines: 2360148"
url <- "https://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-csv-file_2018_03_26_26.zip"
lzip <- "./final/en_US/full-list-of-bad-words-csv-file_2018_03_26_26.zip"
if (!file.exists(lzip)) {
download.file(url, destfile = lzip)
unzip(lzip)
}
url <- "full-list-of-bad-words-csv-file_2018_03_26.csv"
dfprof <- read.csv(url, header = F)
In four parts, each kind of data set individually, blogs, news, twitter, then combined. Why? the project is considering text completion product. The context may matter in developing the model. We want to see what is the difference between the data sets and then consider impact for aggregation.
Because use case is text entry on mobile for example favor usage for texting and similar
This means keep punctuation and contractions and stopwords. “to be or not to be”
THere is a trade off between data set size, model performance, and resource, CPU, and memory usage. To do exploratory with low system burden, smaple the data.
Data skewed to favor expected use cases keeping meomory usage lower
vblogs <- sample(blogs, 10000, replace = FALSE)
vnews <- sample(news, 10000, replace = FALSE)
vtwtr <- sample(twtr, 25000, replace = FALSE)
rm(blogs)
rm(news)
rm(twtr)
gc()
> used (Mb) gc trigger (Mb) max used (Mb)
> Ncells 760868 40.7 3763080 201.0 4232425 226.1
> Vcells 4076266 31.1 67607294 515.9 68453289 522.3
Note: stopwords are all less than 5 character in length.
dfsw <- data.frame(sw = stopwords("english"))
shortstopwords <- as.vector(dfsw[apply(dfsw, 2, nchar)[, 1] < 5, ])
set_encoding <- function(inp) {
t <- iconv(inp, "UTF-8", "ASCII", sub = "")
return(t)
}
Note: preprocess this way for speed
input_preprocess <- function(inp) {
t <- tolower(inp)
t <- gsub("[[:punct:]]", "", t)
t <- gsub("[[:digit:]]", "", t)
t <- gsub(" ", " ", t)
x <- unlist(strsplit(t, " "))
x <- x[nchar(x) > 1]
x <- x[!x %in% dfprof$V1]
x <- x[!x %in% shortstopwords]
t <- paste(x, collapse = " ")
return(t)
}
Not Used
input_preprocess2 <- function(inp) {
t <- tolower(inp)
t <- gsub("[[:punct:]]", "", t)
t <- gsub("[[:digit:]]", "", t)
t <- gsub(" ", " ", t)
t <- lapply(t, function(t) {
ulxs <- unlist(strsplit(t, " "))
ulxs <- ulxs[!ulxs %in% dfprof$V1]
ulxs <- ulxs[!ulxs %in% shortstopwords]
paste(ulxs, collapse = " ")
})
return(t)
}
Plot routine
gram_bar_plot <- function(dfin, ztitle, zxlab) {
p <- ggplot(subset(dfin, Freq > dfin[20, ]$Freq), aes(x = reorder(grams,
Freq), y = Freq)) + geom_bar(stat = "identity", colour = "grey", alpha = 0.8) +
coord_flip() + ggtitle(ztitle) + xlab(zxlab) + ylab("Freq") + geom_text(aes(label = Freq),
color = "blue", size = 3, hjust = -0.1) + theme(axis.text.x = element_text(angle = 0,
hjust = 1))
p
}
tt <- set_encoding(vblogs)
ppt <- input_preprocess(tt)
Note: library ngram is very much faster
ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_b <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")
tt <- set_encoding(vnews)
ppt <- input_preprocess(tt)
ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_n <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")
tt <- set_encoding(vtwtr)
ppt <- input_preprocess(tt)
ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_t <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")
Merge the data
Note: Not sure about the merge. Later, may want to remember meta data category
all <- paste(vblogs, vnews, vtwtr)
tt <- set_encoding(all)
ppt <- input_preprocess(tt)
ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_c <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")
union_bn <- length(union(dng1_b$grams, dng1_n$grams))
union_bt <- length(union(dng1_b$grams, dng1_t$grams))
union_nt <- length(union(dng1_n$grams, dng1_t$grams))
union_bnt <- length(union(union(dng1_b$grams, dng1_n$grams), dng1_t$grams))
intrs_bn <- length(intersect(dng1_b$grams, dng1_n$grams))
intrs_bt <- length(intersect(dng1_b$grams, dng1_t$grams))
intrs_nt <- length(intersect(dng1_n$grams, dng1_t$grams))
intrs_bnt <- length(intersect(intersect(dng1_b$grams, dng1_n$grams), dng1_t$grams))
uniq_b <- dim(dng1_b)[1]
uniq_n <- dim(dng1_n)[1]
uniq_t <- dim(dng1_t)[1]
uniq_c <- dim(dng1_c)[1]
uniq_c2 <- uniq_b + uniq_n + uniq_t - (intrs_bn + intrs_bt + intrs_nt) + intrs_bnt
sd_bn <- uniq_b + uniq_n - intrs_bn
sd_bt <- uniq_b + uniq_t - intrs_bt
sd_nt <- uniq_n + uniq_t - intrs_nt
dfk1 <- data.frame(Srce = c("blogs", "news", "twitter", "combined", "comb_check"),
Unique_words = c(uniq_b, uniq_n, uniq_t, uniq_c, uniq_c2), Sep = c(" ",
" ", " ", " ", " "), Combination = c("blogs_news", "blogs_twitter",
"news_twitter", "combined", " "), Union_Words = c(union_bn, union_bt,
union_nt, union_bnt, 0), Intersect_Words = c(intrs_bn, intrs_bt, intrs_nt,
intrs_bnt, 0))
kable(dfk1, format = "html") %>% kable_styling(full_width = F, position = "left")
| Srce | Unique_words | Sep | Combination | Union_Words | Intersect_Words |
|---|---|---|---|---|---|
| blogs | 31350 | blogs_news | 46639 | 15048 | |
| news | 30337 | blogs_twitter | 45393 | 12644 | |
| 26687 | news_twitter | 44787 | 12237 | ||
| combined | 58420 | combined | 58420 | 9975 | |
| comb_check | 58420 | 0 | 0 |
grid.newpage()
draw.triple.venn(area1 = uniq_b, area2 = uniq_n, area3 = uniq_t, n12 = intrs_bn,
n23 = intrs_bt, n13 = intrs_nt, n123 = intrs_bnt, category = c("blogs",
"news", "twitter"), lty = "blank", fill = c("blue", "yellow", "green"))
> (polygon[GRID.polygon.529], polygon[GRID.polygon.530], polygon[GRID.polygon.531], polygon[GRID.polygon.532], polygon[GRID.polygon.533], polygon[GRID.polygon.534], text[GRID.text.535], text[GRID.text.536], text[GRID.text.537], text[GRID.text.538], text[GRID.text.539], text[GRID.text.540], text[GRID.text.541], text[GRID.text.542], text[GRID.text.543], text[GRID.text.544])
I had a concern that one or another of the text data sets, blogs, news, twitter, would be more aplicable to the project. First Analsysis indicates that there is no special benefit to one data set over another. Combining the data may be beneficial overall.
In general > https://hub.packtpub.com/9-useful-r-packages-for-nlp-text-mining/
From CRAN tm package
> https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
Intro to text mining in R > http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
Walkthrough of process > https://rstudio-pubs-static.s3.amazonaws.com/265713_cbef910aee7642dc8b62996e38d2825d.html
> https://rpubs.com/williamsurles/316682
Encoding
> https://stackoverflow.com/questions/35639317/r-how-to-remove-very-special-characters-in-strings
Profanity elimination
Google: ‘database of english profanity’
> Top hit: ’https://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-csv-file_2018_03_26_26.zip
https://stackoverflow.com/questions/35790652/removing-words-featured-in-character-vector-from-string
END