Introduction

People are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices.

This course will start with analyzing a large corpus of text documents to discover the structure in the data and how words are put together, cleaning and analyzing text data, building and sampling from a predictive text model. Finally, you will use the knowledge you gained in data products to build a predictive text product

Milestone One

In this exercise, you will use the English database but may consider three other databases in German, Russian and Finnish.

Tasks to accomplish I

  1. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  2. Profanity filtering - removing profanity and other words you do not want to predict.

Tasks to accomplish II

  1. Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Overview

This document
+ Input the data, blogs, news, twitter. + Input profanity reference
+ Sample the data to work with a smaller data set
+ Process the data to clean the text, blogs, news, twitter
+ Obtain text frequencies + Plot the top frequencies + Summary table of union and intersection of the data sets

Book Keeping

list.of.packages <- c('ggplot2','ngram','NLP','openNLP','RWeka','tm',
                      'formatR','knitr','kableExtra','VennDiagram')
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages,repos = "http://cran.us.r-project.org")
suppressWarnings( suppressMessages( library(ggplot2) )) 
suppressWarnings( suppressMessages( library(formatR) )) 
suppressWarnings( suppressMessages( library(knitr) ))  
suppressWarnings( suppressMessages( library(kableExtra) )) 
suppressWarnings( suppressMessages( library(VennDiagram) )) 

suppressWarnings( suppressMessages( library(tm) )) 
suppressWarnings( suppressMessages( library(ngram) ))
suppressWarnings( suppressMessages( library(RWeka) ))
suppressWarnings( suppressMessages( library(openNLP) ))
# suppressWarnings( suppressMessages( library(NLP) ))  # loaded by tm

Obtain the data

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
lzip <- "Coursera-SwiftKey.zip"
if (!file.exists(lzip)) {
    download.file(url, destfile = lzip)
    unzip(lzip)
}

List Files

en_dir <- "./final/en_US"
list.files(en_dir, "*.txt")
> [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Blogs

~15 seconds

url <- "./final/en_US/en_US.blogs.txt"
system.time(blogs <- readLines(url, skipNul = T))
print(paste("File size", url, file.info(url)$size, sep = ": "))
print(paste("Number of lines", length(blogs), sep = ": "))
>    user  system elapsed 
>   13.98    0.50   15.68 
> [1] "File size: ./final/en_US/en_US.blogs.txt: 210160014"
> [1] "Number of lines: 899288"

News

url <- "./final/en_US/en_US.news.txt"
system.time(news <- readLines(url, skipNul = T))
print(paste("File size", url, file.info(url)$size, sep = ": "))
print(paste("Number of lines", length(news), sep = ": "))
>    user  system elapsed 
>    1.29    0.01    1.33 
> [1] "File size: ./final/en_US/en_US.news.txt: 205811889"
> [1] "Number of lines: 77259"

Twitter

~15 seconds

url <- "./final/en_US/en_US.twitter.txt"
system.time(twtr <- readLines(url, skipNul = T))
print(paste("File size", url, file.info(url)$size, sep = ": "))
print(paste("Number of lines", length(twtr), sep = ": "))
>    user  system elapsed 
>   12.97    0.33   13.43 
> [1] "File size: ./final/en_US/en_US.twitter.txt: 167105338"
> [1] "Number of lines: 2360148"

Profanity Reference

url <- "https://www.freewebheaders.com/wordpress/wp-content/uploads/full-list-of-bad-words-csv-file_2018_03_26_26.zip"
lzip <- "./final/en_US/full-list-of-bad-words-csv-file_2018_03_26_26.zip"
if (!file.exists(lzip)) {
    download.file(url, destfile = lzip)
    unzip(lzip)
}

Input Profanity

url <- "full-list-of-bad-words-csv-file_2018_03_26.csv"
dfprof <- read.csv(url, header = F)

Exploratory Analysis

In four parts, each kind of data set individually, blogs, news, twitter, then combined. Why? the project is considering text completion product. The context may matter in developing the model. We want to see what is the difference between the data sets and then consider impact for aggregation.

Data Aggregation and Scrub

Because use case is text entry on mobile for example favor usage for texting and similar
This means keep punctuation and contractions and stopwords. “to be or not to be”

Sample the data

THere is a trade off between data set size, model performance, and resource, CPU, and memory usage. To do exploratory with low system burden, smaple the data.

Data skewed to favor expected use cases keeping meomory usage lower

vblogs <- sample(blogs, 10000, replace = FALSE)
vnews <- sample(news, 10000, replace = FALSE)
vtwtr <- sample(twtr, 25000, replace = FALSE)
rm(blogs)
rm(news)
rm(twtr)
gc()
>           used (Mb) gc trigger  (Mb) max used  (Mb)
> Ncells  760868 40.7    3763080 201.0  4232425 226.1
> Vcells 4076266 31.1   67607294 515.9 68453289 522.3

Common routines

Note: stopwords are all less than 5 character in length.

dfsw <- data.frame(sw = stopwords("english"))
shortstopwords <- as.vector(dfsw[apply(dfsw, 2, nchar)[, 1] < 5, ])
set_encoding <- function(inp) {
    t <- iconv(inp, "UTF-8", "ASCII", sub = "")
    return(t)
}

Note: preprocess this way for speed

input_preprocess <- function(inp) {
    t <- tolower(inp)
    t <- gsub("[[:punct:]]", "", t)
    t <- gsub("[[:digit:]]", "", t)
    t <- gsub("  ", " ", t)
    x <- unlist(strsplit(t, " "))
    x <- x[nchar(x) > 1]
    x <- x[!x %in% dfprof$V1]
    x <- x[!x %in% shortstopwords]
    t <- paste(x, collapse = " ")
    return(t)
}

Not Used

input_preprocess2 <- function(inp) {
    t <- tolower(inp)
    t <- gsub("[[:punct:]]", "", t)
    t <- gsub("[[:digit:]]", "", t)
    t <- gsub("  ", " ", t)
    t <- lapply(t, function(t) {
        ulxs <- unlist(strsplit(t, " "))
        ulxs <- ulxs[!ulxs %in% dfprof$V1]
        ulxs <- ulxs[!ulxs %in% shortstopwords]
        paste(ulxs, collapse = " ")
    })
    return(t)
}

Plot routine

gram_bar_plot <- function(dfin, ztitle, zxlab) {
    
    p <- ggplot(subset(dfin, Freq > dfin[20, ]$Freq), aes(x = reorder(grams, 
        Freq), y = Freq)) + geom_bar(stat = "identity", colour = "grey", alpha = 0.8) + 
        coord_flip() + ggtitle(ztitle) + xlab(zxlab) + ylab("Freq") + geom_text(aes(label = Freq), 
        color = "blue", size = 3, hjust = -0.1) + theme(axis.text.x = element_text(angle = 0, 
        hjust = 1))
    p
}

Processing

Blogs Processing

tt <- set_encoding(vblogs)
ppt <- input_preprocess(tt)

Note: library ngram is very much faster

ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_b <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")

Plot Blogs Corpus

News Processing

tt <- set_encoding(vnews)
ppt <- input_preprocess(tt)
ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_n <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")

Plot News Corpus

Twitter Processing

tt <- set_encoding(vtwtr)
ppt <- input_preprocess(tt)
ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_t <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")

Plot Twitter Corpus

Combined Preprocessing

Merge the data
Note: Not sure about the merge. Later, may want to remember meta data category

all <- paste(vblogs, vnews, vtwtr)
tt <- set_encoding(all)
ppt <- input_preprocess(tt)
ng1 <- ngram_asweka(ppt, min = 1, max = 1)
ng2 <- ngram_asweka(ppt, min = 2, max = 2)
ng3 <- ngram_asweka(ppt, min = 3, max = 3)
dng1 <- data.frame(sort(table(ng1), decreasing = T))
dng2 <- data.frame(sort(table(ng2), decreasing = T))
dng3 <- data.frame(sort(table(ng3), decreasing = T))
names(dng1) <- c("grams", "Freq")
names(dng2) <- c("grams", "Freq")
names(dng3) <- c("grams", "Freq")
dng1_c <- dng1
d_dng1 <- dist(dng1[1:20, 2], "euclidian")
hc <- hclust(d_dng1, method = "single")

Plot Combined Corpus

Commonality and Coverage

Genrate the numbers

union_bn <- length(union(dng1_b$grams, dng1_n$grams))
union_bt <- length(union(dng1_b$grams, dng1_t$grams))
union_nt <- length(union(dng1_n$grams, dng1_t$grams))
union_bnt <- length(union(union(dng1_b$grams, dng1_n$grams), dng1_t$grams))

intrs_bn <- length(intersect(dng1_b$grams, dng1_n$grams))
intrs_bt <- length(intersect(dng1_b$grams, dng1_t$grams))
intrs_nt <- length(intersect(dng1_n$grams, dng1_t$grams))
intrs_bnt <- length(intersect(intersect(dng1_b$grams, dng1_n$grams), dng1_t$grams))

uniq_b <- dim(dng1_b)[1]
uniq_n <- dim(dng1_n)[1]
uniq_t <- dim(dng1_t)[1]
uniq_c <- dim(dng1_c)[1]
uniq_c2 <- uniq_b + uniq_n + uniq_t - (intrs_bn + intrs_bt + intrs_nt) + intrs_bnt

sd_bn <- uniq_b + uniq_n - intrs_bn
sd_bt <- uniq_b + uniq_t - intrs_bt
sd_nt <- uniq_n + uniq_t - intrs_nt

“Summary of Text Set Commonality”

dfk1 <- data.frame(Srce = c("blogs", "news", "twitter", "combined", "comb_check"), 
    Unique_words = c(uniq_b, uniq_n, uniq_t, uniq_c, uniq_c2), Sep = c(" ", 
        " ", " ", " ", " "), Combination = c("blogs_news", "blogs_twitter", 
        "news_twitter", "combined", " "), Union_Words = c(union_bn, union_bt, 
        union_nt, union_bnt, 0), Intersect_Words = c(intrs_bn, intrs_bt, intrs_nt, 
        intrs_bnt, 0))

kable(dfk1, format = "html") %>% kable_styling(full_width = F, position = "left")
Srce Unique_words Sep Combination Union_Words Intersect_Words
blogs 31350 blogs_news 46639 15048
news 30337 blogs_twitter 45393 12644
twitter 26687 news_twitter 44787 12237
combined 58420 combined 58420 9975
comb_check 58420 0 0

Venn Diagram

grid.newpage()
draw.triple.venn(area1 = uniq_b, area2 = uniq_n, area3 = uniq_t, n12 = intrs_bn, 
    n23 = intrs_bt, n13 = intrs_nt, n123 = intrs_bnt, category = c("blogs", 
        "news", "twitter"), lty = "blank", fill = c("blue", "yellow", "green"))

> (polygon[GRID.polygon.529], polygon[GRID.polygon.530], polygon[GRID.polygon.531], polygon[GRID.polygon.532], polygon[GRID.polygon.533], polygon[GRID.polygon.534], text[GRID.text.535], text[GRID.text.536], text[GRID.text.537], text[GRID.text.538], text[GRID.text.539], text[GRID.text.540], text[GRID.text.541], text[GRID.text.542], text[GRID.text.543], text[GRID.text.544])

Conclusion

I had a concern that one or another of the text data sets, blogs, news, twitter, would be more aplicable to the project. First Analsysis indicates that there is no special benefit to one data set over another. Combining the data may be beneficial overall.

Next Step

  • Construct a model.
  • Run the model.
  • Do some predictions of third word in the sequence.