Exploratory Analysis

In this exploratory analysis the three provided datasets (obtained from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) are investigated. The structures contain textual data. The following R packages are used for the analysis.

suppressPackageStartupMessages({
    library(knitr, quietly = T)
    library(tidyverse, quietly = T)
    library(data.table, quietly = T)
    library(quanteda, quietly = T)
    library(stm, quietly = T)
    library(stringr, quietly = T)
    library(ggplot2, quietly = T)
    library(ggalt, quietly = T)
    library(gridExtra, quietly = T)
})

First, data is downloaded and read into the R workspace. For a first overview the number of tokens, which are coherent sequences of characters separated by whitespace (e.g. words), are counted for each of the items in the three provided datasets. An item is considered one distinct text in the datasets such as a tweet or a news article. Also, the number of characters in each item are counted.

if (!file.exists("raw_data/Coursera-SwiftKey.zip"))
    download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  dest = "raw_data/Coursera-SwiftKey.zip")

txt.blog <- readLines(unz("raw_data/Coursera-SwiftKey.zip", "final/en_US/en_US.blogs.txt"))
txt.news <- readLines(unz("raw_data/Coursera-SwiftKey.zip", "final/en_US/en_US.news.txt"))
txt.twit <- readLines(unz("raw_data/Coursera-SwiftKey.zip", "final/en_US/en_US.twitter.txt"))

dt <- data.table(doc = c(rep("blog", times = length(txt.blog)),
                         rep("news", times = length(txt.news)),
                         rep("twit", times = length(txt.twit))),
                 txt = c(txt.blog, txt.news, txt.twit))

# count tokens that consist of at least one letter of each item
dt[, token.count := vapply(str_split(txt, 
                                     pattern = " |\\n|\\t"), 
                           FUN = function(x) {length(grep(pattern = "[a-zA-Z]+", 
                                                          x, 
                                                          value = T))}, 
                           FUN.VALUE = numeric(1))]

# count characters of each item
dt[, chr.count := nchar(txt)]

Next, the determined counts of tokens and characters are summerized for each of the three datasets. The three datasets are refered to as doc (“documents”), i.e. “blog”, “news” and “twit” respectively. After the two summary tables, another summary is shown to investigate the appearance of tokens that comprise only non-letter characters such as numbers.

# summary of tokens count
kable(dt[, .(items = .N, 
             total.tokens = sum(token.count), 
             mean.n_tokens = mean(token.count), 
             sd.n_tokens = sd(token.count), 
             max.n_tokens = max(token.count),
             min.n_tokens = min(token.count)), 
         by = doc], digits = 2, caption = "Token summary:")

Token summary:
doc	items	total.tokens	mean.n_tokens	sd.n_tokens	max.n_tokens	min.n_tokens
blog	899288	36815824	40.94	45.90	6327	0
news	1010242	33468015	33.13	22.18	1370	0
twit	2360148	29354754	12.44	6.72	47	1

# summary of character count
kable(dt[, .(total.chr = sum(chr.count),
             mean.n_chr = mean(chr.count),
             sd.n_chr = sd(chr.count),
             max.n_chr = max(chr.count),
             min.n_chr = min(chr.count)),
         by = doc], digits = 2, caption = "Character summary:")

Character summary:
doc	total.chr	mean.n_chr	sd.n_chr	max.n_chr	min.n_chr
blog	206824505	229.99	258.66	40833	1
news	203223159	201.16	133.22	11384	1
twit	162096031	68.68	37.23	140	2

# have a look at items that do not contain letters
kable(dt[!grepl("[a-zA-Z]+", txt), 
         .(items = .N, 
           total.chr = sum(chr.count),
           mean.n_chr = mean(chr.count),
           sd.n_chr = sd(chr.count),
           max.n_chr = max(chr.count),
           min.n_chr = min(chr.count)), 
         by = doc], digits = 2, caption = "Summary of items not containing letters:")

Summary of items not containing letters:
doc	items	total.chr	mean.n_chr	sd.n_chr	max.n_chr	min.n_chr
blog	967	18121	18.74	36.59	365	1
news	777	8050	10.36	6.11	84	1

From the first table (“token summary”) the number of lines (= number of items) it is obtained that

blog contains 899288 lines
news conrains 1010242 lines
twit conrains 2360148 lines

The findings are visualized in the following plots, also showing the overall distributions:

g1 <- ggplot(dt, aes(x = token.count)) + 
    geom_histogram(aes(y=..density..), binwidth = 5) + 
    geom_density(col = "red") +
    xlim(0,200) +
    labs(x = "count of tokens consisting of at least one letter") + 
    facet_wrap(~doc, nrow = 1, scales = "free") 
g2 <- ggplot(dt, aes(x = chr.count)) + 
    geom_histogram(aes(y=..density..), binwidth = 15) + 
    geom_density(col = "red") +
    xlim(0, 1000) + 
    labs(x = "number of characters") + 
    facet_wrap(~doc, nrow = 1, scales = "free")
grid.arrange(g1, g2, top = "tokens and character distributions by document type")

g3 <- ggplot(dt, aes(x = token.count)) + 
    geom_histogram(binwidth = 5) + 
    labs(x = "length of tokens consisting of at least one letter") +
    xlim(0, 200)
g4 <- ggplot(dt, aes(x = chr.count)) + 
    geom_histogram(binwidth = 15) + 
    labs(x = "number of characters in tokens consisting of at least one letter") + 
    xlim(0, 1000)
grid.arrange(g3, g4, top = "tokens and character distributions overall")

As expected, items in the twit dataset on average contain less tokens (words) than items in the other two doc-types (news and blog) which is a direct consequence of the character limitation of tweets. Interestingly, the distribution form of the number of characters per items is rather similar to the distribution form of the number of tokens per items.

Next, a sample of 15000 items is drawn per doc-type to get an rough idea of the textual data. The top 20 features of the sample are plotted. Then, the similarity between the three doc-types within the sample is estimated and provided in a table, showing the correlation between the each other.

set.seed(1337)

dt.smpl <- dt[, .SD[sample(.N, 15000)], by = doc]
crp     <- corpus(dt.smpl$txt, docvars = data.frame(doc = dt.smpl$doc))

df <- dfm(crp, remove_punct = T, remove_symbols = T)

df %>% 
    dfm_remove(stopwords("en")) %>%
    textstat_frequency(n = 20) %>%
    ggplot(mapping = aes(x = reorder(feature, -frequency), y = frequency)) +
        geom_bar(stat="identity") + 
        labs(x = "feature") + 
        theme(axis.text.x = element_text(angle = 90))

# similarity
doc.sim <- df %>%
    dfm_remove(stopwords("en")) %>%
    dfm_group(groups = "doc") %>%
    textstat_simil(margin = "document")

kable(doc.sim)

document1	document2	correlation
blog	news	0.7442165
blog	twit	0.8285023
news	twit	0.5791797

To explore the topics present in the sample, the Structural Topic Model using semi-collapsed variational EM from the stm package is estimated for each doc-type with \(K = 10\) (number of topics to search). The 10 topics per doc-type are then plotted.

# topics per group
stm.fit <- list()
par(mfrow = c(3,1))
for (i in unique(docvars(df, "doc"))) {
    stm.fit[[i]] <- df %>% 
        dfm_subset(subset = (docvars(., "doc") == i)) %>%
        dfm_subset(subset = (1:nrow(.) %in% sample(nrow(.), size = 2))) %>%
        dfm_remove(stopwords("en")) %>%
        dfm_trim(min_termfreq = 1) %>%
        stm(K = 10, verbose = F)
    plot(stm.fit[[i]], main = paste("topics in", gsub(pattern = "twit",
                                                      replacement = "twitter",
                                                      i)))
}

Finally, the first three n-grams \((n = 1:3)\) are estimated within each doc-type of the sample. Since the n-gram approach will be used to predict the next words of a given sequence of words later in the project, the stopwords are not removed as these are essential for the structure of any given sentence. Hence the n-gram with \(n = 1\) is not identical to the top-feature plot above, as stopwords were not taken into account in the top-feature plot.

# n-grams
tkn <- tokens(crp, remove_punct = T, remove_symbols = T, remove_url = T)
grm <- list()
for (i in 1:3) {
    grm[[i]] <- tokens_ngrams(tkn, n = i + 1) %>%
        dfm(stem = T) %>%
        textstat_frequency() %>%
        top_n(20, frequency) %>%
        ggplot(mapping = aes(x = reorder(feature, frequency), y = frequency)) + 
            geom_col() +
            coord_flip() +
            labs(x = paste("n-gram, n =", i))
    grm[[i]] <- ggplotGrob(grm[[i]])
}
grid::grid.newpage()
grid::grid.draw(do.call("rbind", grm))

The strategy for creating a lean and fast word prediction algorithm during this capstone project is to use a combination of various n-grams (e.g. \(n = 1:4\)). The dataset will be split in a training and a test set. The n-grams weights will be optimized to create a high in-sample accuracy, assuming the out-of-sample accuracy will scale accordingly. Stopwords will not be removed for the n-gram creation as stopwords are an important feature of any given sentence. Numbers and punctuation will be replaced with placeholders.

The limitation of the model is of course the prediction performance for out-of-sample texts that are rather different from the training set.

Exploratory Analysis

PhilippB

2021-03-18