In this exploratory analysis the three provided datasets (obtained from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) are investigated. The structures contain textual data. The following R packages are used for the analysis.
suppressPackageStartupMessages({
library(knitr, quietly = T)
library(tidyverse, quietly = T)
library(data.table, quietly = T)
library(quanteda, quietly = T)
library(stm, quietly = T)
library(stringr, quietly = T)
library(ggplot2, quietly = T)
library(ggalt, quietly = T)
library(gridExtra, quietly = T)
})
First, data is downloaded and read into the R workspace. For a first overview the number of tokens, which are coherent sequences of characters separated by whitespace (e.g. words), are counted for each of the items in the three provided datasets. An item is considered one distinct text in the datasets such as a tweet or a news article. Also, the number of characters in each item are counted.
if (!file.exists("raw_data/Coursera-SwiftKey.zip"))
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
dest = "raw_data/Coursera-SwiftKey.zip")
txt.blog <- readLines(unz("raw_data/Coursera-SwiftKey.zip", "final/en_US/en_US.blogs.txt"))
txt.news <- readLines(unz("raw_data/Coursera-SwiftKey.zip", "final/en_US/en_US.news.txt"))
txt.twit <- readLines(unz("raw_data/Coursera-SwiftKey.zip", "final/en_US/en_US.twitter.txt"))
dt <- data.table(doc = c(rep("blog", times = length(txt.blog)),
rep("news", times = length(txt.news)),
rep("twit", times = length(txt.twit))),
txt = c(txt.blog, txt.news, txt.twit))
# count tokens that consist of at least one letter of each item
dt[, token.count := vapply(str_split(txt,
pattern = " |\\n|\\t"),
FUN = function(x) {length(grep(pattern = "[a-zA-Z]+",
x,
value = T))},
FUN.VALUE = numeric(1))]
# count characters of each item
dt[, chr.count := nchar(txt)]
Next, the determined counts of tokens and characters are summerized for each of the three datasets. The three datasets are refered to as doc (“documents”), i.e. “blog”, “news” and “twit” respectively. After the two summary tables, another summary is shown to investigate the appearance of tokens that comprise only non-letter characters such as numbers.
# summary of tokens count
kable(dt[, .(items = .N,
total.tokens = sum(token.count),
mean.n_tokens = mean(token.count),
sd.n_tokens = sd(token.count),
max.n_tokens = max(token.count),
min.n_tokens = min(token.count)),
by = doc], digits = 2, caption = "Token summary:")
| doc | items | total.tokens | mean.n_tokens | sd.n_tokens | max.n_tokens | min.n_tokens |
|---|---|---|---|---|---|---|
| blog | 899288 | 36815824 | 40.94 | 45.90 | 6327 | 0 |
| news | 1010242 | 33468015 | 33.13 | 22.18 | 1370 | 0 |
| twit | 2360148 | 29354754 | 12.44 | 6.72 | 47 | 1 |
# summary of character count
kable(dt[, .(total.chr = sum(chr.count),
mean.n_chr = mean(chr.count),
sd.n_chr = sd(chr.count),
max.n_chr = max(chr.count),
min.n_chr = min(chr.count)),
by = doc], digits = 2, caption = "Character summary:")
| doc | total.chr | mean.n_chr | sd.n_chr | max.n_chr | min.n_chr |
|---|---|---|---|---|---|
| blog | 206824505 | 229.99 | 258.66 | 40833 | 1 |
| news | 203223159 | 201.16 | 133.22 | 11384 | 1 |
| twit | 162096031 | 68.68 | 37.23 | 140 | 2 |
# have a look at items that do not contain letters
kable(dt[!grepl("[a-zA-Z]+", txt),
.(items = .N,
total.chr = sum(chr.count),
mean.n_chr = mean(chr.count),
sd.n_chr = sd(chr.count),
max.n_chr = max(chr.count),
min.n_chr = min(chr.count)),
by = doc], digits = 2, caption = "Summary of items not containing letters:")
| doc | items | total.chr | mean.n_chr | sd.n_chr | max.n_chr | min.n_chr |
|---|---|---|---|---|---|---|
| blog | 967 | 18121 | 18.74 | 36.59 | 365 | 1 |
| news | 777 | 8050 | 10.36 | 6.11 | 84 | 1 |
From the first table (“token summary”) the number of lines (= number of items) it is obtained that
The findings are visualized in the following plots, also showing the overall distributions:
g1 <- ggplot(dt, aes(x = token.count)) +
geom_histogram(aes(y=..density..), binwidth = 5) +
geom_density(col = "red") +
xlim(0,200) +
labs(x = "count of tokens consisting of at least one letter") +
facet_wrap(~doc, nrow = 1, scales = "free")
g2 <- ggplot(dt, aes(x = chr.count)) +
geom_histogram(aes(y=..density..), binwidth = 15) +
geom_density(col = "red") +
xlim(0, 1000) +
labs(x = "number of characters") +
facet_wrap(~doc, nrow = 1, scales = "free")
grid.arrange(g1, g2, top = "tokens and character distributions by document type")
g3 <- ggplot(dt, aes(x = token.count)) +
geom_histogram(binwidth = 5) +
labs(x = "length of tokens consisting of at least one letter") +
xlim(0, 200)
g4 <- ggplot(dt, aes(x = chr.count)) +
geom_histogram(binwidth = 15) +
labs(x = "number of characters in tokens consisting of at least one letter") +
xlim(0, 1000)
grid.arrange(g3, g4, top = "tokens and character distributions overall")
As expected, items in the twit dataset on average contain less tokens (words) than items in the other two doc-types (news and blog) which is a direct consequence of the character limitation of tweets. Interestingly, the distribution form of the number of characters per items is rather similar to the distribution form of the number of tokens per items.
Next, a sample of 15000 items is drawn per doc-type to get an rough idea of the textual data. The top 20 features of the sample are plotted. Then, the similarity between the three doc-types within the sample is estimated and provided in a table, showing the correlation between the each other.
set.seed(1337)
dt.smpl <- dt[, .SD[sample(.N, 15000)], by = doc]
crp <- corpus(dt.smpl$txt, docvars = data.frame(doc = dt.smpl$doc))
df <- dfm(crp, remove_punct = T, remove_symbols = T)
df %>%
dfm_remove(stopwords("en")) %>%
textstat_frequency(n = 20) %>%
ggplot(mapping = aes(x = reorder(feature, -frequency), y = frequency)) +
geom_bar(stat="identity") +
labs(x = "feature") +
theme(axis.text.x = element_text(angle = 90))
# similarity
doc.sim <- df %>%
dfm_remove(stopwords("en")) %>%
dfm_group(groups = "doc") %>%
textstat_simil(margin = "document")
kable(doc.sim)
| document1 | document2 | correlation |
|---|---|---|
| blog | news | 0.7442165 |
| blog | twit | 0.8285023 |
| news | twit | 0.5791797 |
To explore the topics present in the sample, the Structural Topic Model using semi-collapsed variational EM from the stm package is estimated for each doc-type with \(K = 10\) (number of topics to search). The 10 topics per doc-type are then plotted.
# topics per group
stm.fit <- list()
par(mfrow = c(3,1))
for (i in unique(docvars(df, "doc"))) {
stm.fit[[i]] <- df %>%
dfm_subset(subset = (docvars(., "doc") == i)) %>%
dfm_subset(subset = (1:nrow(.) %in% sample(nrow(.), size = 2))) %>%
dfm_remove(stopwords("en")) %>%
dfm_trim(min_termfreq = 1) %>%
stm(K = 10, verbose = F)
plot(stm.fit[[i]], main = paste("topics in", gsub(pattern = "twit",
replacement = "twitter",
i)))
}
Finally, the first three n-grams \((n = 1:3)\) are estimated within each doc-type of the sample. Since the n-gram approach will be used to predict the next words of a given sequence of words later in the project, the stopwords are not removed as these are essential for the structure of any given sentence. Hence the n-gram with \(n = 1\) is not identical to the top-feature plot above, as stopwords were not taken into account in the top-feature plot.
# n-grams
tkn <- tokens(crp, remove_punct = T, remove_symbols = T, remove_url = T)
grm <- list()
for (i in 1:3) {
grm[[i]] <- tokens_ngrams(tkn, n = i + 1) %>%
dfm(stem = T) %>%
textstat_frequency() %>%
top_n(20, frequency) %>%
ggplot(mapping = aes(x = reorder(feature, frequency), y = frequency)) +
geom_col() +
coord_flip() +
labs(x = paste("n-gram, n =", i))
grm[[i]] <- ggplotGrob(grm[[i]])
}
grid::grid.newpage()
grid::grid.draw(do.call("rbind", grm))
The strategy for creating a lean and fast word prediction algorithm during this capstone project is to use a combination of various n-grams (e.g. \(n = 1:4\)). The dataset will be split in a training and a test set. The n-grams weights will be optimized to create a high in-sample accuracy, assuming the out-of-sample accuracy will scale accordingly. Stopwords will not be removed for the n-gram creation as stopwords are an important feature of any given sentence. Numbers and punctuation will be replaced with placeholders.
The limitation of the model is of course the prediction performance for out-of-sample texts that are rather different from the training set.