This is the milestone report for the Johns Hopkins Data Science Capstone. The end goal of the project is a Shiny web app that predicts the next word a user is likely to type, trained on a large collection of English text from blogs, news articles and Twitter.
This report covers the first stage of that work. In plain terms, it shows that I have:
No data-science background is needed to read it — the technical code is hidden by default (use the Code buttons on the right to reveal any of it).
The data comes from the course’s SwiftKey dataset. It contains text in four languages; I use only the three English files:
en_US.blogs.txt — text from blog postsen_US.news.txt — text from news articlesen_US.twitter.txt — tweets# Run once to fetch and unzip the data (not re-run on every knit).
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dir.create("data/raw", recursive = TRUE, showWarnings = FALSE)
download.file(url, "data/raw/Coursera-SwiftKey.zip", mode = "wb")
unzip("data/raw/Coursera-SwiftKey.zip", exdir = "data/raw")data_dir <- "data/raw/final/en_US"
files <- c(Blogs = "en_US.blogs.txt",
News = "en_US.news.txt",
Twitter = "en_US.twitter.txt")
paths <- file.path(data_dir, files)
# Read a file and return it as a clean (valid-UTF-8) character vector of lines.
read_clean <- function(path) {
con <- file(path, open = "r", encoding = "UTF-8")
txt <- readLines(con, skipNul = TRUE, warn = FALSE)
close(con)
iconv(txt, from = "UTF-8", to = "UTF-8", sub = "")
}
corpora <- lapply(paths, read_clean)
names(corpora) <- names(files)For each file I record its size on disk, the number of lines, the total number of words, and the length of the longest single line.
summary_tbl <- tibble(
File = names(files),
`Size (MB)` = round(file.info(paths)$size / 2^20, 1),
Lines = map_int(corpora, length),
Words = map_dbl(corpora, ~ sum(str_count(.x, "\\S+"))),
`Mean words / line` = round(Words / Lines, 1),
`Longest line (chars)` = map_int(corpora, ~ max(nchar(.x)))
)
knitr::kable(
summary_tbl,
format.args = list(big.mark = ","),
caption = "Table 1. Summary of the three English corpora."
)| File | Size (MB) | Lines | Words | Mean words / line | Longest line (chars) |
|---|---|---|---|---|---|
| Blogs | 200.4 | 899,288 | 37,334,131 | 41.5 | 40,833 |
| News | 196.3 | 1,010,206 | 34,371,031 | 34.0 | 11,384 |
| 159.4 | 2,360,148 | 30,373,583 | 12.9 | 140 |
All three files are large — together hundreds of megabytes and millions of lines. Twitter has by far the most lines (each tweet is short), while blogs and news have fewer but much longer lines.
Because the files are so large, training and exploring on all of them at once is slow and memory-hungry. Following standard practice, I take a random 2% sample of lines from each file and combine them. This is more than enough to reveal the structure of the language while keeping the analysis fast and reproducible.
samp <- imap_dfr(corpora, function(lines, src) {
keep <- rbinom(length(lines), size = 1, prob = 0.02) == 1
tibble(source = src, text = lines[keep])
})
# Free the full corpora from memory now that we have the sample.
rm(corpora); invisible(gc())
sample_size <- nrow(samp)The combined sample has 85,058 lines of text.
A first useful view is how long the lines are in each source. Twitter lines are capped short (tweets were 140 characters at the time), blogs and news run longer.
samp %>%
mutate(words = str_count(text, "\\S+")) %>%
filter(words <= 60) %>%
ggplot(aes(words, fill = source)) +
geom_histogram(binwidth = 2, show.legend = FALSE) +
facet_wrap(~ source, scales = "free_y") +
labs(title = "Distribution of words per line, by source",
x = "Words per line", y = "Number of lines") +
theme_minimal()I tokenise the sampled text into single words (unigrams), word pairs (bigrams) and word triples (trigrams), dropping tokens that contain digits. N-grams are the foundation of the prediction model: to guess the next word, the model looks at the last one or two words typed and asks “what most often comes next?”
unigrams <- samp %>%
unnest_tokens(word, text) %>%
filter(!str_detect(word, "[0-9]")) %>%
count(word, sort = TRUE)
bigrams <- samp %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram), !str_detect(bigram, "[0-9]")) %>%
count(bigram, sort = TRUE)
trigrams <- samp %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
filter(!is.na(trigram), !str_detect(trigram, "[0-9]")) %>%
count(trigram, sort = TRUE)top_plot <- function(df, col, title, fill) {
df %>%
slice_max(n, n = 15) %>%
mutate(term = reorder(.data[[col]], n)) %>%
ggplot(aes(n, term)) +
geom_col(fill = fill) +
scale_x_continuous(labels = comma) +
labs(title = title, x = "Frequency", y = NULL) +
theme_minimal()
}As expected, the list is dominated by common English function words (“the”, “to”, “and”, …). These carry little meaning on their own but are essential for predicting fluent text, so I keep them in.
A small number of words accounts for most of the text. The chart below shows the cumulative coverage: how much of all word usage is explained as we add more of the most frequent unique words.
coverage <- unigrams %>%
mutate(rank = row_number(),
cum_cov = cumsum(n) / sum(n))
n50 <- coverage %>% filter(cum_cov >= 0.50) %>% slice(1) %>% pull(rank)
n90 <- coverage %>% filter(cum_cov >= 0.90) %>% slice(1) %>% pull(rank)
total_unique <- nrow(unigrams)
ggplot(coverage, aes(rank, cum_cov)) +
geom_line(colour = "#2c7fb8", linewidth = 1) +
geom_hline(yintercept = c(0.5, 0.9), linetype = "dashed", colour = "grey50") +
scale_y_continuous(labels = percent) +
scale_x_continuous(labels = comma) +
labs(title = "Cumulative word coverage",
x = "Number of unique words (most frequent first)",
y = "Share of all words covered") +
theme_minimal()Only about 144 unique words are needed to cover 50% of all word usage, and about 7,284 to cover 90% — out of roughly 75,798 unique words in the sample. This matters a lot for the app: a relatively small dictionary can handle the vast majority of what users type, which keeps the model small and fast.
Based on this exploration, my plan is:
Feedback I’d welcome: whether 2% sampling is enough, and the right trade-off between model size and prediction accuracy for the hosted app.
## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_Guyana.utf8 LC_CTYPE=English_Guyana.utf8
## [3] LC_MONETARY=English_Guyana.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Guyana.utf8
##
## time zone: America/Guyana
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scales_1.4.0 tidytext_0.4.3 lubridate_1.9.5 forcats_1.0.1
## [5] stringr_1.6.0 dplyr_1.2.1 purrr_1.2.2 readr_2.2.0
## [9] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-5 gtable_0.3.6 jsonlite_2.0.0 janeaustenr_1.0.0
## [5] compiler_4.6.0 Rcpp_1.1.1-1.1 tidyselect_1.2.1 jquerylib_0.1.4
## [9] yaml_2.3.12 fastmap_1.2.0 lattice_0.22-9 R6_2.6.1
## [13] labeling_0.4.3 SnowballC_0.7.1 generics_0.1.4 knitr_1.51
## [17] bslib_0.11.0 pillar_1.11.1 RColorBrewer_1.1-3 tzdb_0.5.0
## [21] tokenizers_0.3.0 rlang_1.2.0 stringi_1.8.7 cachem_1.1.0
## [25] xfun_0.58 sass_0.4.10 S7_0.2.2 otel_0.2.0
## [29] timechange_0.4.0 cli_3.6.6 withr_3.0.2 magrittr_2.0.5
## [33] digest_0.6.39 grid_4.6.0 hms_1.1.4 lifecycle_1.0.5
## [37] vctrs_0.7.3 evaluate_1.0.5 glue_1.8.1 farver_2.1.2
## [41] rmarkdown_2.31 tools_4.6.0 pkgconfig_2.0.3 htmltools_0.5.9