This milestone report summarizes exploratory data analysis (EDA) on the English-language training text from the SwiftKey data bundle (blogs, news, and Twitter). The goals are to characterize basic scale and structure of each source—file size, line counts, token counts per line, and simple lexical patterns.
The analysis reads partial samples from each file so the document knits in reasonable time on a laptop; full-file summaries that do not require loading every line (for example, total line counts via the shell on Unix-like systems) are computed when available.
Three plain-text corpora are read from
Coursera-SwiftKey.zip at the project root
(paths inside the archive: final/en_US/en_US.blogs.txt,
final/en_US/en_US.news.txt,
final/en_US/en_US.twitter.txt). If the zip file is absent,
the same files are read from an extracted
final/en_US/ folder.
| File | Description |
|---|---|
en_US.blogs.txt |
Personal blog posts |
en_US.news.txt |
News articles |
en_US.twitter.txt |
Twitter-like short messages |
library(dplyr)
library(ggplot2)
library(scales)
library(knitr)
source("R/source_milestone_modules.R", encoding = "UTF-8")
swiftkey <- resolve_swiftkey_en()
assert_swiftkey_data_available(swiftkey)file_sizes_mb <- corpus_file_inventory_from_spec(swiftkey)
kable(
file_sizes_mb %>% select(source, path, size_mb, lines_total),
digits = 2,
caption = "English corpus: size as stored (zip entry or file on disk) and line totals (Unix: wc on files; zip members via unzip streaming)."
)| source | path | size_mb | lines_total |
|---|---|---|---|
| blogs | Coursera-SwiftKey.zip::final/en_US/en_US.blogs.txt | 200.42 | NA |
| news | Coursera-SwiftKey.zip::final/en_US/en_US.news.txt | 196.28 | NA |
| Coursera-SwiftKey.zip::final/en_US/en_US.twitter.txt | 159.36 | NA |
sum_tab <- summarize_line_samples(samples)
kable(
sum_tab,
digits = 2,
caption = "Per-source summaries on the head sample (first N lines per file; see sampling note)."
)| source | lines_sampled | mean_words | median_words | sd_words | mean_chars | q90_words | pct_empty |
|---|---|---|---|---|---|---|---|
| blogs | 60000 | 41.36 | 28 | 44.40 | 229.07 | 95 | 0 |
| news | 60000 | 34.19 | 31 | 23.07 | 202.21 | 61 | 0 |
| 60000 | 12.85 | 12 | 6.91 | 68.54 | 23 | 0 |
The Twitter file tends to produce shorter lines (fewer words and characters per line) than blogs and news, which aligns with length constraints and informal style on microblogging platforms versus longer-form writing.
Distribution of words per line by source (sampled lines).
The corpora are large; here we tokenize sampled lines into lowercase word tokens (letters and apostrophes kept) and inspect the most frequent types.
High-frequency items are dominated by function words (articles, pronouns, prepositions). For prediction modeling, these frequent contexts matter for smooth phrase completions; rare and offensive tokens should still be handled deliberately when cleaning and filtering training statistics.
readLines.