Synopsis

This milestone report summarizes exploratory data analysis (EDA) on the English-language training text from the SwiftKey data bundle (blogs, news, and Twitter). The goals are to characterize basic scale and structure of each source—file size, line counts, token counts per line, and simple lexical patterns.

The analysis reads partial samples from each file so the document knits in reasonable time on a laptop; full-file summaries that do not require loading every line (for example, total line counts via the shell on Unix-like systems) are computed when available.

Data sources and loading

Three plain-text corpora are read from Coursera-SwiftKey.zip at the project root (paths inside the archive: final/en_US/en_US.blogs.txt, final/en_US/en_US.news.txt, final/en_US/en_US.twitter.txt). If the zip file is absent, the same files are read from an extracted final/en_US/ folder.

File Description
en_US.blogs.txt Personal blog posts
en_US.news.txt News articles
en_US.twitter.txt Twitter-like short messages
library(dplyr)
library(ggplot2)
library(scales)
library(knitr)

source("R/source_milestone_modules.R", encoding = "UTF-8")

swiftkey <- resolve_swiftkey_en()
assert_swiftkey_data_available(swiftkey)
# The number of lines sampled per source can be adjusted here.
sample_lines_per_file <- 60000L
file_sizes_mb <- corpus_file_inventory_from_spec(swiftkey)

kable(
  file_sizes_mb %>% select(source, path, size_mb, lines_total),
  digits = 2,
  caption = "English corpus: size as stored (zip entry or file on disk) and line totals (Unix: wc on files; zip members via unzip streaming)."
)
English corpus: size as stored (zip entry or file on disk) and line totals (Unix: wc on files; zip members via unzip streaming).
source path size_mb lines_total
blogs Coursera-SwiftKey.zip::final/en_US/en_US.blogs.txt 200.42 NA
news Coursera-SwiftKey.zip::final/en_US/en_US.news.txt 196.28 NA
twitter Coursera-SwiftKey.zip::final/en_US/en_US.twitter.txt 159.36 NA
samples <- assemble_head_samples(swiftkey, sample_lines_per_file)

Summary statistics (sampled lines)

sum_tab <- summarize_line_samples(samples)

kable(
  sum_tab,
  digits = 2,
  caption = "Per-source summaries on the head sample (first N lines per file; see sampling note)."
)
Per-source summaries on the head sample (first N lines per file; see sampling note).
source lines_sampled mean_words median_words sd_words mean_chars q90_words pct_empty
blogs 60000 41.36 28 44.40 229.07 95 0
news 60000 34.19 31 23.07 202.21 61 0
twitter 60000 12.85 12 6.91 68.54 23 0

The Twitter file tends to produce shorter lines (fewer words and characters per line) than blogs and news, which aligns with length constraints and informal style on microblogging platforms versus longer-form writing.

Distribution of words per line

plot_words_per_line_density(samples)
Distribution of words per line by source (sampled lines).

Distribution of words per line by source (sampled lines).

plot_words_per_line_box(samples)

Characters per line

plot_chars_per_line_density(samples)
Characters per line by source (sampled).

Characters per line by source (sampled).

Simple lexical exploration (sample only)

The corpora are large; here we tokenize sampled lines into lowercase word tokens (letters and apostrophes kept) and inspect the most frequent types.

top_tokens <- top_tokens_by_source(samples, k = 15L)
plot_top_tokens_faceted(top_tokens)

High-frequency items are dominated by function words (articles, pronouns, prepositions). For prediction modeling, these frequent contexts matter for smooth phrase completions; rare and offensive tokens should still be handled deliberately when cleaning and filtering training statistics.

Limitations of this EDA

  • Sampling: Statistics on words per line and token frequencies use the first 60000 lines of each file (and token plots subset further). Tail behavior and rare phenomena may differ in unseen portions of the files.
  • Encoding: Text is read as UTF-8; occasional malformed bytes may be skipped by readLines.
  • Token definition: The regex tokenizer is intentionally simple; it does not normalize hashtags, URLs, or punctuation-rich entities the way a production tokenizer might.