Data Science Capstone — Milestone Report

Synopsis

This milestone report summarizes exploratory data analysis (EDA) on the English-language training text from the SwiftKey data bundle (blogs, news, and Twitter). The goals are to characterize basic scale and structure of each source—file size, line counts, token counts per line, and simple lexical patterns.

The analysis reads partial samples from each file so the document knits in reasonable time on a laptop; full-file summaries that do not require loading every line (for example, total line counts via the shell on Unix-like systems) are computed when available.

Data sources and loading

Three plain-text corpora are read from Coursera-SwiftKey.zip at the project root (paths inside the archive: final/en_US/en_US.blogs.txt, final/en_US/en_US.news.txt, final/en_US/en_US.twitter.txt). If the zip file is absent, the same files are read from an extracted final/en_US/ folder.

File	Description
`en_US.blogs.txt`	Personal blog posts
`en_US.news.txt`	News articles
`en_US.twitter.txt`	Twitter-like short messages

library(dplyr)
library(ggplot2)
library(scales)
library(knitr)

source("R/source_milestone_modules.R", encoding = "UTF-8")

swiftkey <- resolve_swiftkey_en()
assert_swiftkey_data_available(swiftkey)

# The number of lines sampled per source can be adjusted here.
sample_lines_per_file <- 60000L

file_sizes_mb <- corpus_file_inventory_from_spec(swiftkey)

kable(
  file_sizes_mb %>% select(source, path, size_mb, lines_total),
  digits = 2,
  caption = "English corpus: size as stored (zip entry or file on disk) and line totals (Unix: wc on files; zip members via unzip streaming)."
)

English corpus: size as stored (zip entry or file on disk) and line totals (Unix: wc on files; zip members via unzip streaming).
source	path	size_mb	lines_total
blogs	Coursera-SwiftKey.zip::final/en_US/en_US.blogs.txt	200.42	NA
news	Coursera-SwiftKey.zip::final/en_US/en_US.news.txt	196.28	NA
twitter	Coursera-SwiftKey.zip::final/en_US/en_US.twitter.txt	159.36	NA

samples <- assemble_head_samples(swiftkey, sample_lines_per_file)

Summary statistics (sampled lines)

sum_tab <- summarize_line_samples(samples)

kable(
  sum_tab,
  digits = 2,
  caption = "Per-source summaries on the head sample (first N lines per file; see sampling note)."
)

Per-source summaries on the head sample (first N lines per file; see sampling note).
source	lines_sampled	mean_words	median_words	sd_words	mean_chars	q90_words
blogs	60000	41.36	28	44.40	229.07	95
news	60000	34.19	31	23.07	202.21	61
twitter	60000	12.85	12	6.91	68.54	23

The Twitter file tends to produce shorter lines (fewer words and characters per line) than blogs and news, which aligns with length constraints and informal style on microblogging platforms versus longer-form writing.

Distribution of words per line

plot_words_per_line_density(samples)

Distribution of words per line by source (sampled lines).

plot_words_per_line_box(samples)

Characters per line

plot_chars_per_line_density(samples)

Characters per line by source (sampled).

Simple lexical exploration (sample only)

The corpora are large; here we tokenize sampled lines into lowercase word tokens (letters and apostrophes kept) and inspect the most frequent types.

top_tokens <- top_tokens_by_source(samples, k = 15L)
plot_top_tokens_faceted(top_tokens)

High-frequency items are dominated by function words (articles, pronouns, prepositions). For prediction modeling, these frequent contexts matter for smooth phrase completions; rare and offensive tokens should still be handled deliberately when cleaning and filtering training statistics.

Limitations of this EDA

Sampling: Statistics on words per line and token frequencies use the first 60000 lines of each file (and token plots subset further). Tail behavior and rare phenomena may differ in unseen portions of the files.
Encoding: Text is read as UTF-8; occasional malformed bytes may be skipped by readLines.
Token definition: The regex tokenizer is intentionally simple; it does not normalize hashtags, URLs, or punctuation-rich entities the way a production tokenizer might.