This report presents an exploratory data analysis (EDA) of the HC Corpora English dataset, provided as part of the Johns Hopkins Data Science Capstone project on Coursera. The dataset consists of text drawn from three internet sources: blog posts, news articles, and Twitter posts.
The goals of this milestone are to:
Key findings:
The raw dataset comprises three plain-text files. We use
file.info() to retrieve sizes without loading any content
into memory.
files <- c(
Blogs = "en_US.blogs.txt",
News = "en_US.news.txt",
Twitter = "en_US.twitter.txt"
)
full_paths <- file.path(DATA_PATH, files)
file_info <- tibble(
Source = names(files),
Filename = unname(files),
`Size (MB)` = round(file.info(full_paths)$size / 1e6, 1)
)
kable(file_info, caption = "Raw Data File Sizes") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(1, bold = TRUE)| Source | Filename | Size (MB) |
|---|---|---|
| Blogs | en_US.blogs.txt | 210.2 |
| News | en_US.news.txt | 205.8 |
| en_US.twitter.txt | 167.1 |
R.utils::countLines() counts line endings without
loading the file into RAM — critical for files this large.
line_counts <- sapply(full_paths, countLines)
line_count_df <- tibble(
Source = names(files),
`Total Lines` = formatC(as.integer(line_counts), format = "d", big.mark = ",")
)
kable(line_count_df, caption = "Total Line Counts (Full Corpus)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(1, bold = TRUE)| Source | Total Lines |
|---|---|
| Blogs | 899,288 |
| News | 1,010,242 |
| 2,360,148 |
Loading all three files entirely would require 2–4 GB of RAM and over 20 minutes of processing time. Instead, we draw a reproducible random sample:
set.seed(12345) for reproducibility)This is standard practice in corpus linguistics. Natural language frequency distributions stabilise quickly — a 20,000-line sample captures the dominant patterns of a 900,000-line file with high fidelity.
read_sample <- function(filepath, read_n = 50000, keep_n = 20000) {
con <- file(filepath, open = "rb")
lines <- readLines(con, n = read_n, encoding = "UTF-8", skipNul = TRUE)
close(con)
sample(lines, size = min(keep_n, length(lines)))
}
blogs_sample <- read_sample(file.path(DATA_PATH, "en_US.blogs.txt"))
news_sample <- read_sample(file.path(DATA_PATH, "en_US.news.txt"))
twitter_sample <- read_sample(file.path(DATA_PATH, "en_US.twitter.txt"))
cat("Sample sizes — Blogs:", length(blogs_sample),
"| News:", length(news_sample),
"| Twitter:", length(twitter_sample))## Sample sizes — Blogs: 20000 | News: 20000 | Twitter: 20000
The table below summarises key metrics computed on the sampled data.
compute_stats <- function(lines, source_name) {
word_counts <- str_count(lines, "\\S+")
char_counts <- nchar(lines, type = "chars")
tibble(
Source = source_name,
`Lines Sampled` = formatC(length(lines), format = "d", big.mark = ","),
`Total Words` = formatC(sum(word_counts), format = "d", big.mark = ","),
`Avg Words / Line` = round(mean(word_counts), 1),
`Median Words / Line` = round(median(word_counts), 1),
`Avg Chars / Line` = round(mean(char_counts), 1)
)
}
stats_table <- bind_rows(
compute_stats(blogs_sample, "Blogs"),
compute_stats(news_sample, "News"),
compute_stats(twitter_sample, "Twitter")
)
kable(stats_table,
caption = "Summary Statistics from Sampled Data",
align = c("l","r","r","r","r","r")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) %>%
column_spec(1, bold = TRUE)| Source | Lines Sampled | Total Words | Avg Words / Line | Median Words / Line | Avg Chars / Line |
|---|---|---|---|---|---|
| Blogs | 20,000 | 826,420 | 41.3 | 28 | 228.9 |
| News | 20,000 | 684,760 | 34.2 | 31 | 202.3 |
| 20,000 | 256,134 | 12.8 | 12 | 68.4 |
Observations:
We combine all samples into a single tidy data frame and tokenize
using the tidytext package.
corpus_df <- bind_rows(
tibble(text = blogs_sample, source = "Blogs"),
tibble(text = news_sample, source = "News"),
tibble(text = twitter_sample, source = "Twitter")
) %>%
mutate(line_id = row_number())After removing common stop words (e.g., “the”, “a”, “is”) to surface meaningful vocabulary:
unigrams <- corpus_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "^[a-z']{2,}$")) %>%
count(source, word, sort = TRUE)
top_unigrams <- unigrams %>%
group_by(source) %>%
slice_max(n, n = 20, with_ties = FALSE) %>%
ungroup()top_unigrams %>%
mutate(word = reorder_within(word, n, source)) %>%
ggplot(aes(x = word, y = n, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ source, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = comma) +
labs(
title = "Top 20 Most Frequent Words by Source",
subtitle = "Stop words removed; alphabetic tokens only",
x = NULL,
y = "Frequency in Sample"
) +
theme_minimal(base_size = 11) +
theme(strip.text = element_text(face = "bold", size = 12))Two-word sequences reveal common phrases that single words cannot capture.
bigrams <- corpus_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(
!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
str_detect(word1, "^[a-z']{2,}$"),
str_detect(word2, "^[a-z']{2,}$")
) %>%
unite(bigram, word1, word2, sep = " ") %>%
count(source, bigram, sort = TRUE)
top_bigrams <- bigrams %>%
group_by(source) %>%
slice_max(n, n = 15, with_ties = FALSE) %>%
ungroup()top_bigrams %>%
mutate(bigram = reorder_within(bigram, n, source)) %>%
ggplot(aes(x = bigram, y = n, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ source, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_fill_brewer(palette = "Pastel1") +
labs(
title = "Top 15 Bigrams by Source",
subtitle = "Two-word phrases with stop words removed",
x = NULL,
y = "Frequency in Sample"
) +
theme_minimal(base_size = 11) +
theme(strip.text = element_text(face = "bold", size = 12))trigrams <- corpus_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(source, trigram, sort = TRUE) %>%
group_by(source) %>%
slice_max(n, n = 15, with_ties = FALSE) %>%
ungroup()trigrams %>%
mutate(trigram = reorder_within(trigram, n, source)) %>%
ggplot(aes(x = trigram, y = n, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ source, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
scale_fill_brewer(palette = "Set1") +
labs(
title = "Top 15 Trigrams by Source",
subtitle = "Three-word phrases (including stop words)",
x = NULL,
y = "Frequency in Sample"
) +
theme_minimal(base_size = 10) +
theme(strip.text = element_text(face = "bold", size = 12))Most English words are 3–8 characters long. Twitter skews slightly shorter due to abbreviations and informal language.
corpus_df %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "^[a-z]+$")) %>%
mutate(word_len = nchar(word)) %>%
filter(word_len >= 1, word_len <= 20) %>%
ggplot(aes(x = word_len, fill = source)) +
geom_histogram(binwidth = 1, color = "white", alpha = 0.85) +
facet_wrap(~ source, nrow = 1) +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = comma) +
labs(
title = "Distribution of Word Lengths by Source",
subtitle = "Most English words are 3–8 characters long",
x = "Word Length (number of characters)",
y = "Count"
) +
theme_minimal(base_size = 11) +
theme(legend.position = "none",
strip.text = element_text(face = "bold"))corpus_df %>%
mutate(line_len = nchar(text)) %>%
filter(line_len > 0, line_len <= 600) %>%
ggplot(aes(x = line_len, fill = source, color = source)) +
geom_density(alpha = 0.45, linewidth = 0.9) +
scale_fill_brewer(palette = "Set2") +
scale_color_brewer(palette = "Set2") +
scale_x_continuous(labels = comma) +
labs(
title = "Line Length Distribution by Source",
subtitle = "Twitter clusters at short lengths; blogs spread widely",
x = "Characters per Line",
y = "Density",
fill = "Source",
color = "Source"
) +
theme_minimal(base_size = 11)How many unique words do we need to cover X% of all word usage? This is one of the most important questions for designing an efficient prediction model.
all_unigrams <- corpus_df %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "^[a-z']{2,}$")) %>%
count(word, sort = TRUE) %>%
mutate(
rank = row_number(),
cumulative_pct = cumsum(n) / sum(n) * 100
)
cover_50 <- all_unigrams %>% filter(cumulative_pct >= 50) %>% slice(1) %>% pull(rank)
cover_90 <- all_unigrams %>% filter(cumulative_pct >= 90) %>% slice(1) %>% pull(rank)all_unigrams %>%
filter(rank <= 15000) %>%
ggplot(aes(x = rank, y = cumulative_pct)) +
geom_line(color = "#2C7BB6", linewidth = 1) +
geom_hline(yintercept = 50, linetype = "dashed", color = "#D7191C", linewidth = 0.8) +
geom_hline(yintercept = 90, linetype = "dashed", color = "#1A9641", linewidth = 0.8) +
geom_vline(xintercept = cover_50, linetype = "dotted", color = "#D7191C") +
geom_vline(xintercept = cover_90, linetype = "dotted", color = "#1A9641") +
annotate("label",
x = cover_50 + 700, y = 43,
label = paste0("50% coverage\ntop ",
formatC(cover_50, format = "d", big.mark = ","), " words"),
color = "#D7191C", size = 3.5, fill = "white") +
annotate("label",
x = cover_90 + 700, y = 83,
label = paste0("90% coverage\ntop ",
formatC(cover_90, format = "d", big.mark = ","), " words"),
color = "#1A9641", size = 3.5, fill = "white") +
labs(
title = "Vocabulary Coverage Curve (Zipf's Law)",
subtitle = "A small vocabulary covers the vast majority of all word usage",
x = "Number of Unique Words (ranked by frequency)",
y = "Cumulative % of All Word Instances"
) +
theme_minimal(base_size = 11)Interpretation: The top 162 unique words account for 50% of all word instances in the sample. The top 7,598 words cover 90%. This power-law behaviour (Zipf’s Law) is universal in human language and has direct practical consequences: we can prune our prediction model’s vocabulary aggressively without significantly hurting accuracy.
A visual summary of the most frequent content words across all three sources.
wc_data <- all_unigrams %>%
anti_join(stop_words, by = "word") %>%
filter(str_detect(word, "^[a-z]{3,}$")) %>%
slice_max(n, n = 200) %>%
select(word, freq = n)
wordcloud2(data = wc_data,
size = 0.55,
color = "random-dark",
backgroundColor = "white",
rotateRatio = 0.3)The corpus was sourced from the open web and contains a small
fraction of non-English tokens mixed into the English text. For the EDA,
these are removed by restricting tokens to ^[a-z']+$ (ASCII
lowercase letters and apostrophes only).
The production preprocessing pipeline will:
"
" to straight quotes ")The prediction app is intended for general audiences. Profanity will be handled by:
lexicon
R package’s profanity_alvarez dataset)The next-word predictor will use a trigram language model with Stupid Backoff (Brants et al., 2007). This approach is:
Algorithm (plain English):
Given the user’s typed text, look at the last 2 words. Search the trigram frequency table for phrases that start with those 2 words. If good matches exist, return the top 3 completions. If not, fall back to the last 1 word and the bigram table (with a small score penalty). If still no match, fall back to the most common words in the corpus.
Implementation steps:
data.table objects keyed on prefix
words for fast lookupsaveRDS(..., compress = "xz") — target < 50 MBEvaluation:
| Metric | Description |
|---|---|
| Perplexity | Lower = better language model (measured on held-out 10% test set) |
| Top-1 accuracy | % of times the correct next word is the #1 prediction |
| Top-3 accuracy | % of times the correct next word appears in top 3 predictions |
The app will present a minimal, mobile-friendly interface:
| Component | Choice | Reason |
|---|---|---|
| Framework | Shiny + shinythemes |
Standard R web framework; easy deployment |
| Reactivity | debounce() (300ms delay) |
Avoids prediction on every single keystroke |
| Model loading | readRDS() at server startup |
Model persists in memory; fast per-request prediction |
| Input pipeline | lowercase → strip punctuation → last 2 words | Matches model training preprocessing |
| Deployment | shinyapps.io free tier | Sufficient for course submission (~25 active hours/month) |
This exploratory analysis of the HC Corpora English dataset reveals several patterns that directly inform the design of the prediction system:
Scale is manageable with sampling. The 560 MB corpus contains millions of lines, but a 20,000-line sample per source captures the dominant statistical patterns effectively.
Source type matters. Twitter’s short, informal text differs substantially from blog and news prose. Training on all three sources will help the model handle the variety of inputs a real user might type.
Zipf’s Law enables efficient models. A vocabulary of ~10,000–15,000 words covers 90% of usage. Pruning rare n-grams allows us to build a fast, compact model without sacrificing meaningful accuracy.
N-gram models are well-suited to this task. The trigram frequency distributions are rich enough to support a Stupid Backoff model with good coverage for common phrases.
Next steps: (1) Train the full n-gram model on the complete corpus, (2) evaluate perplexity and top-k accuracy on a held-out test set, (3) build and deploy the Shiny prediction app to shinyapps.io.
Report generated with R 4.5.3. All code sections are collapsible — click the Code buttons above each section to inspect the implementation.