This report presents an exploratory analysis of the HC Corpora English dataset as the first milestone in building a predictive text model. The corpus draws from three public sources: personal blogs, news articles, and tweets. It gives a broad cross-section of informal and semi-formal English writing. The analysis covers basic file statistics, word frequency distributions, n-gram frequencies, and vocabulary coverage, and concludes with an outline of the planned prediction algorithm.
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", warn = FALSE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", warn = FALSE)summary_tbl <- tibble(
File = c("en_US.blogs", "en_US.news", "en_US.twitter"),
Lines = formatC(c(length(blogs), length(news), length(twitter)), format = "d", big.mark = ","),
Max_chars = formatC(c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))), format = "d", big.mark = ","),
Size_MB = round(c(
file.size("en_US.blogs.txt"),
file.size("en_US.news.txt"),
file.size("en_US.twitter.txt")
) / 1e6, 1)
)
kable(summary_tbl,
col.names = c("Source", "Line Count", "Max Line Length (chars)", "File Size (MB)"),
align = c("l", "r", "r", "r"))| Source | Line Count | Max Line Length (chars) | File Size (MB) |
|---|---|---|---|
| en_US.blogs | 899,288 | 40,833 | 210.2 |
| en_US.news | 77,259 | 5,760 | 205.8 |
| en_US.twitter | 2,360,148 | 140 | 167.1 |
The blogs file is the largest by disk size and contains the longest individual entries with some exceeding 40,000 characters. Twitter has the highest line count but the shortest lines, reflecting the platform’s character limits.
The full corpus is too large for n-gram analysis in-memory. A random 5% sample is drawn from each source and combined.
set.seed(42)
p <- 0.05
blogs_sample <- blogs[as.logical(rbinom(length(blogs), 1, p))]
news_sample <- news[as.logical(rbinom(length(news), 1, p))]
twitter_sample <- twitter[as.logical(rbinom(length(twitter), 1, p))]
corpus_sample <- c(blogs_sample, news_sample, twitter_sample)
cat("Sample size:", formatC(length(corpus_sample), format = "d", big.mark = ","), "lines\n")## Sample size: 166,780 lines
df <- tibble(text = corpus_sample)
unigrams <- df %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
unigrams %>%
slice_head(n = 20) %>%
mutate(highlight = row_number() <= 3) %>%
ggplot(aes(x = reorder(word, n), y = n, fill = highlight)) +
geom_col(width = 0.7, show.legend = FALSE) +
scale_fill_manual(values = c("FALSE" = "#457b9d", "TRUE" = "#e63946")) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.05))) +
coord_flip() +
labs(
title = "Top 20 Most Frequent Words",
subtitle = "Stop words dominate, consistent with Zipf's law",
y = "Count"
) +
report_themeThe top 5 words: the, to, and, a, i, together account for over 13% of all word instances. The distribution is strongly right-skewed, consistent with Zipf’s law: the most frequent word (“the”, ~148,000 occurrences) appears roughly 1.5× more than the second (“to”, ~96,000).
How many unique words are needed to account for a given share of all word instances in the sampled corpus?
unigrams <- unigrams %>%
mutate(cumulative = cumsum(n) / sum(n))
cover_50 <- sum(unigrams$cumulative <= 0.50)
cover_90 <- sum(unigrams$cumulative <= 0.90)
tibble(
Coverage = c("50%", "90%"),
Unique_words_needed = formatC(c(cover_50, cover_90), format = "d", big.mark = ",")
) %>%
kable(col.names = c("Coverage Target", "Unique Words Needed"), align = c("l", "r"))| Coverage Target | Unique Words Needed |
|---|---|
| 50% | 131 |
| 90% | 6,861 |
Just 131 unique words cover 50% of all word instances in the corpus. Reaching 90% requires 6,861 words which is a 52× increase for the remaining 40%. The long tail of rare words informs the frequency-pruning strategy for the final model.
bigrams <- df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
bigrams %>%
slice_head(n = 20) %>%
mutate(highlight = row_number() <= 3) %>%
ggplot(aes(x = reorder(bigram, n), y = n, fill = highlight)) +
geom_col(width = 0.7, show.legend = FALSE) +
scale_fill_manual(values = c("FALSE" = "#2a9d8f", "TRUE" = "#e63946")) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.05))) +
coord_flip() +
labs(
title = "Top 20 Bigrams",
subtitle = "Most frequent two-word sequences in the sampled corpus",
y = "Count"
) +
report_themeThe top bigrams are all stop-word pairs: of the (~13,100), in the (~12,300), and for the (~7,000). Meaningful content bigrams appear further down the frequency ranking.
trigrams <- df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE) %>%
filter(!is.na(trigram))
trigrams %>%
slice_head(n = 20) %>%
mutate(highlight = row_number() <= 3) %>%
ggplot(aes(x = reorder(trigram, n), y = n, fill = highlight)) +
geom_col(width = 0.7, show.legend = FALSE) +
scale_fill_manual(values = c("FALSE" = "#e9c46a", "TRUE" = "#e63946")) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.05))) +
coord_flip() +
labs(
title = "Top 20 Trigrams",
subtitle = "Most frequent three-word sequences in the sampled corpus",
y = "Count"
) +
report_themeTrigram counts drop sharply. The top trigrams (thanks for the, one of the, a lot of) reach only ~1,000–1,200 occurrences compared to tens of thousands for top unigrams. Note: the NA entry at rank 1 reflects tokenisation artefacts and will be filtered before modelling. This sparsity motivates the backoff strategy described below.
The final model will be a stupid backoff n-gram model operating on pre-built frequency tables:
.rds files to keep the
Shiny app within shinyapps.io memory limitsThe primary constraints are response time (target: under one second per prediction) and memory footprint. Frequency pruning and pre-indexing by the last word in each n-gram are the two main tools for meeting both targets.