JHU Data Science Capstone Milestone Report

Overview

This report presents an exploratory analysis of the HC Corpora English dataset as the first milestone in building a predictive text model. The corpus draws from three public sources: personal blogs, news articles, and tweets. It gives a broad cross-section of informal and semi-formal English writing. The analysis covers basic file statistics, word frequency distributions, n-gram frequencies, and vocabulary coverage, and concludes with an outline of the planned prediction algorithm.

1. Loading the Data

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", warn = FALSE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", warn = FALSE)

2. Basic Summary Statistics

summary_tbl <- tibble(
  File = c("en_US.blogs", "en_US.news", "en_US.twitter"),
  Lines = formatC(c(length(blogs), length(news), length(twitter)), format = "d", big.mark = ","),
  Max_chars = formatC(c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))), format = "d", big.mark = ","),
  Size_MB = round(c(
    file.size("en_US.blogs.txt"),
    file.size("en_US.news.txt"),
    file.size("en_US.twitter.txt")
  ) / 1e6, 1)
)

kable(summary_tbl,
      col.names = c("Source", "Line Count", "Max Line Length (chars)", "File Size (MB)"),
      align = c("l", "r", "r", "r"))

Source	Line Count	Max Line Length (chars)	File Size (MB)
en_US.blogs	899,288	40,833	210.2
en_US.news	77,259	5,760	205.8
en_US.twitter	2,360,148	140	167.1

The blogs file is the largest by disk size and contains the longest individual entries with some exceeding 40,000 characters. Twitter has the highest line count but the shortest lines, reflecting the platform’s character limits.

3. Sampling

The full corpus is too large for n-gram analysis in-memory. A random 5% sample is drawn from each source and combined.

set.seed(42)
p <- 0.05

blogs_sample <- blogs[as.logical(rbinom(length(blogs), 1, p))]
news_sample <- news[as.logical(rbinom(length(news), 1, p))]
twitter_sample <- twitter[as.logical(rbinom(length(twitter), 1, p))]

corpus_sample <- c(blogs_sample, news_sample, twitter_sample)
cat("Sample size:", formatC(length(corpus_sample), format = "d", big.mark = ","), "lines\n")

## Sample size: 166,780 lines

4. Word Frequency (Unigrams)

df <- tibble(text = corpus_sample)

unigrams <- df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

unigrams %>%
  slice_head(n = 20) %>%
  mutate(highlight = row_number() <= 3) %>%
  ggplot(aes(x = reorder(word, n), y = n, fill = highlight)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  scale_fill_manual(values = c("FALSE" = "#457b9d", "TRUE" = "#e63946")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.05))) +
  coord_flip() +
  labs(
    title = "Top 20 Most Frequent Words",
    subtitle = "Stop words dominate, consistent with Zipf's law",
    y = "Count"
  ) +
  report_theme

The top 5 words: the, to, and, a, i, together account for over 13% of all word instances. The distribution is strongly right-skewed, consistent with Zipf’s law: the most frequent word (“the”, ~148,000 occurrences) appears roughly 1.5× more than the second (“to”, ~96,000).

5. Vocabulary Coverage

How many unique words are needed to account for a given share of all word instances in the sampled corpus?

unigrams <- unigrams %>%
  mutate(cumulative = cumsum(n) / sum(n))

cover_50 <- sum(unigrams$cumulative <= 0.50)
cover_90 <- sum(unigrams$cumulative <= 0.90)

tibble(
  Coverage = c("50%", "90%"),
  Unique_words_needed = formatC(c(cover_50, cover_90), format = "d", big.mark = ",")
) %>%
  kable(col.names = c("Coverage Target", "Unique Words Needed"), align = c("l", "r"))

Coverage Target	Unique Words Needed
50%	131
90%	6,861

Just 131 unique words cover 50% of all word instances in the corpus. Reaching 90% requires 6,861 words which is a 52× increase for the remaining 40%. The long tail of rare words informs the frequency-pruning strategy for the final model.

6. Bigram Frequencies

bigrams <- df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

bigrams %>%
  slice_head(n = 20) %>%
  mutate(highlight = row_number() <= 3) %>%
  ggplot(aes(x = reorder(bigram, n), y = n, fill = highlight)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  scale_fill_manual(values = c("FALSE" = "#2a9d8f", "TRUE" = "#e63946")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.05))) +
  coord_flip() +
  labs(
    title = "Top 20 Bigrams",
    subtitle = "Most frequent two-word sequences in the sampled corpus",
    y = "Count"
  ) +
  report_theme

The top bigrams are all stop-word pairs: of the (~13,100), in the (~12,300), and for the (~7,000). Meaningful content bigrams appear further down the frequency ranking.

7. Trigram Frequencies

trigrams <- df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE) %>%
  filter(!is.na(trigram))

trigrams %>%
  slice_head(n = 20) %>%
  mutate(highlight = row_number() <= 3) %>%
  ggplot(aes(x = reorder(trigram, n), y = n, fill = highlight)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  scale_fill_manual(values = c("FALSE" = "#e9c46a", "TRUE" = "#e63946")) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.05))) +
  coord_flip() +
  labs(
    title = "Top 20 Trigrams",
    subtitle = "Most frequent three-word sequences in the sampled corpus",
    y = "Count"
  ) +
  report_theme

Trigram counts drop sharply. The top trigrams (thanks for the, one of the, a lot of) reach only ~1,000–1,200 occurrences compared to tens of thousands for top unigrams. Note: the NA entry at rank 1 reflects tokenisation artefacts and will be filtered before modelling. This sparsity motivates the backoff strategy described below.

8. Prediction Algorithm Plan

The final model will be a stupid backoff n-gram model operating on pre-built frequency tables:

Unigram, bigram, and trigram tables are built from the full sampled corpus after profanity removal
Given the last 1–2 typed words, the model first looks for a matching trigram, falls back to bigrams, then falls back to the most frequent unigrams
Tables are stored as compressed .rds files to keep the Shiny app within shinyapps.io memory limits
N-grams appearing only once are pruned. This significantly reduces table size with minimal loss in prediction quality
The app returns the top 3 candidate next words ranked by frequency

The primary constraints are response time (target: under one second per prediction) and memory footprint. Frequency pruning and pre-indexing by the last word in each n-gram are the two main tools for meeting both targets.