Introduction

The goal of this milestone report is to demonstrate exploratory analysis of the HC Corpora dataset provided for the JHU Data Science Capstone. The final goal is to build a Next Word Prediction App using an N-gram backoff model deployed on Shiny.

1. Loading the Data

library(stringr)
library(ggplot2)
library(dplyr)
library(tidytext)

blogs   <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.blogs.txt", warn=FALSE)
news    <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.news.txt", warn=FALSE)
twitter <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.twitter.txt", warn=FALSE)

cat("Data loaded successfully!")
## Data loaded successfully!

2. Basic Summary Statistics

summary_table <- data.frame(
  File    = c("Blogs", "News", "Twitter"),
  Size_MB = round(c(
    file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.blogs.txt"),
    file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.news.txt"),
    file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.twitter.txt")
  ) / 1024^2, 2),
  Lines   = c(length(blogs), length(news), length(twitter)),
  Words   = c(sum(str_count(blogs, "\\S+")),
              sum(str_count(news, "\\S+")),
              sum(str_count(twitter, "\\S+")))
)
knitr::kable(summary_table,
             col.names = c("File", "Size (MB)", "Line Count", "Word Count"),
             caption = "Summary Statistics of HC Corpora Dataset")
Summary Statistics of HC Corpora Dataset
File Size (MB) Line Count Word Count
Blogs 200.42 899288 37334131
News 196.28 1010206 34371031
Twitter 159.36 2360148 30373543

3. Words Per Line Distribution

set.seed(42)
sample_blogs   <- sample(blogs,   5000)
sample_news    <- sample(news,    5000)
sample_twitter <- sample(twitter, 5000)

wpl_df <- data.frame(
  Words  = c(str_count(sample_blogs, "\\S+"),
             str_count(sample_news, "\\S+"),
             str_count(sample_twitter, "\\S+")),
  Source = rep(c("Blogs", "News", "Twitter"), each = 5000)
)

ggplot(wpl_df, aes(x = Words, fill = Source)) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "dodge") +
  xlim(0, 150) +
  labs(title = "Words Per Line Distribution",
       x = "Number of Words", y = "Frequency") +
  theme_minimal()
## Warning: Removed 145 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

4. Most Frequent Words

sample_corpus <- data.frame(
  text   = c(sample_blogs, sample_news, sample_twitter),
  source = rep(c("Blogs", "News", "Twitter"), each = 5000)
)

word_counts <- sample_corpus %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  head(20)
## Joining with `by = join_by(word)`
ggplot(word_counts, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "#2c7bb6") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words",
       x = "Word", y = "Frequency") +
  theme_minimal()

5. Interesting Findings

6. Plan for Prediction Algorithm and Shiny App

Algorithm

  • Build N-gram Backoff Model using Bigrams, Trigrams and Quadgrams
  • Clean and tokenize text using tidytext package
  • Store n-gram tables as rds files for fast loading
  • Use Stupid Backoff strategy for unseen phrases

Shiny App

  • Simple text input box for user to type a phrase
  • Predict Next Word button to trigger prediction
  • Display top predicted word and top 5 candidates
  • Deploy on shinyapps.io for public access

Conclusion

The HC Corpora dataset is rich and diverse covering Blogs, News and Twitter. Exploratory analysis shows clear differences in writing style across sources. The N-gram backoff model is a practical and efficient approach for next word prediction without requiring deep learning. The final Shiny app will provide a clean and fast prediction interface for any English phrase.