Data Science Capstone - Milestone Report

Introduction

The goal of this milestone report is to demonstrate exploratory analysis of the HC Corpora dataset provided for the JHU Data Science Capstone. The final goal is to build a Next Word Prediction App using an N-gram backoff model deployed on Shiny.

1. Loading the Data

library(stringr)
library(ggplot2)
library(dplyr)
library(tidytext)

blogs   <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.blogs.txt", warn=FALSE)
news    <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.news.txt", warn=FALSE)
twitter <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.twitter.txt", warn=FALSE)

cat("Data loaded successfully!")

## Data loaded successfully!

2. Basic Summary Statistics

summary_table <- data.frame(
  File    = c("Blogs", "News", "Twitter"),
  Size_MB = round(c(
    file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.blogs.txt"),
    file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.news.txt"),
    file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.twitter.txt")
  ) / 1024^2, 2),
  Lines   = c(length(blogs), length(news), length(twitter)),
  Words   = c(sum(str_count(blogs, "\\S+")),
              sum(str_count(news, "\\S+")),
              sum(str_count(twitter, "\\S+")))
)
knitr::kable(summary_table,
             col.names = c("File", "Size (MB)", "Line Count", "Word Count"),
             caption = "Summary Statistics of HC Corpora Dataset")

Summary Statistics of HC Corpora Dataset
File	Size (MB)	Line Count	Word Count
Blogs	200.42	899288	37334131
News	196.28	1010206	34371031
Twitter	159.36	2360148	30373543

3. Words Per Line Distribution

set.seed(42)
sample_blogs   <- sample(blogs,   5000)
sample_news    <- sample(news,    5000)
sample_twitter <- sample(twitter, 5000)

wpl_df <- data.frame(
  Words  = c(str_count(sample_blogs, "\\S+"),
             str_count(sample_news, "\\S+"),
             str_count(sample_twitter, "\\S+")),
  Source = rep(c("Blogs", "News", "Twitter"), each = 5000)
)

ggplot(wpl_df, aes(x = Words, fill = Source)) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "dodge") +
  xlim(0, 150) +
  labs(title = "Words Per Line Distribution",
       x = "Number of Words", y = "Frequency") +
  theme_minimal()

## Warning: Removed 145 rows containing non-finite outside the scale range
## (`stat_bin()`).

## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).

4. Most Frequent Words

sample_corpus <- data.frame(
  text   = c(sample_blogs, sample_news, sample_twitter),
  source = rep(c("Blogs", "News", "Twitter"), each = 5000)
)

word_counts <- sample_corpus %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE) %>%
  head(20)

## Joining with `by = join_by(word)`

ggplot(word_counts, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "#2c7bb6") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words",
       x = "Word", y = "Frequency") +
  theme_minimal()

5. Interesting Findings

Twitter has the shortest lines averaging around 12 words per line
Blogs have the longest lines averaging around 41 words per line
News falls in between averaging around 34 words per line
Common words across all sources include: people, time, day, love, life
Twitter data contains more informal language and abbreviations

6. Plan for Prediction Algorithm and Shiny App

Algorithm

Build N-gram Backoff Model using Bigrams, Trigrams and Quadgrams
Clean and tokenize text using tidytext package
Store n-gram tables as rds files for fast loading
Use Stupid Backoff strategy for unseen phrases

Shiny App

Simple text input box for user to type a phrase
Predict Next Word button to trigger prediction
Display top predicted word and top 5 candidates
Deploy on shinyapps.io for public access

Conclusion

The HC Corpora dataset is rich and diverse covering Blogs, News and Twitter. Exploratory analysis shows clear differences in writing style across sources. The N-gram backoff model is a practical and efficient approach for next word prediction without requiring deep learning. The final Shiny app will provide a clean and fast prediction interface for any English phrase.