The goal of this milestone report is to demonstrate exploratory analysis of the HC Corpora dataset provided for the JHU Data Science Capstone. The final goal is to build a Next Word Prediction App using an N-gram backoff model deployed on Shiny.
library(stringr)
library(ggplot2)
library(dplyr)
library(tidytext)
blogs <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.blogs.txt", warn=FALSE)
news <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.news.txt", warn=FALSE)
twitter <- readLines("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.twitter.txt", warn=FALSE)
cat("Data loaded successfully!")
## Data loaded successfully!
summary_table <- data.frame(
File = c("Blogs", "News", "Twitter"),
Size_MB = round(c(
file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.blogs.txt"),
file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.news.txt"),
file.size("C:/Users/bizsu/OneDrive/Desktop/en_US/en_US.twitter.txt")
) / 1024^2, 2),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(str_count(blogs, "\\S+")),
sum(str_count(news, "\\S+")),
sum(str_count(twitter, "\\S+")))
)
knitr::kable(summary_table,
col.names = c("File", "Size (MB)", "Line Count", "Word Count"),
caption = "Summary Statistics of HC Corpora Dataset")
| File | Size (MB) | Line Count | Word Count |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37334131 |
| News | 196.28 | 1010206 | 34371031 |
| 159.36 | 2360148 | 30373543 |
set.seed(42)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)
wpl_df <- data.frame(
Words = c(str_count(sample_blogs, "\\S+"),
str_count(sample_news, "\\S+"),
str_count(sample_twitter, "\\S+")),
Source = rep(c("Blogs", "News", "Twitter"), each = 5000)
)
ggplot(wpl_df, aes(x = Words, fill = Source)) +
geom_histogram(binwidth = 5, alpha = 0.7, position = "dodge") +
xlim(0, 150) +
labs(title = "Words Per Line Distribution",
x = "Number of Words", y = "Frequency") +
theme_minimal()
## Warning: Removed 145 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_bar()`).
sample_corpus <- data.frame(
text = c(sample_blogs, sample_news, sample_twitter),
source = rep(c("Blogs", "News", "Twitter"), each = 5000)
)
word_counts <- sample_corpus %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
head(20)
## Joining with `by = join_by(word)`
ggplot(word_counts, aes(x = reorder(word, n), y = n)) +
geom_col(fill = "#2c7bb6") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words",
x = "Word", y = "Frequency") +
theme_minimal()
The HC Corpora dataset is rich and diverse covering Blogs, News and Twitter. Exploratory analysis shows clear differences in writing style across sources. The N-gram backoff model is a practical and efficient approach for next word prediction without requiring deep learning. The final Shiny app will provide a clean and fast prediction interface for any English phrase.