The goal of this Milestone Report is to demonstrate progress on the Coursera Data Science Capstone project. The final deliverable will be a Shiny application powered by an NLP-based text prediction algorithm trained on a large corpus of English text.
This report covers:
The dataset is the HC Corpora (provided by SwiftKey) and consists of three English-language text files:
| File | Source |
|---|---|
en_US.blogs.txt |
Blog posts |
en_US.news.txt |
News articles |
en_US.twitter.txt |
Tweets |
# Install any missing packages before loading
pkgs <- c("tidyverse", "tidytext", "scales", "knitr", "kableExtra", "stringi", "wordcloud", "RColorBrewer")
new_pkgs <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(new_pkgs)) install.packages(new_pkgs, repos = "https://cran.rstudio.com/")
library(tidyverse)
library(tidytext)
library(scales)
library(knitr)
library(kableExtra)
library(stringi)
library(wordcloud)
library(RColorBrewer)Note: Update the
data_pathvariable below to the folder where your HC Corpora files are saved.
data_path <- "C:/Users/Dr Junaid/Downloads/Coursera-SwiftKey/final/en_US/" # <-- Change this to your actual path
blogs_raw <- readLines(con = paste0(data_path, "en_US.blogs.txt"),
encoding = "UTF-8", skipNul = TRUE)
news_raw <- readLines(con = paste0(data_path, "en_US.news.txt"),
encoding = "UTF-8", skipNul = TRUE)
twitter_raw <- readLines(con = paste0(data_path, "en_US.twitter.txt"),
encoding = "UTF-8", skipNul = TRUE)
cat("Files loaded successfully!\n")## Files loaded successfully!
## Blogs lines : 899288
## News lines : 1010206
## Twitter lines : 2360148
file_summary <- tibble(
File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
`Size (MB)`= round(c(
file.size(paste0(data_path, "en_US.blogs.txt")),
file.size(paste0(data_path, "en_US.news.txt")),
file.size(paste0(data_path, "en_US.twitter.txt"))
) / 1024^2, 1),
`Line Count` = format(c(
length(blogs_raw), length(news_raw), length(twitter_raw)
), big.mark = ","),
`Word Count` = format(c(
sum(stri_count_words(blogs_raw)),
sum(stri_count_words(news_raw)),
sum(stri_count_words(twitter_raw))
), big.mark = ","),
`Avg Words / Line` = round(c(
mean(stri_count_words(blogs_raw), na.rm = TRUE),
mean(stri_count_words(news_raw), na.rm = TRUE),
mean(stri_count_words(twitter_raw), na.rm = TRUE)
), 1),
`Max Words / Line` = c(
max(stri_count_words(blogs_raw), na.rm = TRUE),
max(stri_count_words(news_raw), na.rm = TRUE),
max(stri_count_words(twitter_raw), na.rm = TRUE)
)
)
kable(file_summary, caption = "Table 1: Summary of HC Corpora Files") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| File | Size (MB) | Line Count | Word Count | Avg Words / Line | Max Words / Line |
|---|---|---|---|---|---|
| en_US.blogs.txt | 200.4 | 899,288 | 37,546,806 | 41.8 | 6726 |
| en_US.news.txt | 196.3 | 1,010,206 | 34,761,151 | 34.4 | 1796 |
| en_US.twitter.txt | 159.4 | 2,360,148 | 30,096,690 | 12.8 | 47 |
Key observations:
Processing all lines is computationally expensive. We draw a random 1% sample from each source for EDA.
set.seed(2024)
sample_pct <- 0.01
blogs_sample <- sample(blogs_raw, size = floor(length(blogs_raw) * sample_pct))
news_sample <- sample(news_raw, size = floor(length(news_raw) * sample_pct))
twitter_sample <- sample(twitter_raw, size = floor(length(twitter_raw) * sample_pct))
# Combine into a labelled tibble
corpus <- tibble(
source = rep(c("Blogs", "News", "Twitter"),
times = c(length(blogs_sample),
length(news_sample),
length(twitter_sample))),
text = c(blogs_sample, news_sample, twitter_sample)
)
cat("Sample sizes — Blogs:", length(blogs_sample),
"| News:", length(news_sample),
"| Twitter:", length(twitter_sample), "\n")## Sample sizes — Blogs: 8992 | News: 10102 | Twitter: 23601
corpus <- corpus %>%
mutate(word_count = stri_count_words(text))
ggplot(corpus, aes(x = word_count, fill = source)) +
geom_histogram(bins = 50, colour = "white", alpha = 0.85) +
facet_wrap(~source, scales = "free_y") +
scale_fill_brewer(palette = "Set2") +
scale_x_continuous(labels = comma) +
labs(
title = "Figure 1: Distribution of Words per Line by Source",
x = "Words per Line",
y = "Count",
fill = "Source"
) +
theme_minimal(base_size = 13) +
theme(legend.position = "none",
strip.text = element_text(face = "bold"))Findings: Twitter lines cluster tightly at low word counts (< 30 words), while blogs show a heavy right tail, with some entries exceeding 500 words.
stop_words_custom <- stop_words # using tidytext's built-in English stop words
tokens_raw <- corpus %>%
unnest_tokens(word, text) %>%
filter(!str_detect(word, "^[0-9]+$"), # remove pure numbers
nchar(word) > 1) # remove single characters
tokens_clean <- tokens_raw %>%
anti_join(stop_words_custom, by = "word")top_words <- tokens_clean %>%
count(source, word, sort = TRUE) %>%
group_by(source) %>%
slice_max(n, n = 20) %>%
ungroup()
ggplot(top_words,
aes(x = reorder_within(word, n, source), y = n, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~source, scales = "free") +
scale_x_reordered() +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(labels = comma) +
coord_flip() +
labs(
title = "Figure 2: Top 20 Words per Source (Stop Words Removed)",
x = NULL,
y = "Frequency"
) +
theme_minimal(base_size = 12) +
theme(strip.text = element_text(face = "bold"))bigrams <- corpus %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!str_detect(word1, "^[0-9]+$"),
!str_detect(word2, "^[0-9]+$")) %>%
unite(bigram, word1, word2, sep = " ") %>%
count(source, bigram, sort = TRUE) %>%
group_by(source) %>%
slice_max(n, n = 15) %>%
ungroup()
ggplot(bigrams,
aes(x = reorder_within(bigram, n, source), y = n, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~source, scales = "free") +
scale_x_reordered() +
scale_fill_brewer(palette = "Dark2") +
coord_flip() +
labs(
title = "Figure 3: Top 15 Bigrams per Source (Stop Words Removed)",
x = NULL,
y = "Frequency"
) +
theme_minimal(base_size = 12) +
theme(strip.text = element_text(face = "bold"))trigrams <- corpus %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(source, trigram, sort = TRUE) %>%
group_by(source) %>%
slice_max(n, n = 10) %>%
ungroup()
ggplot(trigrams,
aes(x = reorder_within(trigram, n, source), y = n, fill = source)) +
geom_col(show.legend = FALSE) +
facet_wrap(~source, scales = "free") +
scale_x_reordered() +
scale_fill_brewer(palette = "Set1") +
coord_flip() +
labs(
title = "Figure 4: Top 10 Trigrams per Source",
x = NULL,
y = "Frequency"
) +
theme_minimal(base_size = 12) +
theme(strip.text = element_text(face = "bold"))word_freq <- tokens_clean %>%
count(word, sort = TRUE) %>%
filter(n >= 5)
set.seed(42)
wordcloud(
words = word_freq$word,
freq = word_freq$n,
max.words = 150,
random.order = FALSE,
rot.per = 0.25,
colors = brewer.pal(8, "Dark2")
)
title("Figure 5: Word Cloud — Combined Corpus Sample")A key question for the prediction model is: how many unique words cover X% of all word instances?
all_word_freq <- tokens_raw %>% # use raw tokens (with stop words) for true coverage
count(word, sort = TRUE) %>%
mutate(
cumulative_freq = cumsum(n),
coverage = cumulative_freq / sum(n)
)
cover_50 <- which(all_word_freq$coverage >= 0.50)[1]
cover_90 <- which(all_word_freq$coverage >= 0.90)[1]
ggplot(all_word_freq %>% slice(1:5000),
aes(x = seq_along(word), y = coverage)) +
geom_line(colour = "#2196F3", linewidth = 1) +
geom_hline(yintercept = c(0.5, 0.9), linetype = "dashed", colour = "red") +
annotate("text", x = 200, y = 0.52, label = "50% coverage", colour = "red", size = 4) +
annotate("text", x = 500, y = 0.92, label = "90% coverage", colour = "red", size = 4) +
scale_y_continuous(labels = percent_format()) +
scale_x_continuous(labels = comma) +
labs(
title = "Figure 6: Word Coverage vs. Vocabulary Size",
x = "Number of Unique Words (ranked by frequency)",
y = "Cumulative Coverage"
) +
theme_minimal(base_size = 13)cov_tbl <- tibble(
`Coverage Target` = c("50%", "90%"),
`Unique Words Needed` = format(c(cover_50, cover_90), big.mark = ",")
)
kable(cov_tbl, caption = "Table 2: Vocabulary Size for Coverage Targets") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Coverage Target | Unique Words Needed |
|---|---|
| 50% | 171 |
| 90% | 7,824 |
This insight will drive vocabulary pruning in the prediction model — retaining only the most frequent words significantly reduces model size while maintaining high coverage.
The word-prediction model will be built using an N-gram back-off approach:
| Step | Action |
|---|---|
| 1 | Build unigram, bigram, trigram, and 4-gram frequency tables from the full corpus |
| 2 | Apply Kneser-Ney smoothing or Stupid Back-off to handle unseen n-grams |
| 3 | Given the last 1–3 words typed, look up the most probable next word |
| 4 | Return the top-3 predictions ranked by probability |
| 5 | Prune the vocabulary to reduce memory: keep words covering ≥ 90% of corpus |
The Shiny application will feature:
This report demonstrates that the HC Corpora data has been successfully loaded and explored. The three sources (blogs, news, Twitter) show distinct linguistic patterns. The n-gram frequency analysis confirms the feasibility of building a practical next-word prediction model. The planned Stupid Back-off / Kneser-Ney model with vocabulary pruning will balance accuracy, speed, and memory efficiency.