1 Introduction

The goal of this Milestone Report is to demonstrate progress on the Coursera Data Science Capstone project. The final deliverable will be a Shiny application powered by an NLP-based text prediction algorithm trained on a large corpus of English text.

This report covers:

  1. Loading and summarising the three raw text files
  2. Basic exploratory data analysis (word counts, line counts, token frequencies)
  3. Visualisations of important corpus features
  4. Plans for building the prediction algorithm and Shiny app

2 Data Overview

The dataset is the HC Corpora (provided by SwiftKey) and consists of three English-language text files:

File Source
en_US.blogs.txt Blog posts
en_US.news.txt News articles
en_US.twitter.txt Tweets

2.1 Loading Required Packages

# Install any missing packages before loading
pkgs <- c("tidyverse", "tidytext", "scales", "knitr", "kableExtra", "stringi", "wordcloud", "RColorBrewer")
new_pkgs <- pkgs[!pkgs %in% installed.packages()[, "Package"]]
if (length(new_pkgs)) install.packages(new_pkgs, repos = "https://cran.rstudio.com/")

library(tidyverse)
library(tidytext)
library(scales)
library(knitr)
library(kableExtra)
library(stringi)
library(wordcloud)
library(RColorBrewer)

2.2 Reading the Data

Note: Update the data_path variable below to the folder where your HC Corpora files are saved.

data_path <- "C:/Users/Dr Junaid/Downloads/Coursera-SwiftKey/final/en_US/"   # <-- Change this to your actual path

blogs_raw   <- readLines(con = paste0(data_path, "en_US.blogs.txt"),
                         encoding = "UTF-8", skipNul = TRUE)
news_raw    <- readLines(con = paste0(data_path, "en_US.news.txt"),
                         encoding = "UTF-8", skipNul = TRUE)
twitter_raw <- readLines(con = paste0(data_path, "en_US.twitter.txt"),
                         encoding = "UTF-8", skipNul = TRUE)

cat("Files loaded successfully!\n")
## Files loaded successfully!
cat("Blogs lines   :", length(blogs_raw), "\n")
## Blogs lines   : 899288
cat("News lines    :", length(news_raw),  "\n")
## News lines    : 1010206
cat("Twitter lines :", length(twitter_raw), "\n")
## Twitter lines : 2360148

3 File-Level Summary Statistics

file_summary <- tibble(
  File       = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
  `Size (MB)`= round(c(
    file.size(paste0(data_path, "en_US.blogs.txt")),
    file.size(paste0(data_path, "en_US.news.txt")),
    file.size(paste0(data_path, "en_US.twitter.txt"))
  ) / 1024^2, 1),
  `Line Count` = format(c(
    length(blogs_raw), length(news_raw), length(twitter_raw)
  ), big.mark = ","),
  `Word Count` = format(c(
    sum(stri_count_words(blogs_raw)),
    sum(stri_count_words(news_raw)),
    sum(stri_count_words(twitter_raw))
  ), big.mark = ","),
  `Avg Words / Line` = round(c(
    mean(stri_count_words(blogs_raw),   na.rm = TRUE),
    mean(stri_count_words(news_raw),    na.rm = TRUE),
    mean(stri_count_words(twitter_raw), na.rm = TRUE)
  ), 1),
  `Max Words / Line` = c(
    max(stri_count_words(blogs_raw),   na.rm = TRUE),
    max(stri_count_words(news_raw),    na.rm = TRUE),
    max(stri_count_words(twitter_raw), na.rm = TRUE)
  )
)

kable(file_summary, caption = "Table 1: Summary of HC Corpora Files") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)
Table 1: Summary of HC Corpora Files
File Size (MB) Line Count Word Count Avg Words / Line Max Words / Line
en_US.blogs.txt 200.4 899,288 37,546,806 41.8 6726
en_US.news.txt 196.3 1,010,206 34,761,151 34.4 1796
en_US.twitter.txt 159.4 2,360,148 30,096,690 12.8 47

Key observations:

  • The blogs file is the largest in terms of average words per entry, reflecting longer-form writing.
  • Twitter entries are shortest (capped at 280 characters), resulting in the lowest average word count per line.
  • News articles fall between the two in length and content density.

4 Sampling for Efficient Analysis

Processing all lines is computationally expensive. We draw a random 1% sample from each source for EDA.

set.seed(2024)
sample_pct <- 0.01

blogs_sample   <- sample(blogs_raw,   size = floor(length(blogs_raw)   * sample_pct))
news_sample    <- sample(news_raw,    size = floor(length(news_raw)    * sample_pct))
twitter_sample <- sample(twitter_raw, size = floor(length(twitter_raw) * sample_pct))

# Combine into a labelled tibble
corpus <- tibble(
  source = rep(c("Blogs", "News", "Twitter"),
               times = c(length(blogs_sample),
                         length(news_sample),
                         length(twitter_sample))),
  text   = c(blogs_sample, news_sample, twitter_sample)
)

cat("Sample sizes — Blogs:", length(blogs_sample),
    "| News:", length(news_sample),
    "| Twitter:", length(twitter_sample), "\n")
## Sample sizes — Blogs: 8992 | News: 10102 | Twitter: 23601

5 Exploratory Data Analysis

5.1 Distribution of Words per Line

corpus <- corpus %>%
  mutate(word_count = stri_count_words(text))

ggplot(corpus, aes(x = word_count, fill = source)) +
  geom_histogram(bins = 50, colour = "white", alpha = 0.85) +
  facet_wrap(~source, scales = "free_y") +
  scale_fill_brewer(palette = "Set2") +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Figure 1: Distribution of Words per Line by Source",
    x     = "Words per Line",
    y     = "Count",
    fill  = "Source"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none",
        strip.text = element_text(face = "bold"))

Findings: Twitter lines cluster tightly at low word counts (< 30 words), while blogs show a heavy right tail, with some entries exceeding 500 words.


5.2 Tokenisation & Unigram Frequencies

stop_words_custom <- stop_words  # using tidytext's built-in English stop words

tokens_raw <- corpus %>%
  unnest_tokens(word, text) %>%
  filter(!str_detect(word, "^[0-9]+$"),   # remove pure numbers
         nchar(word) > 1)                 # remove single characters

tokens_clean <- tokens_raw %>%
  anti_join(stop_words_custom, by = "word")

5.2.1 Top 20 Words (with stop words removed)

top_words <- tokens_clean %>%
  count(source, word, sort = TRUE) %>%
  group_by(source) %>%
  slice_max(n, n = 20) %>%
  ungroup()

ggplot(top_words,
       aes(x = reorder_within(word, n, source), y = n, fill = source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~source, scales = "free") +
  scale_x_reordered() +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(labels = comma) +
  coord_flip() +
  labs(
    title = "Figure 2: Top 20 Words per Source (Stop Words Removed)",
    x     = NULL,
    y     = "Frequency"
  ) +
  theme_minimal(base_size = 12) +
  theme(strip.text = element_text(face = "bold"))


5.3 Bigram Frequencies

bigrams <- corpus %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !str_detect(word1, "^[0-9]+$"),
         !str_detect(word2, "^[0-9]+$")) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  count(source, bigram, sort = TRUE) %>%
  group_by(source) %>%
  slice_max(n, n = 15) %>%
  ungroup()

ggplot(bigrams,
       aes(x = reorder_within(bigram, n, source), y = n, fill = source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~source, scales = "free") +
  scale_x_reordered() +
  scale_fill_brewer(palette = "Dark2") +
  coord_flip() +
  labs(
    title = "Figure 3: Top 15 Bigrams per Source (Stop Words Removed)",
    x     = NULL,
    y     = "Frequency"
  ) +
  theme_minimal(base_size = 12) +
  theme(strip.text = element_text(face = "bold"))


5.4 Trigram Frequencies

trigrams <- corpus %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(source, trigram, sort = TRUE) %>%
  group_by(source) %>%
  slice_max(n, n = 10) %>%
  ungroup()

ggplot(trigrams,
       aes(x = reorder_within(trigram, n, source), y = n, fill = source)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~source, scales = "free") +
  scale_x_reordered() +
  scale_fill_brewer(palette = "Set1") +
  coord_flip() +
  labs(
    title = "Figure 4: Top 10 Trigrams per Source",
    x     = NULL,
    y     = "Frequency"
  ) +
  theme_minimal(base_size = 12) +
  theme(strip.text = element_text(face = "bold"))


5.5 Word Cloud — Combined Corpus

word_freq <- tokens_clean %>%
  count(word, sort = TRUE) %>%
  filter(n >= 5)

set.seed(42)
wordcloud(
  words  = word_freq$word,
  freq   = word_freq$n,
  max.words  = 150,
  random.order = FALSE,
  rot.per    = 0.25,
  colors     = brewer.pal(8, "Dark2")
)
title("Figure 5: Word Cloud — Combined Corpus Sample")


5.6 Coverage: How Many Words Needed?

A key question for the prediction model is: how many unique words cover X% of all word instances?

all_word_freq <- tokens_raw %>%          # use raw tokens (with stop words) for true coverage
  count(word, sort = TRUE) %>%
  mutate(
    cumulative_freq = cumsum(n),
    coverage        = cumulative_freq / sum(n)
  )

cover_50 <- which(all_word_freq$coverage >= 0.50)[1]
cover_90 <- which(all_word_freq$coverage >= 0.90)[1]

ggplot(all_word_freq %>% slice(1:5000),
       aes(x = seq_along(word), y = coverage)) +
  geom_line(colour = "#2196F3", linewidth = 1) +
  geom_hline(yintercept = c(0.5, 0.9), linetype = "dashed", colour = "red") +
  annotate("text", x = 200, y = 0.52, label = "50% coverage", colour = "red", size = 4) +
  annotate("text", x = 500, y = 0.92, label = "90% coverage", colour = "red", size = 4) +
  scale_y_continuous(labels = percent_format()) +
  scale_x_continuous(labels = comma) +
  labs(
    title = "Figure 6: Word Coverage vs. Vocabulary Size",
    x     = "Number of Unique Words (ranked by frequency)",
    y     = "Cumulative Coverage"
  ) +
  theme_minimal(base_size = 13)

cov_tbl <- tibble(
  `Coverage Target` = c("50%", "90%"),
  `Unique Words Needed` = format(c(cover_50, cover_90), big.mark = ",")
)
kable(cov_tbl, caption = "Table 2: Vocabulary Size for Coverage Targets") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 2: Vocabulary Size for Coverage Targets
Coverage Target Unique Words Needed
50% 171
90% 7,824

This insight will drive vocabulary pruning in the prediction model — retaining only the most frequent words significantly reduces model size while maintaining high coverage.


6 Interesting Findings

  1. Vocabulary richness varies by source. Blogs use a richer, more varied vocabulary compared to tweets, which rely on common slang and abbreviations.
  2. Stop words dominate. A small set of function words (e.g., the, and, is) accounts for over 50% of all tokens — motivating careful handling of stop words during model training.
  3. High coverage with few words. Approximately 171 unique words cover 50% of corpus instances, and 7,824 cover 90%, indicating a power-law (Zipfian) distribution.
  4. Source-specific language. Twitter contains more informal language, contractions, and hashtags; news uses formal vocabulary; blogs sit in between.

7 Plan for Prediction Algorithm & Shiny App

7.1 Prediction Algorithm

The word-prediction model will be built using an N-gram back-off approach:

Step Action
1 Build unigram, bigram, trigram, and 4-gram frequency tables from the full corpus
2 Apply Kneser-Ney smoothing or Stupid Back-off to handle unseen n-grams
3 Given the last 1–3 words typed, look up the most probable next word
4 Return the top-3 predictions ranked by probability
5 Prune the vocabulary to reduce memory: keep words covering ≥ 90% of corpus

7.2 Shiny App

The Shiny application will feature:

  • Text input box — user types a phrase
  • Real-time predictions — top-3 next-word suggestions appear instantly
  • One-click completion — click a suggestion to append it to the text
  • Settings panel — toggle profanity filter, choose prediction aggressiveness

7.3 Next Steps


8 Conclusion

This report demonstrates that the HC Corpora data has been successfully loaded and explored. The three sources (blogs, news, Twitter) show distinct linguistic patterns. The n-gram frequency analysis confirms the feasibility of building a practical next-word prediction model. The planned Stupid Back-off / Kneser-Ney model with vocabulary pruning will balance accuracy, speed, and memory efficiency.