1 Overview

This report summarises exploratory analysis on the HC Corpora / SwiftKey dataset for the Coursera Data Science Capstone. The goal is to build a next-word prediction algorithm and deploy it as a Shiny web app — similar to smartphone keyboard prediction.

This document covers:

Data loading and basic file statistics
Word and line distributions
Frequency analysis of unigrams, bigrams, and trigrams
Plans for the prediction model and Shiny app

2 Setup: Install & Load Packages

packages <- c("stringi", "tidytext", "tidyr", "dplyr",
              "ggplot2", "knitr", "kableExtra", "stopwords")

installed_pkgs <- rownames(installed.packages())
to_install     <- packages[!packages %in% installed_pkgs]
if (length(to_install)) install.packages(to_install, repos = "https://cran.rstudio.com/")

## package 'ISOcodes' successfully unpacked and MD5 sums checked
## package 'stopwords' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\vkpar\AppData\Local\Temp\RtmpsrORDl\downloaded_packages

invisible(lapply(packages, library, character.only = TRUE))

3 Data Loading

# ── Set your path here ────────────────────────────────────────────────────────
DATA_DIR <- "C:/Users/vkpar/Downloads/Coursera-SwiftKey/final/en_US/"   # <- change to your local path

read_safe <- function(fname) {
  readLines(file.path(DATA_DIR, fname), encoding = "UTF-8", skipNul = TRUE)
}

blogs   <- read_safe("en_US.blogs.txt")
news    <- read_safe("en_US.news.txt")
twitter <- read_safe("en_US.twitter.txt")

cat("Loaded -- blogs:", length(blogs),
    "| news:", length(news),
    "| twitter:", length(twitter), "lines\n")

## Loaded -- blogs: 899288 | news: 1010206 | twitter: 2360148 lines

4 Basic Summary Statistics

file_summary <- function(lines, label) {
  word_counts <- stringi::stri_count_words(lines)
  data.frame(
    Source             = label,
    Lines              = format(length(lines),     big.mark = ","),
    Words              = format(sum(word_counts),  big.mark = ","),
    Chars              = format(sum(nchar(lines)), big.mark = ","),
    Max_Line_Length    = format(max(nchar(lines)), big.mark = ","),
    Avg_Words_Per_Line = round(mean(word_counts), 1)
  )
}

summary_df <- rbind(
  file_summary(blogs,   "Blogs"),
  file_summary(news,    "News"),
  file_summary(twitter, "Twitter")
)

kable(summary_df,
      caption = "Table 1 - Corpus Summary Statistics",
      align   = "lrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 1 - Corpus Summary Statistics
Source	Lines	Words	Chars	Max_Line_Length	Avg_Words_Per_Line
Blogs	899,288	37,546,806	206,824,505	40,833	41.8
News	1,010,206	34,761,151	203,214,543	11,384	34.4
Twitter	2,360,148	30,096,690	162,096,241	140	12.8

Key observations:

Blogs have the longest average line length — long-form prose.
Twitter has the most lines but shortest entries (280-char limit).
News sits between the two in style and length.

5 Line Length Distribution

len_df <- data.frame(
  source = c(rep("Blogs",   length(blogs)),
             rep("News",    length(news)),
             rep("Twitter", length(twitter))),
  len    = c(nchar(blogs), nchar(news), nchar(twitter))
)

ggplot(len_df, aes(x = len, fill = source)) +
  geom_histogram(bins = 60, alpha = 0.8) +
  facet_wrap(~source, scales = "free") +
  scale_fill_manual(values = c("#0984e3", "#00b894", "#e17055")) +
  labs(title = "Distribution of Line Lengths by Source",
       x = "Characters per Line", y = "Count") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

rm(len_df); gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  6736951 359.8   10317100  551.0   9294192  496.4
## Vcells 99629407 760.2  183458216 1399.7 182582246 1393.0

6 Sampling the Corpus

The full corpus is too large to process in memory all at once. We sample 0.5% of each source for n-gram analysis.

set.seed(2024)
SAMPLE_PCT <- 0.005   # 0.5% -- safe for 8GB RAM machines

sample_lines <- c(
  sample(blogs,   size = floor(length(blogs)   * SAMPLE_PCT)),
  sample(news,    size = floor(length(news)    * SAMPLE_PCT)),
  sample(twitter, size = floor(length(twitter) * SAMPLE_PCT))
)

# Free originals immediately
rm(blogs, news, twitter); gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  2534499 135.4    8253680  440.8   9294192  496.4
## Vcells 15343936 117.1  146766573 1119.8 182582246 1393.0

cat("Sample size:", length(sample_lines), "lines\n")

## Sample size: 21347 lines

7 Text Cleaning

clean_text <- function(x) {
  x <- tolower(x)
  x <- gsub("http[s]?://\\S+|www\\.\\S+", " ", x)
  x <- gsub("[^a-z\\s']", " ", x)
  x <- gsub("\\s+", " ", x)
  trimws(x)
}

cleaned_lines <- clean_text(sample_lines)
rm(sample_lines); gc()

##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  2534429 135.4    8253680 440.8   9294192  496.4
## Vcells 15327402 117.0  117413259 895.8 182582246 1393.0

tidy_df <- data.frame(
  line = seq_along(cleaned_lines),
  text = cleaned_lines,
  stringsAsFactors = FALSE
)

rm(cleaned_lines); gc()

##            used  (Mb) gc trigger  (Mb)  max used   (Mb)
## Ncells  2534454 135.4    8253680 440.8   9294192  496.4
## Vcells 15327449 117.0   93930608 716.7 182582246 1393.0

cat("Tidy data frame ready:", nrow(tidy_df), "rows\n")

## Tidy data frame ready: 21347 rows

8 Unigram Analysis

stop_words_en <- stopwords::stopwords("en")

unigram_df <- tidy_df %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words_en, nchar(word) > 1) %>%
  count(word, sort = TRUE)

cat("Unique words (after stopword removal):", nrow(unigram_df), "\n")

## Unique words (after stopword removal): 35180

ggplot(head(unigram_df, 30),
       aes(x = reorder(word, n), y = n, fill = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#74b9ff", high = "#0984e3") +
  labs(title = "Top 30 Most Frequent Words (stopwords removed)",
       x = NULL, y = "Count") +
  theme_minimal(base_size = 13)

9 Bigram Analysis

bigram_df <- tidy_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, into = c("w1", "w2"), sep = " ") %>%
  filter(!w1 %in% stop_words_en,
         !w2 %in% stop_words_en,
         !is.na(w1), !is.na(w2)) %>%
  unite(bigram, w1, w2, sep = " ") %>%
  count(bigram, sort = TRUE)

ggplot(head(bigram_df, 20),
       aes(x = reorder(bigram, n), y = n, fill = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#55efc4", high = "#00b894") +
  labs(title = "Top 20 Bigrams", x = NULL, y = "Count") +
  theme_minimal(base_size = 13)

10 Trigram Analysis

trigram_df <- tidy_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE)

ggplot(head(trigram_df, 20),
       aes(x = reorder(trigram, n), y = n, fill = n)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_fill_gradient(low = "#fd79a8", high = "#e84393") +
  labs(title = "Top 20 Trigrams", x = NULL, y = "Count") +
  theme_minimal(base_size = 13)

11 Word Coverage Analysis

How many unique words are needed to cover X% of all word instances?

unigram_all <- tidy_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  mutate(cum_pct = cumsum(n) / sum(n) * 100,
         rank    = row_number())

cover_50 <- unigram_all$rank[which(unigram_all$cum_pct >= 50)[1]]
cover_90 <- unigram_all$rank[which(unigram_all$cum_pct >= 90)[1]]

cat("Words needed for 50% coverage:", cover_50, "\n")

## Words needed for 50% coverage: 141

cat("Words needed for 90% coverage:", cover_90, "\n")

## Words needed for 90% coverage: 6631

ggplot(unigram_all[1:min(8000, nrow(unigram_all)), ],
       aes(x = rank, y = cum_pct)) +
  geom_line(color = "#6c5ce7", linewidth = 1) +
  geom_hline(yintercept = c(50, 90),
             linetype = "dashed",
             color    = c("#e17055", "#d63031")) +
  annotate("text", x = cover_50 + 200, y = 47,
           label = paste0(cover_50, " words -> 50%"),
           color = "#e17055", size = 4) +
  annotate("text", x = cover_90 + 200, y = 87,
           label = paste0(cover_90, " words -> 90%"),
           color = "#d63031", size = 4) +
  labs(title = "Cumulative Word Coverage Curve",
       x = "Unique Words (ranked by frequency)",
       y = "Cumulative Coverage (%)") +
  theme_minimal(base_size = 13)

We can cover 90% of all word usage with just the top ~6631 unique words — this enables aggressive model compression without sacrificing much accuracy.

12 Key Findings

Table 2 - Summary of Key EDA Findings
Finding	Detail
Corpus scale	Hundreds of millions of words across blogs, news, and Twitter
Source diversity	Three distinct writing styles improve model generalisation
50% coverage	~141 unique words cover half of all word usage
90% coverage	~6631 unique words cover 90% — enables compact model
Bigram signal	Common 2-word phrases are strong predictors of the next word
Style variation	Twitter is short and informal; blogs are long-form — both are valuable training signals

13 Plan: Prediction Algorithm

The next-word predictor will use a Stupid Back-off n-gram model:

Build n-gram tables (unigram through 4-gram) from the full corpus using data.table.
Back-off logic — given the last 3 words typed, search 4-grams first; if no match, fall back to trigrams, then bigrams, then unigrams.
Vocabulary pruning — drop n-grams with frequency < 3 to cut model size by ~80%.
Profanity filtering — strip offensive words before building tables.
Output — return the top 3 predicted next words ranked by score.

14 Plan: Shiny App

The app will work like a smartphone keyboard:

Text input — user types a partial sentence
Prediction buttons — top 3 next-word suggestions appear instantly
Click to accept — tapping a suggestion appends it to the input
Speed — pre-loaded compressed .rds n-gram tables ensure under 200ms response
Deployed on shinyapps.io

15 Next Steps

Build full n-gram tables from 100% of the corpus
Implement and benchmark the back-off algorithm
Evaluate accuracy on a held-out test set
Build and deploy the Shiny app

Report generated with R Markdown · Coursera Data Science Capstone

NLP Capstone: Milestone Report

Exploratory Data Analysis of the SwiftKey Corpus

Vikas Parmar

2026-03-23