Required Packages

library(tidyverse)       # dplyr, ggplot2, stringr
library(tidytext)        # unnest_tokens for n-grams
library(quanteda)        # fast corpus & tokenisation
library(quanteda.textstats) # textstat_frequency
library(stringi)         # string statistics
library(knitr)           # kable tables
library(scales)          # axis formatting
library(wordcloud)       # word cloud
library(RColorBrewer)    # colour palettes

Introduction

This milestone report is part of the Johns Hopkins Data Science Capstone project in partnership with SwiftKey. The ultimate goal is to build a predictive text application - similar to the autocomplete feature on a smartphone keyboard - that suggests the next word a user is likely to type.

This report covers:


Data Loading

The dataset provided is a corpus of English text collected from three sources: blogs, news articles, and Twitter. The raw files are large, so we load them carefully and work with a random sample for exploratory analysis.

blogs_path   <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.blogs.txt"
news_path    <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.news.txt"
twitter_path <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.twitter.txt"

blogs   <- readLines(blogs_path,   encoding = "UTF-8", skipNul = TRUE)
news    <- readLines(news_path,    encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_path, encoding = "UTF-8", skipNul = TRUE)

cat("Files loaded successfully.\n")
## Files loaded successfully.
cat("Blogs lines   :", length(blogs),   "\n")
## Blogs lines   : 899288
cat("News lines    :", length(news),    "\n")
## News lines    : 1010206
cat("Twitter lines :", length(twitter), "\n")
## Twitter lines : 2360148

Basic Summary Statistics

# Word counts
blogs_words   <- sum(stri_count_words(blogs),   na.rm = TRUE)
news_words    <- sum(stri_count_words(news),     na.rm = TRUE)
twitter_words <- sum(stri_count_words(twitter),  na.rm = TRUE)

# Longest line (characters)
blogs_max   <- max(nchar(blogs))
news_max    <- max(nchar(news))
twitter_max <- max(nchar(twitter))

# File sizes on disk (MB)
blogs_size   <- round(file.info(blogs_path)$size   / 1e6, 1)
news_size    <- round(file.info(news_path)$size    / 1e6, 1)
twitter_size <- round(file.info(twitter_path)$size / 1e6, 1)

summary_df <- data.frame(
  Source        = c("Blogs", "News", "Twitter"),
  `File Size (MB)` = c(blogs_size, news_size, twitter_size),
  `Line Count`     = formatC(c(length(blogs), length(news), length(twitter)),
                              format = "d", big.mark = ","),
  `Word Count`     = formatC(c(blogs_words, news_words, twitter_words),
                              format = "d", big.mark = ","),
  `Longest Line`   = formatC(c(blogs_max, news_max, twitter_max),
                              format = "d", big.mark = ","),
  check.names = FALSE
)

kable(summary_df,
      caption = "Table 1: Summary statistics for the three corpus files",
      align   = c("l", "r", "r", "r", "r"))
Table 1: Summary statistics for the three corpus files
Source File Size (MB) Line Count Word Count Longest Line
Blogs 210.2 899,288 37,546,806 40,833
News 205.8 1,010,206 34,761,151 11,384
Twitter 167.1 2,360,148 30,096,690 140

Key observations:


Sampling the Data

The full corpus contains hundreds of millions of words - too large to process interactively. We draw a random 1 % sample from each source and combine them into a single corpus for analysis. This is standard practice in NLP exploratory work.

set.seed(2024)

sample_pct <- 0.01   # change to 0.05 for a richer (but slower) sample

sample_blogs   <- sample(blogs,   size = round(length(blogs)   * sample_pct))
sample_news    <- sample(news,    size = round(length(news)    * sample_pct))
sample_twitter <- sample(twitter, size = round(length(twitter) * sample_pct))

combined <- c(sample_blogs, sample_news, sample_twitter)

cat("Sample sizes — Blogs:", length(sample_blogs),
    "| News:", length(sample_news),
    "| Twitter:", length(sample_twitter), "\n")
## Sample sizes — Blogs: 8993 | News: 10102 | Twitter: 23601
cat("Combined sample lines:", length(combined), "\n")
## Combined sample lines: 42696

Text Pre-Processing

We clean the text using quanteda - converting to lower case, removing punctuation, numbers, symbols, and non-ASCII characters. Stop words are intentionally kept because they are crucial for predicting the next word in natural language (e.g. “I want to ___“).

# Remove non-ASCII (emojis, foreign characters)
combined_clean <- stri_replace_all_regex(combined, "[^\\p{ASCII}]", "")
# Remove URLs
combined_clean <- stri_replace_all_regex(combined_clean,
                    "http[s]?://\\S+|www\\.\\S+", "")
# Remove extra whitespace
combined_clean <- stri_trim_both(combined_clean)
combined_clean <- combined_clean[nchar(combined_clean) > 0]  # drop empty lines

# Build quanteda corpus
qcorp <- corpus(combined_clean)

cat("Cleaned corpus documents:", format(ndoc(qcorp), big.mark = ","), "\n")
## Cleaned corpus documents: 42,694

Exploratory Analysis

Word (Unigram) Frequencies

toks <- tokens(qcorp,
               remove_punct   = TRUE,
               remove_numbers = TRUE,
               remove_symbols = TRUE,
               remove_url     = TRUE)

# Unigram frequency
uni_dfm <- dfm(toks)
uni_df  <- textstat_frequency(uni_dfm, n = 25) %>%
             as.data.frame()

ggplot(uni_df, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#2980b9") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title   = "Figure 1: Top 25 Most Frequent Words (Unigrams)",
    x       = NULL,
    y       = "Frequency",
    caption = "Source: 1% random sample of SwiftKey corpus"
  ) +
  theme_minimal(base_size = 12)

The most common words are function/stop words (the, and, to, a…). This is expected and important - they form the backbone of most sentences.

Bigram Frequencies

bi_toks <- tokens_ngrams(toks, n = 2)
bi_dfm  <- dfm(bi_toks)
bi_df   <- textstat_frequency(bi_dfm, n = 20) %>%
             as.data.frame() %>%
             mutate(feature = stri_replace_all_fixed(feature, "_", " "))

ggplot(bi_df, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#27ae60") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Figure 2: Top 20 Most Frequent Bigrams (2-word phrases)",
    x     = NULL,
    y     = "Frequency"
  ) +
  theme_minimal(base_size = 12)

Trigram Frequencies

tri_toks <- tokens_ngrams(toks, n = 3)
tri_dfm  <- dfm(tri_toks)
tri_df   <- textstat_frequency(tri_dfm, n = 20) %>%
              as.data.frame() %>%
              mutate(feature = stri_replace_all_fixed(feature, "_", " "))

ggplot(tri_df, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#8e44ad") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Figure 3: Top 20 Most Frequent Trigrams (3-word phrases)",
    x     = NULL,
    y     = "Frequency"
  ) +
  theme_minimal(base_size = 12)

Word Cloud

# Top 150 words excluding very common stop words for a more interesting cloud
toks_no_stop <- tokens_remove(toks, pattern = stopwords("en"))
wc_dfm <- dfm(toks_no_stop)
wc_df  <- textstat_frequency(wc_dfm, n = 150) %>% as.data.frame()

set.seed(42)
wordcloud(words  = wc_df$feature,
          freq   = wc_df$frequency,
          min.freq    = 2,
          max.words   = 150,
          random.order = FALSE,
          colors       = brewer.pal(8, "Dark2"),
          scale        = c(4, 0.5))
title("Figure 4: Word Cloud (stop words removed)")

Word Count Distribution by Source

wc_df <- data.frame(
  source   = c(rep("Blogs",   length(sample_blogs)),
               rep("News",    length(sample_news)),
               rep("Twitter", length(sample_twitter))),
  word_cnt = c(stri_count_words(sample_blogs),
               stri_count_words(sample_news),
               stri_count_words(sample_twitter))
)

ggplot(wc_df, aes(x = word_cnt, fill = source)) +
  geom_histogram(bins = 60, alpha = 0.7, position = "identity") +
  facet_wrap(~source, scales = "free_y") +
  scale_fill_manual(values = c("#2980b9","#27ae60","#e67e22")) +
  labs(
    title = "Figure 5: Distribution of Words per Line by Source",
    x     = "Words per Line",
    y     = "Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Vocabulary Coverage

How many unique words are needed to cover most of the corpus? This directly affects how large the final prediction model will be.

freq_all  <- textstat_frequency(uni_dfm) %>% as.data.frame()
total_tok <- sum(freq_all$frequency)
cum_cov   <- cumsum(freq_all$frequency) / total_tok

n_50 <- which(cum_cov >= 0.50)[1]
n_90 <- which(cum_cov >= 0.90)[1]
n_95 <- which(cum_cov >= 0.95)[1]

kable(
  data.frame(
    `Coverage Target` = c("50 %", "90 %", "95 %"),
    `Unique Words Needed` = format(c(n_50, n_90, n_95), big.mark = ","),
    check.names = FALSE
  ),
  align   = c("l","r"),
  caption = "Table 2: Unique words required to reach coverage targets"
)
Table 2: Unique words required to reach coverage targets
Coverage Target Unique Words Needed
50 % 144
90 % 7,852
95 % 18,073
n_plot <- min(30000, length(cum_cov))
cov_plot_df <- data.frame(
  rank     = seq_len(n_plot),
  coverage = cum_cov[seq_len(n_plot)]
)

ggplot(cov_plot_df, aes(x = rank, y = coverage)) +
  geom_line(colour = "#c0392b", linewidth = 1) +
  geom_hline(yintercept = c(0.50, 0.90, 0.95),
             linetype = "dashed", colour = "grey50") +
  annotate("text", x = n_plot * 0.6, y = 0.52, label = "50 % coverage") +
  annotate("text", x = n_plot * 0.6, y = 0.92, label = "90 % coverage") +
  annotate("text", x = n_plot * 0.6, y = 0.97, label = "95 % coverage") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Figure 6: Vocabulary Coverage Curve",
    x     = "Number of Unique Words (ranked by frequency)",
    y     = "Cumulative % of All Tokens Covered"
  ) +
  theme_minimal(base_size = 12)

Key Findings

Finding Implication for the Model
A small set of very common words accounts for 50 %+ of all tokens A compact vocabulary can still cover most predictions
Twitter lines average ~10 words; blog lines average ~40 words Sentence-level context will vary by source
Many rare words appear only once (hapax legomena) These can be replaced by an <UNK> token to reduce model size
Profanity and non-English words exist in the corpus A profanity filter and language detection step are needed
Bigrams and trigrams show clear, meaningful phrases N-gram models should yield useful predictions

Plan: Prediction Algorithm

The prediction model will follow a Stupid Backoff (or optionally Kneser-Ney smoothing) approach using pre-computed n-gram tables:

  1. Build n-gram tables (unigrams through tetragrams) from the full cleaned corpus, storing each n-gram and its probability.
  2. At prediction time:
    • Take the last 3 words typed by the user.
    • Look up matching 4-grams → return the highest-probability next word.
    • If no 4-gram match is found, back off to 3-grams, then bigrams, then unigrams.
  3. Optimisation: Store only n-grams that appear ≥ 2 times to keep the model small enough for a web app.
  4. Output: Return the top 3 predicted next words with their probabilities.

This approach is fast (simple table look-up), interpretable, and well-suited to a lightweight Shiny deployment.


Plan: Shiny App

The final deliverable will be a Shiny web application with:

Feature Description
Text input box User types a sentence; predictions update in real time
Next-word suggestions Top 3 predicted words shown as clickable buttons
One-click insertion Clicking a suggestion appends it to the input
Source selector (optional) Filter predictions by blogs / news / Twitter style
About tab Brief explanation of the model for non-technical users

The app will be deployed on shinyapps.io so that it is accessible in any browser without installation.


Conclusion

This report has demonstrated that:

The next steps are to build the full n-gram model, tune it for speed and accuracy, and wrap it in a polished Shiny interface.


Report prepared for the Johns Hopkins / Coursera Data Science Specialisation Capstone - Week 2 Milestone.