Required Packages

library(tidyverse)       # dplyr, ggplot2, stringr
library(tidytext)        # unnest_tokens for n-grams
library(quanteda)        # fast corpus & tokenisation
library(quanteda.textstats) # textstat_frequency
library(stringi)         # string statistics
library(knitr)           # kable tables
library(scales)          # axis formatting
library(wordcloud)       # word cloud
library(RColorBrewer)    # colour palettes

Introduction

This milestone report is part of the Johns Hopkins Data Science Capstone project in partnership with SwiftKey. The ultimate goal is to build a predictive text application - similar to the autocomplete feature on a smartphone keyboard - that suggests the next word a user is likely to type.

This report covers:

Successful loading of the raw SwiftKey corpus
Basic summary statistics (line counts, word counts, file sizes)
Exploratory analysis with visualisations (word and n-gram frequencies)
Key findings from the data
An outline of the planned prediction algorithm and Shiny app

Data Loading

The dataset provided is a corpus of English text collected from three sources: blogs, news articles, and Twitter. The raw files are large, so we load them carefully and work with a random sample for exploratory analysis.

blogs_path   <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.blogs.txt"
news_path    <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.news.txt"
twitter_path <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.twitter.txt"

blogs   <- readLines(blogs_path,   encoding = "UTF-8", skipNul = TRUE)
news    <- readLines(news_path,    encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_path, encoding = "UTF-8", skipNul = TRUE)

cat("Files loaded successfully.\n")

## Files loaded successfully.

cat("Blogs lines   :", length(blogs),   "\n")

## Blogs lines   : 899288

cat("News lines    :", length(news),    "\n")

## News lines    : 1010206

cat("Twitter lines :", length(twitter), "\n")

## Twitter lines : 2360148

Basic Summary Statistics

# Word counts
blogs_words   <- sum(stri_count_words(blogs),   na.rm = TRUE)
news_words    <- sum(stri_count_words(news),     na.rm = TRUE)
twitter_words <- sum(stri_count_words(twitter),  na.rm = TRUE)

# Longest line (characters)
blogs_max   <- max(nchar(blogs))
news_max    <- max(nchar(news))
twitter_max <- max(nchar(twitter))

# File sizes on disk (MB)
blogs_size   <- round(file.info(blogs_path)$size   / 1e6, 1)
news_size    <- round(file.info(news_path)$size    / 1e6, 1)
twitter_size <- round(file.info(twitter_path)$size / 1e6, 1)

summary_df <- data.frame(
  Source        = c("Blogs", "News", "Twitter"),
  `File Size (MB)` = c(blogs_size, news_size, twitter_size),
  `Line Count`     = formatC(c(length(blogs), length(news), length(twitter)),
                              format = "d", big.mark = ","),
  `Word Count`     = formatC(c(blogs_words, news_words, twitter_words),
                              format = "d", big.mark = ","),
  `Longest Line`   = formatC(c(blogs_max, news_max, twitter_max),
                              format = "d", big.mark = ","),
  check.names = FALSE
)

kable(summary_df,
      caption = "Table 1: Summary statistics for the three corpus files",
      align   = c("l", "r", "r", "r", "r"))

Table 1: Summary statistics for the three corpus files
Source	File Size (MB)	Line Count	Word Count	Longest Line
Blogs	210.2	899,288	37,546,806	40,833
News	205.8	1,010,206	34,761,151	11,384
Twitter	167.1	2,360,148	30,096,690	140

Key observations:

The blogs file is the largest by file size and has the longest individual lines, reflecting long-form writing.
Twitter has the most lines but each line is short (limited to 140/280 characters).
News is the most balanced in terms of line length and structure.

Sampling the Data

The full corpus contains hundreds of millions of words - too large to process interactively. We draw a random 1 % sample from each source and combine them into a single corpus for analysis. This is standard practice in NLP exploratory work.

set.seed(2024)

sample_pct <- 0.01   # change to 0.05 for a richer (but slower) sample

sample_blogs   <- sample(blogs,   size = round(length(blogs)   * sample_pct))
sample_news    <- sample(news,    size = round(length(news)    * sample_pct))
sample_twitter <- sample(twitter, size = round(length(twitter) * sample_pct))

combined <- c(sample_blogs, sample_news, sample_twitter)

cat("Sample sizes — Blogs:", length(sample_blogs),
    "| News:", length(sample_news),
    "| Twitter:", length(sample_twitter), "\n")

## Sample sizes — Blogs: 8993 | News: 10102 | Twitter: 23601

cat("Combined sample lines:", length(combined), "\n")

## Combined sample lines: 42696

Text Pre-Processing

We clean the text using quanteda - converting to lower case, removing punctuation, numbers, symbols, and non-ASCII characters. Stop words are intentionally kept because they are crucial for predicting the next word in natural language (e.g. “I want to ___“).

# Remove non-ASCII (emojis, foreign characters)
combined_clean <- stri_replace_all_regex(combined, "[^\\p{ASCII}]", "")
# Remove URLs
combined_clean <- stri_replace_all_regex(combined_clean,
                    "http[s]?://\\S+|www\\.\\S+", "")
# Remove extra whitespace
combined_clean <- stri_trim_both(combined_clean)
combined_clean <- combined_clean[nchar(combined_clean) > 0]  # drop empty lines

# Build quanteda corpus
qcorp <- corpus(combined_clean)

cat("Cleaned corpus documents:", format(ndoc(qcorp), big.mark = ","), "\n")

## Cleaned corpus documents: 42,694

Exploratory Analysis

Word (Unigram) Frequencies

toks <- tokens(qcorp,
               remove_punct   = TRUE,
               remove_numbers = TRUE,
               remove_symbols = TRUE,
               remove_url     = TRUE)

# Unigram frequency
uni_dfm <- dfm(toks)
uni_df  <- textstat_frequency(uni_dfm, n = 25) %>%
             as.data.frame()

ggplot(uni_df, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#2980b9") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title   = "Figure 1: Top 25 Most Frequent Words (Unigrams)",
    x       = NULL,
    y       = "Frequency",
    caption = "Source: 1% random sample of SwiftKey corpus"
  ) +
  theme_minimal(base_size = 12)

The most common words are function/stop words (the, and, to, a…). This is expected and important - they form the backbone of most sentences.

Bigram Frequencies

bi_toks <- tokens_ngrams(toks, n = 2)
bi_dfm  <- dfm(bi_toks)
bi_df   <- textstat_frequency(bi_dfm, n = 20) %>%
             as.data.frame() %>%
             mutate(feature = stri_replace_all_fixed(feature, "_", " "))

ggplot(bi_df, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#27ae60") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Figure 2: Top 20 Most Frequent Bigrams (2-word phrases)",
    x     = NULL,
    y     = "Frequency"
  ) +
  theme_minimal(base_size = 12)

Trigram Frequencies

tri_toks <- tokens_ngrams(toks, n = 3)
tri_dfm  <- dfm(tri_toks)
tri_df   <- textstat_frequency(tri_dfm, n = 20) %>%
              as.data.frame() %>%
              mutate(feature = stri_replace_all_fixed(feature, "_", " "))

ggplot(tri_df, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_col(fill = "#8e44ad") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Figure 3: Top 20 Most Frequent Trigrams (3-word phrases)",
    x     = NULL,
    y     = "Frequency"
  ) +
  theme_minimal(base_size = 12)

Word Cloud

# Top 150 words excluding very common stop words for a more interesting cloud
toks_no_stop <- tokens_remove(toks, pattern = stopwords("en"))
wc_dfm <- dfm(toks_no_stop)
wc_df  <- textstat_frequency(wc_dfm, n = 150) %>% as.data.frame()

set.seed(42)
wordcloud(words  = wc_df$feature,
          freq   = wc_df$frequency,
          min.freq    = 2,
          max.words   = 150,
          random.order = FALSE,
          colors       = brewer.pal(8, "Dark2"),
          scale        = c(4, 0.5))
title("Figure 4: Word Cloud (stop words removed)")

Word Count Distribution by Source

wc_df <- data.frame(
  source   = c(rep("Blogs",   length(sample_blogs)),
               rep("News",    length(sample_news)),
               rep("Twitter", length(sample_twitter))),
  word_cnt = c(stri_count_words(sample_blogs),
               stri_count_words(sample_news),
               stri_count_words(sample_twitter))
)

ggplot(wc_df, aes(x = word_cnt, fill = source)) +
  geom_histogram(bins = 60, alpha = 0.7, position = "identity") +
  facet_wrap(~source, scales = "free_y") +
  scale_fill_manual(values = c("#2980b9","#27ae60","#e67e22")) +
  labs(
    title = "Figure 5: Distribution of Words per Line by Source",
    x     = "Words per Line",
    y     = "Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

Vocabulary Coverage

How many unique words are needed to cover most of the corpus? This directly affects how large the final prediction model will be.

freq_all  <- textstat_frequency(uni_dfm) %>% as.data.frame()
total_tok <- sum(freq_all$frequency)
cum_cov   <- cumsum(freq_all$frequency) / total_tok

n_50 <- which(cum_cov >= 0.50)[1]
n_90 <- which(cum_cov >= 0.90)[1]
n_95 <- which(cum_cov >= 0.95)[1]

kable(
  data.frame(
    `Coverage Target` = c("50 %", "90 %", "95 %"),
    `Unique Words Needed` = format(c(n_50, n_90, n_95), big.mark = ","),
    check.names = FALSE
  ),
  align   = c("l","r"),
  caption = "Table 2: Unique words required to reach coverage targets"
)

Table 2: Unique words required to reach coverage targets
Coverage Target	Unique Words Needed
50 %	144
90 %	7,852
95 %	18,073

n_plot <- min(30000, length(cum_cov))
cov_plot_df <- data.frame(
  rank     = seq_len(n_plot),
  coverage = cum_cov[seq_len(n_plot)]
)

ggplot(cov_plot_df, aes(x = rank, y = coverage)) +
  geom_line(colour = "#c0392b", linewidth = 1) +
  geom_hline(yintercept = c(0.50, 0.90, 0.95),
             linetype = "dashed", colour = "grey50") +
  annotate("text", x = n_plot * 0.6, y = 0.52, label = "50 % coverage") +
  annotate("text", x = n_plot * 0.6, y = 0.92, label = "90 % coverage") +
  annotate("text", x = n_plot * 0.6, y = 0.97, label = "95 % coverage") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "Figure 6: Vocabulary Coverage Curve",
    x     = "Number of Unique Words (ranked by frequency)",
    y     = "Cumulative % of All Tokens Covered"
  ) +
  theme_minimal(base_size = 12)

Key Findings

Finding	Implication for the Model
A small set of very common words accounts for 50 %+ of all tokens	A compact vocabulary can still cover most predictions
Twitter lines average ~10 words; blog lines average ~40 words	Sentence-level context will vary by source
Many rare words appear only once (hapax legomena)	These can be replaced by an `<UNK>` token to reduce model size
Profanity and non-English words exist in the corpus	A profanity filter and language detection step are needed
Bigrams and trigrams show clear, meaningful phrases	N-gram models should yield useful predictions

Plan: Prediction Algorithm

The prediction model will follow a Stupid Backoff (or optionally Kneser-Ney smoothing) approach using pre-computed n-gram tables:

Build n-gram tables (unigrams through tetragrams) from the full cleaned corpus, storing each n-gram and its probability.
At prediction time:
- Take the last 3 words typed by the user.
- Look up matching 4-grams → return the highest-probability next word.
- If no 4-gram match is found, back off to 3-grams, then bigrams, then unigrams.
Optimisation: Store only n-grams that appear ≥ 2 times to keep the model small enough for a web app.
Output: Return the top 3 predicted next words with their probabilities.

This approach is fast (simple table look-up), interpretable, and well-suited to a lightweight Shiny deployment.

Plan: Shiny App

The final deliverable will be a Shiny web application with:

Feature	Description
Text input box	User types a sentence; predictions update in real time
Next-word suggestions	Top 3 predicted words shown as clickable buttons
One-click insertion	Clicking a suggestion appends it to the input
Source selector (optional)	Filter predictions by blogs / news / Twitter style
About tab	Brief explanation of the model for non-technical users

The app will be deployed on shinyapps.io so that it is accessible in any browser without installation.

Conclusion

This report has demonstrated that:

The SwiftKey corpus has been successfully downloaded and loaded into R.
Basic statistics show the corpus contains hundreds of millions of words across three very different text styles.
Exploratory plots reveal clear word-frequency patterns that n-gram models can exploit.
A practical, efficient prediction pipeline has been designed and is ready for implementation.

The next steps are to build the full n-gram model, tune it for speed and accuracy, and wrap it in a polished Shiny interface.

Report prepared for the Johns Hopkins / Coursera Data Science Specialisation Capstone - Week 2 Milestone.

Milestone_Report

Anam Shaikh

2026-06-24