library(tidyverse) # dplyr, ggplot2, stringr
library(tidytext) # unnest_tokens for n-grams
library(quanteda) # fast corpus & tokenisation
library(quanteda.textstats) # textstat_frequency
library(stringi) # string statistics
library(knitr) # kable tables
library(scales) # axis formatting
library(wordcloud) # word cloud
library(RColorBrewer) # colour palettes
This milestone report is part of the Johns Hopkins Data Science Capstone project in partnership with SwiftKey. The ultimate goal is to build a predictive text application - similar to the autocomplete feature on a smartphone keyboard - that suggests the next word a user is likely to type.
This report covers:
The dataset provided is a corpus of English text collected from three sources: blogs, news articles, and Twitter. The raw files are large, so we load them carefully and work with a random sample for exploratory analysis.
blogs_path <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.blogs.txt"
news_path <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.news.txt"
twitter_path <- "C:/Users/anam.shaikh/OneDrive - YouGov Services Limited/R Training/Statistics Training/en_US.twitter.txt"
blogs <- readLines(blogs_path, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(news_path, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(twitter_path, encoding = "UTF-8", skipNul = TRUE)
cat("Files loaded successfully.\n")
## Files loaded successfully.
cat("Blogs lines :", length(blogs), "\n")
## Blogs lines : 899288
cat("News lines :", length(news), "\n")
## News lines : 1010206
cat("Twitter lines :", length(twitter), "\n")
## Twitter lines : 2360148
# Word counts
blogs_words <- sum(stri_count_words(blogs), na.rm = TRUE)
news_words <- sum(stri_count_words(news), na.rm = TRUE)
twitter_words <- sum(stri_count_words(twitter), na.rm = TRUE)
# Longest line (characters)
blogs_max <- max(nchar(blogs))
news_max <- max(nchar(news))
twitter_max <- max(nchar(twitter))
# File sizes on disk (MB)
blogs_size <- round(file.info(blogs_path)$size / 1e6, 1)
news_size <- round(file.info(news_path)$size / 1e6, 1)
twitter_size <- round(file.info(twitter_path)$size / 1e6, 1)
summary_df <- data.frame(
Source = c("Blogs", "News", "Twitter"),
`File Size (MB)` = c(blogs_size, news_size, twitter_size),
`Line Count` = formatC(c(length(blogs), length(news), length(twitter)),
format = "d", big.mark = ","),
`Word Count` = formatC(c(blogs_words, news_words, twitter_words),
format = "d", big.mark = ","),
`Longest Line` = formatC(c(blogs_max, news_max, twitter_max),
format = "d", big.mark = ","),
check.names = FALSE
)
kable(summary_df,
caption = "Table 1: Summary statistics for the three corpus files",
align = c("l", "r", "r", "r", "r"))
| Source | File Size (MB) | Line Count | Word Count | Longest Line |
|---|---|---|---|---|
| Blogs | 210.2 | 899,288 | 37,546,806 | 40,833 |
| News | 205.8 | 1,010,206 | 34,761,151 | 11,384 |
| 167.1 | 2,360,148 | 30,096,690 | 140 |
Key observations:
The full corpus contains hundreds of millions of words - too large to process interactively. We draw a random 1 % sample from each source and combine them into a single corpus for analysis. This is standard practice in NLP exploratory work.
set.seed(2024)
sample_pct <- 0.01 # change to 0.05 for a richer (but slower) sample
sample_blogs <- sample(blogs, size = round(length(blogs) * sample_pct))
sample_news <- sample(news, size = round(length(news) * sample_pct))
sample_twitter <- sample(twitter, size = round(length(twitter) * sample_pct))
combined <- c(sample_blogs, sample_news, sample_twitter)
cat("Sample sizes — Blogs:", length(sample_blogs),
"| News:", length(sample_news),
"| Twitter:", length(sample_twitter), "\n")
## Sample sizes — Blogs: 8993 | News: 10102 | Twitter: 23601
cat("Combined sample lines:", length(combined), "\n")
## Combined sample lines: 42696
We clean the text using quanteda - converting to lower case, removing punctuation, numbers, symbols, and non-ASCII characters. Stop words are intentionally kept because they are crucial for predicting the next word in natural language (e.g. “I want to ___“).
# Remove non-ASCII (emojis, foreign characters)
combined_clean <- stri_replace_all_regex(combined, "[^\\p{ASCII}]", "")
# Remove URLs
combined_clean <- stri_replace_all_regex(combined_clean,
"http[s]?://\\S+|www\\.\\S+", "")
# Remove extra whitespace
combined_clean <- stri_trim_both(combined_clean)
combined_clean <- combined_clean[nchar(combined_clean) > 0] # drop empty lines
# Build quanteda corpus
qcorp <- corpus(combined_clean)
cat("Cleaned corpus documents:", format(ndoc(qcorp), big.mark = ","), "\n")
## Cleaned corpus documents: 42,694
toks <- tokens(qcorp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_url = TRUE)
# Unigram frequency
uni_dfm <- dfm(toks)
uni_df <- textstat_frequency(uni_dfm, n = 25) %>%
as.data.frame()
ggplot(uni_df, aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "#2980b9") +
coord_flip() +
scale_y_continuous(labels = comma) +
labs(
title = "Figure 1: Top 25 Most Frequent Words (Unigrams)",
x = NULL,
y = "Frequency",
caption = "Source: 1% random sample of SwiftKey corpus"
) +
theme_minimal(base_size = 12)
The most common words are function/stop words (the, and, to, a…). This is expected and important - they form the backbone of most sentences.
bi_toks <- tokens_ngrams(toks, n = 2)
bi_dfm <- dfm(bi_toks)
bi_df <- textstat_frequency(bi_dfm, n = 20) %>%
as.data.frame() %>%
mutate(feature = stri_replace_all_fixed(feature, "_", " "))
ggplot(bi_df, aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "#27ae60") +
coord_flip() +
scale_y_continuous(labels = comma) +
labs(
title = "Figure 2: Top 20 Most Frequent Bigrams (2-word phrases)",
x = NULL,
y = "Frequency"
) +
theme_minimal(base_size = 12)
tri_toks <- tokens_ngrams(toks, n = 3)
tri_dfm <- dfm(tri_toks)
tri_df <- textstat_frequency(tri_dfm, n = 20) %>%
as.data.frame() %>%
mutate(feature = stri_replace_all_fixed(feature, "_", " "))
ggplot(tri_df, aes(x = reorder(feature, frequency), y = frequency)) +
geom_col(fill = "#8e44ad") +
coord_flip() +
scale_y_continuous(labels = comma) +
labs(
title = "Figure 3: Top 20 Most Frequent Trigrams (3-word phrases)",
x = NULL,
y = "Frequency"
) +
theme_minimal(base_size = 12)
# Top 150 words excluding very common stop words for a more interesting cloud
toks_no_stop <- tokens_remove(toks, pattern = stopwords("en"))
wc_dfm <- dfm(toks_no_stop)
wc_df <- textstat_frequency(wc_dfm, n = 150) %>% as.data.frame()
set.seed(42)
wordcloud(words = wc_df$feature,
freq = wc_df$frequency,
min.freq = 2,
max.words = 150,
random.order = FALSE,
colors = brewer.pal(8, "Dark2"),
scale = c(4, 0.5))
title("Figure 4: Word Cloud (stop words removed)")
wc_df <- data.frame(
source = c(rep("Blogs", length(sample_blogs)),
rep("News", length(sample_news)),
rep("Twitter", length(sample_twitter))),
word_cnt = c(stri_count_words(sample_blogs),
stri_count_words(sample_news),
stri_count_words(sample_twitter))
)
ggplot(wc_df, aes(x = word_cnt, fill = source)) +
geom_histogram(bins = 60, alpha = 0.7, position = "identity") +
facet_wrap(~source, scales = "free_y") +
scale_fill_manual(values = c("#2980b9","#27ae60","#e67e22")) +
labs(
title = "Figure 5: Distribution of Words per Line by Source",
x = "Words per Line",
y = "Count"
) +
theme_minimal(base_size = 12) +
theme(legend.position = "none")
How many unique words are needed to cover most of the corpus? This directly affects how large the final prediction model will be.
freq_all <- textstat_frequency(uni_dfm) %>% as.data.frame()
total_tok <- sum(freq_all$frequency)
cum_cov <- cumsum(freq_all$frequency) / total_tok
n_50 <- which(cum_cov >= 0.50)[1]
n_90 <- which(cum_cov >= 0.90)[1]
n_95 <- which(cum_cov >= 0.95)[1]
kable(
data.frame(
`Coverage Target` = c("50 %", "90 %", "95 %"),
`Unique Words Needed` = format(c(n_50, n_90, n_95), big.mark = ","),
check.names = FALSE
),
align = c("l","r"),
caption = "Table 2: Unique words required to reach coverage targets"
)
| Coverage Target | Unique Words Needed |
|---|---|
| 50 % | 144 |
| 90 % | 7,852 |
| 95 % | 18,073 |
n_plot <- min(30000, length(cum_cov))
cov_plot_df <- data.frame(
rank = seq_len(n_plot),
coverage = cum_cov[seq_len(n_plot)]
)
ggplot(cov_plot_df, aes(x = rank, y = coverage)) +
geom_line(colour = "#c0392b", linewidth = 1) +
geom_hline(yintercept = c(0.50, 0.90, 0.95),
linetype = "dashed", colour = "grey50") +
annotate("text", x = n_plot * 0.6, y = 0.52, label = "50 % coverage") +
annotate("text", x = n_plot * 0.6, y = 0.92, label = "90 % coverage") +
annotate("text", x = n_plot * 0.6, y = 0.97, label = "95 % coverage") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = percent_format(accuracy = 1)) +
labs(
title = "Figure 6: Vocabulary Coverage Curve",
x = "Number of Unique Words (ranked by frequency)",
y = "Cumulative % of All Tokens Covered"
) +
theme_minimal(base_size = 12)
| Finding | Implication for the Model |
|---|---|
| A small set of very common words accounts for 50 %+ of all tokens | A compact vocabulary can still cover most predictions |
| Twitter lines average ~10 words; blog lines average ~40 words | Sentence-level context will vary by source |
| Many rare words appear only once (hapax legomena) | These can be replaced by an <UNK> token to reduce
model size |
| Profanity and non-English words exist in the corpus | A profanity filter and language detection step are needed |
| Bigrams and trigrams show clear, meaningful phrases | N-gram models should yield useful predictions |
The prediction model will follow a Stupid Backoff (or optionally Kneser-Ney smoothing) approach using pre-computed n-gram tables:
This approach is fast (simple table look-up), interpretable, and well-suited to a lightweight Shiny deployment.
The final deliverable will be a Shiny web application with:
| Feature | Description |
|---|---|
| Text input box | User types a sentence; predictions update in real time |
| Next-word suggestions | Top 3 predicted words shown as clickable buttons |
| One-click insertion | Clicking a suggestion appends it to the input |
| Source selector (optional) | Filter predictions by blogs / news / Twitter style |
| About tab | Brief explanation of the model for non-technical users |
The app will be deployed on shinyapps.io so that it is accessible in any browser without installation.
This report has demonstrated that:
The next steps are to build the full n-gram model, tune it for speed and accuracy, and wrap it in a polished Shiny interface.
Report prepared for the Johns Hopkins / Coursera Data Science Specialisation Capstone - Week 2 Milestone.