This report is an early checkpoint in a project to build a predictive text application, similar to the keyboard suggestions you see on a smartphone. The end goal is a Shiny web app where a user types a phrase and the app predicts the most likely next word.
This document covers three things, in plain terms:
No predictive model is built yet at this stage — this is purely about understanding the raw material we’ll be working with.
The data comes from a corpus of text collected from three everyday sources: blogs, news articles, and Twitter posts. We’re working with the English (US) subset. The code below downloads the data once (skipping the download if it’s already present) and unzips it.
data_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zip_file <- "Coursera-SwiftKey.zip"
data_dir <- "final"
if (!file.exists(zip_file)) {
download.file(data_url, destfile = zip_file, mode = "wb")
}
if (!dir.exists(data_dir)) {
unzip(zip_file)
}
blogs_path <- file.path(data_dir, "en_US", "en_US.blogs.txt")
news_path <- file.path(data_dir, "en_US", "en_US.news.txt")
twitter_path <- file.path(data_dir, "en_US", "en_US.twitter.txt")
blogs <- readLines(blogs_path, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
news <- readLines(news_path, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
twitter <- readLines(twitter_path, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
All three files have now been read into R as character vectors, one element per line of text. The first part of the analysis confirms this loaded correctly.
The table below shows, for each source: the file size, the number of lines, the total word count, and the average number of words per line.
file_size_mb <- function(path) round(file.info(path)$size / 1024^2, 1)
word_count <- function(x) sum(stri_count_words(x))
summary_df <- data.frame(
Source = c("Blogs", "News", "Twitter"),
`File Size (MB)` = c(file_size_mb(blogs_path), file_size_mb(news_path), file_size_mb(twitter_path)),
`Lines` = c(length(blogs), length(news), length(twitter)),
`Total Words` = c(word_count(blogs), word_count(news), word_count(twitter)),
check.names = FALSE
)
summary_df$`Avg Words / Line` <- round(summary_df$`Total Words` / summary_df$Lines, 1)
kable(summary_df, caption = "Summary statistics for the three English (US) text sources")
| Source | File Size (MB) | Lines | Total Words | Avg Words / Line |
|---|---|---|---|---|
| Blogs | 200.4 | 899288 | 37546806 | 41.8 |
| News | 196.3 | 1010206 | 34761151 | 34.4 |
| 159.4 | 2360148 | 30096690 | 12.8 |
What this tells us, in plain terms: all three files are large — each contains hundreds of thousands to over a million lines of text, and tens of millions of words combined. Twitter posts are short by design (the old 140-character limit shows clearly in low words-per-line), while blog and news entries run much longer per line on average.
Because averages can hide a lot of variation, it helps to look at the full distribution of how many words appear per line in each source.
set.seed(1234)
sample_size <- 20000
sample_lengths <- function(x, n) {
s <- sample(x, min(n, length(x)))
stri_count_words(s)
}
plot_df <- data.frame(
Words = c(sample_lengths(blogs, sample_size),
sample_lengths(news, sample_size),
sample_lengths(twitter, sample_size)),
Source = factor(c(rep("Blogs", min(sample_size, length(blogs))),
rep("News", min(sample_size, length(news))),
rep("Twitter", min(sample_size, length(twitter)))),
levels = c("Blogs", "News", "Twitter"))
)
ggplot(plot_df, aes(x = Words, fill = Source)) +
geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
facet_wrap(~ Source, scales = "free_y") +
xlim(0, 100) +
labs(title = "Words per line by source (random sample)",
x = "Words in line", y = "Number of lines") +
theme_minimal() +
theme(legend.position = "none")
Twitter is tightly clustered at low word counts, while blogs and news show a much wider, longer-tailed spread — consistent with longer-form writing.
These files are large enough that processing the full text for every step of exploration would be slow and isn’t necessary for understanding the data’s structure. So for the rest of this exploratory analysis (and for early model-building), we draw a random sample of lines from each source and combine them into one working dataset. The final model will later be validated against a larger held-out portion of the data.
set.seed(2024)
sample_pct <- 0.02 # 2% sample for exploration speed
sample_lines <- function(x, pct) {
x[sample(seq_along(x), size = floor(length(x) * pct))]
}
sample_text <- c(
sample_lines(blogs, sample_pct),
sample_lines(news, sample_pct),
sample_lines(twitter, sample_pct)
)
# Basic cleaning: lowercase, strip URLs, strip punctuation/numbers, collapse whitespace
clean_text <- sample_text
clean_text <- tolower(clean_text)
clean_text <- gsub("http[[:alnum:][:punct:]]*", " ", clean_text)
clean_text <- gsub("[^a-z' ]", " ", clean_text)
clean_text <- gsub("\\s+", " ", clean_text)
clean_text <- trimws(clean_text)
This gives us a manageable working sample of 85,391 lines of text, cleaned of URLs, punctuation, and numbers so that word patterns can be analyzed clearly.
words <- unlist(stri_split_boundaries(clean_text, type = "word", skip_word_none = TRUE))
words <- words[words != "" & !grepl("^\\s*$", words)]
word_freq <- as.data.frame(table(words), stringsAsFactors = FALSE)
names(word_freq) <- c("word", "freq")
word_freq <- word_freq[order(-word_freq$freq), ]
top20 <- head(word_freq, 20)
ggplot(top20, aes(x = reorder(word, freq), y = freq)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "20 most frequent single words", x = "", y = "Frequency") +
theme_minimal()
As expected, the most frequent words are common function words — “the,” “to,” “and,” and similar connective words. This is normal for any large body of natural English text and confirms the data is behaving as expected.
Looking at single words only tells part of the story. To eventually predict the next word, we need to understand which words tend to follow other words. We look next at pairs (bigrams) and triples (trigrams) of consecutive words.
get_ngrams <- function(text, n) {
tokens_list <- stri_split_boundaries(text, type = "word", skip_word_none = TRUE)
ngram_list <- lapply(tokens_list, function(tok) {
tok <- tok[tok != ""]
if (length(tok) < n) return(character(0))
sapply(1:(length(tok) - n + 1), function(i) paste(tok[i:(i + n - 1)], collapse = " "))
})
unlist(ngram_list)
}
bigrams <- get_ngrams(clean_text, 2)
trigrams <- get_ngrams(clean_text, 3)
bigram_freq <- sort(table(bigrams), decreasing = TRUE)[1:20]
trigram_freq <- sort(table(trigrams), decreasing = TRUE)[1:20]
bigram_df <- data.frame(ngram = names(bigram_freq), freq = as.numeric(bigram_freq))
trigram_df <- data.frame(ngram = names(trigram_freq), freq = as.numeric(trigram_freq))
ggplot(bigram_df, aes(x = reorder(ngram, freq), y = freq)) +
geom_col(fill = "darkorange") +
coord_flip() +
labs(title = "20 most frequent two-word phrases (bigrams)", x = "", y = "Frequency") +
theme_minimal()
ggplot(trigram_df, aes(x = reorder(ngram, freq), y = freq)) +
geom_col(fill = "seagreen") +
coord_flip() +
labs(title = "20 most frequent three-word phrases (trigrams)", x = "", y = "Frequency") +
theme_minimal()
Key takeaway: bigrams and trigrams already reveal natural phrase patterns (“a lot of,” “one of the,” “thanks for the”). This is exactly the kind of pattern a next-word prediction model relies on — given the last word or two someone has typed, certain following words are far more likely than others.
One practical question for building an app is: how many unique words do we actually need to “know” to cover most of what people write? Covering 100% of all words ever used would require an enormous dictionary, but covering the majority of everyday usage requires far fewer.
word_freq <- word_freq[order(-word_freq$freq), ]
cum_coverage <- cumsum(word_freq$freq) / sum(word_freq$freq)
coverage_df <- data.frame(
n_words = seq_along(cum_coverage),
coverage = cum_coverage
)
# Find how many unique words are needed for 50% and 90% coverage
n50 <- which(cum_coverage >= 0.50)[1]
n90 <- which(cum_coverage >= 0.90)[1]
ggplot(coverage_df[1:min(10000, nrow(coverage_df)), ], aes(x = n_words, y = coverage)) +
geom_line(color = "purple", linewidth = 1) +
geom_hline(yintercept = c(0.5, 0.9), linetype = "dashed", color = "grey40") +
labs(title = "Cumulative word coverage vs. vocabulary size",
x = "Number of unique words (most frequent first)",
y = "Fraction of all word instances covered") +
theme_minimal()
In this sample, just 141 unique words cover 50% of all the words people actually use, and 6,998 unique words cover 90%. This is a reassuring sign: it means the prediction model doesn’t need to store every word in the English language — a relatively compact dictionary, focused on the most common words and phrases, will cover the overwhelming majority of real usage.
In plain, non-technical terms, here is the plan going forward:
Build an n-gram model. Using the word-pair (bigram) and word-triple (trigram) patterns explored above, the model will look at the last one or two words a user has typed and predict the most likely next word, based on what showed up most often in this training data.
Handle the “unseen phrase” problem. Not every possible phrase will appear in the training data. When the model encounters a phrase it hasn’t seen, it will “back off” — first trying to match on the last two words, then just the last word, and finally falling back to the most common words overall if nothing else matches.
Balance accuracy with speed and size. A model that stores every single phrase ever seen would be very large and slow. Based on the coverage analysis above, the model will be trimmed to focus on the most frequent and useful word patterns, keeping the app fast and responsive without sacrificing much accuracy.
Build the Shiny app. The final deliverable will be a simple web page with a text box. As the user types, the app will show its best guess(es) for the next word, updating in real time — much like the predictive text feature on a phone keyboard.
Test and refine. The model will be checked against text it hasn’t seen before (held out from training) to make sure it generalizes well and isn’t just memorizing the sample.
Feedback on this approach is welcome before proceeding to model-building.