The goal of this project is to build a predictive text model, similar to the one used in SwiftKey smart keyboards. When a user types a word, the algorithm will predict the most likely next word.
This milestone report outlines the initial exploratory data analysis (EDA) of the English text corpora provided for training: blogs, news, and Twitter feeds. The objective is to understand the distribution of words, the frequency of phrases (N-grams), and to lay out a clear, non-technical roadmap for the final predictive application.
The raw data consists of three large text files
(en_US.blogs.txt, en_US.news.txt, and
en_US.twitter.txt). Due to the massive size of these
datasets (hundreds of megabytes each), processing them in their entirety
is inefficient and unnecessary for initial exploration.
Instead, a systematic sampling approach is used. We extract a representative 1% sample from each file. The data is then cleaned by removing punctuation, numbers, URLs, and standard English stop words to reveal the true vocabulary patterns.
# Load libraries and import data
pacman::p_load(data.table, quanteda, ggplot2)
if(!file.exists("Coursera-SwiftKey.zip")) {
options(timeout = 600)
download.file(
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip", mode = "wb") # mode = "wb" (write binary)
unzip("Coursera-SwiftKey.zip")
}
# Define paths
en_path <- "./final/en_US"
files <- list.files(en_path, full.names = TRUE)
names(files) <- basename(files)
# Systematic sampling function (1% of data)
sample_systematic <- function(file_path, sample_rate = 0.01, chunk_size = 50000) {
step <- floor(1 / sample_rate)
con <- file(file_path, "rb")
on.exit(close(con))
sampled_lines <- character()
line_counter <- 0
total_read <- 0
repeat {
chunk <- readLines(con, n = chunk_size, warn = FALSE, encoding = "UTF-8")
if (length(chunk) == 0) break
for (i in seq_along(chunk)) {
line_counter <- line_counter + 1
total_read <- total_read + 1
if (line_counter %% step == 0) sampled_lines <- c(sampled_lines, chunk[i])
}
}
return(list(lines = sampled_lines, total_lines = total_read))
}
# Extract and process data
results <- list()
for (f in files) {
file_name <- basename(f)
sample_data <- sample_systematic(f, sample_rate = 0.01)
# Tokenization using quanteda
corp <- corpus(sample_data$lines)
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE,
remove_numbers = TRUE, remove_url = TRUE) |>
tokens_tolower() |>
tokens_select(pattern = stopwords("en"), selection = "remove")
words <- as.character(toks)
freq_table <- sort(table(words), decreasing = TRUE)
results[[file_name]] <- list(
total_lines_original = sample_data$total_lines,
sample_lines = length(sample_data$lines),
total_words_sample = length(words),
frequencies = freq_table
)
}
Before building a model, it is crucial to understand the size and scope of our data. The table below displays the total lines in the original files, the number of lines in our 1% sample, and the total word count extracted from that sample.
| Source File | Total Lines (Raw) | Lines (1% Sample) | Word Count (Sample) | Top Word |
|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 8992 | 189707 | one |
| en_US.news.txt | 1010242 | 10102 | 195776 | said |
| en_US.twitter.txt | 2360148 | 23601 | 167635 | just |
By counting the words, we can see the differences in vocabulary across the three sources. For instance, Twitter data tends to be more conversational.
To predict the next word, predicting single words is not enough. We must look at word pairs (Bigrams) and triplets (Trigrams). The algorithm will use the frequency of these word combinations to guess what the user will type next.
(Note: In the final model, stop words will be retained, as they are essential for natural sentence formation).
Moving forward, the strategy to build the predictive text product involves the following steps: