This report outlines the exploratory data analysis of the SwiftKey dataset provided for the Data Science Capstone project. The goal of this milestone is to demonstrate successful data loading, provide basic summary statistics (word and line counts), explore the frequencies of words and word pairs (n-grams), and outline a plan for building a predictive text algorithm and Shiny application. The analysis is written to be easily understood by non-technical stakeholders.
The dataset consists of three text files sourced from US English blogs, news sites, and Twitter. We first load the data and calculate basic statistics including file size, total lines, and total words.
# Define file paths
path_blogs <- file.path(data_dir, "en_US.blogs.txt")
path_news <- file.path(data_dir, "en_US.news.txt")
path_twitter <- file.path(data_dir, "en_US.twitter.txt")
# Read the text lines
blogs <- readLines(path_blogs, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(path_news, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(path_twitter, encoding = "UTF-8", skipNul = TRUE)
# Calculate File Sizes (in Megabytes)
size_blogs <- file.info(path_blogs)$size / 1024^2
size_news <- file.info(path_news)$size / 1024^2
size_twitter <- file.info(path_twitter)$size / 1024^2
# Calculate Word Counts using stringi for speed
words_blogs <- sum(stri_count_words(blogs))
words_news <- sum(stri_count_words(news))
words_twitter <- sum(stri_count_words(twitter))
# Create a summary table
summary_table <- data.frame(
Source = c("Blogs", "News", "Twitter"),
File_Size_MB = round(c(size_blogs, size_news, size_twitter), 2),
Line_Count = c(length(blogs), length(news), length(twitter)),
Word_Count = c(words_blogs, words_news, words_twitter)
)
kable(summary_table, format.args = list(big.mark = ","),
caption = "Table 1: Basic Data Summary of the SwiftKey Corpora")
| Source | File_Size_MB | Line_Count | Word_Count |
|---|---|---|---|
| Blogs | 200.42 | 899,288 | 37,546,250 |
| News | 196.28 | 1,010,242 | 34,762,395 |
| 159.36 | 2,360,148 | 30,093,413 | |
| Key Findin | g: The Twitter | file contains | the most lines, which is expected due to character limits on the platform. However, the Blogs file contains the highest overall word count, indicating longer, more complex sentence structures. |
Because the raw data contains tens of millions of words, processing the entire dataset is computationally expensive. To conduct our exploratory analysis efficiently, we take a random 1% sample of the data. We then clean the text by converting it to lowercase and removing punctuation, numbers, and special characters.
set.seed(12345) # Set seed for reproducibility
# Take a 1% sample of each file
sample_pct <- 0.01
sample_data <- c(
sample(blogs, length(blogs) * sample_pct),
sample(news, length(news) * sample_pct),
sample(twitter, length(twitter) * sample_pct)
)
# Convert to a data frame for tidytext processing
text_df <- tibble(line = 1:length(sample_data), text = sample_data)
# Clean memory
rm(blogs, news, twitter); gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1841113 98.4 7177943 383.4 6142698 328.1
## Vcells 8118771 62.0 170090505 1297.7 212236301 1619.3
To build a predictive text model, we need to understand the frequency of words. We break the text down into “n-grams”.
unigrams <- text_df %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
# Plot the top 15 Unigrams
unigrams %>%
top_n(15, n) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col(fill = "#2c7fb8") +
coord_flip() +
labs(title = "Top 15 Most Frequent Words (Unigrams)",
x = "Word", y = "Frequency Count") +
theme_minimal()
bigrams <- text_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) %>%
count(bigram, sort = TRUE)
bigrams %>%
top_n(15, n) %>%
mutate(bigram = reorder(bigram, n)) %>%
ggplot(aes(x = bigram, y = n)) +
geom_col(fill = "#238b45") +
coord_flip() +
labs(title = "Top 15 Most Frequent Word Pairs (Bigrams)",
x = "Bigram", y = "Frequency Count") +
theme_minimal()
trigrams <- text_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
filter(!is.na(trigram)) %>%
count(trigram, sort = TRUE)
trigrams %>%
top_n(15, n) %>%
mutate(trigram = reorder(trigram, n)) %>%
ggplot(aes(x = trigram, y = n)) +
geom_col(fill = "#d95f02") +
coord_flip() +
labs(title = "Top 15 Most Frequent 3-Word Phrases (Trigrams)",
x = "Trigram", y = "Frequency Count") +
theme_minimal()
Key Finding: The most frequent words and phrases are overwhelmingly “stop words” (e.g., “the”, “and”, “of”, “in the”). While these are often removed in standard text mining, we must keep them for this project because a text prediction app needs to accurately predict standard grammar, including transition words.
Based on this exploratory analysis, the strategy for the final data product is as follows:
N-Gram Frequency Matrices: We will build larger, cleaned frequency tables of 2-gram, 3-gram, and 4-gram sequences using a slightly larger sample of the data.
Pruning for Performance: To ensure the Shiny app loads quickly and respects memory limits, we will drop n-grams that only appear a very small number of times (e.g., frequencies <= 2).
The Prediction Algorithm: We will implement a “Katz’s Back-off Model” or a simplified “Stupid Backoff” algorithm. If a user types two words, the algorithm will search the Trigram table for a match. If no match is found, it will “back-off” to the Bigram table, and finally to the Unigram table (predicting the most common word, like “the”) if the word is entirely unknown.
Shiny Application: The final product will be an interactive web app featuring a simple text input box. As the user types, the algorithm will actively listen and output the top 3 most likely next words in real-time, simulating a mobile phone keyboard experience.