The goal of this milestone report is to present an exploratory data analysis of the SwiftKey dataset, which will be used to build a natural language processing (NLP) prediction algorithm and a corresponding Shiny application. This report outlines the data ingestion process, basic summary statistics, exploratory plots, and the proposed plan for the final predictive model. The analysis is presented in a manner accessible to a non-technical audience.
The dataset used for this project is the HC
Corpora dataset. It contains text from three different sources:
blogs, news, and Twitter, in multiple languages. For this project, we
focus exclusively on the English (en_US) datasets.
# Define URL and file paths
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest_file <- "Coursera-SwiftKey.zip"
# Download and unzip the dataset if not already present
if (!file.exists(dest_file)) {
download.file(url, destfile = dest_file)
unzip(dest_file)
}
Due to the massive size of the dataset and strict memory constraints, we will read a small random sample (1,000 lines) of each text file. This allows us to perform exploratory data analysis quickly and efficiently.
# Read a small sample of the datasets to preserve memory (1,000 lines each)
sample_size <- 1000
blogs_file <- file("final/en_US/en_US.blogs.txt", "r")
blogs_data <- readLines(blogs_file, n = sample_size, encoding = "UTF-8", skipNul = TRUE)
close(blogs_file)
news_file <- file("final/en_US/en_US.news.txt", "r")
news_data <- readLines(news_file, n = sample_size, encoding = "UTF-8", skipNul = TRUE)
close(news_file)
twitter_file <- file("final/en_US/en_US.twitter.txt", "r")
twitter_data <- readLines(twitter_file, n = sample_size, encoding = "UTF-8", skipNul = TRUE)
close(twitter_file)
We analyze the fundamental characteristics of our sampled datasets (1,000 lines each) to understand word distributions, alongside the total file sizes on disk.
library(stringi)
library(knitr)
# Calculate full file sizes in MB on disk
blogs_size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024^2
news_size <- file.info("final/en_US/en_US.news.txt")$size / 1024^2
twitter_size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024^2
# Calculate word counts for our samples
blogs_words <- sum(stri_count_words(blogs_data))
news_words <- sum(stri_count_words(news_data))
twitter_words <- sum(stri_count_words(twitter_data))
# Create a summary table
summary_table <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Total_File_Size_MB = round(c(blogs_size, news_size, twitter_size), 2),
Sampled_Lines = c(sample_size, sample_size, sample_size),
Sampled_Word_Count = format(c(blogs_words, news_words, twitter_words), big.mark=",")
)
kable(summary_table, caption = "Summary Statistics (1,000 Line Sample)")
| Source | Total_File_Size_MB | Sampled_Lines | Sampled_Word_Count |
|---|---|---|---|
| Blogs | 200.42 | 1000 | 42,168 |
| News | 196.28 | 1000 | 34,041 |
| 159.36 | 1000 | 12,724 |
Observations: * The full datasets are quite large on disk (around 160-200 MB each). * Even within a fixed 1,000 line sample, Blogs contain significantly more words than Twitter, highlighting the short, constrained nature of tweets compared to long-form blog posts.
We combine our samples and use the tidytext package to
clean the data. This involves converting text to lowercase, removing
punctuation, and preparing it for tokenization.
library(tibble)
library(dplyr)
library(tidyr)
library(tidytext)
library(ggplot2)
combined_sample <- c(blogs_data, news_data, twitter_data)
# Clean up memory
rm(blogs_data, news_data, twitter_data)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2217862 118.5 4388524 234.4 3001088 160.3
## Vcells 3931769 30.0 8388608 64.0 6088782 46.5
# Convert to a dataframe for tidytext
text_df <- tibble(line = 1:length(combined_sample), text = combined_sample)
In Natural Language Processing, an n-gram is a contiguous sequence of n items from a given sample of text. Understanding the frequency of these sequences is the foundation of building a predictive text model.
We extract the single words and visualize the top 20 most frequent ones.
unigrams <- text_df %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
unigrams %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Most Frequent Unigrams", x = "Word", y = "Frequency") +
theme_minimal()
# Free memory
rm(unigrams)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2496017 133.4 4388524 234.4 4388524 234.4
## Vcells 4449909 34.0 10146329 77.5 7569997 57.8
The most common unigrams are “stop words” like “the”, “to”, “and”, “a”, etc. While some applications remove these, for a text prediction application, it is crucial to keep them because they are typed extremely frequently by users.
Next, we look at the most common pairs of consecutive words.
bigrams <- text_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
drop_na() %>%
count(bigram, sort = TRUE)
bigrams %>%
head(20) %>%
mutate(bigram = reorder(bigram, n)) %>%
ggplot(aes(x = bigram, y = n)) +
geom_col(fill = "coral") +
coord_flip() +
labs(title = "Top 20 Most Frequent Bigrams", x = "Bigram", y = "Frequency") +
theme_minimal()
# Free memory
rm(bigrams)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2501749 133.7 4388524 234.4 4388524 234.4
## Vcells 4464589 34.1 10146329 77.5 7569997 57.8
Finally, we examine the most common combinations of three words.
trigrams <- text_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
drop_na() %>%
count(trigram, sort = TRUE)
trigrams %>%
head(20) %>%
mutate(trigram = reorder(trigram, n)) %>%
ggplot(aes(x = trigram, y = n)) +
geom_col(fill = "mediumseagreen") +
coord_flip() +
labs(title = "Top 20 Most Frequent Trigrams", x = "Trigram", y = "Frequency") +
theme_minimal()
# Free memory
rm(trigrams)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2501763 133.7 4388524 234.4 4388524 234.4
## Vcells 4530275 34.6 10146329 77.5 7569997 57.8
These visualizations highlight the predictable structure of the English language. Common phrases like “one of the” and “a lot of” dominate the trigram frequencies.
The exploratory analysis confirms that frequency-based modeling (n-grams) is a viable approach. Based on these findings, the next steps for creating the predictive algorithm and Shiny app are: