This report presents the exploratory data analysis (EDA) for the Data Science Capstone project in partnership with SwiftKey. The ultimate goal of this project is to build a predictive text application, similar to the smart keyboards used on mobile devices.
In this milestone report, I will demonstrate that the dataset has been successfully loaded, showcase basic summary statistics (such as line and word counts), explore the frequencies of common words and phrases (n-grams) using visual plots, and outline the strategy for the final predictive algorithm and Shiny application. The report is written in a concise manner suitable for non-data scientist stakeholders.
I begin by loading the three US English text datasets provided for this project: Blogs, News, and Twitter. To ensure efficient processing and reporting, I will calculate the basic statistics of the full datasets first.
# Load necessary libraries for data manipulation and visualization
library(stringi)
library(ggplot2)
library(knitr)
library(dplyr)
library(tidytext)
# Define file paths
blogs_file <- "/Users/liwenhe/final/en_US/en_US.blogs.txt"
news_file <- "/Users/liwenhe/final/en_US/en_US.news.txt"
twitter_file <- "/Users/liwenhe/final/en_US/en_US.twitter.txt"
# Read data into memory (ignoring nulls and warnings for special characters)
blogs <- readLines(blogs_file, skipNul = TRUE, warn = FALSE)
news <- readLines(news_file, skipNul = TRUE, warn = FALSE)
twitter <- readLines(twitter_file, skipNul = TRUE, warn = FALSE)
# Calculate file sizes in Megabytes (MB)
size_blogs <- file.info(blogs_file)$size / 1024^2
size_news <- file.info(news_file)$size / 1024^2
size_twitter <- file.info(twitter_file)$size / 1024^2
# Calculate line counts
lines_blogs <- length(blogs)
lines_news <- length(news)
lines_twitter <- length(twitter)
# Calculate word counts using the stringi package
words_blogs <- sum(stri_count_words(blogs))
words_news <- sum(stri_count_words(news))
words_twitter <- sum(stri_count_words(twitter))
# Create a summary data frame
summary_table <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
File_Size_MB = round(c(size_blogs, size_news, size_twitter), 2),
Line_Count = c(lines_blogs, lines_news, lines_twitter),
Word_Count = c(words_blogs, words_news, words_twitter)
)
# Display the table cleanly
kable(summary_table, caption = "Basic Summary Statistics of the US English Datasets")
| Dataset | File_Size_MB | Line_Count | Word_Count |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546250 |
| News | 196.28 | 1010242 | 34762395 |
| 159.36 | 2360148 | 30093413 |
Due to the massive size of the datasets (over 4 million lines combined), I will use a 1% random sample of the data to perform our exploratory analysis. This allows us to understand the distributions without overloading the computer’s memory.
I will clean the data by converting it to lowercase, removing punctuation, and separating it into single words (Unigrams), two-word phrases (Bigrams), and three-word phrases (Trigrams).
# Set seed for reproducibility and create a 1% sample
set.seed(1234)
sample_pct <- 0.01
combined_sample <- c(sample(blogs, length(blogs) * sample_pct),
sample(news, length(news) * sample_pct),
sample(twitter, length(twitter) * sample_pct))
# Convert to a data frame format required for tidytext
text_df <- tibble(line = 1:length(combined_sample), text = combined_sample)
# Clean up memory by removing the massive original datasets
rm(blogs, news, twitter)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 2231946 119.2 8124664 434 NA 6542776 349.5
## Vcells 8804469 67.2 91224953 696 16384 103674216 791.0
# Extract and count top 15 unigrams
unigrams <- text_df %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
slice(1:15)
# Plot Unigrams
ggplot(unigrams, aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 15 Most Frequent Single Words", x = "Word", y = "Frequency") +
theme_minimal()
# Extract and count top 15 bigrams
bigrams <- text_df %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
filter(!is.na(bigram)) %>%
count(bigram, sort = TRUE) %>%
slice(1:15)
# Plot Bigrams
ggplot(bigrams, aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "darkorange") +
coord_flip() +
labs(title = "Top 15 Most Frequent Two-Word Phrases", x = "Bigram", y = "Frequency") +
theme_minimal()
# Extract and count top 15 trigrams
trigrams <- text_df %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
filter(!is.na(trigram)) %>%
count(trigram, sort = TRUE) %>%
slice(1:15)
# Plot Trigrams
ggplot(trigrams, aes(x = reorder(trigram, n), y = n)) +
geom_col(fill = "seagreen") +
coord_flip() +
labs(title = "Top 15 Most Frequent Three-Word Phrases", x = "Trigram", y = "Frequency") +
theme_minimal()