1. Executive Summary

This report presents the exploratory data analysis (EDA) for the Data Science Capstone project in partnership with SwiftKey. The ultimate goal of this project is to build a predictive text application, similar to the smart keyboards used on mobile devices.

In this milestone report, I will demonstrate that the dataset has been successfully loaded, showcase basic summary statistics (such as line and word counts), explore the frequencies of common words and phrases (n-grams) using visual plots, and outline the strategy for the final predictive algorithm and Shiny application. The report is written in a concise manner suitable for non-data scientist stakeholders.


2. Data Loading and Basic Summaries

I begin by loading the three US English text datasets provided for this project: Blogs, News, and Twitter. To ensure efficient processing and reporting, I will calculate the basic statistics of the full datasets first.

# Load necessary libraries for data manipulation and visualization
library(stringi)
library(ggplot2)
library(knitr)
library(dplyr)
library(tidytext)

# Define file paths
blogs_file <- "/Users/liwenhe/final/en_US/en_US.blogs.txt"
news_file <- "/Users/liwenhe/final/en_US/en_US.news.txt"
twitter_file <- "/Users/liwenhe/final/en_US/en_US.twitter.txt"

# Read data into memory (ignoring nulls and warnings for special characters)
blogs <- readLines(blogs_file, skipNul = TRUE, warn = FALSE)
news <- readLines(news_file, skipNul = TRUE, warn = FALSE)
twitter <- readLines(twitter_file, skipNul = TRUE, warn = FALSE)

# Calculate file sizes in Megabytes (MB)
size_blogs <- file.info(blogs_file)$size / 1024^2
size_news <- file.info(news_file)$size / 1024^2
size_twitter <- file.info(twitter_file)$size / 1024^2

# Calculate line counts
lines_blogs <- length(blogs)
lines_news <- length(news)
lines_twitter <- length(twitter)

# Calculate word counts using the stringi package
words_blogs <- sum(stri_count_words(blogs))
words_news <- sum(stri_count_words(news))
words_twitter <- sum(stri_count_words(twitter))

# Create a summary data frame
summary_table <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  File_Size_MB = round(c(size_blogs, size_news, size_twitter), 2),
  Line_Count = c(lines_blogs, lines_news, lines_twitter),
  Word_Count = c(words_blogs, words_news, words_twitter)
)

# Display the table cleanly
kable(summary_table, caption = "Basic Summary Statistics of the US English Datasets")
Basic Summary Statistics of the US English Datasets
Dataset File_Size_MB Line_Count Word_Count
Blogs 200.42 899288 37546250
News 196.28 1010242 34762395
Twitter 159.36 2360148 30093413

3. Exploratory Data Analysis: Word Frequencies (N-Grams)

Due to the massive size of the datasets (over 4 million lines combined), I will use a 1% random sample of the data to perform our exploratory analysis. This allows us to understand the distributions without overloading the computer’s memory.

I will clean the data by converting it to lowercase, removing punctuation, and separating it into single words (Unigrams), two-word phrases (Bigrams), and three-word phrases (Trigrams).

# Set seed for reproducibility and create a 1% sample
set.seed(1234)
sample_pct <- 0.01

combined_sample <- c(sample(blogs, length(blogs) * sample_pct),
                     sample(news, length(news) * sample_pct),
                     sample(twitter, length(twitter) * sample_pct))

# Convert to a data frame format required for tidytext
text_df <- tibble(line = 1:length(combined_sample), text = combined_sample)

# Clean up memory by removing the massive original datasets
rm(blogs, news, twitter)
gc()
##           used  (Mb) gc trigger (Mb) limit (Mb)  max used  (Mb)
## Ncells 2231946 119.2    8124664  434         NA   6542776 349.5
## Vcells 8804469  67.2   91224953  696      16384 103674216 791.0

Top Unigrams (Single Words)

# Extract and count top 15 unigrams
unigrams <- text_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  slice(1:15)

# Plot Unigrams
ggplot(unigrams, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Single Words", x = "Word", y = "Frequency") +
  theme_minimal()

Top Bigrams (Two-Word Phrases)

# Extract and count top 15 bigrams
bigrams <- text_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  count(bigram, sort = TRUE) %>%
  slice(1:15)

# Plot Bigrams
ggplot(bigrams, aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Two-Word Phrases", x = "Bigram", y = "Frequency") +
  theme_minimal()

Top Trigrams (Three-Word Phrases)

# Extract and count top 15 trigrams
trigrams <- text_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  filter(!is.na(trigram)) %>%
  count(trigram, sort = TRUE) %>%
  slice(1:15)

# Plot Trigrams
ggplot(trigrams, aes(x = reorder(trigram, n), y = n)) +
  geom_col(fill = "seagreen") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Three-Word Phrases", x = "Trigram", y = "Frequency") +
  theme_minimal()


4. Interesting Findings and Future Plans

Interesting Findings:

  1. Stop Words Dominate: As expected, the most frequent unigrams, bigrams, and trigrams are dominated by common English “stop words” (e.g., “the”, “and”, “of the”). While we often remove stop words in text classification tasks (like sentiment analysis), they are absolutely crucial for a predictive text keyboard to sound natural and grammatically correct.
  2. Memory Management is Critical: Text data requires significant RAM. Processing even a 1% sample takes noticeable time and memory, highlighting the need for optimization, efficient data structures, and aggressive memory management when building the final application.

Plan for the Prediction Algorithm and Shiny App:

  1. Algorithm Development: I plan to build a standard N-gram language model (specifically a Trigram model). The algorithm will look at the last two words typed by the user to calculate the probability of the next word.
  2. Handling Unseen Words (Backoff Strategy): If a specific three-word phrase hasn’t been seen in our training dataset, the algorithm will use a “backoff” strategy. It will shorten the context to the last single word to find the most probable next word. This is a simplified version of the Katz Backoff model.
  3. App Deployment: The final product will be deployed via a Shiny App with a clean, responsive interface. It will feature a text input box where the user types a phrase, and the app will instantly output the top 1 to 3 predicted next words, simulating the experience of the SwiftKey mobile keyboard. We will prioritize speed and low memory usage to ensure a smooth user experience.