Executive Summary

This report outlines the exploratory data analysis of the SwiftKey dataset provided for the Data Science Capstone project. The goal of this milestone is to demonstrate successful data loading, provide basic summary statistics (word and line counts), explore the frequencies of words and word pairs (n-grams), and outline a plan for building a predictive text algorithm and Shiny application. The analysis is written to be easily understood by non-technical stakeholders.

1. Data Loading and Basic Summaries

The dataset consists of three text files sourced from US English blogs, news sites, and Twitter. We first load the data and calculate basic statistics including file size, total lines, and total words.

# Define file paths
path_blogs   <- file.path(data_dir, "en_US.blogs.txt")
path_news    <- file.path(data_dir, "en_US.news.txt")
path_twitter <- file.path(data_dir, "en_US.twitter.txt")

# Read the text lines
blogs   <- readLines(path_blogs, encoding = "UTF-8", skipNul = TRUE)
news    <- readLines(path_news, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(path_twitter, encoding = "UTF-8", skipNul = TRUE)

# Calculate File Sizes (in Megabytes)
size_blogs   <- file.info(path_blogs)$size / 1024^2
size_news    <- file.info(path_news)$size / 1024^2
size_twitter <- file.info(path_twitter)$size / 1024^2

# Calculate Word Counts using stringi for speed
words_blogs   <- sum(stri_count_words(blogs))
words_news    <- sum(stri_count_words(news))
words_twitter <- sum(stri_count_words(twitter))

# Create a summary table
summary_table <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  File_Size_MB = round(c(size_blogs, size_news, size_twitter), 2),
  Line_Count = c(length(blogs), length(news), length(twitter)),
  Word_Count = c(words_blogs, words_news, words_twitter)
)

kable(summary_table, format.args = list(big.mark = ","), 
      caption = "Table 1: Basic Data Summary of the SwiftKey Corpora")
Table 1: Basic Data Summary of the SwiftKey Corpora
Source File_Size_MB Line_Count Word_Count
Blogs 200.42 899,288 37,546,250
News 196.28 1,010,242 34,762,395
Twitter 159.36 2,360,148 30,093,413
Key Findin g: The Twitter file contains the most lines, which is expected due to character limits on the platform. However, the Blogs file contains the highest overall word count, indicating longer, more complex sentence structures.

2. Data Sampling and Cleaning

Because the raw data contains tens of millions of words, processing the entire dataset is computationally expensive. To conduct our exploratory analysis efficiently, we take a random 1% sample of the data. We then clean the text by converting it to lowercase and removing punctuation, numbers, and special characters.

set.seed(12345) # Set seed for reproducibility

# Take a 1% sample of each file
sample_pct <- 0.01
sample_data <- c(
  sample(blogs, length(blogs) * sample_pct),
  sample(news, length(news) * sample_pct),
  sample(twitter, length(twitter) * sample_pct)
)

# Convert to a data frame for tidytext processing
text_df <- tibble(line = 1:length(sample_data), text = sample_data)

# Clean memory
rm(blogs, news, twitter); gc()
##           used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 1841113 98.4    7177943  383.4   6142698  328.1
## Vcells 8118771 62.0  170090505 1297.7 212236301 1619.3

3. Exploratory Data Analysis (N-Grams)

To build a predictive text model, we need to understand the frequency of words. We break the text down into “n-grams”.

Top Unigrams (Single Words)

unigrams <- text_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

# Plot the top 15 Unigrams
unigrams %>%
  top_n(15, n) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col(fill = "#2c7fb8") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Words (Unigrams)",
       x = "Word", y = "Frequency Count") +
  theme_minimal()

Top Bigrams (Two-Word Combinations)

bigrams <- text_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  count(bigram, sort = TRUE)

bigrams %>%
  top_n(15, n) %>%
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(x = bigram, y = n)) +
  geom_col(fill = "#238b45") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent Word Pairs (Bigrams)",
       x = "Bigram", y = "Frequency Count") +
  theme_minimal()

Top Trigrams (Three-Word Combinations)

trigrams <- text_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  filter(!is.na(trigram)) %>%
  count(trigram, sort = TRUE)

trigrams %>%
  top_n(15, n) %>%
  mutate(trigram = reorder(trigram, n)) %>%
  ggplot(aes(x = trigram, y = n)) +
  geom_col(fill = "#d95f02") +
  coord_flip() +
  labs(title = "Top 15 Most Frequent 3-Word Phrases (Trigrams)",
       x = "Trigram", y = "Frequency Count") +
  theme_minimal()

Key Finding: The most frequent words and phrases are overwhelmingly “stop words” (e.g., “the”, “and”, “of”, “in the”). While these are often removed in standard text mining, we must keep them for this project because a text prediction app needs to accurately predict standard grammar, including transition words.

4. Future Plan: Prediction Algorithm & Shiny App

Based on this exploratory analysis, the strategy for the final data product is as follows: