Data Science Capstone: Milestone Report

Executive Summary

The goal of this milestone report is to present an exploratory data analysis of the SwiftKey dataset, which will be used to build a natural language processing (NLP) prediction algorithm and a corresponding Shiny application. This report outlines the data ingestion process, basic summary statistics, exploratory plots, and the proposed plan for the final predictive model. The analysis is presented in a manner accessible to a non-technical audience.

1. Data Acquisition and Loading

The dataset used for this project is the HC Corpora dataset. It contains text from three different sources: blogs, news, and Twitter, in multiple languages. For this project, we focus exclusively on the English (en_US) datasets.

# Define URL and file paths
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest_file <- "Coursera-SwiftKey.zip"

# Download and unzip the dataset if not already present
if (!file.exists(dest_file)) {
    download.file(url, destfile = dest_file)
    unzip(dest_file)
}

Due to the massive size of the dataset and strict memory constraints, we will read a small random sample (1,000 lines) of each text file. This allows us to perform exploratory data analysis quickly and efficiently.

# Read a small sample of the datasets to preserve memory (1,000 lines each)
sample_size <- 1000

blogs_file <- file("final/en_US/en_US.blogs.txt", "r")
blogs_data <- readLines(blogs_file, n = sample_size, encoding = "UTF-8", skipNul = TRUE)
close(blogs_file)

news_file <- file("final/en_US/en_US.news.txt", "r")
news_data <- readLines(news_file, n = sample_size, encoding = "UTF-8", skipNul = TRUE)
close(news_file)

twitter_file <- file("final/en_US/en_US.twitter.txt", "r")
twitter_data <- readLines(twitter_file, n = sample_size, encoding = "UTF-8", skipNul = TRUE)
close(twitter_file)

2. Basic Summary Statistics

We analyze the fundamental characteristics of our sampled datasets (1,000 lines each) to understand word distributions, alongside the total file sizes on disk.

library(stringi)
library(knitr)

# Calculate full file sizes in MB on disk
blogs_size <- file.info("final/en_US/en_US.blogs.txt")$size / 1024^2
news_size <- file.info("final/en_US/en_US.news.txt")$size / 1024^2
twitter_size <- file.info("final/en_US/en_US.twitter.txt")$size / 1024^2

# Calculate word counts for our samples
blogs_words <- sum(stri_count_words(blogs_data))
news_words <- sum(stri_count_words(news_data))
twitter_words <- sum(stri_count_words(twitter_data))

# Create a summary table
summary_table <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Total_File_Size_MB = round(c(blogs_size, news_size, twitter_size), 2),
  Sampled_Lines = c(sample_size, sample_size, sample_size),
  Sampled_Word_Count = format(c(blogs_words, news_words, twitter_words), big.mark=",")
)

kable(summary_table, caption = "Summary Statistics (1,000 Line Sample)")

Summary Statistics (1,000 Line Sample)
Source	Total_File_Size_MB	Sampled_Lines	Sampled_Word_Count
Blogs	200.42	1000	42,168
News	196.28	1000	34,041
Twitter	159.36	1000	12,724

Observations: * The full datasets are quite large on disk (around 160-200 MB each). * Even within a fixed 1,000 line sample, Blogs contain significantly more words than Twitter, highlighting the short, constrained nature of tweets compared to long-form blog posts.

3. Data Cleaning

We combine our samples and use the tidytext package to clean the data. This involves converting text to lowercase, removing punctuation, and preparing it for tokenization.

library(tibble)
library(dplyr)
library(tidyr)
library(tidytext)
library(ggplot2)

combined_sample <- c(blogs_data, news_data, twitter_data)

# Clean up memory
rm(blogs_data, news_data, twitter_data)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2217862 118.5    4388524 234.4  3001088 160.3
## Vcells 3931769  30.0    8388608  64.0  6088782  46.5

# Convert to a dataframe for tidytext
text_df <- tibble(line = 1:length(combined_sample), text = combined_sample)

4. Exploratory Data Analysis: N-Grams

In Natural Language Processing, an n-gram is a contiguous sequence of n items from a given sample of text. Understanding the frequency of these sequences is the foundation of building a predictive text model.

Most Frequent Unigrams (Single Words)

We extract the single words and visualize the top 20 most frequent ones.

unigrams <- text_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

unigrams %>%
  head(20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Unigrams", x = "Word", y = "Frequency") +
  theme_minimal()

# Free memory
rm(unigrams)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2496017 133.4    4388524 234.4  4388524 234.4
## Vcells 4449909  34.0   10146329  77.5  7569997  57.8

The most common unigrams are “stop words” like “the”, “to”, “and”, “a”, etc. While some applications remove these, for a text prediction application, it is crucial to keep them because they are typed extremely frequently by users.

Most Frequent Bigrams (Two-Word Phrases)

Next, we look at the most common pairs of consecutive words.

bigrams <- text_df %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  drop_na() %>%
  count(bigram, sort = TRUE)

bigrams %>%
  head(20) %>%
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(x = bigram, y = n)) +
  geom_col(fill = "coral") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Bigrams", x = "Bigram", y = "Frequency") +
  theme_minimal()

# Free memory
rm(bigrams)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2501749 133.7    4388524 234.4  4388524 234.4
## Vcells 4464589  34.1   10146329  77.5  7569997  57.8

Most Frequent Trigrams (Three-Word Phrases)

Finally, we examine the most common combinations of three words.

trigrams <- text_df %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  drop_na() %>%
  count(trigram, sort = TRUE)

trigrams %>%
  head(20) %>%
  mutate(trigram = reorder(trigram, n)) %>%
  ggplot(aes(x = trigram, y = n)) +
  geom_col(fill = "mediumseagreen") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Trigrams", x = "Trigram", y = "Frequency") +
  theme_minimal()

# Free memory
rm(trigrams)
gc()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 2501763 133.7    4388524 234.4  4388524 234.4
## Vcells 4530275  34.6   10146329  77.5  7569997  57.8

These visualizations highlight the predictable structure of the English language. Common phrases like “one of the” and “a lot of” dominate the trigram frequencies.

5. Goals for Prediction Algorithm and Shiny App

The exploratory analysis confirms that frequency-based modeling (n-grams) is a viable approach. Based on these findings, the next steps for creating the predictive algorithm and Shiny app are:

Build an N-Gram Language Model: We will compute probabilities for unigrams, bigrams, trigrams, and potentially quadgrams across a larger training set.
Implement a Backoff Model: We will use a strategy such as “Katz’s back-off model”. When the algorithm tries to predict the next word, it will first look for the highest order n-gram (e.g., quadgram) that matches the user’s input. If no match is found, it will “back off” to a lower-order n-gram (e.g., trigram, then bigram) until a prediction can be made.
Optimize for Performance: Memory management and response time are critical for a web application. The n-gram frequency tables will be optimized (e.g., by pruning very low-frequency terms) and stored efficiently (potentially as data.tables or SQLite databases) to ensure fast lookup times.
Develop the Shiny App: The final product will be a user-friendly Shiny web application. It will feature a text input box where the user can type a phrase. As the user types, the app will instantly display the top 1-3 predicted words, simulating a real-world smartphone keyboard experience.