Capstone Milestone Report

Introduction

This report presents the initial exploratory data analysis (EDA) for the SwiftKey Capstone project. The goal is to build a text prediction algorithm and a Shiny app that suggests the next word given a phrase. This milestone demonstrates data loading, basic summaries, interesting findings, and plans for the final app.

Data Loading

The data come from the HC Corpora and include English text from blogs, news, and Twitter.

# Paths
folder <- "final/en_US"
files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")

# Read in data
blogs <- readLines(file.path(folder, files[1]), warn = FALSE, encoding = "UTF-8")
news <- readLines(file.path(folder, files[2]), warn = FALSE, encoding = "UTF-8")
twitter <- readLines(file.path(folder, files[3]), warn = FALSE, encoding = "UTF-8")

Basic Summaries

We can look at line counts, word counts, and file sizes.

file_stats <- data.frame(
  File = files,
  Lines = sapply(list(blogs, news, twitter), length),
  Words = sapply(list(blogs, news, twitter), function(x) sum(str_count(x, "\\S+"))),
  Size_MB = round(sapply(file.path(folder, files), function(f) file.info(f)$size / (1024^2)), 2)
)

kable(file_stats, caption = "Summary of the three datasets")

Summary of the three datasets
	File	Lines	Words	Size_MB
final/en_US/en_US.blogs.txt	en_US.blogs.txt	899288	37334131	200.42
final/en_US/en_US.news.txt	en_US.news.txt	1010242	34372530	196.28
final/en_US/en_US.twitter.txt	en_US.twitter.txt	2360148	30373543	159.36

Sampling and Tokenization

To make computation faster, we’ll sample a small portion of the data.

set.seed(123)
sample_size <- 10000
sample_text <- c(
  sample(blogs, sample_size),
  sample(news, sample_size),
  sample(twitter, sample_size)
)

Tokenize the sampled text into words.

tokens <- tibble(text = sample_text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(word, sort = TRUE)

head(tokens, 10)

## # A tibble: 10 × 2
##    word       n
##    <chr>  <int>
##  1 time    1878
##  2 people  1327
##  3 day     1316
##  4 love    1080
##  5 1        829
##  6 2        819
##  7 life     765
##  8 3        733
##  9 home     694
## 10 week     603

Word Frequency Distribution

tokens %>%
  top_n(20, n) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Most Frequent Words", x = "Word", y = "Frequency")

The distribution shows that few words account for most of the text — consistent with Zipf’s Law.

N-gram Analysis

We can look at common bigrams (2-grams) and trigrams (3-grams).

bigrams <- tibble(text = sample_text) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

trigrams <- tibble(text = sample_text) %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE)

head(bigrams, 10)

## # A tibble: 10 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the    4129
##  2 in the    3910
##  3 to the    1964
##  4 on the    1741
##  5 for the   1704
##  6 to be     1440
##  7 and the   1262
##  8 at the    1171
##  9 in a      1059
## 10 with the  1013

head(trigrams, 10)

## # A tibble: 10 × 2
##    trigram         n
##    <chr>       <int>
##  1 <NA>          881
##  2 one of the    335
##  3 a lot of      264
##  4 the end of    169
##  5 to be a       151
##  6 out of the    138
##  7 some of the   138
##  8 as well as    137
##  9 going to be   137
## 10 it was a      131

Coverage Analysis

We can estimate how many unique words cover 50% and 90% of all word instances.

tokens <- tokens %>% mutate(pct = n / sum(n), cum_pct = cumsum(pct))
n50 <- which.min(abs(tokens$cum_pct - 0.5))
n90 <- which.min(abs(tokens$cum_pct - 0.9))
cat("Words to cover 50%:", n50, "\n")

## Words to cover 50%: 1716

cat("Words to cover 90%:", n90, "\n")

## Words to cover 90%: 18541

Interesting Findings

The data is highly skewed: a few words dominate usage.
Twitter has the most lines but shortest text per entry.
Blogs contain longer sentences and richer vocabulary.
Text often includes non-standard words, emoticons, and abbreviations that must be cleaned.

Next Steps: Prediction Algorithm and Shiny App

The next phase will involve:

Building a predictive model using n-gram frequency tables.

Use a Katz Backoff or Stupid Backoff model.
Predict the most likely next word given previous 1–3 words.

Developing a Shiny app where:

The user enters a phrase.
The app displays the top predicted next words.

Optimizations:

Text cleaning (remove profanity, punctuation).
Use stemming/lemmatization to reduce redundancy.
Cache results for faster prediction.

Conclusion

This exploratory analysis confirms the data was successfully loaded, cleaned, and tokenized. The next step is to develop the prediction algorithm and interactive Shiny application. The findings here guide which preprocessing steps and n-gram sizes will be most effective.