Introduction

This report presents the initial exploratory data analysis (EDA) for the SwiftKey Capstone project. The goal is to build a text prediction algorithm and a Shiny app that suggests the next word given a phrase. This milestone demonstrates data loading, basic summaries, interesting findings, and plans for the final app.

Data Loading

The data come from the HC Corpora and include English text from blogs, news, and Twitter.

# Paths
folder <- "final/en_US"
files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")

# Read in data
blogs <- readLines(file.path(folder, files[1]), warn = FALSE, encoding = "UTF-8")
news <- readLines(file.path(folder, files[2]), warn = FALSE, encoding = "UTF-8")
twitter <- readLines(file.path(folder, files[3]), warn = FALSE, encoding = "UTF-8")

Basic Summaries

We can look at line counts, word counts, and file sizes.

file_stats <- data.frame(
  File = files,
  Lines = sapply(list(blogs, news, twitter), length),
  Words = sapply(list(blogs, news, twitter), function(x) sum(str_count(x, "\\S+"))),
  Size_MB = round(sapply(file.path(folder, files), function(f) file.info(f)$size / (1024^2)), 2)
)

kable(file_stats, caption = "Summary of the three datasets")
Summary of the three datasets
File Lines Words Size_MB
final/en_US/en_US.blogs.txt en_US.blogs.txt 899288 37334131 200.42
final/en_US/en_US.news.txt en_US.news.txt 1010242 34372530 196.28
final/en_US/en_US.twitter.txt en_US.twitter.txt 2360148 30373543 159.36

Sampling and Tokenization

To make computation faster, we’ll sample a small portion of the data.

set.seed(123)
sample_size <- 10000
sample_text <- c(
  sample(blogs, sample_size),
  sample(news, sample_size),
  sample(twitter, sample_size)
)

Tokenize the sampled text into words.

tokens <- tibble(text = sample_text) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(word, sort = TRUE)

head(tokens, 10)
## # A tibble: 10 × 2
##    word       n
##    <chr>  <int>
##  1 time    1878
##  2 people  1327
##  3 day     1316
##  4 love    1080
##  5 1        829
##  6 2        819
##  7 life     765
##  8 3        733
##  9 home     694
## 10 week     603

Word Frequency Distribution

tokens %>%
  top_n(20, n) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Most Frequent Words", x = "Word", y = "Frequency")

The distribution shows that few words account for most of the text — consistent with Zipf’s Law.

N-gram Analysis

We can look at common bigrams (2-grams) and trigrams (3-grams).

bigrams <- tibble(text = sample_text) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

trigrams <- tibble(text = sample_text) %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE)

head(bigrams, 10)
## # A tibble: 10 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the    4129
##  2 in the    3910
##  3 to the    1964
##  4 on the    1741
##  5 for the   1704
##  6 to be     1440
##  7 and the   1262
##  8 at the    1171
##  9 in a      1059
## 10 with the  1013
head(trigrams, 10)
## # A tibble: 10 × 2
##    trigram         n
##    <chr>       <int>
##  1 <NA>          881
##  2 one of the    335
##  3 a lot of      264
##  4 the end of    169
##  5 to be a       151
##  6 out of the    138
##  7 some of the   138
##  8 as well as    137
##  9 going to be   137
## 10 it was a      131

Coverage Analysis

We can estimate how many unique words cover 50% and 90% of all word instances.

tokens <- tokens %>% mutate(pct = n / sum(n), cum_pct = cumsum(pct))
n50 <- which.min(abs(tokens$cum_pct - 0.5))
n90 <- which.min(abs(tokens$cum_pct - 0.9))
cat("Words to cover 50%:", n50, "\n")
## Words to cover 50%: 1716
cat("Words to cover 90%:", n90, "\n")
## Words to cover 90%: 18541

Interesting Findings

Next Steps: Prediction Algorithm and Shiny App

The next phase will involve:

  1. Building a predictive model using n-gram frequency tables.
  1. Developing a Shiny app where:
  1. Optimizations:

Conclusion

This exploratory analysis confirms the data was successfully loaded, cleaned, and tokenized. The next step is to develop the prediction algorithm and interactive Shiny application. The findings here guide which preprocessing steps and n-gram sizes will be most effective.