This report presents the initial exploratory data analysis (EDA) for the SwiftKey Capstone project. The goal is to build a text prediction algorithm and a Shiny app that suggests the next word given a phrase. This milestone demonstrates data loading, basic summaries, interesting findings, and plans for the final app.
The data come from the HC Corpora and include English text from blogs, news, and Twitter.
# Paths
folder <- "final/en_US"
files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
# Read in data
blogs <- readLines(file.path(folder, files[1]), warn = FALSE, encoding = "UTF-8")
news <- readLines(file.path(folder, files[2]), warn = FALSE, encoding = "UTF-8")
twitter <- readLines(file.path(folder, files[3]), warn = FALSE, encoding = "UTF-8")
We can look at line counts, word counts, and file sizes.
file_stats <- data.frame(
File = files,
Lines = sapply(list(blogs, news, twitter), length),
Words = sapply(list(blogs, news, twitter), function(x) sum(str_count(x, "\\S+"))),
Size_MB = round(sapply(file.path(folder, files), function(f) file.info(f)$size / (1024^2)), 2)
)
kable(file_stats, caption = "Summary of the three datasets")
| File | Lines | Words | Size_MB | |
|---|---|---|---|---|
| final/en_US/en_US.blogs.txt | en_US.blogs.txt | 899288 | 37334131 | 200.42 |
| final/en_US/en_US.news.txt | en_US.news.txt | 1010242 | 34372530 | 196.28 |
| final/en_US/en_US.twitter.txt | en_US.twitter.txt | 2360148 | 30373543 | 159.36 |
To make computation faster, we’ll sample a small portion of the data.
set.seed(123)
sample_size <- 10000
sample_text <- c(
sample(blogs, sample_size),
sample(news, sample_size),
sample(twitter, sample_size)
)
Tokenize the sampled text into words.
tokens <- tibble(text = sample_text) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(word, sort = TRUE)
head(tokens, 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 time 1878
## 2 people 1327
## 3 day 1316
## 4 love 1080
## 5 1 829
## 6 2 819
## 7 life 765
## 8 3 733
## 9 home 694
## 10 week 603
tokens %>%
top_n(20, n) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Most Frequent Words", x = "Word", y = "Frequency")
The distribution shows that few words account for most of the text — consistent with Zipf’s Law.
We can look at common bigrams (2-grams) and trigrams (3-grams).
bigrams <- tibble(text = sample_text) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
trigrams <- tibble(text = sample_text) %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE)
head(bigrams, 10)
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 of the 4129
## 2 in the 3910
## 3 to the 1964
## 4 on the 1741
## 5 for the 1704
## 6 to be 1440
## 7 and the 1262
## 8 at the 1171
## 9 in a 1059
## 10 with the 1013
head(trigrams, 10)
## # A tibble: 10 × 2
## trigram n
## <chr> <int>
## 1 <NA> 881
## 2 one of the 335
## 3 a lot of 264
## 4 the end of 169
## 5 to be a 151
## 6 out of the 138
## 7 some of the 138
## 8 as well as 137
## 9 going to be 137
## 10 it was a 131
We can estimate how many unique words cover 50% and 90% of all word instances.
tokens <- tokens %>% mutate(pct = n / sum(n), cum_pct = cumsum(pct))
n50 <- which.min(abs(tokens$cum_pct - 0.5))
n90 <- which.min(abs(tokens$cum_pct - 0.9))
cat("Words to cover 50%:", n50, "\n")
## Words to cover 50%: 1716
cat("Words to cover 90%:", n90, "\n")
## Words to cover 90%: 18541
The data is highly skewed: a few words dominate usage.
Twitter has the most lines but shortest text per entry.
Blogs contain longer sentences and richer vocabulary.
Text often includes non-standard words, emoticons, and abbreviations that must be cleaned.
The next phase will involve:
Use a Katz Backoff or Stupid Backoff model.
Predict the most likely next word given previous 1–3 words.
The user enters a phrase.
The app displays the top predicted next words.
Text cleaning (remove profanity, punctuation).
Use stemming/lemmatization to reduce redundancy.
Cache results for faster prediction.
This exploratory analysis confirms the data was successfully loaded, cleaned, and tokenized. The next step is to develop the prediction algorithm and interactive Shiny application. The findings here guide which preprocessing steps and n-gram sizes will be most effective.