NLP Capstone: Exploratory Data Analysis & Prediction Model Plan

1. Overview

This report summarizes the exploratory data analysis (EDA) performed on the HC Corpora English text dataset as part of the Data Science Capstone. The ultimate goal is to build a next-word prediction application — similar to the autocomplete feature on a smartphone keyboard. This report covers:

Loading and summarizing the data
Key statistical findings
Plans for the prediction algorithm and Shiny app

2. About the Data

The dataset comes from the HC Corpora project, which collected text from three publicly available sources in four languages (English, German, Finnish, Russian). For this project, we focus on English (en_US), which contains three files:

File	Source
`en_US.blogs.txt`	Long-form blog posts
`en_US.news.txt`	News articles
`en_US.twitter.txt`	Tweets (short, informal text)

3. Setup: Libraries

library(tidyverse)
library(tidytext)
library(stringr)

4. File Summary Statistics

setwd("C:/Users/Linh Ngoc Tran/Downloads/Coursera-SwiftKey/final/en_US")

files <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")

file_info <- data.frame(
  file    = basename(files),
  size_MB = round(file.size(files) / 1e6, 1),
  lines   = NA_integer_,
  words   = NA_integer_
)

for (i in seq_along(files)) {
  con <- file(files[i], "r")
  lines <- readLines(con, warn = FALSE)
  close(con)
  file_info$lines[i] <- length(lines)
  file_info$words[i] <- sum(str_count(lines, "\\S+"))
}

knitr::kable(file_info,
             col.names = c("File", "Size (MB)", "Lines", "Words"),
             format.args = list(big.mark = ","),
             caption = "Table 1: Summary of en_US data files")

Table 1: Summary of en_US data files
File	Size (MB)	Lines	Words
en_US.blogs.txt	210.2	899,288	37,334,131
en_US.news.txt	205.8	77,259	2,643,969
en_US.twitter.txt	167.1	2,360,148	30,373,543

Key takeaway: These are very large files — over 550MB of raw text combined, with more than 100 million words across all three sources. Because of this, we work with a 1% random sample throughout this analysis.

5. Sampling the Data

To keep computation manageable, we randomly sampled 1% of lines from each file using a biased coin flip (rbinom). This gives a representative but much smaller working dataset.

sample_file <- function(filepath, sample_prob = 0.01, seed = 42) {
  set.seed(seed)
  con   <- file(filepath, "r")
  lines <- readLines(con, warn = FALSE)
  close(con)
  keep  <- rbinom(length(lines), size = 1, prob = sample_prob) == 1
  lines[keep]
}

blogs_sample   <- sample_file("en_US.blogs.txt",   0.01)
news_sample    <- sample_file("en_US.news.txt",    0.01)
twitter_sample <- sample_file("en_US.twitter.txt", 0.01)

all_sample <- c(blogs_sample, news_sample, twitter_sample)
writeLines(all_sample, "sample_corpus.txt")

cat("Total lines sampled:", length(all_sample), "\n")

## Total lines sampled: 33198

6. Quick Data Facts

# Longest line in each file
max_blogs   <- max(nchar(readLines("en_US.blogs.txt",   warn = FALSE)))
max_news    <- max(nchar(readLines("en_US.news.txt",    warn = FALSE)))
max_twitter <- max(nchar(readLines("en_US.twitter.txt", warn = FALSE)))

cat("Longest line - Blogs:  ", max_blogs,   "\n")

## Longest line - Blogs:   40833

cat("Longest line - News:   ", max_news,    "\n")

## Longest line - News:    5760

cat("Longest line - Twitter:", max_twitter, "\n")

## Longest line - Twitter: 144

twitter_lines <- readLines("en_US.twitter.txt", warn = FALSE)

love_count <- sum(grepl("love", twitter_lines))
hate_count <- sum(grepl("hate", twitter_lines))

cat("Love count:", love_count, "\n")

## Love count: 90956

cat("Hate count:", hate_count, "\n")

## Hate count: 22138

cat("Ratio (love/hate):", round(love_count / hate_count, 2), "\n")

## Ratio (love/hate): 4.11

Interesting finding: The word “love” appears about 4 times more than “hate” on Twitter — people express positivity more than negativity in this dataset.

7. Text Cleaning & Tokenization

Before analysis, text is cleaned by converting to lowercase and removing punctuation, numbers, and extra whitespace.

lines <- readLines("sample_corpus.txt", warn = FALSE)

clean_text <- function(text) {
  text %>%
    str_to_lower() %>%
    str_replace_all("[^a-z\\s']", " ") %>%
    str_replace_all("\\s+", " ") %>%
    str_trim()
}

cleaned_lines <- clean_text(lines)

8. Word Frequency Distribution

8.1 Top 20 Most Frequent Words (with stop words removed)

unigrams <- tibble(text = lines) %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  filter(!str_detect(word, "^\\d+$")) %>%
  count(word, sort = TRUE)

unigrams %>%
  top_n(20, n) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words (Stop Words Removed)",
       x = "Word", y = "Count") +
  theme_minimal()

Finding: The most common meaningful words are social and emotional — love, life, day, people, feel. This reflects the heavy Twitter component of the corpus. Tokens like rt (retweet) are Twitter-specific noise that will be cleaned before model training.

9. N-gram Analysis

N-grams are sequences of N consecutive words. They are the foundation of our prediction model.

9.1 Top 20 Bigrams (2-word pairs)

bigrams_df <- tibble(text = lines) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  count(bigram, sort = TRUE)

bigrams_df %>%
  top_n(20, n) %>%
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Bigrams",
       x = "Bigram", y = "Count") +
  theme_minimal()

9.2 Top 20 Trigrams (3-word phrases)

trigrams_df <- tibble(text = lines) %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  filter(!is.na(trigram)) %>%
  count(trigram, sort = TRUE)

trigrams_df %>%
  top_n(20, n) %>%
  ggplot(aes(x = reorder(trigram, n), y = n)) +
  geom_col(fill = "darkred") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Trigrams",
       x = "Trigram", y = "Count") +
  theme_minimal()

Finding: Bigrams are dominated by prepositional glue phrases (of the, in the). Trigrams are more meaningful and natural-sounding: one of the, i want to, can't wait to — exactly the kind of patterns a prediction model can leverage.

10. Word Coverage (Zipf’s Law)

A key question for building an efficient model is: how many unique words do you actually need?

unigrams_all <- tibble(text = lines) %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  mutate(
    cumulative   = cumsum(n),
    total        = sum(n),
    coverage_pct = cumulative / total * 100,
    rank         = row_number()
  )

cover_50 <- min(which(unigrams_all$coverage_pct >= 50))
cover_90 <- min(which(unigrams_all$coverage_pct >= 90))

cat("Words needed for 50% coverage:", cover_50, "\n")

## Words needed for 50% coverage: 132

cat("Words needed for 90% coverage:", cover_90, "\n")

## Words needed for 90% coverage: 6571

unigrams_all %>%
  filter(rank <= 10000) %>%
  ggplot(aes(x = rank, y = coverage_pct)) +
  geom_line(color = "steelblue", size = 1) +
  geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 90, linetype = "dashed", color = "orange") +
  annotate("text", x = 500,  y = 52, label = "50% coverage", color = "red") +
  annotate("text", x = 500,  y = 92, label = "90% coverage", color = "orange") +
  labs(title = "Word Coverage Curve (Zipf's Law)",
       x = "Number of Unique Words (ranked by frequency)",
       y = "% of Total Words Covered") +
  theme_minimal()

Coverage Target	Unique Words Needed
50% of all text	132 words
90% of all text	6,571 words

This is Zipf’s Law — a tiny fraction of words account for the vast majority of text. Just 132 words cover half of everything written. This means our prediction model can be very efficient by focusing on the most common words and phrases.

11. Profanity Filtering

library(lexicon)

profanity_list <- lexicon::profanity_alvarez

tokens_all <- tibble(text = lines) %>%
  unnest_tokens(word, text)

tokens_clean <- tokens_all %>%
  filter(!word %in% profanity_list)

cat("Tokens before filtering:", nrow(tokens_all),   "\n")

## Tokens before filtering: 699672

cat("Tokens after filtering: ", nrow(tokens_clean), "\n")

## Tokens after filtering:  698295

cat("Profane tokens removed: ", nrow(tokens_all) - nrow(tokens_clean), "\n")

## Profane tokens removed:  1377

Only ~0.2% of tokens were profane — a small but important cleanup step to ensure the prediction app never suggests offensive words.

12. Plan for Prediction Algorithm & Shiny App

12.1 Algorithm: Stupid Backoff N-gram Model

The prediction model will use a backoff approach:

Given the last 3 words typed → look up matching 4-grams
If no match → back off to last 2 words → look up trigrams
If no match → back off to last 1 word → look up bigrams
If still no match → return most common unigrams

This approach is simple, fast, and handles words/phrases never seen in the training data.

12.2 Memory Optimization

By removing n-grams that appear only once, we dramatically reduce model size:

Table	Raw Count	After Trimming	Reduction
Bigrams	311,837	62,145	80%
Trigrams	532,244	38,531	93%
Quadgrams	579,731	10,731	98%

The trimmed model fits comfortably in memory for a Shiny app deployment.

12.3 Shiny App Plan

The app will: - Accept a text input from the user - Predict the top 3 most likely next words in real time - Display predictions as clickable suggestion buttons (like a mobile keyboard) - Run fast enough for interactive use on shinyapps.io

13. Summary of Key Findings

Finding	Detail
Dataset size	580MB+, 4M+ lines, 100M+ words across 3 files
Sample used	1% random sample (~33,000 lines)
Vocabulary size	~40,670 unique words in sample
50% coverage	Just 132 words needed
90% coverage	6,571 words needed
Love/Hate ratio	~4x more “love” than “hate” on Twitter
Profanity rate	~0.2% of tokens removed
Model approach	Stupid Backoff with quadgram → trigram → bigram → unigram