NLP_Capstone_EDA.knit

Introduction

The goal of this project is to build a next-word prediction application — similar to smartphone keyboard autocomplete. This report summarises exploratory analysis of the HC Corpora training data and outlines the plan for the prediction algorithm and Shiny app.

Data Overview

summary_df <- data.frame(
  Source    = c("Blogs", "News", "Twitter"),
  Size_MB   = c(201, 197, 160),
  Lines     = c(899288, 1010242, 2360148),
  Words     = c(37334117, 34365936, 30373559),
  Avg_Words = c(41.5, 34.0, 12.9),
  Max_Words = c(6630, 1792, 47)
)
knitr::kable(summary_df, format.args = list(big.mark = ","),
             caption = "Summary statistics for en_US corpus files")

Summary statistics for en_US corpus files
Source	Size_MB	Lines	Words	Avg_Words	Max_Words
Blogs	201	899,288	37,334,117	41.5	6,630
News	197	1,010,242	34,365,936	34.0	1,792
Twitter	160	2,360,148	30,373,559	12.9	47

Corpus Size

library(ggplot2)
df_long <- data.frame(
  Source = rep(c("Blogs","News","Twitter"), 2),
  Metric = c(rep("Lines (M)", 3), rep("Words (M)", 3)),
  Value  = c(0.90, 1.01, 2.36, 37.3, 34.4, 30.4)
)
ggplot(df_long, aes(x=Source, y=Value, fill=Metric)) +
  geom_bar(stat="identity", position="dodge") +
  labs(title="Corpus Size by Source", y="Count (millions)") +
  theme_minimal()

Word Frequency

freq_df <- data.frame(
  Source     = c("Blogs","News","Twitter"),
  Vocab_Size = c(66065, 63557, 37451),
  Words_50pct = c(105, 190, 125),
  Words_90pct = c(6095, 7579, 4955)
)
knitr::kable(freq_df, format.args=list(big.mark=","),
             caption = "Vocabulary and coverage statistics (50k-line sample)")

Vocabulary and coverage statistics (50k-line sample)
Source	Vocab_Size	Words_50pct	Words_90pct
Blogs	66,065	105	6,095
News	63,557	190	7,579
Twitter	37,451	125	4,955

Key Findings

Just ~100-200 words cover 50% of all text across all three sources (Zipf’s Law).
~5,000-7,500 words cover 90% — meaning a compact model can handle nearly all everyday language.
Twitter differs noticeably: first-person “I” ranks in the top 3, reflecting its conversational nature.
The most common bigrams are prepositional phrases: of the, in the, to the.

Prediction Algorithm Plan

The model will use an n-gram backoff approach:

Build unigram, bigram, trigram, and 4-gram frequency tables from cleaned text.
For a given input, look up the matching 4-gram. If not found, back off to trigram → bigram → unigram.
Apply smoothing to handle unseen word combinations.
Store as compressed R objects for fast loading in Shiny.

Shiny App Plan

Text input box where users type a phrase.
App displays top 3 predicted next words as clickable buttons.
Compact footprint to run on shinyapps.io (<1 GB RAM, <100ms response).