Introduction

The goal of this project is to build a next-word prediction application — similar to smartphone keyboard autocomplete. This report summarises exploratory analysis of the HC Corpora training data and outlines the plan for the prediction algorithm and Shiny app.

Data Overview

summary_df <- data.frame(
  Source    = c("Blogs", "News", "Twitter"),
  Size_MB   = c(201, 197, 160),
  Lines     = c(899288, 1010242, 2360148),
  Words     = c(37334117, 34365936, 30373559),
  Avg_Words = c(41.5, 34.0, 12.9),
  Max_Words = c(6630, 1792, 47)
)
knitr::kable(summary_df, format.args = list(big.mark = ","),
             caption = "Summary statistics for en_US corpus files")
Summary statistics for en_US corpus files
Source Size_MB Lines Words Avg_Words Max_Words
Blogs 201 899,288 37,334,117 41.5 6,630
News 197 1,010,242 34,365,936 34.0 1,792
Twitter 160 2,360,148 30,373,559 12.9 47

Corpus Size

library(ggplot2)
df_long <- data.frame(
  Source = rep(c("Blogs","News","Twitter"), 2),
  Metric = c(rep("Lines (M)", 3), rep("Words (M)", 3)),
  Value  = c(0.90, 1.01, 2.36, 37.3, 34.4, 30.4)
)
ggplot(df_long, aes(x=Source, y=Value, fill=Metric)) +
  geom_bar(stat="identity", position="dodge") +
  labs(title="Corpus Size by Source", y="Count (millions)") +
  theme_minimal()

Word Frequency

freq_df <- data.frame(
  Source     = c("Blogs","News","Twitter"),
  Vocab_Size = c(66065, 63557, 37451),
  Words_50pct = c(105, 190, 125),
  Words_90pct = c(6095, 7579, 4955)
)
knitr::kable(freq_df, format.args=list(big.mark=","),
             caption = "Vocabulary and coverage statistics (50k-line sample)")
Vocabulary and coverage statistics (50k-line sample)
Source Vocab_Size Words_50pct Words_90pct
Blogs 66,065 105 6,095
News 63,557 190 7,579
Twitter 37,451 125 4,955

Key Findings

Prediction Algorithm Plan

The model will use an n-gram backoff approach:

  1. Build unigram, bigram, trigram, and 4-gram frequency tables from cleaned text.
  2. For a given input, look up the matching 4-gram. If not found, back off to trigram → bigram → unigram.
  3. Apply smoothing to handle unseen word combinations.
  4. Store as compressed R objects for fast loading in Shiny.

Shiny App Plan