The goal of this project is to build a next-word prediction application — similar to smartphone keyboard autocomplete. This report summarises exploratory analysis of the HC Corpora training data and outlines the plan for the prediction algorithm and Shiny app.
summary_df <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Size_MB = c(201, 197, 160),
Lines = c(899288, 1010242, 2360148),
Words = c(37334117, 34365936, 30373559),
Avg_Words = c(41.5, 34.0, 12.9),
Max_Words = c(6630, 1792, 47)
)
knitr::kable(summary_df, format.args = list(big.mark = ","),
caption = "Summary statistics for en_US corpus files")
| Source | Size_MB | Lines | Words | Avg_Words | Max_Words |
|---|---|---|---|---|---|
| Blogs | 201 | 899,288 | 37,334,117 | 41.5 | 6,630 |
| News | 197 | 1,010,242 | 34,365,936 | 34.0 | 1,792 |
| 160 | 2,360,148 | 30,373,559 | 12.9 | 47 |
library(ggplot2)
df_long <- data.frame(
Source = rep(c("Blogs","News","Twitter"), 2),
Metric = c(rep("Lines (M)", 3), rep("Words (M)", 3)),
Value = c(0.90, 1.01, 2.36, 37.3, 34.4, 30.4)
)
ggplot(df_long, aes(x=Source, y=Value, fill=Metric)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Corpus Size by Source", y="Count (millions)") +
theme_minimal()
freq_df <- data.frame(
Source = c("Blogs","News","Twitter"),
Vocab_Size = c(66065, 63557, 37451),
Words_50pct = c(105, 190, 125),
Words_90pct = c(6095, 7579, 4955)
)
knitr::kable(freq_df, format.args=list(big.mark=","),
caption = "Vocabulary and coverage statistics (50k-line sample)")
| Source | Vocab_Size | Words_50pct | Words_90pct |
|---|---|---|---|
| Blogs | 66,065 | 105 | 6,095 |
| News | 63,557 | 190 | 7,579 |
| 37,451 | 125 | 4,955 |
The model will use an n-gram backoff approach: