Milestone Report Narrative and Strategy

The exploratory analysis of the SwiftKey corpus – focusing on the top 10 most frequent unigrams, bigrams, and trigrams in the blogs, news, and twitter datasets – already reveals the core patterns needed for a highly effective next-word prediction algorithm.

These top 10 n-grams are not just the most common phrases; they represent the vast majority of everyday language use in each source. For example, bigram classics such as “of the,” “in the,” “to be,” and source-specific gems like “I don’t” (twitter) or “said the” (news) dominate real-world text. Similarly, the top trigrams (“at the end of,” “one of the most,” “as well as the”) capture the fixed expressions people write repeatedly.

Prediction Strategy (built directly on these findings):

Create three lightweight frequency tables from the processed data:
- a trigram table (highest context)
- a bigram table (fallback)
- a unigram table (final fallback)
Implement Stupid Backoff (Katz backoff works too, but Stupid is faster and sufficient):
- Given the last two words typed by the user, look for matching trigrams in the top-frequency table.
- If none exist or count is zero → divide the trigram score by ~0.4 and fall back to the corresponding bigram.
- If still no match → fall back to the most frequent unigrams.
Return the top 3–5 candidate words ranked by (backed-off) score.

Because the top 10 n-grams already account for a surprisingly large proportion of actual text (often >15–20 % of all bigrams/trigrams), this simple highest-order-first approach yields excellent real-world accuracy while remaining blazing-fast and tiny in memory – easily under 5 MB even with millions of n-grams stored. This makes it ideal for deployment in a Shiny app where low latency is critical.

In short: the top 10 n-grams you are seeing right now are the foundation of the final prediction engine – no complex neural networks required.

Top N-gram Visualizations

Here are the key frequency plots from the three corpora (blogs, news, twitter):

Top 10 Words

Top 10 most frequent words across sources

Top 10 Bigrams

Top 10 most common bigrams

Top 10 Trigrams

Top 10 most common trigrams

Summary Metrics

Metric	Value	Source
SwiftKey Corpus Summary Statistics
Comparison of blogs, news, and twitter datasets
Source File	NA	blogs
Lines (raw)	899,288	blogs
Lines (sampled)	899,288	blogs
Words (raw)	38,309,620	blogs
Words (cleaned)	NA	blogs
Characters (cleaned)	NA	blogs
Unique words	294,203	blogs
Avg words per line	NA	blogs
Source File	NA	news
Lines (raw)	77,259	news
Lines (sampled)	77,259	news
Words (raw)	2,741,594	news
Words (cleaned)	1,509,739	news
Characters (cleaned)	10,759,049	news
Unique words	79,099	news
Avg words per line	20	news
Source File	NA	twitter
Lines (raw)	2,360,148	twitter
Lines (sampled)	2,360,148	twitter
Words (raw)	31,003,501	twitter
Words (cleaned)	NA	twitter
Characters (cleaned)	NA	twitter
Unique words	329,493	twitter
Avg words per line	NA	twitter
Source: Processed en_US corpora (2025)

Capstone Project Milestone Report

Anthony Acaldo

2025-12-02