Milestone Report Narrative and Strategy

The exploratory analysis of the SwiftKey corpus – focusing on the top 10 most frequent unigrams, bigrams, and trigrams in the blogs, news, and twitter datasets – already reveals the core patterns needed for a highly effective next-word prediction algorithm.

These top 10 n-grams are not just the most common phrases; they represent the vast majority of everyday language use in each source. For example, bigram classics such as “of the,” “in the,” “to be,” and source-specific gems like “I don’t” (twitter) or “said the” (news) dominate real-world text. Similarly, the top trigrams (“at the end of,” “one of the most,” “as well as the”) capture the fixed expressions people write repeatedly.

Prediction Strategy (built directly on these findings):

  1. Create three lightweight frequency tables from the processed data:
    • a trigram table (highest context)
    • a bigram table (fallback)
    • a unigram table (final fallback)
  2. Implement Stupid Backoff (Katz backoff works too, but Stupid is faster and sufficient):
    • Given the last two words typed by the user, look for matching trigrams in the top-frequency table.
    • If none exist or count is zero → divide the trigram score by ~0.4 and fall back to the corresponding bigram.
    • If still no match → fall back to the most frequent unigrams.
  3. Return the top 3–5 candidate words ranked by (backed-off) score.

Because the top 10 n-grams already account for a surprisingly large proportion of actual text (often >15–20 % of all bigrams/trigrams), this simple highest-order-first approach yields excellent real-world accuracy while remaining blazing-fast and tiny in memory – easily under 5 MB even with millions of n-grams stored. This makes it ideal for deployment in a Shiny app where low latency is critical.

In short: the top 10 n-grams you are seeing right now are the foundation of the final prediction engine – no complex neural networks required.

Top N-gram Visualizations

Here are the key frequency plots from the three corpora (blogs, news, twitter):

Top 10 Words

Top 10 most frequent words across sources
Top 10 most frequent words across sources

Top 10 Bigrams

Top 10 most common bigrams
Top 10 most common bigrams

Top 10 Trigrams

Top 10 most common trigrams
Top 10 most common trigrams

Summary Metrics

SwiftKey Corpus Summary Statistics
Comparison of blogs, news, and twitter datasets
Metric Value Source
Source File NA blogs
Lines (raw) 899,288 blogs
Lines (sampled) 899,288 blogs
Words (raw) 38,309,620 blogs
Words (cleaned) NA blogs
Characters (cleaned) NA blogs
Unique words 294,203 blogs
Avg words per line NA blogs
Source File NA news
Lines (raw) 77,259 news
Lines (sampled) 77,259 news
Words (raw) 2,741,594 news
Words (cleaned) 1,509,739 news
Characters (cleaned) 10,759,049 news
Unique words 79,099 news
Avg words per line 20 news
Source File NA twitter
Lines (raw) 2,360,148 twitter
Lines (sampled) 2,360,148 twitter
Words (raw) 31,003,501 twitter
Words (cleaned) NA twitter
Characters (cleaned) NA twitter
Unique words 329,493 twitter
Avg words per line NA twitter
Source: Processed en_US corpora (2025)