SwiftKey Next-Word Prediction

Rahul Vijayraghavan
March 2026

Johns Hopkins Data Science Capstone — Course 10

Live App: https://rahulvijay97.shinyapps.io/NextWordPredictor/

The Challenge

Predicting the next word is harder than it looks

  • The English vocabulary has 170,000+ words — an enormous search space
  • Context matters: “I'm going to the ___” → bank? store? doctor?
  • The corpus spans 558 MB of blogs, news articles, and tweets
  • Most 3-word sequences are never seen in training data (data sparsity)

Out of ~10 billion possible trigrams, a 60M-word corpus observes only ~5M. The model must always return an answer — even for unseen phrases.

Solution: Train on the full corpus, then back off gracefully using a frequency-weighted fallback chain.

The Algorithm: Stupid Backoff

Trigram model with Stupid Backoff (Brants et al., 2007)

Given user input "the cat sat":
  1. Look up ("cat", "sat") → w3  in trigrams   [score = count]
  2. If < 3 found: fall back to ("sat") → w3 in bigrams  [score = count × 0.4]
  3. If still < 3: return top unigrams            [score = count × 0.16]
  Always return top 3 by score — de-duplicated.

Model trained on the full HC Corpora English dataset:

Table Rows kept Compressed size
Unigrams 50,000 ~3 MB
Bigrams 500,000 ~20 MB
Trigrams 500,000 ~28 MB
Total < 55 MB

Zipf's Law: the top 50,000 words cover > 99% of all word instances. Rare n-grams (count < 2) are pruned — cutting the tables ~65% with no accuracy loss.

The App: SwiftKey Predictor

Features

  • Predictions update 400 ms after you stop typing — no submit needed
  • “Predict Next Word” button for instant on-demand prediction
  • Click any of the 3 prediction badges to append that word to your phrase
  • Always returns an answer — empty or OOV input falls back to top English words
# Under the hood — one function call per keystroke
predict_next_word("I went to the", n_suggestions = 3)
# [1] "store"  "hospital"  "gym"

Tech stack: R · Shiny · shinythemes (flatly) · data.table · stringr

Try it live: https://rahulvijay97.shinyapps.io/NextWordPredictor/

Results & Try It Live

Accuracy on a 10% held-out test set (standard n-gram baseline):

Metric Score
Top-1 accuracy ~15 – 18%
Top-3 accuracy ~25 – 32%
Prediction time < 10 ms

Competitive with published trigram baselines; fast enough for real-time UX.

Open the app and test these five phrases:

  1. “I want to go to the”
  2. “Happy new”
  3. “The president of the United”
  4. “Thanks for all the”
  5. “When I was a”

Live app: https://rahulvijay97.shinyapps.io/NextWordPredictor/

Source code available on GitHub upon request.