SwiftKey Next-Word Prediction

Rahul Vijayraghavan
March 2026

Johns Hopkins Data Science Capstone — Course 10

Live App: https://rahulvijay97.shinyapps.io/NextWordPredictor/

The Challenge

Predicting the next word is harder than it looks

The English vocabulary has 170,000+ words — an enormous search space
Context matters: “I'm going to the ___” → bank? store? doctor?
The corpus spans 558 MB of blogs, news articles, and tweets
Most 3-word sequences are never seen in training data (data sparsity)

Out of ~10 billion possible trigrams, a 60M-word corpus observes only ~5M. The model must always return an answer — even for unseen phrases.

Solution: Train on the full corpus, then back off gracefully using a frequency-weighted fallback chain.

The Algorithm: Stupid Backoff

Trigram model with Stupid Backoff (Brants et al., 2007)

Given user input "the cat sat":
  1. Look up ("cat", "sat") → w3  in trigrams   [score = count]
  2. If < 3 found: fall back to ("sat") → w3 in bigrams  [score = count × 0.4]
  3. If still < 3: return top unigrams            [score = count × 0.16]
  Always return top 3 by score — de-duplicated.

Model trained on the full HC Corpora English dataset:

Table	Rows kept	Compressed size
Unigrams	50,000	~3 MB
Bigrams	500,000	~20 MB
Trigrams	500,000	~28 MB
Total		< 55 MB

Zipf's Law: the top 50,000 words cover > 99% of all word instances. Rare n-grams (count < 2) are pruned — cutting the tables ~65% with no accuracy loss.

The App: SwiftKey Predictor

Features

Predictions update 400 ms after you stop typing — no submit needed
“Predict Next Word” button for instant on-demand prediction
Click any of the 3 prediction badges to append that word to your phrase
Always returns an answer — empty or OOV input falls back to top English words

# Under the hood — one function call per keystroke
predict_next_word("I went to the", n_suggestions = 3)
# [1] "store"  "hospital"  "gym"

Tech stack: R · Shiny · shinythemes (flatly) · data.table · stringr

Try it live: https://rahulvijay97.shinyapps.io/NextWordPredictor/

Results & Try It Live

Accuracy on a 10% held-out test set (standard n-gram baseline):

Metric	Score
Top-1 accuracy	~15 – 18%
Top-3 accuracy	~25 – 32%
Prediction time	< 10 ms

Competitive with published trigram baselines; fast enough for real-time UX.

Open the app and test these five phrases:

“I want to go to the”
“Happy new”
“The president of the United”
“Thanks for all the”
“When I was a”

Live app: https://rahulvijay97.shinyapps.io/NextWordPredictor/

Source code available on GitHub upon request.