Overview
102M
words in training corpus
~8ms
average prediction time
4-gram
maximum context window
3 sources
blogs, news, Twitter
The Problem & The Data
Smartphone keyboards need to predict your next word
instantly — before you finish typing. The challenge: build a
model that is both accurate and fast enough to work in real time.
Training Data (HC Corpora)
| Blogs |
899K |
37.3M |
| News |
1.01M |
34.4M |
| Twitter |
2.36M |
30.4M |
Key Insight: Zipf’s Law
Just ~150 words cover
50% of all text.
Only
~7,000 words cover
90% of all text.
This means a compact model can
handle nearly all everyday language.
The Algorithm: Stupid Backoff
A fast, practical extension of n-gram models used in production at
Google. No complex probability normalisation — just smart score
weighting.
-
Step 1 — Clean input: Lowercase, strip URLs, numbers,
and punctuation. Extract the last 1–3 words as context.
-
Step 2 — 4-gram lookup: Search the quadgram table for
the last 3 words. Score matching predictions at 1.0×.
-
Step 3 — Trigram backoff: If no quadgram match, search
trigram table for last 2 words. Score at 0.4×.
-
Step 4 — Bigram backoff: Search bigram table for last
word. Score at 0.16×.
-
Step 5 — Unigram fallback: Always have an answer — fall
back to most frequent words. Score at 0.064×.
n-gram pruning min
frequency = 3 no probability
normalisation O(1) hash lookup ~8MB model size
The Shiny App
How to Use It
-
Type any English phrase in the text box
-
Click “Predict” (or use the example phrases)
-
The top prediction is highlighted in green — click it
to append to your phrase
-
Up to 5 alternatives are shown — click any to continue
building your sentence
Live Predictions
Input phrase:
“I want to go to the”
★ store
park movies beach gym
Input phrase:
“It would mean the”
★ world
same most difference end
∞
phrase length supported
100%
coverage (always predicts)
Why This Matters
What We Built
-
Processed 102M+ words from three real-world text
sources
-
Built a production-ready backoff model in R with
sub-10ms response times
-
Deployed as a live Shiny app — accessible to anyone, no
installation required
-
Model is compact (<10MB) and deployable on free-tier
infrastructure
Next Steps
-
Kneser-Ney smoothing for better probability estimates
-
Larger training sample (full corpus vs. 180K sentence sample)
-
Perplexity benchmarking on held-out test set
-
User personalisation — adapt model to individual typing patterns