Overview

102M
words in training corpus
~8ms
average prediction time
4-gram
maximum context window
3 sources
blogs, news, Twitter

The Problem & The Data

Smartphone keyboards need to predict your next word instantly — before you finish typing. The challenge: build a model that is both accurate and fast enough to work in real time.

Training Data (HC Corpora)

Source Lines Words
Blogs 899K 37.3M
News 1.01M 34.4M
Twitter 2.36M 30.4M

Key Insight: Zipf’s Law

Just ~150 words cover 50% of all text.

Only ~7,000 words cover 90% of all text.

This means a compact model can handle nearly all everyday language.


The Algorithm: Stupid Backoff

A fast, practical extension of n-gram models used in production at Google. No complex probability normalisation — just smart score weighting.

  • Step 1 — Clean input: Lowercase, strip URLs, numbers, and punctuation. Extract the last 1–3 words as context.
  • Step 2 — 4-gram lookup: Search the quadgram table for the last 3 words. Score matching predictions at 1.0×.
  • Step 3 — Trigram backoff: If no quadgram match, search trigram table for last 2 words. Score at 0.4×.
  • Step 4 — Bigram backoff: Search bigram table for last word. Score at 0.16×.
  • Step 5 — Unigram fallback: Always have an answer — fall back to most frequent words. Score at 0.064×.

n-gram pruning min frequency = 3 no probability normalisation O(1) hash lookup ~8MB model size


The Shiny App

How to Use It

  • Type any English phrase in the text box
  • Click “Predict” (or use the example phrases)
  • The top prediction is highlighted in green — click it to append to your phrase
  • Up to 5 alternatives are shown — click any to continue building your sentence

Live Predictions

Input phrase:
“I want to go to the”
store
park   movies   beach   gym
Input phrase:
“It would mean the”
world
same   most   difference   end
<10ms
prediction latency
5
alternatives shown
phrase length supported
100%
coverage (always predicts)

Why This Matters

What We Built

  • Processed 102M+ words from three real-world text sources
  • Built a production-ready backoff model in R with sub-10ms response times
  • Deployed as a live Shiny app — accessible to anyone, no installation required
  • Model is compact (<10MB) and deployable on free-tier infrastructure

Next Steps

  • Kneser-Ney smoothing for better probability estimates
  • Larger training sample (full corpus vs. 180K sentence sample)
  • Perplexity benchmarking on held-out test set
  • User personalisation — adapt model to individual typing patterns