Next Word Predictor

Overview

102M

words in training corpus

~8ms

average prediction time

4-gram

maximum context window

3 sources

blogs, news, Twitter

The Problem & The Data

Smartphone keyboards need to predict your next word instantly — before you finish typing. The challenge: build a model that is both accurate and fast enough to work in real time.

Training Data (HC Corpora)

Source	Lines	Words
Blogs	899K	37.3M
News	1.01M	34.4M
Twitter	2.36M	30.4M

Key Insight: Zipf’s Law

Just ~150 words cover 50% of all text.

Only ~7,000 words cover 90% of all text.

This means a compact model can handle nearly all everyday language.

The Algorithm: Stupid Backoff

A fast, practical extension of n-gram models used in production at Google. No complex probability normalisation — just smart score weighting.

Step 1 — Clean input: Lowercase, strip URLs, numbers, and punctuation. Extract the last 1–3 words as context.
Step 2 — 4-gram lookup: Search the quadgram table for the last 3 words. Score matching predictions at 1.0×.
Step 3 — Trigram backoff: If no quadgram match, search trigram table for last 2 words. Score at 0.4×.
Step 4 — Bigram backoff: Search bigram table for last word. Score at 0.16×.
Step 5 — Unigram fallback: Always have an answer — fall back to most frequent words. Score at 0.064×.

n-gram pruning min frequency = 3 no probability normalisation O(1) hash lookup ~8MB model size

The Shiny App

How to Use It

Type any English phrase in the text box
Click “Predict” (or use the example phrases)
The top prediction is highlighted in green — click it to append to your phrase
Up to 5 alternatives are shown — click any to continue building your sentence

Live Predictions

Input phrase:

“I want to go to the”

★ store

park movies beach gym

Input phrase:

“It would mean the”

★ world

same most difference end

<10ms

prediction latency

5

alternatives shown

∞

phrase length supported

100%

coverage (always predicts)

Why This Matters

What We Built

Processed 102M+ words from three real-world text sources
Built a production-ready backoff model in R with sub-10ms response times
Deployed as a live Shiny app — accessible to anyone, no installation required
Model is compact (<10MB) and deployable on free-tier infrastructure

Next Steps

Kneser-Ney smoothing for better probability estimates
Larger training sample (full corpus vs. 180K sentence sample)
Perplexity benchmarking on held-out test set
User personalisation — adapt model to individual typing patterns