Next Word Predictor

June 30, 2026

The Problem

Typing on mobile and web is slow, and users want smart, fast word suggestions as they type.
Most predictive keyboards rely on huge external datasets and heavy infrastructure.
We needed a solution that is:
- Fast — instant predictions, no lag
- Lightweight — small footprint, easy to deploy anywhere
- Self-contained — no external data dependencies or downloads at runtime

Our app solves this with a Stupid Backoff n-gram model that ships fully self-contained.

We use the Stupid Backoff algorithm (Brants et al., 2007), a simple but highly effective approach used in large-scale language models:

Build n-gram frequency tables (unigrams through 4-grams) from training text.
To predict the next word, look for the longest matching prefix (last 3 words) in the 4-gram table.
If no match is found, back off to the 3-gram, then 2-gram, then unigram table — applying a 0.4 discount weight at each lower order.
Return the highest-scoring candidate word.

Why this works well:

No complex probability smoothing needed
Scales efficiently — used in production by Google’s large-scale models
Always returns a prediction, even for unseen phrases, thanks to the unigram fallback

A simple, distraction-free Shiny interface:

Key features:

Clean, single-screen UI — no learning curve
Predictions return in well under a second
Entire model (corpus + n-gram tables) is built at app startup — zero external file dependencies, so it is easy to redeploy anywhere

Novel angle: most capstone submissions ship a large external corpus file; ours embeds and builds the model entirely in-app, making it trivially portable and easy to maintain.
Robust: the multi-level backoff with unigram fallback guarantees a prediction for any input, including short or unusual phrases.
Production-minded: small footprint, fast cold start, no dependency management headaches when redeploying or handing off to another engineer.
Extensible: the n-gram tables can easily be swapped for a larger corpus (e.g. full SwiftKey dataset) without changing the app’s architecture.

Live app: [insert your shinyapps.io link here]

Next steps to take this further:

Train on a larger, more diverse corpus (news, blogs, social media) for broader vocabulary coverage
Add top-3 suggestions instead of a single word
Add Kneser-Ney smoothing as a comparison baseline against Stupid Backoff
Optimize n-gram lookup with a hash-based index for larger-scale deployment

Thank you — happy to answer questions.