June 30, 2026

The Problem

  • Typing on mobile and web is slow, and users want smart, fast word suggestions as they type.
  • Most predictive keyboards rely on huge external datasets and heavy infrastructure.
  • We needed a solution that is:
    • Fast — instant predictions, no lag
    • Lightweight — small footprint, easy to deploy anywhere
    • Self-contained — no external data dependencies or downloads at runtime

Our app solves this with a Stupid Backoff n-gram model that ships fully self-contained.

The Algorithm: Stupid Backoff

We use the Stupid Backoff algorithm (Brants et al., 2007), a simple but highly effective approach used in large-scale language models:

  1. Build n-gram frequency tables (unigrams through 4-grams) from training text.
  2. To predict the next word, look for the longest matching prefix (last 3 words) in the 4-gram table.
  3. If no match is found, back off to the 3-gram, then 2-gram, then unigram table — applying a 0.4 discount weight at each lower order.
  4. Return the highest-scoring candidate word.

Why this works well:

  • No complex probability smoothing needed
  • Scales efficiently — used in production by Google’s large-scale models
  • Always returns a prediction, even for unseen phrases, thanks to the unigram fallback

The App

A simple, distraction-free Shiny interface:

  1. Type a partial sentence into the text box (e.g. “I can not wait to see”)
  2. Click “Predict Next Word”
  3. The model instantly returns its top predicted word

Key features:

  • Clean, single-screen UI — no learning curve
  • Predictions return in well under a second
  • Entire model (corpus + n-gram tables) is built at app startup — zero external file dependencies, so it is easy to redeploy anywhere

Why It’s a Good Bet

  • Novel angle: most capstone submissions ship a large external corpus file; ours embeds and builds the model entirely in-app, making it trivially portable and easy to maintain.
  • Robust: the multi-level backoff with unigram fallback guarantees a prediction for any input, including short or unusual phrases.
  • Production-minded: small footprint, fast cold start, no dependency management headaches when redeploying or handing off to another engineer.
  • Extensible: the n-gram tables can easily be swapped for a larger corpus (e.g. full SwiftKey dataset) without changing the app’s architecture.

Try It / Next Steps

Live app: [insert your shinyapps.io link here]

Next steps to take this further:

  • Train on a larger, more diverse corpus (news, blogs, social media) for broader vocabulary coverage
  • Add top-3 suggestions instead of a single word
  • Add Kneser-Ney smoothing as a comparison baseline against Stupid Backoff
  • Optimize n-gram lookup with a hash-based index for larger-scale deployment

Thank you — happy to answer questions.