Next Word Predictor

Data Science Capstone
April 2026

A fast, lightweight text prediction app built with R and Shiny

Overview

Problem: Predict the next word a user will type, given a phrase.

Solution: An N-gram language model trained on real-world English text.

  • Trained on blogs, news, and Twitter data from the SwiftKey corpus
  • Uses 4-gram, 3-gram, and 2-gram frequency tables
  • Implements Stupid Backoff algorithm for robust prediction
  • Deployed as an interactive Shiny web app

Algorithm

How Stupid Backoff works:

  1. Clean and tokenise the input phrase
  2. Look up the last 3 words in the 4-gram table
  3. If no match, back off to 3-grams (last 2 words)
  4. If still no match, back off to 2-grams (last word)
  5. Each backoff level multiplies the score by 0.4
  6. Return the highest-scoring candidate word

Why Stupid Backoff?

  • Simple, fast, and effective for large corpora
  • No expensive normalisation step (unlike Kneser-Ney)
  • Proven to rival more complex smoothing methods at scale

The App

App Screenshot

Features:

  • Text input box — type any English phrase
  • Instant prediction — click “Predict” for the next word
  • Top 5 results — see ranked alternatives with scores
  • Lightweight — loads in seconds, runs on shinyapps.io

Try it: [shinyapps.io link]

Performance & Future Work

Model stats:

  • Trained on ~300,000 lines (10% sample)
  • Model size: < 30 MB (compressed RDS files)
  • Response time: < 1 second per prediction

Future improvements:

  • Increase training sample size for better coverage
  • Add 5-gram support for longer context
  • Implement interpolated Kneser-Ney smoothing
  • Add auto-complete (predict while typing)
  • Support multiple languages