Next Word Predictor

Data Science Capstone
April 2026

A fast, lightweight text prediction app built with R and Shiny

Overview

Problem: Predict the next word a user will type, given a phrase.

Solution: An N-gram language model trained on real-world English text.

Trained on blogs, news, and Twitter data from the SwiftKey corpus
Uses 4-gram, 3-gram, and 2-gram frequency tables
Implements Stupid Backoff algorithm for robust prediction
Deployed as an interactive Shiny web app

Algorithm

How Stupid Backoff works:

Clean and tokenise the input phrase
Look up the last 3 words in the 4-gram table
If no match, back off to 3-grams (last 2 words)
If still no match, back off to 2-grams (last word)
Each backoff level multiplies the score by 0.4
Return the highest-scoring candidate word

Why Stupid Backoff?

Simple, fast, and effective for large corpora
No expensive normalisation step (unlike Kneser-Ney)
Proven to rival more complex smoothing methods at scale

The App

App Screenshot

Features:

Text input box — type any English phrase
Instant prediction — click “Predict” for the next word
Top 5 results — see ranked alternatives with scores
Lightweight — loads in seconds, runs on shinyapps.io

Try it: [shinyapps.io link]

Performance & Future Work

Model stats:

Trained on ~300,000 lines (10% sample)
Model size: < 30 MB (compressed RDS files)
Response time: < 1 second per prediction

Future improvements:

Increase training sample size for better coverage
Add 5-gram support for longer context
Implement interpolated Kneser-Ney smoothing
Add auto-complete (predict while typing)
Support multiple languages