Next Word Predictor

2026-06-25

The Problem

Typing full sentences is slow
Mobile keyboards already suggest the next word
Goal: Build the same thing from scratch using real data

Given a phrase, predict the most likely next word.

How the Model Works

Step 1 — Training data

3 sources: Blogs, News, Twitter (~555 MB, ~70M words)
5% random sample used for speed (~14M words after cleaning)

Step 2 — N-gram tables

Counts how often word sequences appear together
Builds lookup tables: bigram (2-word), trigram (3-word), quadgram (4-word)

Step 3 — Backoff prediction

Input: "thanks for the"
  → Look in quadgram table first
  → No match? Try trigram
  → No match? Try bigram
  → Return top 3 most frequent matches

Predictive Performance

Metric	Value
Training words	~14 million
Bigram pairs	~500,000
Trigram sequences	~300,000
Quadgram sequences	~200,000
Predictions returned	Up to 3
Response time	< 1 second

Backoff ensures a prediction is always returned, even for rare phrases.

The App

How to use it:

Type any word or phrase in the text box
Use the slider to choose 1, 2, or 3 suggestions
Predicted next words appear instantly below

Features:

Handles contractions, mixed case, punctuation
Filters profanity automatically
Falls back gracefully for unknown words

Live app: shinyapps.io/next-word-prediction-shiny-app-swadhwa

Summary

Data	Blogs + News + Twitter
Model	N-gram backoff (2 / 3 / 4-gram)
Output	Top 3 next-word predictions
Speed	Under 1 second
Built with	R + Shiny

Simple, fast, and always returns a prediction.

Created by Sumit Wadhwa