2026-06-25

The Problem

  • Typing full sentences is slow
  • Mobile keyboards already suggest the next word
  • Goal: Build the same thing from scratch using real data

Given a phrase, predict the most likely next word.

How the Model Works

Step 1 — Training data

  • 3 sources: Blogs, News, Twitter (~555 MB, ~70M words)
  • 5% random sample used for speed (~14M words after cleaning)

Step 2 — N-gram tables

  • Counts how often word sequences appear together
  • Builds lookup tables: bigram (2-word), trigram (3-word), quadgram (4-word)

Step 3 — Backoff prediction

Input: "thanks for the"
  → Look in quadgram table first
  → No match? Try trigram
  → No match? Try bigram
  → Return top 3 most frequent matches

Predictive Performance

Metric Value
Training words ~14 million
Bigram pairs ~500,000
Trigram sequences ~300,000
Quadgram sequences ~200,000
Predictions returned Up to 3
Response time < 1 second

Backoff ensures a prediction is always returned, even for rare phrases.

The App

How to use it:

  1. Type any word or phrase in the text box
  2. Use the slider to choose 1, 2, or 3 suggestions
  3. Predicted next words appear instantly below

Features:

  • Handles contractions, mixed case, punctuation
  • Filters profanity automatically
  • Falls back gracefully for unknown words

Live app: shinyapps.io/next-word-prediction-shiny-app-swadhwa

Summary

Data Blogs + News + Twitter
Model N-gram backoff (2 / 3 / 4-gram)
Output Top 3 next-word predictions
Speed Under 1 second
Built with R + Shiny

Simple, fast, and always returns a prediction.

Created by Sumit Wadhwa