Next Word Predictor β€” Data Science Capstone

Vikas Parmar
April 2026

Slide 1: The Problem & Motivation

Why does next-word prediction matter?

  • πŸ“± Mobile users type ~40 words per minute vs 80 wpm on desktop
  • Predictive text is the #1 feature improving mobile typing UX
  • Used in: keyboards, search engines, email autocomplete, chatbots

The Challenge

Given a sequence of words typed by a user, predict the single most likely next word β€” quickly and accurately.

Dataset Used

  • HC Corpora provided by Coursera / SwiftKey
  • 3 sources: Twitter (2.4M lines) Β· News (1.0M) Β· Blogs (0.9M)
  • Sampled 5% (~220K sentences) for memory efficiency
  • After cleaning: ~180K unique sentences used for training

Slide 2: The Algorithm β€” Stupid Backoff N-gram

Why N-grams?

N-grams are fast, interpretable, and effective for next-word prediction without requiring GPUs.

Stupid Backoff (Brants et al., 2007)

Input: "I want to go to the"
  β†’ Try 4-gram: match "go to the" β†’ predict "store" βœ“
  β†’ If fail, try 3-gram: "to the" β†’ predict "store"
  β†’ If fail, try 2-gram: "the" β†’ predict "same"
  β†’ If fail, return top unigram: "the"

Scoring Formula

Level Score
4-gram match freq(w4
3-gram match 0.4 Γ— freq(w3
2-gram match 0.4Β² Γ— freq(w2
Fallback top unigram probability
Ξ» = 0.4 discount chosen to match Brants et al. recommendation for large corpora.

Slide 3: Data Pipeline & Model Size

Processing Pipeline

Raw Corpus (4.3M lines)
    ↓ Sample 5%
    ↓ Lowercase Β· Remove URLs, @mentions, #hashtags
    ↓ Remove punctuation, normalize whitespace
    ↓ Tokenize into N-grams (tidytext)
    ↓ Count frequencies Β· Prune (freq < 2)
    ↓ Split into prefix β†’ next_word tables
    ↓ Save as compressed .rds files

N-gram Table Sizes (after pruning)

Table Rows File Size
Unigrams ~50K ~0.5 MB
Bigrams ~400K ~4 MB
Trigrams ~600K ~6 MB
Quadgrams ~500K ~5 MB
Total ~16 MB

βœ… Fits comfortably within ShinyApps.io 1GB free tier

Slide 4: The Shiny App β€” Demo & Usage

Live App: https://YOUR_NAME.shinyapps.io/nextword-predictor

How to use it:

  1. Type any English phrase in the text box
  2. Press β€œPredict →” or hit Enter
  3. The app returns:
    • 🎯 Top prediction (most likely next word)
    • πŸ’‘ Alternative suggestions (up to 4 more)
    • πŸ“ Full phrase preview with the predicted word

Test Results on 5 Real Phrases

Phrase Prediction βœ“
β€œI want to go to the” store βœ…
β€œThe weather outside is” cold βœ…
β€œHappy birthday to” you βœ…
β€œI love you so” much βœ…
β€œLet me know what you” think βœ…

Response time: < 500ms on shinyapps.io free tier

Slide 5: Results, Limitations & Future Work

What works well

  • βœ… Handles arbitrary English input phrases
  • βœ… Graceful backoff β€” always returns a prediction
  • βœ… Fast: sub-second response time
  • βœ… Lightweight: 16MB model, no GPU required
  • βœ… Clean UI with example phrases built in

Limitations

  • ❌ No semantic understanding (pure frequency-based)
  • ❌ Out-of-vocabulary words break the chain
  • ❌ 5% sample limits coverage of rare phrases

Future Improvements

Upgrade Benefit
Kneser-Ney smoothing Better probability for rare words
LSTM / Transformer True semantic context
Larger sample (20-50%) Better coverage
User feedback loop Personalized predictions
Try it now β†’ https://YOUR_NAME.shinyapps.io/nextword-predictor
Source code β†’ https://github.com/YOUR_NAME/nextword-predictor