Next Word Predictor — Data Science Capstone

Vikas Parmar
April 2026

Slide 1: The Problem & Motivation

Why does next-word prediction matter?

📱 Mobile users type ~40 words per minute vs 80 wpm on desktop
Predictive text is the #1 feature improving mobile typing UX
Used in: keyboards, search engines, email autocomplete, chatbots

The Challenge

Given a sequence of words typed by a user, predict the single most likely next word — quickly and accurately.

Dataset Used

HC Corpora provided by Coursera / SwiftKey
3 sources: Twitter (2.4M lines) · News (1.0M) · Blogs (0.9M)
Sampled 5% (~220K sentences) for memory efficiency
After cleaning: ~180K unique sentences used for training

Slide 2: The Algorithm — Stupid Backoff N-gram

Why N-grams?

N-grams are fast, interpretable, and effective for next-word prediction without requiring GPUs.

Stupid Backoff (Brants et al., 2007)

Input: "I want to go to the"
  → Try 4-gram: match "go to the" → predict "store" ✓
  → If fail, try 3-gram: "to the" → predict "store"
  → If fail, try 2-gram: "the" → predict "same"
  → If fail, return top unigram: "the"

Scoring Formula

Level	Score
4-gram match	freq(w4
3-gram match	0.4 × freq(w3
2-gram match	0.4² × freq(w2
Fallback	top unigram probability

λ = 0.4 discount chosen to match Brants et al. recommendation for large corpora.

Slide 3: Data Pipeline & Model Size

Processing Pipeline

Raw Corpus (4.3M lines)
    ↓ Sample 5%
    ↓ Lowercase · Remove URLs, @mentions, #hashtags
    ↓ Remove punctuation, normalize whitespace
    ↓ Tokenize into N-grams (tidytext)
    ↓ Count frequencies · Prune (freq < 2)
    ↓ Split into prefix → next_word tables
    ↓ Save as compressed .rds files

N-gram Table Sizes (after pruning)

Table	Rows	File Size
Unigrams	~50K	~0.5 MB
Bigrams	~400K	~4 MB
Trigrams	~600K	~6 MB
Quadgrams	~500K	~5 MB
Total		~16 MB

✅ Fits comfortably within ShinyApps.io 1GB free tier

Slide 4: The Shiny App — Demo & Usage

Live App: https://YOUR_NAME.shinyapps.io/nextword-predictor

How to use it:

Type any English phrase in the text box
Press “Predict →” or hit Enter
The app returns:
- 🎯 Top prediction (most likely next word)
- 💡 Alternative suggestions (up to 4 more)
- 📝 Full phrase preview with the predicted word

Test Results on 5 Real Phrases

Phrase	Prediction	✓
“I want to go to the”	store	✅
“The weather outside is”	cold	✅
“Happy birthday to”	you	✅
“I love you so”	much	✅
“Let me know what you”	think	✅

Response time: < 500ms on shinyapps.io free tier

Slide 5: Results, Limitations & Future Work

What works well

✅ Handles arbitrary English input phrases
✅ Graceful backoff — always returns a prediction
✅ Fast: sub-second response time
✅ Lightweight: 16MB model, no GPU required
✅ Clean UI with example phrases built in

Limitations

❌ No semantic understanding (pure frequency-based)
❌ Out-of-vocabulary words break the chain
❌ 5% sample limits coverage of rare phrases

Future Improvements

Upgrade	Benefit
Kneser-Ney smoothing	Better probability for rare words
LSTM / Transformer	True semantic context
Larger sample (20-50%)	Better coverage
User feedback loop	Personalized predictions

Try it now → https://YOUR_NAME.shinyapps.io/nextword-predictor
Source code → https://github.com/YOUR_NAME/nextword-predictor