Vikas Parmar
April 2026
Why does next-word prediction matter?
The Challenge
Dataset Used
Why N-grams?
N-grams are fast, interpretable, and effective for next-word prediction without requiring GPUs.
Stupid Backoff (Brants et al., 2007)
Input: "I want to go to the"
β Try 4-gram: match "go to the" β predict "store" β
β If fail, try 3-gram: "to the" β predict "store"
β If fail, try 2-gram: "the" β predict "same"
β If fail, return top unigram: "the"
Scoring Formula
| Level | Score |
|---|---|
| 4-gram match | freq(w4 |
| 3-gram match | 0.4 Γ freq(w3 |
| 2-gram match | 0.4Β² Γ freq(w2 |
| Fallback | top unigram probability |
Processing Pipeline
Raw Corpus (4.3M lines)
β Sample 5%
β Lowercase Β· Remove URLs, @mentions, #hashtags
β Remove punctuation, normalize whitespace
β Tokenize into N-grams (tidytext)
β Count frequencies Β· Prune (freq < 2)
β Split into prefix β next_word tables
β Save as compressed .rds files
N-gram Table Sizes (after pruning)
| Table | Rows | File Size |
|---|---|---|
| Unigrams | ~50K | ~0.5 MB |
| Bigrams | ~400K | ~4 MB |
| Trigrams | ~600K | ~6 MB |
| Quadgrams | ~500K | ~5 MB |
| Total | ~16 MB |
β Fits comfortably within ShinyApps.io 1GB free tier
Live App: https://YOUR_NAME.shinyapps.io/nextword-predictor
How to use it:
Test Results on 5 Real Phrases
| Phrase | Prediction | β |
|---|---|---|
| βI want to go to theβ | store | β |
| βThe weather outside isβ | cold | β |
| βHappy birthday toβ | you | β |
| βI love you soβ | much | β |
| βLet me know what youβ | think | β |
Response time: < 500ms on shinyapps.io free tier
What works well
Limitations
Future Improvements
| Upgrade | Benefit |
|---|---|
| Kneser-Ney smoothing | Better probability for rare words |
| LSTM / Transformer | True semantic context |
| Larger sample (20-50%) | Better coverage |
| User feedback loop | Personalized predictions |