Rahul Vijayraghavan
March 2026
Johns Hopkins Data Science Capstone — Course 10
Live App: https://rahulvijay97.shinyapps.io/NextWordPredictor/
Predicting the next word is harder than it looks
Out of ~10 billion possible trigrams, a 60M-word corpus observes only ~5M. The model must always return an answer — even for unseen phrases.
Solution: Train on the full corpus, then back off gracefully using a frequency-weighted fallback chain.
Trigram model with Stupid Backoff (Brants et al., 2007)
Given user input "the cat sat":
1. Look up ("cat", "sat") → w3 in trigrams [score = count]
2. If < 3 found: fall back to ("sat") → w3 in bigrams [score = count × 0.4]
3. If still < 3: return top unigrams [score = count × 0.16]
Always return top 3 by score — de-duplicated.
Model trained on the full HC Corpora English dataset:
| Table | Rows kept | Compressed size |
|---|---|---|
| Unigrams | 50,000 | ~3 MB |
| Bigrams | 500,000 | ~20 MB |
| Trigrams | 500,000 | ~28 MB |
| Total | < 55 MB |
Zipf's Law: the top 50,000 words cover > 99% of all word instances. Rare n-grams (count < 2) are pruned — cutting the tables ~65% with no accuracy loss.
Features
# Under the hood — one function call per keystroke
predict_next_word("I went to the", n_suggestions = 3)
# [1] "store" "hospital" "gym"
Tech stack: R · Shiny · shinythemes (flatly) · data.table · stringr
Try it live: https://rahulvijay97.shinyapps.io/NextWordPredictor/
Accuracy on a 10% held-out test set (standard n-gram baseline):
| Metric | Score |
|---|---|
| Top-1 accuracy | ~15 – 18% |
| Top-3 accuracy | ~25 – 32% |
| Prediction time | < 10 ms |
Competitive with published trigram baselines; fast enough for real-time UX.
Open the app and test these five phrases:
Live app: https://rahulvijay97.shinyapps.io/NextWordPredictor/
Source code available on GitHub upon request.