Data Science Student | Johns Hopkins Capstone
June 2026
https://akshaisuresh.shinyapps.io/Capstone_Project/
What it does: Predicts your next word as you type β just like the
autocomplete bar on a smartphone keyboard, powered entirely by data science.
How to use it:
Works on desktop and mobile browsers. No login required.
Given input βI want to ___β, the model:
| Step | Action | Score |
|---|---|---|
| 1 | Look up quadgrams starting with βwant toβ | freq(quad)/freq(tri) |
| 2 | No match β try trigrams starting with βtoβ | Γ 0.4 |
| 3 | No match β try bigrams | Γ 0.4Β² |
| 4 | No match β top unigrams | Γ 0.4Β³ |
| Source | Lines | Words | Style |
|---|---|---|---|
| Blogs | 899K | 37M | Long-form, personal |
| News | 1.01M | 34M | Formal, structured |
| 2.36M | 30M | Short, conversational |
Training used a 10% random sample (seed = 42) for speed and memory.
127 words β 50% of all text covered
6,694 words β 90% of all text covered
This means a vocabulary of ~10,000 words handles nearly everything a user will type. The rest is pruned without meaningfully hurting accuracy.
βFeels like Swiftkey but in a browser. Start typing any news headline or tweet β by the third word, predictions are already on target.β
Test phrases (try these):
"the president of the" β United States"happy new" β year"thanks for" β the / your / sharing"looking forward to" β seeing / hearing / working| Metric | Result |
|---|---|
| Top-1 accuracy | ~15β17% |
| Top-3 accuracy | ~30β35% |
| Avg prediction time | < 5 ms |
| Model RAM footprint | ~150 MB |
| Training data | ~102M words |
Benchmarked on 5,000 held-out Twitter sentences.
β Fully deployed β live URL, no setup needed
β Robust β predicts for any input, never fails
β Transparent β shows which n-gram order fired
β Extensible β swap in Kneser-Ney or neural LM with zero UI changes
β Open source R β reproducible, documented, ready to scale
Built with R Β· data.table Β· Shiny Β· tidytext
Data: HC Corpora (SwiftKey / Johns Hopkins)