author: Tulsidai Singh date:
r format(Sys.Date(), "%B %d, %Y") autosize: true
Every time you type, your phone guesses the next word.
Behind that guess is a language model trained on millions of sentences that learns which words tend to follow others.
This app does exactly that Built entirely in R from a 10% sample of the HC Corpora English dataset (~900,000 lines of blogs, news, and tweets).
Why does this matter?
The model is a stupid backoff n-gram model — fast, interpretable, and effective for this scale.
Why stupid backoff? No complex smoothing required. Runs in milliseconds on pre-built frequency tables. Singleton n-grams pruned to reduce memory footprint.
Corpus: 10% sample of HC Corpora English (~900K lines)
| Metric | Value |
|---|---|
| Vocabulary size | ~150,000 unique words |
| 50% word coverage | 131 unique words |
| 90% word coverage | 6,861 unique words |
| Top bigram | of the (26,000+ occurrences) |
| Prediction latency | < 1 second |
The sharp coverage cliff (131 → 6,861 words for 50% → 90%) confirms Zipf’s law and justifies aggressive singleton pruning without meaningful accuracy loss.
Live at
tulsidai.shinyapps.io/en_US
Example output for “arctic monkeys this” weekend time year
Next Word Predictor demonstrates that a lightweight n-gram model built entirely in R can deliver fast, reasonable text predictions with no external dependencies.
Key takeaways
Built with R · shiny · tidytext · shinyapps.io