Next Word Predictor

author: Tulsidai Singh date: r format(Sys.Date(), "%B %d, %Y") autosize: true

The Problem

Every time you type, your phone guesses the next word.

Behind that guess is a language model trained on millions of sentences that learns which words tend to follow others.

This app does exactly that Built entirely in R from a 10% sample of the HC Corpora English dataset (~900,000 lines of blogs, news, and tweets).

Why does this matter?

How It Works

The model is a stupid backoff n-gram model — fast, interpretable, and effective for this scale.

At prediction time
1Take the last 2 words → look up matching trigrams
2No trigram found → fall back to the last word → look up bigrams
3No bigram found → return the most frequent unigrams

Why stupid backoff? No complex smoothing required. Runs in milliseconds on pre-built frequency tables. Singleton n-grams pruned to reduce memory footprint.

Model Performance

Corpus: 10% sample of HC Corpora English (~900K lines)

Metric Value
Vocabulary size ~150,000 unique words
50% word coverage 131 unique words
90% word coverage 6,861 unique words
Top bigram of the (26,000+ occurrences)
Prediction latency < 1 second

The sharp coverage cliff (131 → 6,861 words for 50% → 90%) confirms Zipf’s law and justifies aggressive singleton pruning without meaningful accuracy loss.

The App

Live at tulsidai.shinyapps.io/en_US

How to use it
1Type any phrase into the text box
2Press Predict
3The top 3 predicted next words appear instantly — the most likely prediction is highlighted in pink

Example output for “arctic monkeys this” weekend time year

Summary

Next Word Predictor demonstrates that a lightweight n-gram model built entirely in R can deliver fast, reasonable text predictions with no external dependencies.

Key takeaways

  • Trained on ~900K lines of real English text
  • Trigram → bigram → unigram backoff chain
  • Sub-second predictions, minimal memory footprint
  • Clean, accessible UI deployable on any device

Built with R · shiny · tidytext · shinyapps.io