Next-Word Prediction Explorer

Dominique Mühlbauer

2025-07-04

1. Next-Word Prediction: The Opportunity

2. How It Works

  1. Preprocessing & N-grams
    • Corpus tokenized, lemmatized, stop-words removed
    • Build 1–4-grams with counts, cached on disk
  2. Interpolated Kneser-Ney Smoothing
    • Discounts low-count events (D=0.75)
    • Backoff across 4→1 gram levels with learned λ weights
  3. On-Demand, Indexed Storage
    • Model persisted as Parquet with dictionary encoding
    • Arrow predicate‐pushdown reads only needed contexts
    • Memoisation caches repeated lookups

3. Predictive Performance

4. Live Demo of the Shiny App

  1. Enter text in the sidebar; the last token is highlighted in real time.
  2. Adjust “Max n-gram order” to see trade-offs between context depth and speed.
  3. View top-k suggestions in the table and bar chart.
  4. Toggle to “Word Cloud” for a visual glimpse of candidate probabilities.