Next-Word Prediction Explorer

Dominique Mühlbauer

2025-07-04

1. Next-Word Prediction: The Opportunity

Problem: In modern text‐based interfaces (chatbots, email clients, code editors), users expect smart, real-time suggestions.
Solution: Our Interpolated Kneser-Ney 4-gram model delivers lightning-fast, context-aware next‐word predictions.
Why now: NLP advances + in-browser/data-product integration make this the time to ship.

2. How It Works

Preprocessing & N-grams
- Corpus tokenized, lemmatized, stop-words removed
- Build 1–4-grams with counts, cached on disk
Interpolated Kneser-Ney Smoothing
- Discounts low-count events (D=0.75)
- Backoff across 4→1 gram levels with learned λ weights
On-Demand, Indexed Storage
- Model persisted as Parquet with dictionary encoding
- Arrow predicate‐pushdown reads only needed contexts
- Memoisation caches repeated lookups

3. Predictive Performance

Test set (10 000 sentences):
- Top-1 accuracy: ** 35 %**
- Top-3 accuracy: ** 62 %**
- Coverage (any suggestion): ** 99 %**
Latency:
- First lookup/context: ~ 100 ms
- Subsequent/cached: ~ 5 ms
Footprint:
- 5 × Parquet files ≈ 20 MB each → < 200 MB total on-disk

4. Live Demo of the Shiny App

Enter text in the sidebar; the last token is highlighted in real time.
Adjust “Max n-gram order” to see trade-offs between context depth and speed.
View top-k suggestions in the table and bar chart.
Toggle to “Word Cloud” for a visual glimpse of candidate probabilities.