next-word predictor
Next-word prediction (Katz backoff) — aseemanand.shinyapps.io/Next-Word-Predictor
This app takes one or more words as input and predicts the next word (ranked candidates with scores and plots).
Building Next-Word Predictor entailed:
model_accuracy_report.Rmd)Key R tooling in this repository includes data.table (n-gram tables and speed), shiny / shinythemes, ggplot2, and report workflows with rmarkdown (milestone EDA uses dplyr and ggplot2).
The prediction model uses the Katz back-off algorithm: it estimates P(next word | context) by combining evidence from higher-order n-grams and falling back to shorter histories when contexts are rare or missing (Katz, 1987).
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3), 400–401.
To obtain reasonable accuracy in reasonable time in the browser, the implementation uses standard tradeoffs between model size and coverage:
model_accuracy_report.Rmd: higher-order types with count ≥ 3). Unigrams stay for vocabulary and backoff endpoints.D: Exposed in the Shiny UI as “Katz absolute discount D” so we can see sensitivity to the standard Katz discounting parameter (default 0.75 on the live app).Input handling
. ! ? — the whole box is one phrase.Instructions:
Output (as on the live site):
Footer on the app: Scores are Katz backoff probabilities over an approximate candidate set (continuations seen in training plus frequent unigrams); bars renormalize across the displayed top-K for readability.
model_accuracy_report.Rmd — Rank-based accuracy for top-K predictions (top-1 full credit, graded decay by rank, zero if absent from the list), plus comparison of full vs pruned training and timing.capstone_milestone_report.Rmd — corpus EDA (scale, sampling, lexical summaries).Try it: Next-Word-Predictor
Next-word prediction (Katz backoff)