2026-05-27
Product Overview and Data
- Built an interactive Shiny app that predicts the next word from a user-entered English phrase
- Final model: Stupid Backoff
- Sampled 30% from blogs, news, and Twitter to reduce memory and processing time
- Split data into train, validation, and test sets using an 80% / 10% / 10% split
- Cleaned text and created unigram, bigram, trigram, and fourgram frequency tables
- Converted raw text into context-target training data using a 3-word sliding window
Algorithm: Stupid Backoff
- The app uses a Stupid Backoff language model
- It first searches for the longest available context using fourgrams
- If no match is found, it backs off to trigram, bigram, and then unigram models
- Backoff scores are adjusted using an alpha penalty
- Score = n-gram probability × alpha penalty
Model Evaluation and Selection
- Compared Simple Backoff, Stupid Backoff, Laplace-smoothed Backoff, and Naive Bayes
- Evaluation used validation and test datasets
- Metrics included top-k accuracy, runtime, and model object size
- Stupid Backoff achieved the highest test top-k accuracy: 30.60%
- Naive Bayes was fastest and smallest, but Stupid Backoff was selected for better prediction accuracy
App Experience and Conclusion
- Users type an English phrase into the text box
- The app displays the top 5 predicted next words; the first suggestion is the highest-ranked prediction
- Users can click a suggested word to add it to the input phrase and continue writing
- Five Twitter/news-style test phrases were entered with the final word removed; the app returned predictions for all five
- The app provides a fast and interactive next-word prediction experience using a compact language model