2026-05-27

Product Overview and Data

  • Built an interactive Shiny app that predicts the next word from a user-entered English phrase
  • Final model: Stupid Backoff
  • Sampled 30% from blogs, news, and Twitter to reduce memory and processing time
  • Split data into train, validation, and test sets using an 80% / 10% / 10% split
  • Cleaned text and created unigram, bigram, trigram, and fourgram frequency tables
  • Converted raw text into context-target training data using a 3-word sliding window

Algorithm: Stupid Backoff

  • The app uses a Stupid Backoff language model
  • It first searches for the longest available context using fourgrams
  • If no match is found, it backs off to trigram, bigram, and then unigram models
  • Backoff scores are adjusted using an alpha penalty
  • Score = n-gram probability × alpha penalty

Model Evaluation and Selection

  • Compared Simple Backoff, Stupid Backoff, Laplace-smoothed Backoff, and Naive Bayes
  • Evaluation used validation and test datasets
  • Metrics included top-k accuracy, runtime, and model object size
  • Stupid Backoff achieved the highest test top-k accuracy: 30.60%
  • Naive Bayes was fastest and smallest, but Stupid Backoff was selected for better prediction accuracy

App Experience and Conclusion

  • Users type an English phrase into the text box
  • The app displays the top 5 predicted next words; the first suggestion is the highest-ranked prediction
  • Users can click a suggested word to add it to the input phrase and continue writing
  • Five Twitter/news-style test phrases were entered with the final word removed; the app returned predictions for all five
  • The app provides a fast and interactive next-word prediction experience using a compact language model