1. Project Overview

2. Modeling Approach

  • Preprocessing:
    • Sampled ~1% from each source
    • Converted to lowercase, removed punctuation/whitespace
  • Modeling:
    • Constructed unigram, bigram, trigram tables
    • Selected top 5000 TF-IDF words as vocabulary
    • Used backoff model: trigram → bigram → top tf-idf

3. Prediction Function

Example outputs from our model:

Input Phrase Predicted Word
I’d give anything to see arctic monkeys
When you breathe I want to be air
Talking to your mom has the same effect
I like how the same people are in Adam Sandler’s movies
  • Predictions made using frequency + context match
  • Efficient for real-time use in Shiny

4. Shiny App Summary

5. Highlights & Conclusion

✅ Fast + lightweight model
✅ TF-IDF restricts vocabulary to high-signal words
✅ Handles unseen inputs via backoff fallback
✅ Shiny app loads fast and is interactive
✅ Deployed successfully and peer-review ready

Thanks for reviewing!