Introduction
- Built as part of Coursera Data Science Capstone
- Predicts the next word in a user-entered phrase
- Uses blogs text dataset is used.
How the Algorithm Works
- Text cleaned and tokenized using
tidytext
- Created unigram, bigram, and trigram frequency tables
- Uses n-gram model with backoff:
- If trigram not found → backoff to bigram → backoff to unigram
Prediction Model
- Efficient and memory-optimized using RDS storage
- Fast lookup using
dplyr
filtering and slice_max
- Handles unknown inputs with fallback strategy
Shiny App Demo
Summary
- Accurate, fast predictions with minimal resources
- Real-time prediction from cleaned social and web text
- Could be extended into mobile keyboards or chat assistants