Overview
- Goal: Predict the next word from a user-typed phrase
- Built using Shiny for NLP interaction
- Real-time prediction using N-gram models
- Useful for mobile typing suggestions, chatbots, etc.
Data Source & Cleaning
- Dataset: HC Corpora (Blogs, News, Twitter)
- Steps:
- Removed punctuation, numbers, stopwords
- Converted to lowercase
- Tokenized into N-grams (uni-, bi-, tri-)
- Sampled data (~5%) for efficiency
Prediction Algorithm
- Built N-gram language models
- Used Stupid Backoff strategy:
- Try trigram → if not found, backoff to bigram → then unigram
- Fast, simple, works well with sparse data
- Implemented in R with
dplyr
, stringr
, tidytext
The Shiny App
Reflection
- Works well for many common phrases
- Could improve with:
- Smarter backoff (e.g. Kneser-Ney)
- Contextual models (e.g. RNN, transformers)
- Great experience building full NLP pipeline!