Project Overview
- Objective: Build a Shiny app that predicts the next word in a sentence.
- Data Source: English text corpora from blogs, news, and Twitter (SwiftKey dataset).
- Tools: R, Shiny,
stringr, tm, tidytext
- Final product: A lightweight app using a basic N-gram model.
Data Processing
- Downloaded and unzipped the corpus files.
- Sampled a small portion for performance.
- Cleaned the text:
- Lowercase, removed punctuation, numbers, profanity.
- Tokenized into:
- Bigrams (2-word)
- Trigrams (3-word)
Prediction Algorithm
- N-gram backoff strategy:
- If trigram match found: use it.
- If not, fall back to bigram.
- If not, return most frequent word.
- Example:
Input: "I love" → Match "I love you"
If not found: try "love you", then "you".
Shiny App Overview
Future Improvements
- Improve prediction accuracy with:
- Smoothing techniques (e.g., Katz backoff)
- Larger training sample
- POS tagging or deep learning
- Add top 3 predictions
- Mobile optimization