Introduction
- This project demonstrates a next-word prediction app.
- Users input a phrase, and the app predicts the most likely next word.
- Built as part of the Data Science Capstone Project using the SwiftKey dataset.
- Goal: Mimic real-world mobile keyboard prediction.
Data and Preprocessing
- Data sourced from blogs, news, and Twitter (en_US).
- Steps taken:
- Lowercasing, punctuation & number removal.
- Profanity filtering.
- Tokenization and n-gram (unigram to trigram) creation.
- Sampling used to reduce computational load.
Algorithm Overview
- Uses n-gram language modeling (mainly bigram & trigram).
- Stupid backoff algorithm:
- Try trigram → backoff to bigram → backoff to unigram.
- Predictions ranked by frequency of occurrence.
- Fast, lightweight, and interpretable.
Shiny App: How It Works
Final Notes
- The app demonstrates a basic NLP pipeline.
- Can be expanded with:
- Deep learning (e.g., LSTMs or transformers)
- User personalization
- Clean UI for non-technical users.
- [Insert RPubs slide link if needed]
Thank you!