Text Predictor App
Data Science Capstone: Final Project
2025-12-30
2. Project Overview & Data Pipeline
This project represents the culmination of the 9-course JHU Data Science specialization, transforming raw corpora into a predictive product (Task 1-3).
- Foundation: Applied the Data Scientist’s Toolbox and R Programming to manage 100MB of Twitter, Blog, and News data.
- Cleaning: Used Getting and Cleaning Data techniques to tokenize, remove profanity, and handle “s” possessives (Task 2).
- Exploration: Conducted Exploratory Data Analysis to visualize n-gram frequencies (Task 3).
- Methodology: All steps are documented via Reproducible Research standards to ensure transparency.
3. The Algorithm(v14) & Inference
The app uses a Stupid Backoff model with an IDF-Weighted Skip-Gram approach to ensure contextually relevant predictions (Task 4-5).
- Statistical Inference: Used probability distributions to estimate the likelihood of unseen word sequences.
- Regression & ML: Applied principles from Regression Models and Practical Machine Learning to weight n-gram scores.
- Innovation: Penalize “generic” words using \(Score = \frac{n}{\log(GlobalFreq + 1)}\).
- The Skip-Gram: If a stopword is detected, the model “skips” back to the anchor verb.
4. Developing the Product
Based on the Developing Data Products, I created the Text Predictor App (Link)
- Interactive Design: The app features a modern smartphone frame with a “Tap-to-Complete” keyboard functionality.
- Customization: A dynamic slider allows users to adjust prediction depth in real-time.
- Visual Analytics: Allows users to see the trends of the data through interactive word-clouds.
- Efficiency: The model is optimized for sub-second response times, demonstrating a product ready for real-world deployment.
5. Conclusion
This Capstone successfully executed all 7 Tasks prescribed in the course.
- The User Experience: Users enjoy a seamless, “smart” typing experience that feels native to modern mobile devices.
- Functioning: The IDF-weighting makes the predictions feel uniquely “human” compared to standard maximum-likelihood models.
I have demonstrated the ability to take a complex problem, apply rigorous statistical methods, and deliver a polished, user-facing product.
Thank you!