Text Predictor App

Data Science Capstone: Final Project

Emma Schelhase

2025-12-30

2. Project Overview & Data Pipeline

This project represents the culmination of the 9-course JHU Data Science specialization, transforming raw corpora into a predictive product (Task 1-3).

Foundation: Applied the Data Scientist’s Toolbox and R Programming to manage 100MB of Twitter, Blog, and News data.
Cleaning: Used Getting and Cleaning Data techniques to tokenize, remove profanity, and handle “s” possessives (Task 2).
Exploration: Conducted Exploratory Data Analysis to visualize n-gram frequencies (Task 3).
Methodology: All steps are documented via Reproducible Research standards to ensure transparency.

3. The Algorithm(v14) & Inference

The app uses a Stupid Backoff model with an IDF-Weighted Skip-Gram approach to ensure contextually relevant predictions (Task 4-5).

Statistical Inference: Used probability distributions to estimate the likelihood of unseen word sequences.
Regression & ML: Applied principles from Regression Models and Practical Machine Learning to weight n-gram scores.
Innovation: Penalize “generic” words using \(Score = \frac{n}{\log(GlobalFreq + 1)}\).
The Skip-Gram: If a stopword is detected, the model “skips” back to the anchor verb.

4. Developing the Product

Based on the Developing Data Products, I created the Text Predictor App (Link)

Interactive Design: The app features a modern smartphone frame with a “Tap-to-Complete” keyboard functionality.
Customization: A dynamic slider allows users to adjust prediction depth in real-time.
Visual Analytics: Allows users to see the trends of the data through interactive word-clouds.
Efficiency: The model is optimized for sub-second response times, demonstrating a product ready for real-world deployment.

5. Conclusion

This Capstone successfully executed all 7 Tasks prescribed in the course.

The User Experience: Users enjoy a seamless, “smart” typing experience that feels native to modern mobile devices.
Functioning: The IDF-weighting makes the predictions feel uniquely “human” compared to standard maximum-likelihood models.

I have demonstrated the ability to take a complex problem, apply rigorous statistical methods, and deliver a polished, user-facing product.

Thank you!