Data Science Specialization Capstone: Prediction

2026-01-20

Goal

Predict the Next Word from a Given String

Collaboration between Johns Hopkins University and SwiftKey
Objective: Build a functioning predictive text model
Data: HC Corpora (English only)

Data & Cleaning

Sampled 1,000,000 lines from Twitter, Blogs, and News datasets
Cleaned data by:
- Removing non-ASCII characters (emojis)
- Converting to lowercase
- Removing contractions, punctuation, numbers, profanities, extra whitespaces
Tokenized data to create MLE n-grams (up to 6-grams)

Predictive Model

Built Maximum Likelihood Estimation (MLE) matrices
Used Back-off model for prediction
Output: Top 3 predicted words for user input
Accuracy enhanced by showing multiple predictions rather than 1

Shiny Application

Hosted at: Capstone Prediction App
Features:
- Clickable predicted words to append to input
- Instant predictions
UI preview: