2026-01-20

Goal

Predict the Next Word from a Given String

  • Collaboration between Johns Hopkins University and SwiftKey
  • Objective: Build a functioning predictive text model
  • Data: HC Corpora (English only)

Data & Cleaning

  • Sampled 1,000,000 lines from Twitter, Blogs, and News datasets
  • Cleaned data by:
    • Removing non-ASCII characters (emojis)
    • Converting to lowercase
    • Removing contractions, punctuation, numbers, profanities, extra whitespaces
  • Tokenized data to create MLE n-grams (up to 6-grams)

Predictive Model

  • Built Maximum Likelihood Estimation (MLE) matrices
  • Used Back-off model for prediction
  • Output: Top 3 predicted words for user input
  • Accuracy enhanced by showing multiple predictions rather than 1

Shiny Application

  • Hosted at: Capstone Prediction App
  • Features:
    • Clickable predicted words to append to input
    • Instant predictions
  • UI preview: