JHU Data Science Capstone

SwiftKey Text Prediction Model

Paddy McPhillips

Project Overview

  • Objective: Build an NLP text prediction model in R (similar to mobile keyboards)
  • Data: SwiftKey corpus with ~4 million lines from Blogs, News, and Twitter
  • Challenge: Balance prediction accuracy vs. computational efficiency
  • Deliverable: End-to-end pipeline from raw data to deployed Shiny application

Modeling Approach

  • Algorithm: Stupid Backoff with pre-computed n-gram lookup tables
  • N-grams: Unigrams, bigrams, trigrams, and quadgrams (model < 10MB)
  • Process: Extract last 1-3 words → search 4-gram first → backoff to 3-gram → 2-gram
  • Output: Return most frequent completions matching the context

Model Evaluation

  • Top-3 Accuracy: 21% (typical for n-gram models: 15-25%)
  • Limitation: Data sparsity in higher-order n-grams
  • Perplexity: Measures model “surprise” at test data (lower = better)
  • Formula: \(Perplexity = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(word_i | context_i)}\)
  • Current Result: ~4000 (could improve with higher sampling rate)

Shiny Application

  • UI Component: Text input box, clickable prediction buttons, confidence scores
  • Server Component: Loads n-gram tables at startup, runs predict_next_word() reactively
  • Features: Real-time predictions, debug trace showing n-gram level matched
  • Live Demo: tsunamimor.shinyapps.io/JHU_Data_Science_Capstone_SwiftKey_Text_Prediction

Summary and Future Work

  • Achieved: Working text prediction system with proper NLP evaluation metrics
  • Deployed: Interactive Shiny application demonstrating the model
  • Improvements: Increase sampling rate, add backoff weighting, implement Kneser-Ney smoothing