JHU Data Science Capstone
SwiftKey Text Prediction Model
Project Overview
- Objective: Build an NLP text prediction model in R (similar to mobile keyboards)
- Data: SwiftKey corpus with ~4 million lines from Blogs, News, and Twitter
- Challenge: Balance prediction accuracy vs. computational efficiency
- Deliverable: End-to-end pipeline from raw data to deployed Shiny application
Modeling Approach
- Algorithm: Stupid Backoff with pre-computed n-gram lookup tables
- N-grams: Unigrams, bigrams, trigrams, and quadgrams (model < 10MB)
- Process: Extract last 1-3 words → search 4-gram first → backoff to 3-gram → 2-gram
- Output: Return most frequent completions matching the context
Model Evaluation
- Top-3 Accuracy: 21% (typical for n-gram models: 15-25%)
- Limitation: Data sparsity in higher-order n-grams
- Perplexity: Measures model “surprise” at test data (lower = better)
- Formula: \(Perplexity = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(word_i | context_i)}\)
- Current Result: ~4000 (could improve with higher sampling rate)
Shiny Application
- UI Component: Text input box, clickable prediction buttons, confidence scores
- Server Component: Loads n-gram tables at startup, runs predict_next_word() reactively
- Features: Real-time predictions, debug trace showing n-gram level matched
- Live Demo: tsunamimor.shinyapps.io/JHU_Data_Science_Capstone_SwiftKey_Text_Prediction
Summary and Future Work
- Achieved: Working text prediction system with proper NLP evaluation metrics
- Deployed: Interactive Shiny application demonstrating the model
- Improvements: Increase sampling rate, add backoff weighting, implement Kneser-Ney smoothing