JHU Data Science Capstone

SwiftKey Text Prediction Model

Paddy McPhillips

Project Overview

Objective: Build an NLP text prediction model in R (similar to mobile keyboards)
Data: SwiftKey corpus with ~4 million lines from Blogs, News, and Twitter
Challenge: Balance prediction accuracy vs. computational efficiency
Deliverable: End-to-end pipeline from raw data to deployed Shiny application

Modeling Approach

Algorithm: Stupid Backoff with pre-computed n-gram lookup tables
N-grams: Unigrams, bigrams, trigrams, and quadgrams (model < 10MB)
Process: Extract last 1-3 words → search 4-gram first → backoff to 3-gram → 2-gram
Output: Return most frequent completions matching the context

Model Evaluation

Top-3 Accuracy: 21% (typical for n-gram models: 15-25%)
Limitation: Data sparsity in higher-order n-grams
Perplexity: Measures model “surprise” at test data (lower = better)
Formula: \(Perplexity = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(word_i | context_i)}\)
Current Result: ~4000 (could improve with higher sampling rate)

Shiny Application

UI Component: Text input box, clickable prediction buttons, confidence scores
Server Component: Loads n-gram tables at startup, runs predict_next_word() reactively
Features: Real-time predictions, debug trace showing n-gram level matched
Live Demo: tsunamimor.shinyapps.io/JHU_Data_Science_Capstone_SwiftKey_Text_Prediction

Summary and Future Work

Achieved: Working text prediction system with proper NLP evaluation metrics
Deployed: Interactive Shiny application demonstrating the model
Improvements: Increase sampling rate, add backoff weighting, implement Kneser-Ney smoothing