Final Deck

The Challenge

Goal: Build a predictive text application that suggests the next word as users type

Real-world applications: - Mobile keyboard suggestions - Search query completion - Writing assistants - Accessibility tools

Our approach: N-gram statistical language model with backoff strategy

The Data Pipeline

Corpus: English text from blogs, news articles, and Twitter

Processing steps:

Sampling & cleaning - Extract representative samples, remove noise
Tokenization - Split text into word sequences, filter profanity
N-gram generation - Build unigrams, bigrams, and trigrams with frequency counts
Aggregation & optimization - Combine counts, trim to top predictions per context

Result: Compact lookup tables (~838k trigrams, ~325k bigrams) for fast predictions

The Algorithm

Backoff N-gram Model:

Input: "I love"
├─ Try trigram lookup → [you, the, my, to, that]
│
├─ If no trigram match, try bigram on last word
│
└─ If no bigram match, return top unigrams [the, to, and, a, of]

Key features: - Prioritizes longer context (trigrams) for better accuracy - Falls back gracefully to shorter context when needed - Top-K pruning (100 predictions per context) balances memory and coverage

The Shiny Application

Live demo: [Your shinyapps.io URL here]

How to use: 1. Type a word or phrase in the text box (e.g., “I love”, “the weather”) 2. Click “Predict Next Word” button 3. View ranked predictions instantly (< 1 second response) 4. Adjust number of suggestions (1-10) or enable debug mode to see algorithm details

User experience: - Fast & responsive - Predictions appear immediately with no lag - Context-aware - Different inputs yield different predictions dynamically - Transparent - Debug mode reveals which n-gram level provided the match

Technical implementation: R + data.table for millisecond lookups, Shiny for interactive UI, deployed on shinyapps.io

Results & Future Work

What makes this implementation strong: - Complete end-to-end pipeline - From raw text corpus to deployed web application - Memory-efficient design - Handles 1M+ n-grams with fast lookups via data.table - Production-ready - Robust error handling, graceful fallbacks, public deployment - Scalable architecture - Easy to expand corpus, add n-gram levels, or refine algorithm

Future enhancements: - Expand training corpus for broader vocabulary coverage - Implement 4-grams for longer context windows - Add Kneser-Ney smoothing for improved probability estimates - Incorporate part-of-speech tagging for grammatical accuracy

Key takeaway: Successfully demonstrates full data science workflow from data acquisition through model deployment, with a functional product ready for real-world use.