[1] 2
2026-02-17
Goal: To predict the next word from an input phrase, similar to smart phone keypads.
Deliverables: - Deployed Shiny app (next-word prediction) - Lightweight, responsive model suitable for deployment - Reproducible workflow and documentation
Typing on mobile devices is slow and error-prone.
A smart keypad on mobile phones improves typing by predicting the next word given context, for example:
Input: “I went to the”
Predictions: gym, store, restaurant
Objective: Build a predictive text model and deploy it as a Shiny application as part of the Capstone project for the Data Science Course from JHU on Coursera.
Training data (HC Corpora, English): - Blogs, News, Twitter (millions of lines)
Key finding: - Sources differ substantially: Twitter lines are short, while blog lines can be extremely long. - This motivates: - Sampling for training efficiency - Consistent cleaning and tokenization - Compact model storage for fast runtime
Example quiz check result: Twitter love/hate line ratio ≈ 5
Model: Frequency-based n-gram language model
- 2-grams, 3-grams, 4-grams built from cleaned text
Backoff prediction strategy: 1. Use last 3 words → search 4-grams
2. If not found → last 2 words → search 3-grams
3. If not found → last 1 word → search 2-grams
4. If not found → fallback to a common default (e.g., “the”)
Pruning: Remove rare n-grams to reduce model size and improve latency.
Shiny app behavior: - User inputs a phrase - App returns the top 3 predicted next words
Example outputs from the model: - “i love” → you, the, it - “thank you for” → the, your, following - “going to the” → movies, beach, gym
Links: - Shiny app: https://chidemannie.shinyapps.io/Swiftkey_Next_Word_Predictor/ - GitHub repo: https://github.com/chidemannie/swiftkey-capstone - RPubs EDA report: https://rpubs.com/chidemannie/1398186
Future improvements: - Smoothing (e.g., Laplace/Katz), profanity filtering, stronger tokenization rules
[1] 2