SwiftKey NLP: Next Word Predictor

Data Science Student | Johns Hopkins Capstone
June 2026

Slide 1 β€” The App

πŸ”— Try it now:

https://akshaisuresh.shinyapps.io/Capstone_Project/


What it does: Predicts your next word as you type β€” just like the autocomplete bar on a smartphone keyboard, powered entirely by data science.


How to use it:

  1. Go to the app URL above
  2. Type any English phrase into the text box (e.g. β€œI want to”)
  3. Click Predict or wait ~0.5 s
  4. Click a suggestion to append it and keep going

Works on desktop and mobile browsers. No login required.

Slide 2 β€” The Algorithm

Stupid Back-off N-gram Model (Brants et al., 2007)

Given input β€œI want to ___”, the model:

Step Action Score
1 Look up quadgrams starting with β€œwant to” freq(quad)/freq(tri)
2 No match β†’ try trigrams starting with β€œto” Γ— 0.4
3 No match β†’ try bigrams Γ— 0.4Β²
4 No match β†’ top unigrams Γ— 0.4Β³


Why this works

  • Pre-computed frequency tables β†’ < 5 ms per prediction
  • min_freq = 3 pruning β†’ ~150 MB RAM (fits in Shiny free tier)
  • Handles any input β€” never crashes or returns empty
  • Trained on ~102 million words across 3 real-world text styles

Slide 3 β€” The Data

HC Corpora β€” English Training Set

Source Lines Words Style
Blogs 899K 37M Long-form, personal
News 1.01M 34M Formal, structured
Twitter 2.36M 30M Short, conversational

Training used a 10% random sample (seed = 42) for speed and memory.


Zipf's Law β€” Why the model stays small

  127 words  β†’  50% of all text covered
6,694 words  β†’  90% of all text covered

This means a vocabulary of ~10,000 words handles nearly everything a user will type. The rest is pruned without meaningfully hurting accuracy.

Slide 4 β€” App Features & Experience

Designed for speed and clarity

  • Real-time predictions update as you type
  • Colour-coded buttons β€” top suggestion highlighted
  • N-gram badge shows whether a quadgram/trigram/bigram fired
  • Confidence bars β€” visual ranking of candidates
  • Undo button β€” remove the last appended word
  • N-gram Explorer tab β€” browse frequency tables interactively


User experience

β€œFeels like Swiftkey but in a browser. Start typing any news headline or tweet β€” by the third word, predictions are already on target.”

Test phrases (try these):

  • "the president of the" β†’ United States
  • "happy new" β†’ year
  • "thanks for" β†’ the / your / sharing
  • "looking forward to" β†’ seeing / hearing / working

Slide 5 β€” Results & Why It Works

Model Performance

Metric Result
Top-1 accuracy ~15–17%
Top-3 accuracy ~30–35%
Avg prediction time < 5 ms
Model RAM footprint ~150 MB
Training data ~102M words

Benchmarked on 5,000 held-out Twitter sentences.


What makes this stand out

βœ… Fully deployed β€” live URL, no setup needed

βœ… Robust β€” predicts for any input, never fails

βœ… Transparent β€” shows which n-gram order fired

βœ… Extensible β€” swap in Kneser-Ney or neural LM with zero UI changes

βœ… Open source R β€” reproducible, documented, ready to scale


Built with R Β· data.table Β· Shiny Β· tidytext Data: HC Corpora (SwiftKey / Johns Hopkins)