Next Word Predictor

2025-10-25

Introduction

For my Johns Hopkins Data Science Capstone, I built a Next Word Predictor — a Shiny app that guesses the next word in a phrase (like your phone keyboard).

Built with R, Shiny, tidyverse, tidytext
Trained on the SwiftKey English corpus (blogs, news, Twitter)

Data & Cleaning

Lowercased text
Removed punctuation, numbers, extra spaces
Tokenized into trigrams (3-word sequences) using tidytext
Used a 15,000-line random sample (fits in 8 GB RAM)

Example: “i love coding so much” →
• “i love coding”, “love coding so”, “coding so much”

Model

Trigram frequency model: given a two-word prefix, pick the most frequent third word.

Steps: 1. Count (word1, word2, word3) combos
2. Compute prob = n / sum(n) within each (word1, word2) pair
3. Return top word

Examples:
- i love → you
- how are → you
- one of → the

The Shiny App

How it works: 1. User enters a phrase
2. Text is cleaned like training data
3. App searches trigram model
4. Top predictions shown

Live app:
https://saanienaqvi.shinyapps.io/NextWordPredictor/

Try: “i love”, “how are”, “thank you”.

Wrap-Up & Next Steps

Demonstrates NLP basics: tokenization, counts, probabilities, reactive UI.

Future ideas: - Back-off to bigrams/unigrams
- Larger training set with data.table or quanteda
- Small UI tweaks for mobile