2025-10-25

Introduction

For my Johns Hopkins Data Science Capstone, I built a Next Word Predictor — a Shiny app that guesses the next word in a phrase (like your phone keyboard).

  • Built with R, Shiny, tidyverse, tidytext
  • Trained on the SwiftKey English corpus (blogs, news, Twitter)

Data & Cleaning

  • Lowercased text
  • Removed punctuation, numbers, extra spaces
  • Tokenized into trigrams (3-word sequences) using tidytext
  • Used a 15,000-line random sample (fits in 8 GB RAM)

Example: “i love coding so much” →
• “i love coding”, “love coding so”, “coding so much”

Model

Trigram frequency model: given a two-word prefix, pick the most frequent third word.

Steps: 1. Count (word1, word2, word3) combos
2. Compute prob = n / sum(n) within each (word1, word2) pair
3. Return top word

Examples:
- i loveyou
- how areyou
- one ofthe

The Shiny App

Wrap-Up & Next Steps

Demonstrates NLP basics: tokenization, counts, probabilities, reactive UI.

Future ideas: - Back-off to bigrams/unigrams
- Larger training set with data.table or quanteda
- Small UI tweaks for mobile