2025-12-02

Next-Word Prediction App

This project demonstrates a predictive text model trained on a sample of the SwiftKey English corpus. The model is deployed as a Shiny application that predicts the next word in a user-typed phrase.

The app showcases: - A backoff n-gram model (trigram → bigram → unigram) - Efficient preprocessing applied to large text data - An intuitive, real-time prediction interface

Data Preparation & Preprocessing

Data Source - SwiftKey English datasets (blogs, news, Twitter) - ~3 million lines of text; a sampled subset (1%) used for training

Cleaning Steps - Convert to lowercase - Remove punctuation, numbers, and symbols - Tokenize text using quanteda - Generate unigrams, bigrams, trigrams

Model Reduction (for deployment) - Keep only top 5,000 unigrams - Keep top 20,000 bigrams & trigrams - Save trimmed model as ngram_model.RData (lightweight for Shiny)

Prediction Algorithm (Backoff Model)

  1. Trigram match
    • Extract the last two words of the user’s phrase
    • If a matching trigram exists, return the most frequent next word
  2. Bigram match (backoff)
    • If no trigram match, use the last word to search bigrams
    • Return the most likely next word
  3. Unigram fallback
    • If no bigram match, return the most common word in the corpus

N-gram is fast, interpretable, suitable for prediction and is reliable with small datasets

Using the Shiny App

App Link:
https://sleepystitch.shinyapps.io/ShinyAppPredictor/

Instructions 1. Type a phrase into the text box (e.g., “I am going to”) 2. Click “Predict” 3. App returns: - The most likely next word - The top 3 possible next words based on the model

Behind the Scenes - Loads a pre-computed n-gram model - Cleans user text the same way as training data - Applies the trigram → bigram → unigram backoff logic - Produces predictions instantly

Summary & User Experience

Strengths of the App - Fast, lightweight model optimized for deployment - Simple, intuitive interface - Predicts next word in real time - Built using reproducible text mining techniques

User Experience - Immediate, smooth predictions - Easy to understand and interact with - Demonstrates the core idea behind predictive typing tools

Future Enhancements - Train on a larger sample for higher accuracy - Add probability scores - Incorporate smoothing or neural language models

Thank you!