Overview

  • Goal: Predict the next word from a user-typed phrase
  • Built using Shiny for NLP interaction
  • Real-time prediction using N-gram models
  • Useful for mobile typing suggestions, chatbots, etc.

Data Source & Cleaning

  • Dataset: HC Corpora (Blogs, News, Twitter)
  • Steps:
    • Removed punctuation, numbers, stopwords
    • Converted to lowercase
    • Tokenized into N-grams (uni-, bi-, tri-)
  • Sampled data (~5%) for efficiency

Prediction Algorithm

  • Built N-gram language models
  • Used Stupid Backoff strategy:
    • Try trigram → if not found, backoff to bigram → then unigram
  • Fast, simple, works well with sparse data
  • Implemented in R with dplyr, stringr, tidytext

The Shiny App

Reflection

  • Works well for many common phrases
  • Could improve with:
    • Smarter backoff (e.g. Kneser-Ney)
    • Contextual models (e.g. RNN, transformers)
  • Great experience building full NLP pipeline!