June 30, 2025, 03:21 PM EDT

Slide 1: How the Model Works

  • Objective: Build a predictive text model for efficient word suggestions.

  • Approch:

  1. Uses n-gram modeling (unigram, bigram, trigram) for next-word prediction.
  2. Implements a Stupid Backoff algorithm for smoothing and handling unseen n-grams.
  • Pipeline: :
  1. Input Tokenization: Cleans and tokenizes the text.
  2. Search for n-grams: Matches bigrams and trigrams from a pre-trained frequency dataset.
  3. Fallback: Falls back to unigrams if higher-order matches fail.

Slide 2: Prediction

  • Approach:
    • Corpus: HC Corpora (blogs, news, tweets; 10,000 lines sampled per file).
    • Preprocessed: Tokenized (lowercase, no punctuation, symbols, numbers, URLs, profanity).
  • Modeling:
    • N-grams: Predicts next word using 1 or 2 prior words (trigrams ineffective).
    • Backoff: Falls back to unigrams when bigrams or trigrams fail.
    • Profanity Filtering: Ensures clean predictions.
  • Link: https://yuemin2025.shinyapps.io/TextPredictionApp/

Slide 3: Predictive Performance

  • Current Output (Based on Provided Data):
    • Input: “this is a” | Predicted: “said, will, one” (Trigram: 0, Bigram: 0)
    • Input: “hello world” | Predicted: “war, series, peace” (Trigram: 0, Bigram: 66)
    • Input: “the quick brown” | Predicted: “sugar, eyes, ale” (Trigram: 0, Bigram: 10)
  • Performance Metrics:

Total Predictions: 957

Top-1 Accuracy: 0.03866249; Top-2 Accuracy: 0.05015674

Top-3 Accuracy: 0.05956113; Perplexity: 1.359654

Runtime (seconds): 115.4759

Slide 4: Improved Predictive Performance

  • Performance Metrics:

Total Predictions: 957

Top-1 Accuracy: 0.03866249; Top-2 Accuracy: 0.04284222

Top-3 Accuracy: 0.04806688

Mean Reciprocal Rank (MRR): 0.04754441; Runtime (seconds): 107.0584