Predictive Text Model & Application

June 30, 2025, 03:21 PM EDT

Slide 1: How the Model Works

Uses n-gram modeling (unigram, bigram, trigram) for next-word prediction.
Implements a Stupid Backoff algorithm for smoothing and handling unseen n-grams.

Input Tokenization: Cleans and tokenizes the text.
Search for n-grams: Matches bigrams and trigrams from a pre-trained frequency dataset.
Fallback: Falls back to unigrams if higher-order matches fail.

Approach:
- Corpus: HC Corpora (blogs, news, tweets; 10,000 lines sampled per file).
- Preprocessed: Tokenized (lowercase, no punctuation, symbols, numbers, URLs, profanity).
Modeling:
- N-grams: Predicts next word using 1 or 2 prior words (trigrams ineffective).
- Backoff: Falls back to unigrams when bigrams or trigrams fail.
- Profanity Filtering: Ensures clean predictions.
Link: https://yuemin2025.shinyapps.io/TextPredictionApp/

Current Output (Based on Provided Data):
- Input: “this is a” | Predicted: “said, will, one” (Trigram: 0, Bigram: 0)
- Input: “hello world” | Predicted: “war, series, peace” (Trigram: 0, Bigram: 66)
- Input: “the quick brown” | Predicted: “sugar, eyes, ale” (Trigram: 0, Bigram: 10)
Performance Metrics:

Total Predictions: 957

Top-1 Accuracy: 0.03866249; Top-2 Accuracy: 0.05015674

Top-3 Accuracy: 0.05956113; Perplexity: 1.359654

Runtime (seconds): 115.4759

Total Predictions: 957

Top-1 Accuracy: 0.03866249; Top-2 Accuracy: 0.04284222

Top-3 Accuracy: 0.04806688

Mean Reciprocal Rank (MRR): 0.04754441; Runtime (seconds): 107.0584