Text Prediction Algorithm - Data Science Capstone project

Next-Word Prediction Using N-gram Language Model

Dejan Dojcinovic

2026-03-10

Algorithm Description

5-gram Language Model with Stupid Backoff

  • Predicts next word based on previous 4 words
  • Backoff: 5-gram → 4-gram → 3-gram → 2-gram → 1-gram
  • Apply penalty (α = 0.4) at each level
  • Return top 5 predictions with scores

Model Statistics:

  • Size: 33 MB | Vocabulary: 50,000 words
  • N-grams: 3.7 million | Training: 13M sentences
  • Response: ~3 ms

Application Description

Interactive Shiny Application

  • Real-Time Prediction: Enter text, get instant predictions with confidence scores

  • Top 5 Results: View likely words with visual bars

  • Performance Metrics: See n-gram level and response time

  • Clean Predictions: Filters stopwords, returns meaningful words only

Instructions on How to Use

1. Enter Text - Type a phrase (e.g., “I’m going to”)

2. Click Predict - Press button or use example sentences

3. View Results - Top 5 predictions with confidence percentages

4. Explore Details - Check n-gram level, response time, model stats

How It Functions

Process: Clean text → Extract context → Search tables → Apply backoff → Filter → Return predictions

Training Data: 50% OpenSubtitles, 20% Movies, 15% Twitter, 10% Blogs, 5% News

Evaluation: - 90% accuracy (multiple choice) - 100% meaningful predictions - 20% accuracy (open prediction)