2025-12-31

Project Overview

This project develops a predictive text system based on statistical language modeling, designed to suggest the next word in a sequence efficiently and accurately.

The system uses n-gram models (unigrams, bigrams, and trigrams) combined with a backoff strategy and Laplace smoothing to handle unseen word combinations. It is implemented in R, with an interactive Shiny web application for real-time predictions.

#Key Features

  • Processes large-scale text data from Twitter, Blogs, and News
  • Implements unigram, bigram, and trigram models
  • Uses Laplace smoothing to manage rare or unseen words
  • Applies an intelligent backoff mechanism for robust predictions
  • Provides real-time predictions via a user-friendly interface

Statistical Modeling

Basic N-gram Probability The conditional probability of a word \(w_n\) given the previous word \(w_{n-1}\) is defined as: \[ P(w_n \mid w_{n-1}) = \frac{\text{count}(w_{n-1}, w_n)} {\text{count}(w_{n-1})} \] Laplace Smoothing To avoid zero probabilities for unseen word pairs, Laplace smoothing is applied: \[ P(w_n \mid w_{n-1}) = \frac{\text{count}(w_{n-1}, w_n) + 1} {\text{count}(w_{n-1}) + V} \] where: - \(V\) is the vocabulary size
- The constant \(+1\) prevents zero probabilities
- The denominator is adjusted to preserve probability mass

Backoff Strategy - Primary: Try trigram model (last 2 words) - Secondary: Fall back to bigram model (last word) - Tertiary: Use unigram model (most frequent word) - Final: Hard-coded default (“the”)

Shiny Application

Interactive Interface

- Real-time Prediction: Updates as user types
- Multiple Display Views: Prediction, candidates, model info
- Visual Analytics: Bar charts for top candidates
- Performance Metrics: Calculation speed and probability

Application Features

- Clean, responsive UI with thematic styling
- Tab-based navigation for information layers
- Dynamic candidate list with adjustable length
- Model statistics and technical details
- Educational components explaining methodology

Results and Applications

Prediction Performance

- Accuracy: Contextually appropriate word suggestions
- Speed: Real-time response (< 0.5 seconds)
- Robustness: Handles various input patterns
- Scalability: Processes inputs of varying lengths

Practical Applications

- Text Autocompletion: Assistive typing for mobile devices
- Writing Assistance: Next-word suggestions for content creation
- Language Learning: Exposure to common word patterns
- Accessibility: Aid for users with typing difficulties