Last project

2025-12-31

Project Overview

This project develops a predictive text system based on statistical language modeling, designed to suggest the next word in a sequence efficiently and accurately.

The system uses n-gram models (unigrams, bigrams, and trigrams) combined with a backoff strategy and Laplace smoothing to handle unseen word combinations. It is implemented in R, with an interactive Shiny web application for real-time predictions.

#Key Features

Processes large-scale text data from Twitter, Blogs, and News
Implements unigram, bigram, and trigram models
Uses Laplace smoothing to manage rare or unseen words
Applies an intelligent backoff mechanism for robust predictions
Provides real-time predictions via a user-friendly interface

Statistical Modeling

Basic N-gram Probability The conditional probability of a word \(w_n\) given the previous word \(w_{n-1}\) is defined as: \[ P(w_n \mid w_{n-1}) = \frac{\text{count}(w_{n-1}, w_n)} {\text{count}(w_{n-1})} \] Laplace Smoothing To avoid zero probabilities for unseen word pairs, Laplace smoothing is applied: \[ P(w_n \mid w_{n-1}) = \frac{\text{count}(w_{n-1}, w_n) + 1} {\text{count}(w_{n-1}) + V} \] where: - \(V\) is the vocabulary size
- The constant \(+1\) prevents zero probabilities
- The denominator is adjusted to preserve probability mass

Backoff Strategy - Primary: Try trigram model (last 2 words) - Secondary: Fall back to bigram model (last word) - Tertiary: Use unigram model (most frequent word) - Final: Hard-coded default (“the”)

Shiny Application

Interactive Interface

- Real-time Prediction: Updates as user types
- Multiple Display Views: Prediction, candidates, model info
- Visual Analytics: Bar charts for top candidates
- Performance Metrics: Calculation speed and probability

Application Features

- Clean, responsive UI with thematic styling
- Tab-based navigation for information layers
- Dynamic candidate list with adjustable length
- Model statistics and technical details
- Educational components explaining methodology

Results and Applications

Prediction Performance

- Accuracy: Contextually appropriate word suggestions
- Speed: Real-time response (< 0.5 seconds)
- Robustness: Handles various input patterns
- Scalability: Processes inputs of varying lengths

Practical Applications

- Text Autocompletion: Assistive typing for mobile devices
- Writing Assistance: Next-word suggestions for content creation
- Language Learning: Exposure to common word patterns
- Accessibility: Aid for users with typing difficulties