2025-08-13

Slide 1: Project Overview

Next Word Prediction Application

Objective: Build a predictive text application using natural language processing

Data Sources: - English blogs corpus (200MB) - English news corpus (196MB) - English Twitter corpus (159MB)

Deliverables: - Prediction algorithm using n-gram models - Interactive Shiny web application - User-friendly text input interface

Slide 2: Algorithm Description

N-Gram Backoff Model

The prediction algorithm uses a hierarchical backoff approach:

  1. 4-Gram Model: Uses last 3 words to predict the 4th
    • Most specific, highest accuracy when data available
  2. 3-Gram Model: Uses last 2 words to predict the 3rd
    • Falls back when 4-gram insufficient data
  3. 2-Gram Model: Uses last word to predict the 2nd
    • Broader context, more general patterns
  4. Unigram Model: Returns most frequent words
    • Final fallback for unknown contexts

Key Features: - Frequency-based ranking - Smoothing for unseen combinations - Efficient data.table lookups

Slide 3: Application Description

Shiny Web Application

User Interface Components: - Text input area for user sentences - “Predict Next Word” button - Results display with top 5 predictions - Algorithm explanation sidebar

Functionality: - Real-time text processing - Instant prediction generation - User-friendly instructions - Educational content about the algorithm

Technical Implementation: - R Shiny framework - quanteda for text processing - data.table for fast lookups - Responsive design

Slide 4: How to Use the App

Step-by-Step Instructions

  1. Launch the Application:

    runApp("app.R")
  2. Enter Your Text:

    • Type a sentence in the text input box
    • Example: “I love to read”
  3. Get Predictions:

    • Click “Predict Next Word” button
    • View top 5 suggested next words
    • Select the most appropriate option
  4. Understand Results:

    • Predictions ranked by frequency
    • Based on training data patterns
    • Context-aware suggestions

Example Usage: - Input: “The weather is” - Predictions: “nice”, “good”, “bad”, “sunny”, “cold”

Slide 5: Technical Implementation

Data Processing Pipeline

Training Phase:

# Load and clean text data
tokens <- tokens(corpus, remove_punct = TRUE, 
                remove_numbers = TRUE)

# Build n-gram frequency tables
ngrams <- build_ngrams(tokens)

Prediction Phase:

# Clean user input
input_tokens <- tokens(user_text)

# Apply backoff algorithm
predictions <- predict_next_word(input_tokens, ngrams)

Performance Optimizations: - Data.table for fast lookups - Pre-computed frequency tables - Efficient string matching - Memory-optimized data structures

Future Enhancements: - Larger training datasets - Advanced smoothing techniques - Context-aware predictions - User feedback integration