26 March, 2025

Overview

  • This project is part of the Data Science Capstone, the final course in the Coursera Data Science Specialization. The goal is to analyze extensive text datasets, uncover linguistic patterns, and develop an intelligent next-word prediction system.
  • Key Components:
    • Text Mining: Examining word frequencies and relationships within a large corpus.
    • Predictive Modeling: Constructing n-gram models for accurate text forecasting.
    • Interactive App: Designing a user-friendly Shiny application for real-time predictions.

Methodology

  1. Data Preparation:
    • Removed inappropriate language and tokenized text for analysis.
  2. Exploratory Analysis:
    • Investigated word and phrase frequencies to identify common patterns.
  3. Model Development:
    • Built 2-gram to 7-gram models to enhance prediction accuracy.
  4. Prediction Algorithm:
    • Utilized Katz’s back-off model for efficient word prediction.
    • The system checks higher-order n-grams (7-gram down to 2-gram) for matches.
    • If no match is found, it defaults to the most common word (“the”).
    • Optimized performance by excluding rare word combinations (occurring fewer than 5 times).

Key Outcomes

  • Detailed documentation on data processing and model construction is available on GitHub.
  • Interactive Prediction Tool:
    • Access the live Shiny app here.
    • Features:
      1. User-inputted text query.
      2. Adjustable number of word suggestions.
    • Output: Predictions ranked by usage frequency.

Sources & Tools