Project Overview

  • Objective: Build a Shiny app that predicts the next word in a sentence.
  • Data Source: English text corpora from blogs, news, and Twitter (SwiftKey dataset).
  • Tools: R, Shiny, stringr, tm, tidytext
  • Final product: A lightweight app using a basic N-gram model.

Data Processing

  • Downloaded and unzipped the corpus files.
  • Sampled a small portion for performance.
  • Cleaned the text:
    • Lowercase, removed punctuation, numbers, profanity.
  • Tokenized into:
    • Bigrams (2-word)
    • Trigrams (3-word)

Prediction Algorithm

  • N-gram backoff strategy:
    • If trigram match found: use it.
    • If not, fall back to bigram.
    • If not, return most frequent word.
  • Example:
    Input: "I love" → Match "I love you"
    If not found: try "love you", then "you".

Shiny App Overview

Future Improvements

  • Improve prediction accuracy with:
    • Smoothing techniques (e.g., Katz backoff)
    • Larger training sample
    • POS tagging or deep learning
  • Add top 3 predictions
  • Mobile optimization