Introduction

  • This project demonstrates a next-word prediction app.
  • Users input a phrase, and the app predicts the most likely next word.
  • Built as part of the Data Science Capstone Project using the SwiftKey dataset.
  • Goal: Mimic real-world mobile keyboard prediction.

Data and Preprocessing

  • Data sourced from blogs, news, and Twitter (en_US).
  • Steps taken:
    • Lowercasing, punctuation & number removal.
    • Profanity filtering.
    • Tokenization and n-gram (unigram to trigram) creation.
  • Sampling used to reduce computational load.

Algorithm Overview

  • Uses n-gram language modeling (mainly bigram & trigram).
  • Stupid backoff algorithm:
    • Try trigram → backoff to bigram → backoff to unigram.
  • Predictions ranked by frequency of occurrence.
  • Fast, lightweight, and interpretable.

Shiny App: How It Works

Final Notes

  • The app demonstrates a basic NLP pipeline.
  • Can be expanded with:
    • Deep learning (e.g., LSTMs or transformers)
    • User personalization
  • Clean UI for non-technical users.
  • [Insert RPubs slide link if needed]

Thank you!