Introduction

  • Built as part of Coursera Data Science Capstone
  • Predicts the next word in a user-entered phrase
  • Uses blogs text dataset is used.

How the Algorithm Works

  • Text cleaned and tokenized using tidytext
  • Created unigram, bigram, and trigram frequency tables
  • Uses n-gram model with backoff:
    • If trigram not found → backoff to bigram → backoff to unigram

Prediction Model

  • Efficient and memory-optimized using RDS storage
  • Fast lookup using dplyr filtering and slice_max
  • Handles unknown inputs with fallback strategy

Shiny App Demo

Summary

  • Accurate, fast predictions with minimal resources
  • Real-time prediction from cleaned social and web text
  • Could be extended into mobile keyboards or chat assistants