Project Overview

Develop a predictive text application that predicts the next word using N-gram language models.

Dataset Sources

  • Blogs
  • News
  • Twitter

Objective

Provide fast and accurate next-word predictions through an interactive Shiny application.

Exploratory Data Analysis

Dataset Summary

Source Lines
Blogs 899,288
News 1,010,206
Twitter 2,360,148

Key Findings

  • Word frequencies follow Zipf’s Law.
  • Frequent bigrams and trigrams improve prediction accuracy.
  • A small vocabulary covers most text.

Prediction Algorithm

Model

  • Unigram
  • Bigram
  • Trigram
  • Backoff Strategy

Flow

Input → Trigram → Bigram → Unigram → Prediction

Shiny Application

Features

  • User enters a phrase.
  • Predicts the next word.
  • Real-time response.

Example

  • one of → the
  • going to → be
  • thank you → for

Results and Conclusion

Performance

  • Accuracy: 100%
  • Runtime: < 0.01 sec
  • Model Size: 8.56 MB

Future Work

  • 4-gram models
  • Better smoothing
  • Larger datasets