Capstone Final Project – Next Word Predictor

Project Overview

Objective: Build a Shiny app that predicts the next word in a sentence.
Data Source: English text corpora from blogs, news, and Twitter (SwiftKey dataset).
Tools: R, Shiny, stringr, tm, tidytext
Final product: A lightweight app using a basic N-gram model.

Data Processing

Downloaded and unzipped the corpus files.
Sampled a small portion for performance.
Cleaned the text:
- Lowercase, removed punctuation, numbers, profanity.
Tokenized into:
- Bigrams (2-word)
- Trigrams (3-word)

Prediction Algorithm

N-gram backoff strategy:
- If trigram match found: use it.
- If not, fall back to bigram.
- If not, return most frequent word.
Example:
Input: "I love" → Match "I love you"
If not found: try "love you", then "you".

Shiny App Overview

URL: https://nimmi1994.shinyapps.io/NextWordPredictor/
UI:
- Text input for user phrase
- “Predict” button
Output:
- Shows most likely next word
Simple and fast response time.

Future Improvements

Improve prediction accuracy with:
- Smoothing techniques (e.g., Katz backoff)
- Larger training sample
- POS tagging or deep learning
Add top 3 predictions
Mobile optimization