Capstone Project

2025-11-21

1. Project Goal

Build a next-word prediction model similar to mobile phone keyboards.
Use publicly available text from blogs, news, and Twitter (HC Corpora).
Deploy a simple, easy-to-use Shiny web application.
Show that the model can give a reasonable next-word guess for typical English phrases.

2. Data and Preprocessing

Data sources

English blogs
English news articles
English Twitter messages

Preprocessing steps

Sampled a subset of lines to keep the model lightweight.
Converted text to lowercase and removed punctuation and symbols.
Split text into words and built:
- Single-word counts (unigrams)
- Two-word sequences (bigrams)
- Three-word sequences (trigrams)

3. Prediction Algorithm (N-gram Backoff)

Core idea

Use recent words typed by the user to guess the most likely next word.
Based on frequency of short word sequences in the training data.

Backoff strategy

Take the last two words of the input and search the trigram table.
If no trigram match is found, use the last one word and search the bigram table.
If there is still no match, fall back to the most frequent single word overall

4. Shiny App: User Experience

How the app works

User will type their phrase and be given a predicted following word.

Key strengths

Very easy to use—no configuration needed.
Response is fast because the model is pre-computed and lightweight.
Works for a wide range of everyday phrases drawn from blogs, news, and social media.