2025-11-21
1. Project Goal
- Build a next-word prediction model similar to mobile phone keyboards.
- Use publicly available text from blogs, news, and Twitter (HC Corpora).
- Deploy a simple, easy-to-use Shiny web application.
- Show that the model can give a reasonable next-word guess for typical English phrases.
2. Data and Preprocessing
Data sources
- English blogs
- English news articles
- English Twitter messages
Preprocessing steps
- Sampled a subset of lines to keep the model lightweight.
- Converted text to lowercase and removed punctuation and symbols.
- Split text into words and built:
- Single-word counts (unigrams)
- Two-word sequences (bigrams)
- Three-word sequences (trigrams)
3. Prediction Algorithm (N-gram Backoff)
Core idea
- Use recent words typed by the user to guess the most likely next word.
- Based on frequency of short word sequences in the training data.
Backoff strategy
- Take the last two words of the input and search the trigram table.
- If no trigram match is found, use the last one word and search the bigram table.
- If there is still no match, fall back to the most frequent single word overall
4. Shiny App: User Experience
How the app works
- User will type their phrase and be given a predicted following word.
Key strengths
- Very easy to use—no configuration needed.
- Response is fast because the model is pre-computed and lightweight.
- Works for a wide range of everyday phrases drawn from blogs, news, and social media.