2025-11-21

1. Project Goal

  • Build a next-word prediction model similar to mobile phone keyboards.
  • Use publicly available text from blogs, news, and Twitter (HC Corpora).
  • Deploy a simple, easy-to-use Shiny web application.
  • Show that the model can give a reasonable next-word guess for typical English phrases.

2. Data and Preprocessing

Data sources

  • English blogs
  • English news articles
  • English Twitter messages

Preprocessing steps

  • Sampled a subset of lines to keep the model lightweight.
  • Converted text to lowercase and removed punctuation and symbols.
  • Split text into words and built:
    • Single-word counts (unigrams)
    • Two-word sequences (bigrams)
    • Three-word sequences (trigrams)

3. Prediction Algorithm (N-gram Backoff)

Core idea

  • Use recent words typed by the user to guess the most likely next word.
  • Based on frequency of short word sequences in the training data.

Backoff strategy

  1. Take the last two words of the input and search the trigram table.
  2. If no trigram match is found, use the last one word and search the bigram table.
  3. If there is still no match, fall back to the most frequent single word overall

4. Shiny App: User Experience

How the app works

  • User will type their phrase and be given a predicted following word.

Key strengths

  • Very easy to use—no configuration needed.
  • Response is fast because the model is pre-computed and lightweight.
  • Works for a wide range of everyday phrases drawn from blogs, news, and social media.