2025-12-14

1. Goal and key decisions

Core modeling choices

  • Word-level trigram language model
  • Stupid backoff: trigram → bigram → unigram (with backoff factor \(\lambda\))
    • rather than additive smoothing or Kneser–Ney to keep the model simple and fast
    • the large corpus size makes sophisticated discounting less critical for this project
  • <UNK> used internally for Out-Of-Vocabulary words, but never shown as a suggestion

UX choices

  • Display-time stopword down-weighting
    • stopwords are kept in model’s vocabulary, but their scores are downweighted at display time to surface more content words
  • Regex fallback over a high-ratio Twitter sample for rare phrases

2. Data and preprocessing

Preprocessing pipeline

I used the tm package presented in the Jurafsky & Manning slides. The preprocessing pipeline itself:

  • Lowercasing
  • Remove / blank out: URLs, @handles, emails
  • Remove numbers and punctuation
  • Whitespace normalization
  • Token filter: ^[a-z]+$

Vocabulary + unknowns

  • Keep tokens with frequency ≥ k (used the value of 2)
  • Map remaining tokens → <UNK> before n-gram counting

3. Prediction algorithm

Given the cleaned input tokens \(w_1, \dots, w_{n-1}\), predict \(w_n\).

1) Trigram (preferred)

\[ score(w \mid w_{n-2}, w_{n-1}) = \frac{c(w_{n-2}, w_{n-1}, w)}{c(w_{n-2}, w_{n-1})} \]

2) Back off to bigram if trigram unseen

\[ score(w \mid w_{n-1}) = \lambda \cdot \frac{c(w_{n-1}, w)}{c(w_{n-1})} \]

3) Back off to unigram if bigram unseen \[ score(w) = \lambda^2 \cdot \frac{c(w)}{N} \]

Output: top-k candidates by score (with <UNK> removed from suggestions).

4. “Best match” optimization

Why: many intuitive completions (e.g., “case of ___”) are rare and can slip out of samples, so the language model backs off to of → the/his/....

Fallback idea

  • Keep a large sample of cleaned Twitter lines
  • Build a regex from the user’s suffix (last ~6–10 tokens)
  • Find matching lines and extract the next token after the suffix
  • Rank candidates by extracted frequency, merge with the list produced by language model

When fallback triggers

  • when the language model ends up in unigram-only mode, or
  • top-k is dominated by stopwords / low-utility suggestions
    • the intent is to improve user experience
    • in practice, it’s triggered rarely – refer to the milestone observations about coverage.

5. Shiny app: how it works and deployment

UI

  1. Input text box
  2. Submit button - prediction runs on click
  3. Output: top-k next-token suggestions + method used
  4. Additional inputs serve as tuning parameters: number of suggestions, stopword penalty, if the app should use fallback to regex

On Submit

  • Clean + tokenize input – apply the same rules as for model training
  • Run trigram → bigram → unigram scoring
  • Apply display ranking (stopword downweight)
  • If needed, run regex fallback and merge results

Deployment

  1. Built model offline – stored as model.rds
  2. Published to shinyapps.io through rsconnect::deployApp()