Coursera Datascience Captsone

2025-12-14

1. Goal and key decisions

Core modeling choices

Word-level trigram language model
Stupid backoff: trigram → bigram → unigram (with backoff factor $\lambda$)
- rather than additive smoothing or Kneser–Ney to keep the model simple and fast
- the large corpus size makes sophisticated discounting less critical for this project
<UNK> used internally for Out-Of-Vocabulary words, but never shown as a suggestion

UX choices

Display-time stopword down-weighting
- stopwords are kept in model’s vocabulary, but their scores are downweighted at display time to surface more content words
Regex fallback over a high-ratio Twitter sample for rare phrases

2. Data and preprocessing

Preprocessing pipeline

I used the tm package presented in the Jurafsky & Manning slides. The preprocessing pipeline itself:

Lowercasing
Remove / blank out: URLs, @handles, emails
Remove numbers and punctuation
Whitespace normalization
Token filter: ^[a-z]+$

Vocabulary + unknowns

Keep tokens with frequency ≥ k (used the value of 2)
Map remaining tokens → <UNK> before n-gram counting

3. Prediction algorithm

Given the cleaned input tokens $w_1, \dots, w_{n-1}$, predict $w_n$.

1) Trigram (preferred)

\[ score(w \mid w_{n-2}, w_{n-1}) = \frac{c(w_{n-2}, w_{n-1}, w)}{c(w_{n-2}, w_{n-1})} \]

2) Back off to bigram if trigram unseen

\[ score(w \mid w_{n-1}) = \lambda \cdot \frac{c(w_{n-1}, w)}{c(w_{n-1})} \]

3) Back off to unigram if bigram unseen \[ score(w) = \lambda^2 \cdot \frac{c(w)}{N} \]

Output: top-k candidates by score (with <UNK> removed from suggestions).

4. “Best match” optimization

Why: many intuitive completions (e.g., “case of ___”) are rare and can slip out of samples, so the language model backs off to of → the/his/....

Fallback idea

Keep a large sample of cleaned Twitter lines
Build a regex from the user’s suffix (last ~6–10 tokens)
Find matching lines and extract the next token after the suffix
Rank candidates by extracted frequency, merge with the list produced by language model

When fallback triggers

when the language model ends up in unigram-only mode, or
top-k is dominated by stopwords / low-utility suggestions
- the intent is to improve user experience
- in practice, it’s triggered rarely – refer to the milestone observations about coverage.

5. Shiny app: how it works and deployment

Input text box
Submit button - prediction runs on click
Output: top-k next-token suggestions + method used
Additional inputs serve as tuning parameters: number of suggestions, stopword penalty, if the app should use fallback to regex

On Submit

Clean + tokenize input – apply the same rules as for model training
Run trigram → bigram → unigram scoring
Apply display ranking (stopword downweight)
If needed, run regex fallback and merge results

Deployment

Built model offline – stored as model.rds
Published to shinyapps.io through rsconnect::deployApp()