Core modeling choices
- Word-level trigram language model
- Stupid backoff: trigram → bigram → unigram (with backoff factor \(\lambda\))
- rather than additive smoothing or Kneser–Ney to keep the model simple and fast
- the large corpus size makes sophisticated discounting less critical for this project
<UNK>used internally for Out-Of-Vocabulary words, but never shown as a suggestion
UX choices
- Display-time stopword down-weighting
- stopwords are kept in model’s vocabulary, but their scores are downweighted at display time to surface more content words
- Regex fallback over a high-ratio Twitter sample for rare phrases