2026-02-03

Suggestive

For this project, a corpora consisting of Twitter content, blog posts, and news articles was evaluated to develop a next-word text prediction model.

Project Overview
1. Quantification
2. Modeling
3. Testing and Scoring
4. Implementation

Note: Some links are used in this presentation to reveal more information.
Hover the mouse pointer over the highlighted text to see the extra details.
Avoid clicking those links as it may reload the presentation.

Quantification

  • Corpora split into training, test, validation sets (80/10/10)
  • Training lines cleaned and tokenized
  • Quanteda package used to extract n-grams (for n in 1:5)
  • n-grams appearing >1 time written to frequency files
  • Interesting Word Association model
    • Full dictionary to capture every word’s frequency
    • Interesting words are not-too-rare and not-too-common
    • Association n-gram frequency irrespective of word order
      • Interesting words that appear together in any order
      • Repeated for interesting pairs, triples, and quads

Modeling

  • Offensive words removed using a bad word list
  • Simple Backoff (SBO):
    • Given history (last n-1 words), most likely nextword is?
    • Reduced model size using cutoffs
      • Keep top 10 nextword options per history
    • Stored as CSV, top n-grams (2:5) tables ~96 MB on disk
  • Word Association:
    • Created interesting word tuples (2, 3, & 4)
    • Stored each permutation as alpha-sorted “history”
  • Katz Back-Off (KBO):
    • Discounted P(nextword) used for seen n-gram history
    • Alpha-weighted P(nextword) for shorter (n-1)-gram history for unseen n-gram history

Testing and Scoring

Implementation

  • Suggestive can be tested here: https://parseljc.shinyapps.io/Suggestive/
    • Uses SBO 4-gram model for best combination of speed, accuracy, model size
  • App has two modes (user can toggle)
    • Manual - type a phrase then click Predict button to get top ten predictions
    • Auto - Top ten predictions will update as you are typing