Suggestive

2026-02-03

Suggestive

For this project, a corpora consisting of Twitter content, blog posts, and news articles was evaluated to develop a next-word text prediction model.

Project Overview
1. Quantification
2. Modeling
3. Testing and Scoring
4. Implementation

Note: Some links are used in this presentation to reveal more information.
Hover the mouse pointer over the highlighted text to see the extra details.
Avoid clicking those links as it may reload the presentation.

Quantification

Corpora split into training, test, validation sets (80/10/10)
Training lines cleaned and tokenized
Quanteda package used to extract n-grams (for n in 1:5)
n-grams appearing >1 time written to frequency files
Interesting Word Association model
- Full dictionary to capture every word’s frequency
- Interesting words are not-too-rare and not-too-common
- Association n-gram frequency irrespective of word order
  - Interesting words that appear together in any order
  - Repeated for interesting pairs, triples, and quads

Modeling

Offensive words removed using a bad word list
Simple Backoff (SBO):
- Given history (last n-1 words), most likely nextword is?
- Reduced model size using cutoffs
  - Keep top 10 nextword options per history
- Stored as CSV, top n-grams (2:5) tables ~96 MB on disk
Word Association:
- Created interesting word tuples (2, 3, & 4)
- Stored each permutation as alpha-sorted “history”
Katz Back-Off (KBO):
- Discounted P(nextword) used for seen n-gram history
- Alpha-weighted P(nextword) for shorter (n-1)-gram history for unseen n-gram history

Testing and Scoring

Word Association fast (~0.01s/resp); very poor accuracy
KBO highest accuracy, slowest (avg ~1.1s/response)
SBO slightly less accurate than KBO, but fast (~0.01s/response)
- SBO performance tested max n-gram order at 3, 4, and 5

image notes

Implementation

Suggestive can be tested here: https://parseljc.shinyapps.io/Suggestive/
- Uses SBO 4-gram model for best combination of speed, accuracy, model size
App has two modes (user can toggle)
- Manual - type a phrase then click Predict button to get top ten predictions
- Auto - Top ten predictions will update as you are typing