2021-11-29

Predictive modelling choices

  • Next word suggestion based on frequency calculation of n-grams from a random sample taken from the input corpora.

  • Automatic filtering of profanities, and most common French and Spanish words.

  • 2-grams to 4-grams are ordered by frequencies, and split “(n-1)+1” as “input+next word”, with minimal numbers of occurrences depending on chain length.

  • 2-grams are complemented by a list of synonyms, and longer word chains by a list of common expressions.

  • Finally, Input | Next Word are gathered in a 2-column database ordered by likelihood.

Performance indicators

Our ranked database Input|Next Word weight 834 Kb, and below are some simple performance measures on random samples from a test set.

Input Accuracy* (%) Avg. Response Time (s)
1-word 13 0.4
2-word 84 0.2
3-word 72 0.1
4-word 50 0.2

* Accuracy is measured as percentage of exact responses among 1000 top (input+1)-grams from test samples

Illustration

Try it here