Coursera Data Science Capstone

2021-11-29

Predictive modelling choices

Next word suggestion based on frequency calculation of n-grams from a random sample taken from the input corpora.
Automatic filtering of profanities, and most common French and Spanish words.
2-grams to 4-grams are ordered by frequencies, and split “(n-1)+1” as “input+next word”, with minimal numbers of occurrences depending on chain length.
2-grams are complemented by a list of synonyms, and longer word chains by a list of common expressions.
Finally, Input | Next Word are gathered in a 2-column database ordered by likelihood.

Our ranked database Input|Next Word weight 834 Kb, and below are some simple performance measures on random samples from a test set.

* Accuracy is measured as percentage of exact responses among 1000 top (input+1)-grams from test samples

A light implementation here:

And the code for the complete version: