Cleverson
November 2, 2019
This App on Natural Language Processing (NLP) field tested 2 algorithm models to predict the next word input. The training dataset was a sample of 1% (700K words) from a collection of several entries (70M words) on social media (twitter/blogs) and news in US. Ngram bag of words (unigrams/bigrams/trigrams) and their individual frequencies were calculated. That is how the 2 predictive algorithms tested by the App work:
Katz Backoff with Good Turing: It consults the most detailed model first, which is the trigram model in this App case and, if that doesn’t work, back off to a lower-order model. Whether the trigram is reliable, that means it has a high count, then the trigram linear model is used. Otherwise, it backs off and use a bigram linear model. That means, it continuous to back off until a model that has some counts is reached. The higher order probabilities are discounted and redistributed/reserved for the lower order ngrams. In the case of words that were never seen, the probability mass assigned to words that occurred only once is calculated and distributed to them, that is the Good Turing smoothing technique.
Interpolation with Modified Kneser-Ney: The interpolation makes use of both higher and lower order ngrams by reallocating some probability mass of higher order to unigram models. The discounted raw probability of the ngram is linearly interpolated with the smoothed probability of the (n-1)gram. This interpolation is like a backoff. The probability is calculated by the number of different contexts that the word follows and subtracted by a discounting amount. This discounting amount is then re-added equally to all ngram probabilities having the same context.
The final App contains only Modified Kneser-Ney model.
Have fun…
To access the App: < https://cleversonsch.shinyapps.io/nextwordApp/>