Next Word Input App

Concept

This App on Natural Language Processing (NLP) field tested 2 algorithm models to predict the next word input. The training dataset was a sample of 1% (700K words) from a collection of several entries (70M words) on social media (twitter/blogs) and news in US. Ngram bag of words (unigrams/bigrams/trigrams) and their individual frequencies were calculated. That is how the 2 predictive algorithms tested by the App work:

Katz Backoff with Good Turing: It consults the most detailed model first, which is the trigram model in this App case and, if that doesn’t work, back off to a lower-order model. Whether the trigram is reliable, that means it has a high count, then the trigram linear model is used. Otherwise, it backs off and use a bigram linear model. That means, it continuous to back off until a model that has some counts is reached. The higher order probabilities are discounted and redistributed/reserved for the lower order ngrams. In the case of words that were never seen, the probability mass assigned to words that occurred only once is calculated and distributed to them, that is the Good Turing smoothing technique.
Interpolation with Modified Kneser-Ney: The interpolation makes use of both higher and lower order ngrams by reallocating some probability mass of higher order to unigram models. The discounted raw probability of the ngram is linearly interpolated with the smoothed probability of the (n-1)gram. This interpolation is like a backoff. The probability is calculated by the number of different contexts that the word follows and subtracted by a discounting amount. This discounting amount is then re-added equally to all ngram probabilities having the same context.

Performance

Quanteda package was majorly used to decrease processing and memory.
Data tables were used to decrease memory.
It was preferred to keep the stopwords. They are by far the most frequent words appearing in the data. Performance of the models would decrease without them.
By measuring the perplexity of the 2 models on the testing data set, the Modified Kneser-Ney had the lowest perplexity (highest probability), the perplexity decreased by increasing the order of the model. That is why the Modified Kneser-Ney was selected for the final App.
If you want to check the performance of both models on the same text entry.For the test backup App, access this URL(https://cleversonsch.shinyapps.io/backupApp/). Attention, that is not the final App!!!

How to use the final App

The App today may take some short time to initially load. After this loading, the App is completely reactive to the User input.
The user should enter the text in the textbox. While the user is typing, the prediction models suggest the next word input.The 5 next most probable words suggestions are given. They are ranked from the most to the least probable.
The final App contains only Modified Kneser-Ney model.
Have fun…
To access the App: < https://cleversonsch.shinyapps.io/nextwordApp/>

Next Word Input App

Interface

Concept

Performance

How to use the final App