Product Presentation: N-Gram Word Prediction

Konstantin Mingoulin
March 17, 2018

Provide up to 5 suggestions
To insure accuracy and relevancy up to 4 preceding words used to predict the next one

Sample data from 3 corpora: news, blogs and twitter
Clean-up and stem the combined corpus
Create term matrix that contains 2 to 5 n-grams
Function is created to take a line of text and predict the word based on the maximum number of preceding words, i.e. start with 4, then 3, all the way to 1. The input does not need to be stemmed
The function outputs 5 most likely outcomes based on the frequency of occurrence in corpus. Results go through the stem completion to output most prevalent options based on same combined corpus (not stemmed)
If no matches found, the function returns “no match”

Note: if you continue typing, suggestion will appear automatically and there is no need to click “Predict”