Predictionary

Michael Baldassaro
7/23/2018

About Predictionary

Predictionary is designed to predict the next word from an input phrase based on n-gram (unigram, bigram, trigram and quadgram) tables.

The data corpora used to generate the n-gram tables is composed of blog posts, news articles and tweets provided by Swiftkey via Coursera.

The combined data corpora contained over 4 million lines of text therefore, for speed and efficiency, a sample of 5% (200,000+ lines) of the corpora was used to generate the n-gram tables used for prediction purposes.

The data was processed to remove numbers, punctuation, urls, Twitter-specific features, symbols, and stopwords.

About the Algorithm

To construct a language model, a modified Kneser-Ney smoothing approach was used to assign probabilities to unknown n-grams. This process entails:

-Discounting: removing a fixed probability mass dervied from the maximum likelihood estimate and reassigning it to out-of-corpus n-grams

-Interpolation-backoff: recursively combining probabilities for lower-order models to assign probabilities to higher-order n-grams

-Contextualization: predicting the likelihood of a word based on the different contexts in which it appears

How It Works

The interpolation-backoff method first uses higher-order n-grams to predict the next word and then lower-order n-grams to assign probabilities to higher-order n-grams.

Thus, if you input a string of text, it will return a prediction for the next word based on high-to-low n-gram frequencies.

predictNext('Michael you are')
        pred         pkn ngram
 1:  awesome 0.007737087     3
 2:    ready 0.005354313     3
 3:   coming 0.004575207     3
 4:   living 0.004448141     3
 5:  amazing 0.004277860     3
 6:  playing 0.004125545     3
 7: expected 0.003848106     2
 8:   coming 0.002712964     2
 9:     free 0.002283685     2
10:   pretty 0.002168034     2

How to Use It

Go to https://mbaldassaro.shinyapps.io/predictionary/ and type some text into the “Enter Text” box. Below the text, 10 suggested words will appear as well as a “Next Word Probability” box that shows the predicted probability of each word based on frequencies drawn from the n-gram tables.