Michael Baldassaro
7/23/2018
Predictionary is designed to predict the next word from an input phrase based on n-gram (unigram, bigram, trigram and quadgram) tables.
The data corpora used to generate the n-gram tables is composed of blog posts, news articles and tweets provided by Swiftkey via Coursera.
The combined data corpora contained over 4 million lines of text therefore, for speed and efficiency, a sample of 5% (200,000+ lines) of the corpora was used to generate the n-gram tables used for prediction purposes.
The data was processed to remove numbers, punctuation, urls, Twitter-specific features, symbols, and stopwords.
To construct a language model, a modified Kneser-Ney smoothing approach was used to assign probabilities to unknown n-grams. This process entails:
-Discounting: removing a fixed probability mass dervied from the maximum likelihood estimate and reassigning it to out-of-corpus n-grams
-Interpolation-backoff: recursively combining probabilities for lower-order models to assign probabilities to higher-order n-grams
-Contextualization: predicting the likelihood of a word based on the different contexts in which it appears
The interpolation-backoff method first uses higher-order n-grams to predict the next word and then lower-order n-grams to assign probabilities to higher-order n-grams.
Thus, if you input a string of text, it will return a prediction for the next word based on high-to-low n-gram frequencies.
predictNext('Michael you are')
pred pkn ngram
1: awesome 0.007737087 3
2: ready 0.005354313 3
3: coming 0.004575207 3
4: living 0.004448141 3
5: amazing 0.004277860 3
6: playing 0.004125545 3
7: expected 0.003848106 2
8: coming 0.002712964 2
9: free 0.002283685 2
10: pretty 0.002168034 2
Go to https://mbaldassaro.shinyapps.io/predictionary/ and type some text into the “Enter Text” box. Below the text, 10 suggested words will appear as well as a “Next Word Probability” box that shows the predicted probability of each word based on frequencies drawn from the n-gram tables.