The Amazing Zoltar!

Motivation : The Amazing Zoltar is a “character” from the 1988 movie Big.
Zoltar, a “magic wish machine” at a carnival, granted a young boy's wish to be “big”.
Our studio is going to remake Big and make a HUNDRED MILLION dollars!

Image of Zoltar

In our reboot, Zoltar will not just grant wishes, he will predict the future as well!

A Web application was written in R, using a framework that allows web-hosted, R-driven apps to be hosted on ShinyApps.io

Instructions at the top of the page invite users to enter text into a text box above the image of Zoltar
A submit button triggers the previously-described algorithm
The text is passed to a series of functions
User-entered text is split into words, removing unnecessary punctuation and converting to lowercase
Then, the previously-described algorithm runs to return a prediction of what the next word would be, displayed in Zoltar's mouth

What follows is a semi-technical description of the algorithm
R “data.table” structures were created to hold three things:
- Sentence fragments of one or two words
- Predicted next word
- The “smoothed” value for the combination of the predicted word given the previous word(s)
The “data.tables” were filled with sentence fragments and their frequencies
Text source was Twitter tweets, blog posts, and news articles

Fragments of one, two and three words were used to train the predictive model
The last two words of user-entered text are used to try to predict the next word
- A “data.table” is searched for the two-word phrase
- If the phrase is found, a prediction is returned based on the “smoothed” value of the next word occurring after the previous words
- If the phrase is not found, the first word in the phrase is removed, and another “data.table” is searched for the word
- If the word is not found, the most frequent word found in the training data is returned

Smoothed values are based on how often a phrase was seen in the training data compared to other phrases with the same first word(s), but different next word (the one we're trying to predict)
Smoothing is necessary to account for words that were never seen in the training data
The algorithm uses a process called “Kneser-Ney” smoothing, which works well for this type of algorithm
I wrote the code myself! (How about a raise?)