The Amazing Zoltar!

  • Motivation : The Amazing Zoltar is a “character” from the 1988 movie Big.
  • Zoltar, a “magic wish machine” at a carnival, granted a young boy's wish to be “big”.
  • Our studio is going to remake Big and make a HUNDRED MILLION dollars!

Image of Zoltar

  • In our reboot, Zoltar will not just grant wishes, he will predict the future as well!

The Application: Zoltar

A Web application was written in R, using a framework that allows web-hosted, R-driven apps to be hosted on ShinyApps.io

  • Instructions at the top of the page invite users to enter text into a text box above the image of Zoltar
  • A submit button triggers the previously-described algorithm
  • The text is passed to a series of functions
  • User-entered text is split into words, removing unnecessary punctuation and converting to lowercase
  • Then, the previously-described algorithm runs to return a prediction of what the next word would be, displayed in Zoltar's mouth

The Algorithm: How Zoltar Predicts the Future

  • What follows is a semi-technical description of the algorithm
  • R “data.table” structures were created to hold three things:
    • Sentence fragments of one or two words
    • Predicted next word
    • The “smoothed” value for the combination of the predicted word given the previous word(s)
  • The “data.tables” were filled with sentence fragments and their frequencies
  • Text source was Twitter tweets, blog posts, and news articles

The Algorithm (cont'd)

  • Fragments of one, two and three words were used to train the predictive model
  • The last two words of user-entered text are used to try to predict the next word
    • A “data.table” is searched for the two-word phrase
    • If the phrase is found, a prediction is returned based on the “smoothed” value of the next word occurring after the previous words
    • If the phrase is not found, the first word in the phrase is removed, and another “data.table” is searched for the word
    • If the word is not found, the most frequent word found in the training data is returned

The Algorithm - Smoothing

  • Smoothed values are based on how often a phrase was seen in the training data compared to other phrases with the same first word(s), but different next word (the one we're trying to predict)
  • Smoothing is necessary to account for words that were never seen in the training data
  • The algorithm uses a process called “Kneser-Ney” smoothing, which works well for this type of algorithm
  • I wrote the code myself! (How about a raise?)