Capstone - Predict Next Word

Kevin Carhart
January 7, 2019

Preparation

I did not want to use sampling, so I broke the corpus into 20 pieces and performed tasks on each piece, up to a point. I converted the 20 segments into unigrams, bigrams and trigrams using quanteda. After this, the issue was that I wanted to remove ngrams with low frequencies, but cutting the frequency-1's from each of my arbitrary segments would have been cutting too much data that potentially could have a frequency higher than 1 overall. After several steps, I worked this out and had ngram tables that were ready for an algorithm to address.

A Description of the algorithm used to make the prediction

Using sqldf and data.table, the algorithm assembles four small SQL result tables of ngrams - observed trigrams, unobserved trigrams, observed bigrams and unobserved bigrams. The two “observed” result sets are found using Maximum Likelihood Estimates. It would be possible to stop there and get something that is about as good as a crude (and fun) Markov Chainer toy. However, I drew upon a lovely and invaluable guide to Katz Backoff written by Michael Szczepaniak in order to make the model better than just MLE. The concept here is that since the unseen ngrams are underrepresented and the seen ngrams are overrepresented, you can steal some probability mass from the observed and apply it to the unobserved. (Katz designed one scheme, and there are other suggested ways such as the Stupid Backoff by Brants, et. al.) The way this is turned into a tangible result set is by divying up the stolen mass to all possible prediction words based on the relative weightings of those words in the n-1 gram. Please refer to the Stanford NLP lectures by Dan Jurafsky for more detail about probability mass redistribution.

Now I have four result sets, some of which may be empty. I used a SQL 'UNION' and 'ORDER BY' to assemble these tables into one big table in order of richest to most barren results. I used a passive design to accomplish the backing off based on the fact that a table with 0 rows in a UNION is simply not there, so that when you select the top three rows, you are selecting from however far it needed to back off to find three rows. If necessary, it can back off to the hardcoded articles such as “the.”

A description of the Shiny app and how to use it

When you enter a series of words in the box, the algorithm described above is triggered after a moment's delay. The words appear in clickable bubbles. You can add a word on the end of your phrase by clicking the bubble, and the prediction algorithm will begin to churn again. The 1st, 2nd and 3rd best predictions are the left, middle and right bubbles. Thank you Enrique Estrada and Rebecca Kotula for the clickable-words idea. Thank you Len Greski for saying “use quanteda, data.table and sqldf.”

A description of Serendipity Mode

While some users want to use the app for pragmatic reasons, the “best” prediction may be in the eye of the beholder for others. If you click the Serendipity Mode radio button, you will be presented with three random words. This could be a way of getting more colorful, concrete nouns and adjectives into the mix. Many of these interesting, low-frequency words have been cut out of the ngrams as a tradeoff for speed. A lot of popular mobile-device apps are toys and games, so we may want to consider use cases involving imagination. If the users love it, they'll think we are geniuses.