Predict Next Word

Kevin Carhart - January 8, 2019
:)1








Data Scrubbing and Preparation:)2


The early period is already described in the Milestone Report and won't be repeated here. There was some work involved in removing low-frequency grams, because you would like to be able to address as much data as possible when getting quanteda to generate counts. But resource limits make this difficult. Eventually I got past this and had unigram, bigram and trigram tables that were ready for an algorithm to address.
Description of the algorithm used to make predictions:)3

Using sqldf and data.table, the algorithm assembles four small data.tables - observed trigrams, unobserved trigrams, observed bigrams and unobserved bigrams. The two “observed” result sets are found using Maximum Likelihood Estimates. It would be possible to stop there and get something that is about as good as a crude (and fun) Markov Chainer toy. However, I drew upon a lovely and invaluable guide to Katz Backoff written by Michael Szczepaniak in order to make the model better than just MLE.

The concept here is that since the unseen ngrams are underrepresented and the seen ngrams are overrepresented, you can steal some probability mass from the observed and apply it to the unobserved. The way this is turned into a tangible result set is by divying up the stolen mass to all possible prediction words based on the relative weightings of those words in the n-1 gram.

A few moments later, I have four result sets, some of which may be empty. I used a SQL 'UNION' and 'ORDER BY' to assemble these tables into one big table in order of richest to most barren results. I used a passive design to accomplish the backing off based on the fact that a table with 0 rows in a UNION is simply not there, so that when you select the top three rows, you are selecting from however many UNION'ed layers it has passed through by the third row. If necessary, it can back off to the hardcoded articles such as “the.”
Description of the Shiny App and how to use it:)4

  When you enter a series of words in the textarea, the algorithm described above is triggered using Shiny reactivity following a moment's delay. The words appear in clickable bubbles. You can add a word on the end of your phrase by clicking the bubble, and the prediction algorithm will begin to churn again. The 1st, 2nd and 3rd best predictions are the left, middle and right bubbles.

You can also resize the textarea by dragging the corner.

I would like to acknowledge Enrique Estrada and Rebecca Kotula for the clickable-words idea.

Bonus: Serendipity Mode:)5



While some users want to use the app for pragmatic reasons, the “best” prediction may be in the eye of the beholder for others. If you click the Serendipity Mode radio button, you will be presented with three random words. This could be a way of getting more colorful, concrete nouns and adjectives into the mix. Many of these interesting, low-frequency words have been cut out of the ngrams as a tradeoff for speed. A lot of popular mobile-device apps are toys and games, so we may want to consider use cases involving imagination. A user who finds our suggestions emotionally rich is (perhaps) going to think the app has mystical powers beyond what it actually has.