Using sqldf and data.table, the algorithm assembles four small data.tables - observed trigrams, unobserved trigrams, observed bigrams and unobserved bigrams. The two “observed” result sets are found using Maximum Likelihood Estimates. It would be possible to stop there and get something that is about as good as a crude (and fun) Markov Chainer toy. However, I drew upon a lovely and invaluable guide to Katz Backoff written by Michael Szczepaniak in order to make the model better than just MLE.
The concept here is that since the unseen ngrams are underrepresented and the seen ngrams are overrepresented, you can steal some probability mass from the observed and apply it to the unobserved. The way this is turned into a tangible result set is by divying up the stolen mass to all possible prediction words based on the relative weightings of those words in the n-1 gram.
A few moments later, I have four result sets, some of which may be empty. I used a SQL 'UNION' and 'ORDER BY' to assemble these tables into one big table in order of richest to most barren results. I used a passive design to accomplish the backing off based on the fact that a table with 0 rows in a UNION is simply not there, so that when you select the top three rows, you are selecting from however many UNION'ed layers it has passed through by the third row. If necessary, it can back off to the hardcoded articles such as “the.” |