What shall the next word... ... ... ... ... ... ...be? Ask Cassandra AI!

Bojan Zunar
26th November 2018

The Algorithm

English corpora were tokenised into bigrams, trigram, and fourgrams. For each n-gram, tokens were sorted by the number of occurrences. When user types in a phrase, the algorithm takes last three words, searches for fourgrams that start with the exact phrase, and outputs the last word of five most common fourgrams it finds. If it finds less than five such fourgrams, it takes the last two words and searches most common trigrams that start with them, etc. This approach is known as 'stupid backoff'. If typed combination of words is not found, the algorithm shows the five most common words in the English language.

As human languages are systems of rare events, i.e. most word combinations occur rarely, tokens that occur only once may be too unique to help us predict what the user wishes to type. Thus, we may not need all the tokens and so can save on RAM while speeding up the algorithm.

Thus, we looked at how many tokens occur only once and how they impact accuracy.

Are all the tokens needed?

The figure summarises tokenisation results. For each n-gram (each plotted in its facet), we see how many tokens were found and how many times they occur in the corpora. As expected, most tokens occur only once (they represent rare events in the language), and if we remove them, the dataset will become much more manageable. Moreover, we can also remove all but the five most common tokens that differ only in the last word. This approach also saves resources.

plot of chunk unnamed-chunk-1

How accurate is the algorithm?

We looked at different combinations of n-gram datasets to see which one will give the most accurate prediction on independent test corpus. We also calculated fivegrams but did not use them in the final algorithm as they consumed additional memory but did not improve the accuracy. Fourgrams that went on to trigrams and then to bigrams (4+3+2) were the most successful, and we opted for the version that included even bigrams occurring only once (4+3+2 all), as they would give unique predictions even for rare words. We also investigated would the additional words improve the accuracy. Saving more than five most likely words did not help much. Each extra word increased accuracy by less than 2%.

plot of chunk unnamed-chunk-2

Cassandra AI

The final algorithm, Cassandra AI, takes less than 500 MB RAM and suggests top five likely words in less than a second. It is available at:

https://bzunar.shinyapps.io/cassandra/

One just needs to start typing into the input box. Suggested words will appear in the table (1: the most likely word, 5: 5th most likeley word). If you click on the word, the word will be added to the input box. Why not try it?

Thank you for your attention!