Josh Oberman
July 2016
-Stupid backoff is a smoothing method that has shown to be effective when drawing from word counts in large corpuses
-Essentially, the model starts by searching for a count at the largest n-gram level possible to generate the next word in the sequence, if no sequence is found at that n-gram level, we search the “n-1”-gram level. E.g. if our input phrase is “the cat is on the mat” and there is no quadgram beginning with “on the mat”, then the model will instead search for a trigram beginning with “the mat”, and if there is no trigram beginning with “the mat”, the model searches for any bigrams beginning with “mat”. If no bigrams exist that contain the most rece
-The algorithm used could be better described as “stupid stupid backoff”. Stupid backoff typically weights the n-1 gram probability predictions with some constant alpha. However, for the purposes of building a text engine it made sense to not do this, since this is only relevant if we are comparing probabilites from different n-gram levels
-Word frequencies at different n-gram levels are loaded from .RData files as pre-processed named integer vectors with counts in to the app environment. These vectors are relatively large and slow to search. Implementing more advanced sorting methods or storing data in an external database could likely increase performance
-The shiny app was made to be minimal, as if this was a text engine on a phone -User inputs some text on the sidepanel, and a progress bar indicates that the next word is being generated -After generating a new word, the user has an option of generating another word based on the current updated or inputting a new phrase in the sidepanel and restarting the process.