A Simple Text Prediction App Using R Shiny

Josh Oberman
July 2016

Introduction

The Goal of this project was to implement a reasonably accurate and performant text prediction engine based on mining a corpus of written natural language from the internet
Data comes from blogs, twitter, and news websites
Ngram analysis and “stupid backoff” smoothing was used for prediction. The search space for generating the next word in a sequence starts at the quadgram level

Stupid Backoff

-Stupid backoff is a smoothing method that has shown to be effective when drawing from word counts in large corpuses

-Essentially, the model starts by searching for a count at the largest n-gram level possible to generate the next word in the sequence, if no sequence is found at that n-gram level, we search the “n-1”-gram level. E.g. if our input phrase is “the cat is on the mat” and there is no quadgram beginning with “on the mat”, then the model will instead search for a trigram beginning with “the mat”, and if there is no trigram beginning with “the mat”, the model searches for any bigrams beginning with “mat”. If no bigrams exist that contain the most rece

Drawbacks

-The algorithm used could be better described as “stupid stupid backoff”. Stupid backoff typically weights the n-1 gram probability predictions with some constant alpha. However, for the purposes of building a text engine it made sense to not do this, since this is only relevant if we are comparing probabilites from different n-gram levels

-Word frequencies at different n-gram levels are loaded from .RData files as pre-processed named integer vectors with counts in to the app environment. These vectors are relatively large and slow to search. Implementing more advanced sorting methods or storing data in an external database could likely increase performance

Shiny App

-The shiny app was made to be minimal, as if this was a text engine on a phone -User inputs some text on the sidepanel, and a progress bar indicates that the next word is being generated -After generating a new word, the user has an option of generating another word based on the current updated or inputting a new phrase in the sidepanel and restarting the process.