Alexander Spray
20161008
This application, available at the below link presents a simple ngram text prediction algorithm. The most common nth word is presented given an input string of length n-1.
https://roastmoose.shinyapps.io/textPred/.
The user string is tested for existence in the highest order stored ngram space (in this case we use a 3-gram model. If the user string does not exist, we recursively trim the string until we find an existing n(-n)gram and generate a prediction based on that restricted ngram model. Our model size is perhaps somewhat large (1Gb) and the time to execute is relatively small (median<1s). This is a compromise chosen to balance the size of the model, it's accuracy, and the time to fit and evaluate the model.
The training corpus is very large, and the simple reality of its sheer size presents issues in fitting the model described in the previous slide. The training corpus used includes some 70 million unique words (for the english dataset alone), as such, some care was neccessary in the fitting of the model. The package data.table provides for efficient (importantly efficient on search) data structures, and efficient search time in model evaluation.
Time to fit under data.frame was impractical… much than 24 hours for a 40% sample.
The model was found to be accurate in predicting the next word approximately 20% of the time.
Do to time considerations, many possible improvements to the application remain unimplemented. These include, but are not limited to