Using the “sbo” package in R, a model was developed, using the provided Kneser-Ney smoothing. Thereafter the model was tested again on a sub-sample of the original data. The model’s initial perplexity was over 3oo. After testing and adjusting the smoothing parameters, the perplexity metric of the model was reduced to below 300, with the model able to both predict the next word (giving three options), as well as provide a probability metric for most-likely word inserted:
## Next-word text predictor from Stupid Back-off N-gram model
##
## Order (N): 3
## Dictionary size: 1685 words
## Back-off penalization (lambda): 0.4
## Maximum number of predictions (L): 3
##
## See ?predict.sbo_predictor for usage help.
The most optimal model works off of N = 5, allowing for more context. The model is saved as a .rda file, which allows for quick loading and predicting. THe model can be updated running the r scripts in the repository.