12/12/2020

What is it?

This application produces a prediction of the next word based on a given input string. It uses n-grams with a stupid backoff algorithm. It is a proof of concept designed to run quickly with minimal hardware requirements, making it ideal for mobile or embedded use.

How does it work?

The application takes the input string and uses a stupid backoff algorithm (Brants et al. 2007) with a pre-computed n-gram frequency table to produce scores for the top 5 suggestions. For example, if I were to input “raining cats and”, it would predict dogs with a high score. This is because higher points are awarded to an n-gram if it occurs at a higher frequency than other n-grams of the same order. If not enough n-grams are found, it backs off (hence the name) to the table for the last n-1 words. The equation below describes the algorithm.

What features does it have?

-As previously mentioned, it is extremely lightweight. It is able to be complied and trained on a system with only 2GB of RAM, and run with as little as 100MB. This is because the backoff model is very effective for how simplistic it is.

-The alpha penalty for backing off to a lower order n-gram is modifiable with the slider

-You can apply a penalty weight to score of stopwords

How can it be improved?

-Due to hardware constraints, this application was compiled using a small subset of the training data. The code is transferable and will deal with gathering all of the datasets and libraries, even on a fresh machine, allowing it to be easily transferred to a more powerful cloud computer. Then simply tweaking the sampling rate parameter will produce much better results.

-A more complex algorithm, such as a LSTM cell will allow more context and meaning to flow from earlier tokens to the end of the sentence, allowing for more accurate predictions.

Thanks for reading! Check it out at the link below:

https://julesv3rne.shinyapps.io/Data-Science-Capstone/