Pablo Rojo
October 22nd, 2021
Shiny application is available at https://pajarom.shinyapps.io/DataProduct/.
This application is based fundamentally in a n-gram stupid back off algorithm combined with a language model that attempts to identify words that are commonly used together.
Syntax correctness is primarily provided by the stupid backoff algorithm since it takes into account the order of the word in n-grams. The formula used to estimate the probability of each word is:
The resulting size of the model using only 1% of the data available is not big (~10MB) but its creation is CPU intensive due to:
In order to provide semantic context to predictions, a language model was build identifying words that commonly appear together regardless of the order. The main limitation that we faced was that the size grew geometrically and we need to use several techniques to reduce the size from 1GB to less than 100MB:
This algorithm is far from complete or optimal. It is just a proof of concept in the area of Natural Language Processing.
During the testing performed for the Quizzes 2 and 3 several limitation were identified:
Send us your feedback!