Patrick Machado
18 aug 2019
This is a pitch for a Shiny application that make use of the Natural Language Processing framework and predict the next word given an user input phrase.
It was constructed as the final project of the Coursera Data Science Specialization.
The data used for training the algorithm was downloaded from here. The great amount of data was cleaned and sampled to construct the quadgrams used for the application, which then were classiffied and aggregated into five features:
More details about the dataset can be found in this Rpubs presentation.
For predicting, a modification of the back off model was implemented, the algorithm workflow is like this:
The user input is cleaned with the same function that the original data was, and the three rightmost words are selected.
This words are searched in the app database, and if there is a match the prediction word is returned.
If no exact match is founded, the app database is filtered, when possible, by each one of the three input words and the prediction word with the greatest quadgram frequency is returned.
Thanks and enjoy typing!