Angus Macdonald
28th February 2020
The goal of this project is to build a Shiny app that takes a string of characters and predicts what the next word should be.
The following slides will describe:
The first step was to create a text corpus of which to perform the prediction on. This took data from different source and combined them in several n-grams to be used in the app.
This project used the back-off algorithm commonly used in NLP (Natural Language Processing). This algorithm does the following:
– Takes the input and passes it into the function as a character (not a very fun one). – Searches for the “quadgram” and sees if it can find the most likely output. – If the word count for the input is less than that required for the “quadgram”, the inut is then compared to the “trigram” to see if it can find the most likely word. – This goes on until the word count is 1 word and the “bigram” is used to find the next most likely predictor. – This produces an output estimate for the given input.
The App is set up in two panels:
– The guide panel – The App
The guide panel gives an intuitive explanation on how where the user inputs are required and the outputs.
This app in essence would work even better, should a larger corpus of data be used. I implore anyone with a PC strong enough to cater for such processing to give it a shot and see what they can find!
The main issue was the processing power required to create the corpus. Machine learning and NLP requires large amounts of RAM and processing power to run the algorithms not only to create the corpus but also in the prediction algorithm.
As such the corpus used in this project is smaller than most, given the minute amount of RAM and strength of the laptop this project was performed on.
Machine learning and other aspects of programming hold efficiency at the core, and this little prediction engine encapsualtes the idea that despite the underwhelming performance, prediction and NLP is still possible!.