Ankit Upadhyay
November 28, 2020
This is a word prediction tool to help us determine the next word in a sentence.
The goal of this exercise is to create a product to highlight the prediction algorithm that you have built and to provide an interface that can be accessed by others. For this project you must submit:
A Shiny app that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.
A slide deck consisting of no more than 5 slides created with R Studio Presenter (https://support.rstudio.com/hc/en-us/articles/200486468-Authoring-R-Presentations) pitching your algorithm and app as if you were presenting to your boss or an investor.
N-grams are used for estimating the most likely next word and for preprocessing of sentences correctly.
Significant data cleaning has been done (converting to lower case, removing punctuation marks, numbers and non-printable characters) and lines have been processed for twitter,news and blog files (english version only).
After data cleaning, the next word is predicted base on the “Stupid Back-off” algorithm.
Firstly, a 4-gram is matched with last three words of the user provided sentence to give a prediction.
If 4-gram is not matched, the algorithm backs-off to 3-gram and tries to match first two words of 3-gram with last two words of the user provided sentence.
if 3-gram is not matched, the algorithm backs-off to 2-gram and tries to match first word of 2-gram with last word of the user provided sentence.
If 2-gram is not matched, the algorithm takes the most frequent word from 1-gram and it is suggested as the next word.
In general,by pruning the n-gram database and using a word-integer hash table, our application has low memory usage and thus is faster in predicting the output.
App can be found here: https://ankit-techspace.shinyapps.io/NLP_Word_Prediction/