Rui Wang
04-04-2020
The NLP algorithm used in this word prediction application is stupid backoff n-gram algorithm. The traditional backoff model uses the trigram model as the base model layer, which means the traditional backoff would search the input text pattern through trigram model. If trigram failed with zero evidence of input text pattern, then the model will back off to the lower order bigram model, and then unigram.
Because I need to upload the model and let the free shiny server hold the model backend, I decrease the size of the training data for scalable purpose, which could also decrease the prediction waiting time.
Since the training data size has been cut, I decided to add four-gram model into the stupid backoff algorithm as the base model layer, instead of using the original tri-gram model, which could potentially make more accurate prediction.
The basic introduction of the word prediction app sits at the left part of application. And it includes very important background information and instructions for using this app.
The core part of the entire application sits at the right part of the page (If users view this app on the mobile device, it should sits at the lower part of the webpage.)
Users can fit the text for prediction right into the blank input area. And then hit the update text button for next word prediction. After users hit the update text button. The shiny backend server will calculate the results automatically using the optimized stupid backoff model, and give users three best choices for next word.
For more detailed information about stupid backoff algorithm and n-gram theorey, please check Dan Jurafsky's book Speech and Language Processing.
You could also visit my GitHub repository and get the code I used for building the model, and the data I used for training the model. I will upload all releavent files to the repository later, and users feel free to use them and modify them.
For more advanced use of word or sentence generation, I will recommend to use deep learning technology, which will be more effective both in model training and words generation.