The application has a navigation bar with two panels: Predict App (main panel), and Documentation.
This image shows the navigation bar and the main panel.
The user enters a small text in Text input, and clicks on the Predict next word button.
Four predicted words will be displayed in the text output box.
The words are listed in order by more relevant for the prediction.
Algorithm
It was implemented the Stupid Backoff algorithm in this way:
Text input is cleaning and transforming to be analyzed.
The last 3 words of text input are searched in the 4-gram file.
If they are found, the last words of the 4-gram more frequently are defined as predicted words.
If there are not four predict words, the app searches the last 2 words of the text input in 3-gram file.
The same happen with last word of the text input if there are not four predicted words: 2-Gram file is searched.
If there are not four predicted words at the end, the app searches last word of the text input in the most frequent 2-grams from the Corpus of Contemporary American English (COCA).
R code of the algorithm is in “Documentation” tab of the app.
Necessary improvements
The application developed needs the following enhancements:
Implement use of n-gram probabilities to get better accuracy.
Implement some method of smoothing the probabilities of all null probabilities.
Use a higher percentage of text for the training test.
Improve steps for transformation and cleaning data, and the layout of the panels.