- Application Summary
- Development
- Notes
9/21/2019
This is a simple gadget to show how to produce a text predictor using n-gram algorithm. The concept is based on Markov chain theory that states that you don’t have to know the whole history of the sentence to predict the next word. i.e., only the last 2,3 or 4 words (states) are sufficient.
Upon prediction, the application will match the highest 3 possible words and place them into push buttoms, where the user can select the best word, or type his own. Upon typing a new word, a new prediction will happen, and will be placed again in the three push buttons.
We build simple n-gram tables by inheriting them from available corpus and then tokenize them and input individual words in their last 2-3-4 sequence into rows. We then calculate the probabilities of those sequence so that to maximize the likelihood when trying to decide the next word.
During the prediction algorithm, we try to match the best 4-gram sentence, if it is not available, we drop to the 3-gram and then to the 2-gram model.
Here it is important to limit the size of the corpus, possible 10% is sufficient in order to preserver the speed of the application. But that can be increased depending on the user requirements.