TextPredictor.utf8.md

Text Predictor

This is a simple gadget to show how to produce a text predictor using n-gram algorithms. The main idea is based on Markov chain theory that states that you don’t have to know the whole history of the sentence to predict the next word. i.e., only the last 2 or 3 or 4 words (states) are sufficient. For that we build simple n-gram tables by inheriting them from available corpus and then tokenize them by input individual words in their last 2-3-4 sequence into rows. Then we calculate the probabilities of those sequence so that to maximize the likelihood when trying to decide the next word.

During the prediction algorithm, we try to match the best 4-gram sentence, if it is not available, we drop to the 3-gram and then to the 2-gram model.

Here it is important to limit the size of the corpus, possible 10% is sufficient in order to preserver the speed of the application. But that can be increased depending on the application.