Partha Majumdar
10-December-2014
The strategy is explained in non-technical language as far as possible.
For any given set of words (forming a sentence), the words are broken into 2-word (consecutive) set, 3-word set, 4-word set and so on upto 8-word sets. If the given sentence has less than 8 words, then sets are prepared upto the maximum number of words provided. If the given sentence has more than 8 words, then we consider only the last 8 words.
Let us say, 5 words were provided as Input.
We search our Database for the same combination of 5 words in 6 word sets. If the same is found, then we look up the 6th word as the next possible word. If there are more than one 6th word found, then we take the word which has the largest probability as found in the text.
If the word set is not found, then we search our Database for the combination of 4 words in 5 word sets. And so on…
Create n-grams from the provided text.
For example, 2-gram tokens are combinations of 2 consecutive words as found in the text.
As the provided text is very large, we sampled out 10% of the text to form the n-grams.
Establish probabilities for the various phrases for the different contexts.
Before n-grams can be created, the text was cleaned.
We show the comparison of Phrase Count versus Probabilities for the different n-grams created from the provided text.
The application can be accessed at the following links.
http://partha6369.shinyapps.io/PredictText
or
The application has only one control which can be altered.
In the text box on the left hand side of the screeen, enter any text.
On right, below the label “Predicted Text”, the predicted word will appear.