To use - type in an incomplete phrase of text and press the Predict button
next three possible word which will follow will show up on the screen
Algorithm
The prediction model uses 4-grams, 3-grams and 2-grams tokenized from cleaned training data
The Maximum Likelihood Estimate (MLE) of the next probable word following the input phrase is calculated using back off of occurrences of the matching n-gram phrases in the training data
Interpolation of the MLEs for matching 4-grams (if any), 3-grams (if any) and 2-grams (if any) is used
-If no match n-grams found then it will output “it”
In case no matching n-grams exist, the model simply predicts the 3 most common words which will follow if there aren't 3 words then it will replace it with NA.
Accuracy and Memory Usage
Accuracy:15.5%
Memory usage :27 MB
Average prediction time : 60s
Data Used
Only the final/en_US data files with blog, news and twitter data were used
5% random sample of lines from each of the blogs, news and twitter data files was taken (to ensure similar coverage of all genres) and combined
Preprocessing included cleaning text - lower case, removed numbers, various special characters and punctuation.
To reduce memory usage, 3- and 4- grams occurring only once, 2-grams occurring less than 3 times were omitted.