Next Word Prediction

Avnish
20-04-2020

A text prediction app to predict the next word following a phrase of input text
Prediction model uses n-grams( 2-, 3- and 4-grams) developed from training data
The application is hosted on shinyapps.io at: https://avnishsingh.shinyapps.io/NextWordPrediction/
To use - type in an incomplete phrase of text and press the Predict button
next three possible word which will follow will show up on the screen

The prediction model uses 4-grams, 3-grams and 2-grams tokenized from cleaned training data
The Maximum Likelihood Estimate (MLE) of the next probable word following the input phrase is calculated using back off of occurrences of the matching n-gram phrases in the training data
Interpolation of the MLEs for matching 4-grams (if any), 3-grams (if any) and 2-grams (if any) is used -If no match n-grams found then it will output “it”
In case no matching n-grams exist, the model simply predicts the 3 most common words which will follow if there aren't 3 words then it will replace it with NA.

Only the final/en_US data files with blog, news and twitter data were used
5% random sample of lines from each of the blogs, news and twitter data files was taken (to ensure similar coverage of all genres) and combined
Preprocessing included cleaning text - lower case, removed numbers, various special characters and punctuation.
To reduce memory usage, 3- and 4- grams occurring only once, 2-grams occurring less than 3 times were omitted.