English text data from 3 different sources: Blogs, Twitter & News were used in this project. The data were sampled, preprocessed and tokenized using well known R-text mining packages.
A “next word prediction” model was developed by optimizing accuracy and efficiency. A simple “backoff model” was used to handle the words that may not be in the corpora. If there is no suggestion for the next word using 5-Grams (using last 4 words), then use 4-Grams, if not 3-grams, if not 2-Grams. Finally, if still not, use the highest probability 1-Gram word: “the”. The model also calculates a second guess and a third guess.
Shiny interactive app was developed and launched in Shinyapps.io. It predicts the next word, 2nd guess and 3rd guess when user enters words. Also displays the plots and data of n- grams as selected by the user.