This project involves developing an algorithm for predicting the next word in real-time based on user entry.
The algorithm learns from a set of news articles, blog posts and twitter data.
The output is provided through Shiny app, where the user can enter a word/sentence, and the next predicted word is shown in real-time.
Preparing data from documents
As the first step, the text documents (blogs, news and twitter data) are read as input
They are cleaned: converted to lower case, numbers, & punctuations and profanities are removed
This is to normalize the words and prepare the corpus
R packages used: quanteda, tm, RWeka, wordcloud (visualization)
Model Algorithm
The concept of Markov chains are used to generate word chains of various lengths (NGrams)–up to 4 Ngrams are generated
These are trained using a Naive Bayes algorithm, with last word as its predictor
LaPlace smoothing process is included in the training model, in order to capture the word chains that may occur outside of the training corpus
While taking real-time input from the user, the model goes through the same cleaning process with the entered words as mentioned in the previous slides, and then predicts the next possible word using the trained model