Sriram Vadlamani
20 September, 2016
During this project, a corpus of documents were given which include - news, blogs and twitter. The corpus was pretty big to be analyzed with developer laptops.
During the course of the project, the files were cleaned, analyzed and sampled.
Initially the stop words were removed as was suggested in many natual language processing methods. However, it was soon realized that stop words are necessary to make a good prediction as the model should predict the stop words as well.
The algorithm used for this prediction model are Markov Chains.
The shiny app that was built along with this presentation takes text as input. The text you enter has to be from the corpus given. If you type one word in the text box, then the next word is predicted for you.
As you type more words, the prediction would take into account the previously typed words (upto 4) to make predictions.