Next Word Prediction

Sriram Vadlamani
20 September, 2016

Methodology

During this project, a corpus of documents were given which include - news, blogs and twitter. The corpus was pretty big to be analyzed with developer laptops.

During the course of the project, the files were cleaned, analyzed and sampled.

Files were parsed, lower cased and pre-processed
Files were randomly sampled.
N-grams were ranked. (bi-grams, tri-grams and quad-grams)

Challenges

Initially the stop words were removed as was suggested in many natual language processing methods. However, it was soon realized that stop words are necessary to make a good prediction as the model should predict the stop words as well.

Algorithms used

The algorithm used for this prediction model are Markov Chains.

Shiny App

The shiny app that was built along with this presentation takes text as input. The text you enter has to be from the corpus given. If you type one word in the text box, then the next word is predicted for you.

As you type more words, the prediction would take into account the previously typed words (upto 4) to make predictions.