Next Word Prediction in r

November 24, 2017

Next Word Prediction in r

This project is done as part of Capstone project offered by John Hopkins university on Coursera.org.

Natural language processing is very relevant and challenging in today's era of heavy reliance on IOT (internet of things). The coursera gave a large amount of text data collected from twitter, blogs and news. This collection of texts is called a corpura.

A language model is a model that computes either probability of a sequence of words or the probability of the nth word given the (n-1) words. Probability of a sequence of words W consisting of w1, w2,….wn is determined using various models. In the following 2 slides, I have discussed the two methods of modeling that I have implemented to predict the nextword:

Markov chain
Kneser Ney

Markov Chain Modeling

In Markov assumption probability of next word is computed by considering only the last few words in the sequence, instead of the entire sequence. This is the basis for MLE (Maximum likelihood estimate). Let's consider the example: The first argument can be a list of data In this above example, instead of considering all the 9 words to predict the 10th word, typically only the last few words are considerd. Maybe, only 'of data' (called a bigram- for a sequence of 2 words) or 'list of data' (called a trigram- for a sequence of 3 words) is considered, to predict the next word.

The advantage of this method is it's computational simplicity. Disadvantage is that it gives a probability of zero for unlikely n-grams (please see references in the last slide for more details).

In my code I used a maximum of trigram prediction. If this did not give a likely prediction, I backed off to bigrams and then unigram prediction to obtain the most likely nextword.

Kneser-Ney Modeling

Kneser-Ney modeling depends on a concept called discounting. A part of the probability from the higher probability n-grams, are discounted and distributed to other words that have zero probability. This helps to smooth the probability distribution. In Kneser-Ney, the lower order maximum likelihood factor becomes significant when no higher-order matches are found.

In my shiny app, a comparison between the 2 modeling is shown. Also a confidence is predicted based on square root of the count.

ShinyApp

My Shiny App and references

Instructions to use the app: Start typing your text in the box and top 3 predictions based on the 2 models populate in the nav bar page. Also the confidence probability is displayed as a bar graph.

Lectures by Dan Jufarsky and Chris Manning at https://www.youtube.com/playlist?list=PL6397E4B26D00A269

Chen and Goodman paper at http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf

Coursera data science specialization blogs and a special thanks to stackoverflow :)