Predicting Next Word

Levi Brackman
June 5th 2016

The Task

This slide deck is part the Data Science Capstone which included the following.

  • Clean blog, twitter and news data
  • Create predictive models that will predict the next word of a phrase or given word
  • Create a data product in terms of a shiny app that will take and input and reply with w predicted next word.

Algorithm used

The algorithm I used is stupid backoff with a few modifications. The algorithm is able to detect how many words are put into the input box and give results accordingly.

If for example only one word is put into the predict box the algorithm does a very simple prediction by looking at all those words in the corpus and returning the most likely next work. If there is more than one word we then look for those two words in the trigram corpus and see if we can find the next word.

We then take the second word and look for that word in the bigrams and see if we can find a rod that follows it in that corpus. If neither of these words are found we simply return the most common words in the entire corpus. If they are found then we rank them by their how probable it is to that each word would be the next word.

How the Probabilities Work

I will use the words “barack obama” as an example. The probability that word i-1 (obama) is followed by word i (barack) = [Num times we saw word i-1 (obama) follow Word i (barack) ] / [Num times we saw word i-1 (obama)]. In this case since we are looking for the next word, we look at number of times we see obama followed by another word / number of time we saw obama. We then multiply that by 0.4 as suggested by stupid backoff.

In a trigram (e.g. barack hussein obama) is works as follows: the probability that we saw word i-1 (hussain) followed by word i-2 (barack) followed by word i (obama) = [Num times we saw the three words in order (barack hussein obama)]/[Num times we saw wordi-1 (hussain) followed by wordi-2 (barack)]. I multiply that by 0.4. I then ranked the probabilities and returned the top according to the numbers specified by the user in the app.

How the Probabilities Work Cont.

The app used the stupid backoff algorithm but I have modified it slightly in order that it should work more efficiently in the app. In addition I found that the simple unigram model was very effective without having to use sophisticated probabilities.

How the App. Works

This app is incredibly simple. You just enter a word or a phrase in the input box, choose how many of the top matches you want to see and click submit and you will see the predicted next words appear in a string of words. App can be found here https://levi.shinyapps.io/prediction/