Markov Chains

J Faleiro
Jul 18 2016

Predicting Sequencing of Events using Sample Data and Markov Chains

This Case Study

On this case study we leverage a corpus of about XXX words, extracted from different media:

  • Twitter: 160Mb, 30M words
  • Blogs: 200Mb, 37M words
  • News: 196Mb, 34M words

On this case study we develop a model to predict the next word you will type, in real-time, based on Markov Chains.

This model is based on the assumption that the probability of a word depends on previous words already in a sentence (a Markov assumption). Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past.[1]

Process

Challenges

  • Relativelly large amount of data, given the available platform for this exercise
  • Required real time response, interactive utilization

Method

  • Random sampling of the original data, to about 30% of the original data
  • Parsing of n-grams, and assembly of markov chains. For each sentence \( s \) we calculate the conditional probability of word \( w \) by approximation to

    \( P(w|s) = C(paste(s,w)) / C(s) \)

    where \( P(A|B) \) is a conditional probability of A given B and \( paste(s_1, s_2, ..., s_n) \) is a string concatenation function

Process (2)

  • This conditional probability provides the fitness for each word given previous words already in the sentence
  • Arrange each dataset for a N-gram N separatelly, and depending on the size of sentence, we select the few small items with greater fitness on that dataset
  • These items are presented to the user, that can then either continue typing, or select one of the alternatives already presented.

Findings

  • There is a substantial difference in the structure of the sentence and as a consequence sample error level depending on the source of the data.
  • Increasing N for n-grams greater than 4 did not change the sample error level.
  • Increasing the sampling rate reduced substantially the sample error level.
  • For samples rates above 30% we saw degradation in performance without a noticeable change in the sample error level

Conclusion

Future Work

  • Add filtering for words that are not appropriate
  • Look into alternatives that would consider semantics of words within a sentence as additional features

Go ahead, give it a try

https://jfaleiro.shinyapps.io/markovchains/

(the app takes about 30 seconds to load, please be patient)

References