Markov Models

Introduction

Andrey Markov
1856 - 1922

Andrey Markov was a Russian mathematician, that through his exploration of the weak law of large numbers, fathered what are now known as Markov Models.

A Markov Model is a stochastic model used to examine randomly changing systems. In these systems, it is assumed that future states depend only on the current state, not on the events that occurred before it, thus exhibiting the Markov Property.

There are four common Markov Models:

Markov Chains

Hidden Markov Models

Markov Decision Processes
Partially Observable Markov Decision Processes

The simplest way to illustrate the Markov Property is with the first common Markov Model: Markov Chains

Markov Chains

Specifications

Formally a Markov Chain is Specified by:

Set of States:
\(Q=\{q_1,q_2,...q_n\}\)
Vector of prior probabilities:
\(\Pi = \{\pi_1, \pi_2,...\pi_n\}\) \(where\) \(a_{ij} = P[S_0 = q_i]\)
Matrix of transition proabilities:
\(A = \{a_{ij}\},i = [1...n],j = [1...n]\) \(where\) \(a_{ij} = P[S_t=q_j|S_{t-1} = q_i]\)

Markov Chains Satisfy the Following Properties:

Probability axioms, i.e., sum of all probabilites should be equal to 1.
\(\sum_i \pi_i = 1\) \(and\) \(\sum_j a_{ij} = 1\)
Markov Property:
\(P[S_t = q_j|S_{t-1}=q_i]\)

Markov Chain Example

Weather Example

Set of States: Sunny\((q_1)\), Cloudy\((q_2)\), and Rainy\((q_3)\)
Vector of Probabilies: In this case \(\pi\) or the current weather forecast
Transition Matrix: \(A\), in this case based on historical observation

What is the forecast for the day after tomorrow?

\(A * \pi = \left[\begin{array}{rrr}0.8 & 0.2 & 0.3 \\0.1 & 0.6 & 0.2 \\0.1 & 0.2 & 0.5\end{array}\right]*\left[\begin{array} 00.75 \\ 0.2 \\0.05\end{array}\right] = \left[\begin{array}{rrr}0.655 \\ 0.205 \\ 0.140 \end{array}\right]\)

What are the chances it is going to be 3 sunny days followed by 2 rainy days?

\(Q = s,s,s,r,r\)
\(P(Q) = \pi_1(q_{11})(q_{11})(q_{13})(q_{33}) = (0.75)(0.8)(0.8)(0.3)(0.5) = 0.072\)

How likely is it to get 4 sunny days in a row?

\(P(d_i) = a^{d-1}_{ii}(1-a_{ii}) = 0.8^3(1-0.8) = 0.1024\)

Hidden Markov Model

Central Issues with Hidden Markov Models (HMM):

Evaluation Problem:
Model \((\theta)\) & seq. of visible symbols \((V^T)\). Need to find \(p(V^T | \theta_i)\). Use Bayes Rule \(p(V^t |\theta_i) = (p(V^T|\theta_i)p(\theta))/p(V^T)\)

Learning Problem - Baum-Welch Algorithm:
Many combinations. Use training data & specific # of hidden states to train model faster. need to estimate the Transition \((a_{ij})\) & Emission \((b_{jk})\) Probabilities using the training sequences.
Expectation Maximization \((EM)\) algorithm will be used to estimate the Transition \((a_{ij})\) & Emission \((b_{jk})\) Probabilities.

Decoding Problem - Viterbi Algorithm:
model \((\theta)\) to predict hidden states \(W^T\) which generated the visible sequence \(V^T\)

Baum-Welch

Overview

Baum Welch (Forward-Backward) Algorithm

The Baum–Welch algorithm uses the well known EM (Expectation Maximization) algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors.

Overview

Set of M Hidden States (\(S^M\))
Sequence of T observations (\(V^T\)),
Transition Probability Matrix (\(A\)) - (\(a_{12}\) = probability of transition \(s_1\) to \(s_2\))

\(A = \begin{bmatrix}a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{bmatrix}\)

Emission Probability Matrix (\(B\)) - (\(b_{12}\) = probability while in \(s_1\) of emitting \(v_2\))

\(B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \\b_{31} & b_{32} \end{bmatrix}\)

Initial Probability Distribution (\(\Pi\)) - (probability of starting in \(s_i\))

\(\Pi = \bigg[\frac{1}{3}\frac{1}{3}\frac{1}{3}\bigg]\)

Data Used in Following Code Examples

#---set up: Transition Prob, Emission Prob, & Init Dist matrices---
a = matrix(c(0.54, 0.49, 0.46, 0.51),nrow = 2,ncol = 2)
b = matrix(c(0.16, 0.25, 0.26, 0.28, 0.58, 0.47),nrow = 2,ncol = 3)
initial_distribution = c(1/2, 1/2)

a

##      [,1] [,2]
## [1,] 0.54 0.46
## [2,] 0.49 0.51

##      [,1] [,2] [,3]
## [1,] 0.16 0.26 0.58
## [2,] 0.25 0.28 0.47

initial_distribution

## [1] 0.5 0.5

Forward Algorithm

Overview

Forward Algorithm R Code:

data = read.csv("hmm_ex_data_r.csv")

#---set up: Transition Prob, Emission Prob, & Init Dist matrices---
a = matrix(c(0.54, 0.49, 0.46, 0.51),nrow = 2,ncol = 2)
b = matrix(c(0.16, 0.25, 0.26, 0.28, 0.58, 0.47),nrow = 2,ncol = 3)
initial_distribution = c(1/2, 1/2)

#---function to implement forward algorithm-------
forward = function(v, a, b, initial_distribution){
  T = length(v)                                 # length of time series
  m = nrow(a)                                   # no. hidden states
  alpha = matrix(0, T, m)                       # set up alpha matrix
  alpha[1, ] = initial_distribution*b[, v[1]]   # initial value
  for(t in 2:T){                                # calculate remaining values
    tmp = alpha[t-1, ] %*% a
    alpha[t, ] = tmp * b[, v[t]]
  }
  return(alpha)
}

#---call forward algorithm-----------------------
alpha = forward(data$Visible,a,b,initial_distribution)

print("alpha:")

## [1] "alpha:"

head(alpha)

##             [,1]        [,2]
## [1,] 0.080000000 0.125000000
## [2,] 0.027157000 0.028154000
## [3,] 0.016506939 0.012619857
## [4,] 0.008756537 0.006593780
## [5,] 0.004616500 0.003473692
## [6,] 0.002433111 0.001830731

tail(alpha)

##                 [,1]          [,2]
## [495,] 5.123287e-220 5.131635e-220
## [496,] 1.373080e-220 1.392677e-220
## [497,] 8.258473e-221 6.306845e-221
## [498,] 4.378959e-221 3.297233e-221
## [499,] 1.034873e-221 1.034855e-221
## [500,] 6.182280e-222 4.717943e-222

Exponential decay of probabilities through time series

Backwards Algorithm

Overview

Backwards Algorithm R Code:

data = read.csv("hmm_ex_data_r.csv")

#---set up: Transition Prob & Emission Prob matrices------
a = matrix(c(0.54, 0.49, 0.46, 0.51),nrow = 2,ncol = 2)
b = matrix(c(0.16, 0.25, 0.26, 0.28, 0.58, 0.47),nrow = 2,ncol = 3)

#---BACKWARD Algorithm Function---------------------------
backward = function(V, A, B){
  T = length(V)
  m = nrow(A)
  beta = matrix(1, T, m)
  
  for(t in (T-1):1){
    tmp = as.matrix(beta[t+1, ] * B[, V[t+1]])
    beta[t, ] = t(A %*% tmp)
  }
  return(beta)
}
#---call Backward Algorithm Function----------------------
beta = backward(data$Visible,a,b)

head(beta)

##               [,1]          [,2]
## [1,] 5.306946e-221 5.323733e-221
## [2,] 1.981733e-220 1.960087e-220
## [3,] 3.760130e-220 3.719059e-220
## [4,] 7.134450e-220 7.056523e-220
## [5,] 1.353687e-219 1.338901e-219
## [6,] 2.568478e-219 2.540423e-219

tail(beta)

##              [,1]       [,2]
## [495,] 0.01061245 0.01064604
## [496,] 0.03962865 0.03919713
## [497,] 0.07516995 0.07440065
## [498,] 0.14180608 0.14225848
## [499,] 0.52940000 0.52390000
## [500,] 1.00000000 1.00000000

Baum Welch (Forward-Backward) Algorithm

Overview

Baum Welch Algorithm R Code:

forward = function(v, a, b, initial_distribution){
  T = length(v)
  M = nrow(a)
  alpha = matrix(0, T, M)
  alpha[1, ] = initial_distribution*b[, v[1]]
  for(t in 2:T){
    tmp = alpha[t-1, ] %*% a
    alpha[t, ] = tmp * b[, v[t]]}
  return(alpha)}
 
backward = function(v, a, b){
  T = length(v)
  M = nrow(a)
  beta = matrix(1, T, M)
  for(t in (T-1):1){
    tmp = as.matrix(beta[t+1, ] * b[, v[t+1]])
    beta[t, ] = t(a %*% tmp)}
  return(beta)}
 
BaumWelch = function(v, a, b, initial_distribution, n.iter = 100){
  for(i in 1:n.iter){
    T = length(v)
    M = nrow(a)
    K=ncol(b)
    alpha = forward(v, a, b, initial_distribution)
    beta = backward(v, a, b)
    xi = array(0, dim=c(M, M, T-1))
    
    for(t in 1:T-1){
      denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,]) 
      for(s in 1:M){
        numerator = alpha[t,s] * a[s,] * b[,v[t+1]] * beta[t+1,]
        xi[s,,t]=numerator/as.vector(denominator)
      }
    }
    xi.all.t = rowSums(xi, dims = 2)
    a = xi.all.t/rowSums(xi.all.t)
    
    gamma = apply(xi, c(1, 3), sum)  
    gamma = cbind(gamma, colSums(xi[, , T-1]))
    for(l in 1:K){
      b[, l] = rowSums(gamma[, which(v==l)])
    }
    b = b/rowSums(b)
  }
  return(list(a = a, b = b, initial_distribution = initial_distribution))
}
 
data = read.csv("hmm_ex_data_r.csv")
 
M=2; K=3
A = matrix(1, M, M)
A = A/rowSums(A)
B = matrix(1:6, M, K)
B = B/rowSums(B)
initial_distribution = c(1/2, 1/2)

(myout = BaumWelch(data$Visible, A, B, initial_distribution, n.iter = 100))

## $a
##           [,1]      [,2]
## [1,] 0.5381634 0.4618366
## [2,] 0.4866444 0.5133556
## 
## $b
##           [,1]      [,2]      [,3]
## [1,] 0.1627751 0.2625807 0.5746441
## [2,] 0.2514996 0.2778097 0.4706907
## 
## $initial_distribution
## [1] 0.5 0.5

Checking output against hmm() R package results:

library(HMM)
hmm =initHMM(c("A", "B"), c(1, 2, 3), 
             startProbs = initial_distribution,
             transProbs = A, emissionProbs = B)

true.out = baumWelch(hmm, data$Visible, maxIterations=100, pseudoCount=0)
true.out$hmm

## $States
## [1] "A" "B"
## 
## $Symbols
## [1] 1 2 3
## 
## $startProbs
##   A   B 
## 0.5 0.5 
## 
## $transProbs
##     to
## from         A         B
##    A 0.5381634 0.4618366
##    B 0.4866444 0.5133556
## 
## $emissionProbs
##       symbols
## states         1         2         3
##      A 0.1627751 0.2625807 0.5746441
##      B 0.2514996 0.2778097 0.4706907

N.B.:
- The \(T\)’th data is put into the \(\gamma\) since \(\xi\)’s length is \(T-1\).
- Using \(\xi\) to derive \(\gamma\)
- The indicator function has been implemented using “b[,1]=rowSums(gamma[, which(v==1)])”

The Baum Welch is used to estimate Transition and Emission Matrices. The next step involving Viterbi algoithn is used to make state predictions from testing data.

Viterbi

Viterbi with simulated data example

Observed states A, B with Hidden States L,R. Viterbi finds the most likely path

library(HMM)

# Initialise HMM
hmm = initHMM(c("A","B"), c("L","R"), transProbs=matrix(c(.6,.4,.4,.6),2),
emissionProbs=matrix(c(.6,.4,.4,.6),2))
hmm

## $States
## [1] "A" "B"
## 
## $Symbols
## [1] "L" "R"
## 
## $startProbs
##   A   B 
## 0.5 0.5 
## 
## $transProbs
##     to
## from   A   B
##    A 0.6 0.4
##    B 0.4 0.6
## 
## $emissionProbs
##       symbols
## states   L   R
##      A 0.6 0.4
##      B 0.4 0.6

# Sequence of observations
observations = c("L","L","R","R")

# Calculate Viterbi path
vp = viterbi(hmm,observations)
vp

## [1] "A" "A" "B" "B"

Dishonest Casino

Explore a classic example of HMM of an imaginary dishonest casino and demo the aphid Hidden Markov Package

2 dice, one fair and one weighted with probability of rolling 6 at .5 and .1 for rest

# set states 
states <- c("Begin", "Fair", "Loaded")
# dice results 1 through 6
residues <- paste(1:6)
### Define transition probability matrix A
A <- matrix(c(0, 0, 0, 0.99, 0.95, 0.1, 0.01, 0.05, 0.9), nrow = 3)
dimnames(A) <- list(from = states, to = states)
### Define emission probability matrix E
E <- matrix(c(rep(1/6, 6), rep(1/10, 5), 1/2), nrow = 2, byrow = TRUE)
dimnames(E) <- list(states = states[-1], residues = residues)

The plot.HMM method depicts the transition probabilities as weighted lines, and emission probabilities as horizontal bars.

### Create the HMM object
x <- structure(list(A = A, E = E), class = "HMM")
### Plot the model
plot(x, textexp = 1.5)
### Add the transition probabilities as text
text(x = 0.02, y = 0.5, labels = "0.95")
text(x = 0.51, y = 0.5, labels = "0.90")
text(x = 0.5, y = 0.9, labels = "0.05")
text(x = 0.5, y = 0.1, labels = "0.10")

6’s appearing at a higher rate suggesting the use of the loaded dice

# Load the casino dataset
data(casino)
rolls <- as.numeric(casino)
# roll results
qplot(rolls, ylab = 'count',main = 'Dice Roll Results')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The viterbi algorithm allows us to discover the most likely sequence of hidden states (when the dice switched). Comparison below.

# The actual path is stored in the names attribute of the sequence
actual <- c("F", "L")[match(names(casino), c("Fair", "Loaded"))]
### Find the predicted path
vit1 <- Viterbi(x, casino)
predicted <- c("F", "L")[vit1$path + 1]

actual[1:50]

##  [1] "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F"
## [18] "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F"
## [35] "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "L" "L" "L" "L" "L"

predicted[1:50]

##  [1] "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F"
## [18] "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F"
## [35] "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "L" "L"

We can also calculate the full and posterior probabilities of the sequence given the model using the forward and backward algorithms:

casino.post <- posterior(x, casino)
plot(1:300, seq(0, 1, length.out = 300), type = "n", xlab = "Roll number",
     ylab = "Posterior probability of dice being fair", main = 'Fairness Probabilities')
starts <- which(c("L", actual) == "F" & c(actual, "F") == "L")
ends <- which(c("F", actual) == "L" & c(actual, "L") == "F") - 1
for(i in 1:6) rect(starts[i], 0, ends[i], 1, col = "grey", border = NA)
lines(1:300, casino.post[1, ])

Weather Problem

Data Preprocessing

Data

The data used has been created from two datasets:

Google mobility data: https://www.google.com/covid19/mobility/
NOAA climate data from Chicago O’Hare: https://www.ncdc.noaa.gov/

The dataset consists of 90 days of observed weather and park mobility from March 3, 2020 to June 1, 2020. These date were chosen based on the availability of the Google’s mobility data.

weather_mob <- read.csv("weather_mobility.csv")

#pull out spring dates only
spring <- weather_mob[1:90,]

States

Define 3 states:

Clear (below 53 degrees and no precipitation)
Rainy (below 53 degrees with precipitation)
Warm (53 degrees or above)

Add the states to the dataset.

#find rainy days
spring$rain <- ifelse(spring$prcp > 0, 1, 0)

#classify each observation as a specific state: rain, clear, or warm
spring$markov_states <- ifelse(spring$rain == 1 & spring$tmax < 53, "rain", 
                               ifelse(spring$rain == 0 & spring$tmax < 53, "clear", "warm"))

Observations

Park mobility observations are taken from the Google Mobility data. The data compares mobility for the report date to the baseline day. Calculated for the report date (unless there are gaps) and reported as a positive or negative percentage. The baseline day is the median value from the 5‑week period Jan 3 – Feb 6, 2020.

Bin the mobility data to create 3 types of observations:

Low Mobility (less than -10)
Moderate Mobility (-10 to 40)
High Mobility (> 40)

plot(spring$parks, main = "Mobility Observation Bins",  ylab = "Mobility Score")
abline(h = c(-10, 40), col = "red")

#bin mobilty data as low, moderate, high mobility
spring$park_obs <- ifelse(spring$parks < -10, "L",
                          ifelse(spring$parks >= -10 & spring$parks < 40, "M", "H"))

Part 1: Define Structure of HMM

Overview
In part 1, the hidden states are known for the estimation of the transition and emissions matrix as we generated the example. Part 2 will detail use of Baum-Welch to estimate these matrices when the hidden states are not known.

There are 5 components we need to define to run our HMM model:

Hidden states
Set of observations
Transition matrix
Emissions Matrix
Initial Probability Distribution

Hidden States

In a Hidden Markov Model, the state at time \(t\) is unknown. The state will emit an observation, \(v\). In this example, \(M\) is the number of hidden states. The set of Hidden States is \(S\).

Thus, \(S^M =\) {clear, rainy, warm}

#define state: 
S <- c('clear','rain', 'warm')

Transition Probability Matrix

The transition matrix is the probability of changing from one state to another state. It is represented as a \(M\times M\) matrix where the rows sum to 1. \(a_{ij}\) is simply the probability of state \(i\) changing to state \(j\) at \(t+1\).

The transition probability matrix here:

\((\hat A) = \begin{bmatrix}p(clear|clear)&p(clear|rainy)&p(clear|warm) \\p(rainy|clear)&p(rainy|rainy)&p(rainy|warm)\\p(warm|clear)&p(warm|rainy)&p(warm|warm)\\\end{bmatrix}\)

A simple example of a transition matrix with two states time step to 10:

spring$markov_states[1:10]

##  [1] "clear" "warm"  "clear" "warm"  "warm"  "warm"  "clear" "clear"
##  [9] "warm"  "clear"

The following transitions occur:

clear -> clear 1x

clear -> warm 3x

warm -> clear 3x

warm -> warm 2x

Our expected transition matrix for this simple example is:

\(A = \begin{bmatrix}.25&.75 \\.6&.4\\\end{bmatrix}\)

Compare to output using markovchainFit function.

simple_A <- markovchainFit(spring$markov_states[1:10])
simple_A$estimate

## MLE Fit 
##  A  2 - dimensional discrete Markov Chain defined by the following states: 
##  clear, warm 
##  The transition matrix  (by rows)  is defined as follows: 
##       clear warm
## clear  0.25 0.75
## warm   0.60 0.40

Repeat the same process on all 90 states for the estimated transition matrix.

#find transition matrix using markovchain library
transition <- markovchainFit(data=spring$markov_states)
transition$estimate

## MLE Fit 
##  A  3 - dimensional discrete Markov Chain defined by the following states: 
##  clear, rain, warm 
##  The transition matrix  (by rows)  is defined as follows: 
##           clear       rain      warm
## clear 0.1428571 0.42857143 0.4285714
## rain  0.4545455 0.09090909 0.4545455
## warm  0.2105263 0.01754386 0.7719298

A <- rbind(c(transition$estimate[1]), (transition$estimate[2]), (transition$estimate[3]))

#check rows sum to 1
apply(A, 1, sum)

## [1] 1 1 1

Sequence of Observations

At each time step \(t\) the system will emit an observation: low, moderate, or high mobility at the parks. This creates a sequence of observations \(V^T\).

Key Point: each observation is dependent on the state that generated it, not on the neighboring observations.

#H: High - L: Low - M: Moderate (alphabetical order)
V_labels <- c('H', 'L','M') 
  
V <- spring$park_obs

Emission Probability Matrix

The emissions probability matrix contains the probability of emitting symbol \(k\) given state \(j\). It is represented by a \(M\times C\) matrix that rows also sum to 1.

Our observation symbols are L, M, H. One observation will emit from a state at each time step. The estimated emissions matrix will be as follows:

\((\hat B) = \begin{bmatrix}p(high|clear)&p(low|clear)&p(moderate|clear) \\p(high|rainy)&p(low|rainy)&p(moderate|rainy)\\p(high|warm)&p(low|warm)&p(moderate|warm)\\\end{bmatrix}\)

To find the emission’s matrix from the observed data, find the number of times each state is observed for a specific observation type and divide by total number of observations for the state.

#Emissions probabilities
#contingency table
e_table <- table(spring$markov_states, spring$park_obs)
e_table

##        
##          H  L  M
##   clear  1  8 12
##   rain   0 11  0
##   warm  15 10 33

#probabilities
clear <- e_table[1,]/sum(e_table[1,])
rain <- e_table[2,]/sum(e_table[2,])
warm <- e_table[3,]/sum(e_table[3,])

#make sure order is same as S vector above
B <- rbind(clear, rain, warm)
B

##                H         L         M
## clear 0.04761905 0.3809524 0.5714286
## rain  0.00000000 1.0000000 0.0000000
## warm  0.25862069 0.1724138 0.5689655

Initial Probability Distribution

The initial state is the state \(s\) at \(t = 1\). Estimate the probability of starting at a particular state from the probability of being in a state in the observed data.

table(spring$markov_states)/length(spring$markov_states)

## 
##     clear      rain      warm 
## 0.2333333 0.1222222 0.6444444

pi <- rbind(c(.26, .14, .6))

\(\pi = \begin{bmatrix}.26&.14&.60\\\end{bmatrix}\)

Run HMM Model

# Initialise HMM
hmm = initHMM(S, V_labels, startProbs = pi, transProbs = A, emissionProbs = B)
print(hmm)

## $States
## [1] "clear" "rain"  "warm" 
## 
## $Symbols
## [1] "H" "L" "M"
## 
## $startProbs
## clear  rain  warm 
##  0.26  0.14  0.60 
## 
## $transProbs
##        to
## from        clear       rain      warm
##   clear 0.1428571 0.42857143 0.4285714
##   rain  0.4545455 0.09090909 0.4545455
##   warm  0.2105263 0.01754386 0.7719298
## 
## $emissionProbs
##        symbols
## states           H         L         M
##   clear 0.04761905 0.3809524 0.5714286
##   rain  0.00000000 1.0000000 0.0000000
##   warm  0.25862069 0.1724138 0.5689655

Evaluation Problem

Our model is defined as \(\theta\) and our sequence of observations \(\ V^T\)

Given \(\theta\) and \(V^T\), estimate \(p(V^T|\theta)\)

In other words, let us calculate the probability our model generated a particular sequence \(V^T\).

Forward Algorithm

The forward algorithm is one efficient way to solve the evaluation problem as it derives the probability of the next step on the computed probability of the current step. Goal of forward algorithm compute the joint distribution on \(S_k\)given \(x_{1:k}\)

Simple example: HMM model is \(\theta\). What is the probability the set of observations \((M,M,M)\) came from model \(\theta\) \(p(V^T|\theta)\).

#take first 3 observations
V_sample <- c("M", "M", "M")
 
forward = function(v, a, b, initial_distribution){
  
  T = length(v)
  m = nrow(a)
  alpha = matrix(0, T, m)
  
  alpha[1, ] = initial_distribution*b[, v[1]]
  
  for(t in 2:T){
    tmp = alpha[t-1, ] %*% a
    alpha[t, ] = tmp * b[, v[t]]
  }
  return(alpha)
}
 
answer <- forward(V_sample,A,B,pi)
answer

##            [,1] [,2]      [,3]
## [1,] 0.14857143    0 0.3413793
## [2,] 0.05319647    0 0.1861623
## [3,] 0.02673803    0 0.0947343

sum(answer[3,])

## [1] 0.1214723

Decoding Problem

The decoding problem finds the most probable hidden state at each time step. In other words, what is the most probable path given our set of observations?

Viterbi

Find most likely states based on A and B estimates from observed data. use the viterbi algorithm.

#HMM most likely states
HMM_states <- viterbi(hmm, V)

#compare model to actual
cbind(HMM_states, spring$markov_states)[1:15,]

##       HMM_states        
##  [1,] "warm"     "clear"
##  [2,] "warm"     "warm" 
##  [3,] "warm"     "clear"
##  [4,] "warm"     "warm" 
##  [5,] "warm"     "warm" 
##  [6,] "warm"     "warm" 
##  [7,] "warm"     "clear"
##  [8,] "warm"     "clear"
##  [9,] "warm"     "warm" 
## [10,] "clear"    "clear"
## [11,] "rain"     "rain" 
## [12,] "clear"    "clear"
## [13,] "rain"     "rain" 
## [14,] "clear"    "clear"
## [15,] "rain"     "rain"

#how'd we do? % of states correct
sum(HMM_states == spring$markov_states)/length(spring$markov_states)

## [1] 0.7222222

plot(HMM_states == spring$markov_states, ylab = "Correct States", main = "Performance - Hidden States Known")

Part 2: Estimate A and B

Learning Problem

Hidden Markov Models are generally an unsupervised learning process where the number of hidden states are unknown and only the observed symbols are visible.

Baum-Welch

The Baum-Welch algorithm (forward/backward algorithm) is a special case of Expectation Maximization that will estimate the Transition and Emission probabilities.

#update H,L,M to 1,2,3 for function. 
spring$park_obs <- ifelse(spring$parks < -10, 1,
                          ifelse(spring$parks >= -10 & spring$parks < 40, 2, 3))

#backward
backward = function(V, A, B){
  T = length(V)
  m = nrow(A)
  beta = matrix(1, T, m)
  
  for(t in (T-1):1){
    tmp = as.matrix(beta[t+1, ] * B[, V[t+1]])
    beta[t, ] = t(A %*% tmp)
  }
  return(beta)
}

#BaumWelch
BaumWelch = function(v, a, b, initial_distribution, n.iter = 100){
  
  for(i in 1:n.iter){
    T = length(v)
    M = nrow(a)
    K=ncol(b)
    alpha = forward(v, a, b, initial_distribution)
    beta = backward(v, a, b)
    xi = array(0, dim=c(M, M, T-1))
    
    for(t in 1:T-1){
      denominator = ((alpha[t,] %*% a) * b[,v[t+1]]) %*% matrix(beta[t+1,]) 
      for(s in 1:M){
        numerator = alpha[t,s] * a[s,] * b[,v[t+1]] * beta[t+1,]
        xi[s,,t]=numerator/as.vector(denominator)
      }
    }
    
    
    xi.all.t = rowSums(xi, dims = 2)
    a = xi.all.t/rowSums(xi.all.t)
    
    gamma = apply(xi, c(1, 3), sum)  
    gamma = cbind(gamma, colSums(xi[, , T-1]))
    for(l in 1:K){
      b[, l] = rowSums(gamma[, which(v==l)])
    }
    b = b/rowSums(b)
    
  }
  return(list(a = a, b = b, initial_distribution = initial_distribution))
}

M=3; K=3
A = matrix(1, M, M)
A = A/rowSums(A)
B = matrix(1:6, M, K)
B = B/rowSums(B)
initial_distribution = c(1/3,1/3, 1/3)

(output = BaumWelch(spring$park_obs, A, B, initial_distribution, n.iter = 100))

## $a
##           [,1]       [,2]       [,3]
## [1,] 0.3953952 0.53133439 0.07327042
## [2,] 0.7546056 0.20945732 0.03593709
## [3,] 0.1042512 0.06582564 0.82992313
## 
## $b
##            [,1]      [,2]       [,3]
## [1,] 0.02238586 0.7390644 0.23854972
## [2,] 0.30672318 0.4771669 0.21610996
## [3,] 0.86278023 0.1130348 0.02418494
## 
## $initial_distribution
## [1] 0.3333333 0.3333333 0.3333333

Find most likely states based on A and B matrix estimated from Baum-Welch.

pi <- c(1/3,1/3,1/3)
S <- c("rain", "clear","warm")
hmm2 = initHMM(S, V_labels, startProbs = pi, transProbs = output$a, emissionProbs = output$b)

print(hmm2)

## $States
## [1] "rain"  "clear" "warm" 
## 
## $Symbols
## [1] "H" "L" "M"
## 
## $startProbs
##      rain     clear      warm 
## 0.3333333 0.3333333 0.3333333 
## 
## $transProbs
##        to
## from         rain      clear       warm
##   rain  0.3953952 0.53133439 0.07327042
##   clear 0.7546056 0.20945732 0.03593709
##   warm  0.1042512 0.06582564 0.82992313
## 
## $emissionProbs
##        symbols
## states           H         L          M
##   rain  0.02238586 0.7390644 0.23854972
##   clear 0.30672318 0.4771669 0.21610996
##   warm  0.86278023 0.1130348 0.02418494

HMM_states2 <- viterbi(hmm2, V)

#compare model to actual
cbind(HMM_states2, spring$markov_states)[1:15,]

##       HMM_states2        
##  [1,] "rain"      "clear"
##  [2,] "clear"     "warm" 
##  [3,] "rain"      "clear"
##  [4,] "clear"     "warm" 
##  [5,] "clear"     "warm" 
##  [6,] "rain"      "warm" 
##  [7,] "clear"     "clear"
##  [8,] "rain"      "clear"
##  [9,] "clear"     "warm" 
## [10,] "rain"      "clear"
## [11,] "rain"      "rain" 
## [12,] "clear"     "clear"
## [13,] "rain"      "rain" 
## [14,] "clear"     "clear"
## [15,] "rain"      "rain"

#how'd we do? % of states correct
sum(HMM_states2 == spring$markov_states)/length(spring$markov_states)

## [1] 0.2333333

plot(HMM_states2 == spring$markov_states, ylab = "Correct States", main = "Performance - Baum-Welch Estimates")

Conclusion

Future Work

Markov Models

Something became apparent very quickly when investigating Markov Models: there is no shortage of information about them or applications for them. We only investigated two of the four common Markov models, and Markov Chains were only used as a rudimentary introduction to Hidden Markov Models. Future studies will include:

Markov Decision Processes

Partially Observable Markov Decision Processes

Model Improvement
Also a common refrain of modelers: “If we had more time we would have tried to improve the model.” This exercise is no different. Andrew L. did some impressive work getting a real world Hidden Markov Model up and running with only self study to guide him. These models have unlimited applications and an expertise in creating and refining them would be valuable.

Available Libraries

This is far from an exhaustive list of Markov and Hidden Markov packages in R. Packages range from simple to complex. Many of the latter packages were designed to address specific features that other packages are missing.

Markovchain
- Discrete time Markov Chain models
HMM
- Standard, general purpose Hidden Markov package. From our point of view the most widely used Hidden Markov package
MSM
- Multi state models, including Markov and Hidden Markov
- created for modeling chronic diseases
Aphid
- created for bio sequence analysis
- Profile HMM <- allows for transition and emission matrix to change at any time position.
DepmixS4
- a flexible framework for fitting dependent mixture models for a large variety of response distributions

Sources

https://web.stanford.edu/~jurafsky/slp3/A.pdf
http://www.adeveloperdiary.com/data-science/machine-learning/implement-viterbi-algorithm-in-hidden-markov-model-using-python-and-r/</li?
mathematicalmonk youtube

Markov Models

Fliege | Guardi | Leonard | Poole

Introduction

Markov Chains

Specifications

Markov Chain Example

Hidden Markov Model

Baum-Welch

Overview

Forward Algorithm

Backwards Algorithm

Baum Welch (Forward-Backward) Algorithm

Viterbi

Dishonest Casino

Weather Problem

Data Preprocessing

Part 1: Define Structure of HMM

Run HMM Model

Part 2: Estimate A and B

Conclusion