Markov Reward Processes

Author

Arqam Patel

Markov Reward Process

  • Markov chain + rewards

Definition:

Tuple (S, P, R, y):

  • S is set of states

  • P is the Transition Probability Matrix of the markov chain

  • R is the reward function R(s_t = s) = E[ r_t| s_t = s]

  • y is the discount factor

There are no actions in an MRP.

  • If finite no. of states, then we can represent R as a vector (one scalar fro each state).

Reward function can either be defined over the present state or a tuple of present state and next state (we use former).

# Defining our own Markov Reward Process:

S = c(1,2,3)
P = matrix( c(0.5, 0.5, 0,
              0.25, 0.5, 0.25,
              0, 0.5, 0.5), 3, 3, byrow = TRUE)

R = matrix(c(1, 2, 1), 3, 1)
y = 0.5

Horizon

Number of timesteps in an episode. (can be infinity)

Return

Discounted sum of rewards from time step t to horizon.

G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} ...

State Value function

V(s) = E[ G_t | s_t = s]

Discount factor

Reflects that conventionally, earlier reward is valued more.

We can assume it to be 1 for finite horizon MRPs.

Computing value

Monte Carlo method

  • Simulate the Markov chain a large number of times and calculate G_t for each instance of the state.

  • Average it to get V(s).

    Does not incorporate additional information about the MRP structure like the Markov assumption.

Analytic solution

We can observe

V(s) = R(s) + \gamma \sum P(s'|s) V(s')

Since we can write these down in terms of vectors:

Solving further, we get:

V = (I- \gamma P)^{-1} R

As long as this matrix is invertible (it usually is). we can compute this analytically. However, it takes O(n3) time.

I = diag(1, 3, 3)
V1 = solve(I- y*P) %*% R

print(V1)
     [,1]
[1,]  2.5
[2,]  3.5
[3,]  2.5

Dynamic programming approach

Algorithm:

  1. Initialise V_0(s) = 0.
  2. For k = 1 until convergence:
    • For all s in S:

      V_k(s) = R(s) + \gamma \sum_{s' \in S} P(s'|s) V_{k-1}(s')

Computational complexity : O(|S|^2) for each iteration.

n <- 3

#initialise V = 0
V2 = numeric(n)
V2_prev = numeric(n)

k = 1
while( k ==1 | norm(V2 - V2_prev, type = "2") > 1e-4 ){
  V2_prev = V2
  V2 = R + y* P %*% V2_prev
  
  k = k+1
}

print(V2)
         [,1]
[1,] 2.499954
[2,] 3.499954
[3,] 2.499954

References:

CS234 Stanford