Markov Reward Processes

Author

Arqam Patel

Markov Reward Process

Markov chain + rewards

Definition:

Tuple (S, P, R, y):

S is set of states
P is the Transition Probability Matrix of the markov chain
R is the reward function R(s_t = s) = E[ r_t| s_t = s]
y is the discount factor

There are no actions in an MRP.

If finite no. of states, then we can represent R as a vector (one scalar fro each state).

Reward function can either be defined over the present state or a tuple of present state and next state (we use former).

# Defining our own Markov Reward Process:

S = c(1,2,3)
P = matrix( c(0.5, 0.5, 0,
              0.25, 0.5, 0.25,
              0, 0.5, 0.5), 3, 3, byrow = TRUE)

R = matrix(c(1, 2, 1), 3, 1)
y = 0.5

Horizon

Number of timesteps in an episode. (can be infinity)

Return

Discounted sum of rewards from time step t to horizon.

G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} ...

State Value function

V(s) = E[ G_t | s_t = s]

Discount factor

Reflects that conventionally, earlier reward is valued more.

We can assume it to be 1 for finite horizon MRPs.

Computing value

Monte Carlo method

Simulate the Markov chain a large number of times and calculate G_t for each instance of the state.
Average it to get V(s).

Does not incorporate additional information about the MRP structure like the Markov assumption.

Analytic solution

We can observe

V(s) = R(s) + \gamma \sum P(s'|s) V(s')

Since we can write these down in terms of vectors:

Solving further, we get:

V = (I- \gamma P)^{-1} R

As long as this matrix is invertible (it usually is). we can compute this analytically. However, it takes O(n3) time.

I = diag(1, 3, 3)
V1 = solve(I- y*P) %*% R

print(V1)

     [,1]
[1,]  2.5
[2,]  3.5
[3,]  2.5

Dynamic programming approach

Algorithm:

Initialise V_0(s) = 0.
For k = 1 until convergence:
- For all s in S:
  
  V_k(s) = R(s) + \gamma \sum_{s' \in S} P(s'|s) V_{k-1}(s')

Computational complexity : O(|S|^2) for each iteration.

n <- 3

#initialise V = 0
V2 = numeric(n)
V2_prev = numeric(n)

k = 1
while( k ==1 | norm(V2 - V2_prev, type = "2") > 1e-4 ){
  V2_prev = V2
  V2 = R + y* P %*% V2_prev
  
  k = k+1
}

print(V2)

         [,1]
[1,] 2.499954
[2,] 3.499954
[3,] 2.499954

References:

CS234 Stanford