# Defining our own Markov Reward Process:
S = c(1,2,3)
P = matrix( c(0.5, 0.5, 0,
0.25, 0.5, 0.25,
0, 0.5, 0.5), 3, 3, byrow = TRUE)
R = matrix(c(1, 2, 1), 3, 1)
y = 0.5Markov Reward Processes
Markov Reward Process
- Markov chain + rewards
Definition:
Tuple (S, P, R, y):
S is set of states
P is the Transition Probability Matrix of the markov chain
R is the reward function R(s_t = s) = E[ r_t| s_t = s]
y is the discount factor
There are no actions in an MRP.
- If finite no. of states, then we can represent R as a vector (one scalar fro each state).
Reward function can either be defined over the present state or a tuple of present state and next state (we use former).
Horizon
Number of timesteps in an episode. (can be infinity)
Return
Discounted sum of rewards from time step t to horizon.
G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} ...
State Value function
V(s) = E[ G_t | s_t = s]
Discount factor
Reflects that conventionally, earlier reward is valued more.
We can assume it to be 1 for finite horizon MRPs.
Computing value
Monte Carlo method
Simulate the Markov chain a large number of times and calculate G_t for each instance of the state.
Average it to get V(s).
Does not incorporate additional information about the MRP structure like the Markov assumption.
Analytic solution
We can observe
V(s) = R(s) + \gamma \sum P(s'|s) V(s')
Since we can write these down in terms of vectors:
Solving further, we get:
V = (I- \gamma P)^{-1} R
As long as this matrix is invertible (it usually is). we can compute this analytically. However, it takes O(n3) time.
I = diag(1, 3, 3)
V1 = solve(I- y*P) %*% R
print(V1) [,1]
[1,] 2.5
[2,] 3.5
[3,] 2.5
Dynamic programming approach
Algorithm:
- Initialise V_0(s) = 0.
- For k = 1 until convergence:
For all s in S:
V_k(s) = R(s) + \gamma \sum_{s' \in S} P(s'|s) V_{k-1}(s')
Computational complexity : O(|S|^2) for each iteration.
n <- 3
#initialise V = 0
V2 = numeric(n)
V2_prev = numeric(n)
k = 1
while( k ==1 | norm(V2 - V2_prev, type = "2") > 1e-4 ){
V2_prev = V2
V2 = R + y* P %*% V2_prev
k = k+1
}
print(V2) [,1]
[1,] 2.499954
[2,] 3.499954
[3,] 2.499954