# Mars rover MDP example
S <- c(1,2,3,4,5,6,7)
A <- c("a1", "a2")
P <- array(0, dim = c(7, 7,2))
R <- c(1, 0, 0, 0, 0, 0, 10 )
y <- 0.5
# a1: TryLeft
P[,,1] <- rbind(c(1,0,0,0,0,0,0) , cbind(diag(1,6,6), numeric(6)))
# a2: TryRight
P[,,2] <- rbind( cbind(numeric(6), diag(1,6,6)), c(0,0,0,0,0,0,1))Markov Decision Processes With Model
MDP
- MRP + actions (variable)
Alternate conception is to imagine it as a graph with states, chance nodes, transition probabilities, rewards.
Definition
Tuple (S, A, P, R, y)
S: set of states s
A: set of actions a
P or T: Transition or dynamics model for each action \(T(s,a, s') = P(s'|s,a) = \mathbb{P}( s_{t+1} = s' | s_t = s, a_t = a )\).
R: Reward function (function of both state and action) \(R(s, a) = E[r_t | s_t = s, a_t = a]\)
Alternatively, \(\mathcal{R}(s, a, s' ) = E[r_t | s_t = s, a_t = a, s_{t+1} = s']\)
y: Discount factor
Policies
- Mapping (or conditional distribution) from state to action
- May be stochastic
\[ \pi (a|s) = \mathbb{P}(a_t = a | s_t = s) \]
An MDP with a certain policy is effectively an MRP (since we kinda fix the actions).
More precisely, it is the MRP: (S, \(R^\pi\) , \(P^\pi\) , y), where:
\[ R^\pi(s) = \sum_{a \in A} \pi(a|s) R(s,a) \]
The new reward function for a state is the weighted sum of the rewards of the action-state pairs (whose probabilities are specified by the policy).
\[ P^\pi(s' | s) = \sum_{a \in A} \pi(a|s)P(s'|s,a) \]
# Compute TPM given policy and dynamics model
TPM<- function(pi, P){
# pi: n_states x n_actions
# P: n_states x n_states x n_actions
n_states <- dim(P)[1]
n_actions <- dim(P)[3]
ans <- matrix(0, n_states, n_states)
for(i in 1:n_actions){
ans = ans + P[,,i] * matrix(rep( pi[,i] ,n_states), n_states, n_states)
}
return(ans)
}The dynamics model (i.e. probability of transitioning from one state to another) is effectively the sum of the probabilities of reaching the new state from the current one under each action, weighted by the probability of the action given the current state (specified by the policy).
Policy evaluation
Iterative algorithm (Bellman Backup for a particular policy)
Algorithm:
- Initialise \(V_0^\pi(s) = 0\).
- For k = 1 until convergence:
For all s in S:
\[ V^\pi_k(s) = r(s, \pi(s)) + \gamma \sum_{s' \in S} P(s'|s, \pi(s)) V^\pi_{k-1}(s') \]
Same as with MRPs, except that we index it with the particular policy.
# Computes value for given policy
policy_eval <- function(pi, P, R, y){
tpm <- TPM(pi, P)
n <- dim(P)[1]
#initialise V = 0
V = matrix(0, n, 1)
V_prev = matrix(0, n, 1)
k = 1
while( k ==1 | norm(V - V_prev, type = "2") > 1e-6 ){
V_prev = V
V = R + y* tpm %*% V_prev
k = k+1
}
return(V)
}Example 1
# Consider policy pi(s) = a1 for all s, y =0
# Compute the value for this policy
policy <- matrix(rep(c(1,0), 7), 7, 2, byrow = TRUE)
policy_eval(policy, P, R,0) [,1]
[1,] 1
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 0
[7,] 10
Example 2
P2 <- array(0, dim = c(7, 7, 1))
P2[,,1] <- rbind(cbind(diag(0.5, 6, 6), numeric(6)) + cbind(numeric(6), diag(0.5, 6, 6)), c(0,0,0,0,0,0,1) )
policy2 = matrix(1,7,1)
policy_eval(policy2, P2, R, 0.5) [,1]
[1,] 1.36076788
[2,] 0.08230423
[3,] 0.24691328
[4,] 0.74074044
[5,] 2.22222192
[6,] 6.66666637
[7,] 19.99999970
Bellman Backup Notation
\[ B^\pi V (s) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s'|s) V(s') \]
Policy evaluation
\[ V^\pi = B^\pi B^\pi ... B^\pi V \]
MDP Control
We aim to compute the optimal policy:
\[ \pi^*(s) = argmax [ V^\pi(s)] \]
There exists a unique optimal value function.
Number of deterministic policies
For each state, we can choose |A| actions.
\[ |A| ^{|S|} \]
The optimal policy may not be unique (tied).
Infinite horizon MDP
The optimal policy in an infinite horizon MDP is:
deterministic
stationary (time invariant)
Bellman equation
\[V^\pi(s) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s'|s) V^\pi(s')\]
Bellman Backup Operator
\[ BV(s) = \max_a R(s,a) + \gamma \sum_{s' \in S} P(s'|s, a) V(s') \]
Policy search
Duh. There’s a finite number of policies. Just search the policy space, right?
Very inefficient.
Policy iteration
Iteratively improve the policy to reach optimum policy.
State-Action Value for policy
\[ Q^\pi(s, a) = R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V^\pi(s') \]
Value of the state s if I immediately take action a, then follow the policy \(\pi\).
Policy Improvement
We have some policy \(\pi_k\) that prescribes some action \(a_j\) for the state s. But, if the Q function is maximised at action \(a_m\) instead, it means that a better policy would recommend action \(a_m\) from the current state.
So for our next policy \(\pi_{k+1}\) , we take the action \(a_m\) when in state s.
Policy Iteration Algorithm
- Initialize \(\pi_0(s)\) randomly for all s.
- While (\(||\pi_i - \pi_{i-1} ||> 0\))
Compute \(V^{\pi_i}\) using policy evaluation techniques.
Policy improvement step: \(\pi_{i+1}(s) \leftarrow argmax[Q^{\pi_i}(s,a)] \forall a \in A\)
If the policy stops changing, it’s reached the optimum value and will never change again.
Maximum number of iterations = \(|A|^{|S|}\).
Value Iteration
Alternative approach to policy iteration.
Space action value (Q value)
\[ Q(s, a) = \sum_{s`} T(s, a, s') (\mathcal{R}(s, a, s')+ \gamma V(s') ) = R(s,a) + \sum_{s'} \gamma P(s'|s,a) V(s') \]
Algorithm
- Initialise \(V_0(s) = 0\) for all s.
- Until convergence or end of horizon:
For each s:
Compute Q(s,a) for all (s,a) pairs ( R(s,a), P(s’|s,a) and \(V_k(s')\) are known )
Assign:
\[V_{k+1}(s) = \max_a Q(s,a) = \max_a[ R(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) V_{k}(s')]\]
Or, viewed in terms of Bellman Backup Operator
\[ BV(s) = \max_a [R(s,a) + \gamma \sum_{s' \in S} P(s'|s, a) V(s')] \]
Update the policy too:
\[ \pi_{k+1}(s) = argmax[R(s,a) + \gamma \sum_{s' \in S} P(s'|s, a) V(s')]\]
Doubt: There’s something about the “backing up” that I don’t yet fully get. I understand it goes stepwise kinda but still.
Python playground where I copy code for a OOP MDP model and value iteration from the example in the CS221 lecs and then adapt it to a problem in a very similar problem proposed in a Computerphile video.
I like the CS221 way of thinking about it as a graph a bit more than the abstract CS234 approach. Also reflects in the OOP approach (which I recall being used in Karpathy’s Micrograd too for the gradient computational graph, and so am taking as common practice for representing graphs)