Markov Decision Processes With Model

Author

Arqam Patel

MDP

  • MRP + actions (variable)

Alternate conception is to imagine it as a graph with states, chance nodes, transition probabilities, rewards.

Definition

Tuple (S, A, P, R, y)

  • S: set of states s

  • A: set of actions a

  • P or T: Transition or dynamics model for each action \(T(s,a, s') = P(s'|s,a) = \mathbb{P}( s_{t+1} = s' | s_t = s, a_t = a )\).

  • R: Reward function (function of both state and action) \(R(s, a) = E[r_t | s_t = s, a_t = a]\)

    Alternatively, \(\mathcal{R}(s, a, s' ) = E[r_t | s_t = s, a_t = a, s_{t+1} = s']\)

  • y: Discount factor

# Mars rover MDP example

S <- c(1,2,3,4,5,6,7)
A <- c("a1", "a2")

P <- array(0, dim = c(7, 7,2))
R <- c(1, 0, 0, 0, 0, 0, 10 )
y <- 0.5

# a1: TryLeft
P[,,1] <- rbind(c(1,0,0,0,0,0,0) , cbind(diag(1,6,6), numeric(6)))

# a2: TryRight
P[,,2] <- rbind( cbind(numeric(6), diag(1,6,6)), c(0,0,0,0,0,0,1))

Policies

  • Mapping (or conditional distribution) from state to action
  • May be stochastic

\[ \pi (a|s) = \mathbb{P}(a_t = a | s_t = s) \]

An MDP with a certain policy is effectively an MRP (since we kinda fix the actions).

More precisely, it is the MRP: (S, \(R^\pi\) , \(P^\pi\) , y), where:

\[ R^\pi(s) = \sum_{a \in A} \pi(a|s) R(s,a) \]

The new reward function for a state is the weighted sum of the rewards of the action-state pairs (whose probabilities are specified by the policy).

\[ P^\pi(s' | s) = \sum_{a \in A} \pi(a|s)P(s'|s,a) \]

# Compute TPM given policy and dynamics model

TPM<- function(pi, P){
  # pi: n_states x n_actions
  # P: n_states x n_states x n_actions
  
  n_states <- dim(P)[1]
  n_actions <- dim(P)[3]
  
  ans <- matrix(0, n_states, n_states)
  for(i in 1:n_actions){
    ans = ans + P[,,i] * matrix(rep( pi[,i] ,n_states), n_states, n_states)
  }
  return(ans)
}

The dynamics model (i.e. probability of transitioning from one state to another) is effectively the sum of the probabilities of reaching the new state from the current one under each action, weighted by the probability of the action given the current state (specified by the policy).

Policy evaluation

Iterative algorithm (Bellman Backup for a particular policy)

Algorithm:

  1. Initialise \(V_0^\pi(s) = 0\).
  2. For k = 1 until convergence:
    • For all s in S:

      \[ V^\pi_k(s) = r(s, \pi(s)) + \gamma \sum_{s' \in S} P(s'|s, \pi(s)) V^\pi_{k-1}(s') \]

Same as with MRPs, except that we index it with the particular policy.

# Computes value for given policy

policy_eval <- function(pi, P, R, y){
  tpm <- TPM(pi, P)
  n <- dim(P)[1]

  #initialise V = 0
  V = matrix(0, n, 1)
  V_prev = matrix(0, n, 1)

  k = 1
  while( k ==1 | norm(V - V_prev, type = "2") > 1e-6 ){
    V_prev = V
    V = R + y* tpm %*% V_prev
  
    k = k+1
  }
  
  return(V)
  
}

Example 1

# Consider policy pi(s) = a1 for all s, y =0
# Compute the value for this policy

policy <- matrix(rep(c(1,0), 7), 7, 2, byrow = TRUE)

policy_eval(policy, P, R,0)
     [,1]
[1,]    1
[2,]    0
[3,]    0
[4,]    0
[5,]    0
[6,]    0
[7,]   10

Example 2

P2 <- array(0, dim = c(7, 7, 1))
P2[,,1] <- rbind(cbind(diag(0.5, 6, 6), numeric(6)) + cbind(numeric(6), diag(0.5, 6, 6)), c(0,0,0,0,0,0,1) )

policy2 = matrix(1,7,1)

policy_eval(policy2, P2, R, 0.5)
            [,1]
[1,]  1.36076788
[2,]  0.08230423
[3,]  0.24691328
[4,]  0.74074044
[5,]  2.22222192
[6,]  6.66666637
[7,] 19.99999970

Bellman Backup Notation

\[ B^\pi V (s) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s'|s) V(s') \]

Policy evaluation

\[ V^\pi = B^\pi B^\pi ... B^\pi V \]

MDP Control

We aim to compute the optimal policy:

\[ \pi^*(s) = argmax [ V^\pi(s)] \]

There exists a unique optimal value function.

Number of deterministic policies

For each state, we can choose |A| actions.

\[ |A| ^{|S|} \]

The optimal policy may not be unique (tied).

Infinite horizon MDP

The optimal policy in an infinite horizon MDP is:

  • deterministic

  • stationary (time invariant)

Bellman equation

\[V^\pi(s) = R^\pi(s) + \gamma \sum_{s' \in S} P^\pi(s'|s) V^\pi(s')\]

Bellman Backup Operator

\[ BV(s) = \max_a R(s,a) + \gamma \sum_{s' \in S} P(s'|s, a) V(s') \]

Policy iteration

Iteratively improve the policy to reach optimum policy.

State-Action Value for policy

\[ Q^\pi(s, a) = R(s,a) + \gamma \sum_{s' \in S} P(s'|s,a) V^\pi(s') \]

Value of the state s if I immediately take action a, then follow the policy \(\pi\).

Policy Improvement

We have some policy \(\pi_k\) that prescribes some action \(a_j\) for the state s. But, if the Q function is maximised at action \(a_m\) instead, it means that a better policy would recommend action \(a_m\) from the current state.

So for our next policy \(\pi_{k+1}\) , we take the action \(a_m\) when in state s.

Policy Iteration Algorithm

  1. Initialize \(\pi_0(s)\) randomly for all s.
  2. While (\(||\pi_i - \pi_{i-1} ||> 0\))
    • Compute \(V^{\pi_i}\) using policy evaluation techniques.

    • Policy improvement step: \(\pi_{i+1}(s) \leftarrow argmax[Q^{\pi_i}(s,a)] \forall a \in A\)

If the policy stops changing, it’s reached the optimum value and will never change again.

Maximum number of iterations = \(|A|^{|S|}\).

Value Iteration

Alternative approach to policy iteration.

Space action value (Q value)

\[ Q(s, a) = \sum_{s`} T(s, a, s') (\mathcal{R}(s, a, s')+ \gamma V(s') ) = R(s,a) + \sum_{s'} \gamma P(s'|s,a) V(s') \]

Algorithm

  1. Initialise \(V_0(s) = 0\) for all s.
  2. Until convergence or end of horizon:
    • For each s:

      • Compute Q(s,a) for all (s,a) pairs ( R(s,a), P(s’|s,a) and \(V_k(s')\) are known )

      • Assign:

        \[V_{k+1}(s) = \max_a Q(s,a) = \max_a[ R(s, a) + \gamma \sum_{s' \in S} P(s'|s, a) V_{k}(s')]\]

        Or, viewed in terms of Bellman Backup Operator

        \[ BV(s) = \max_a [R(s,a) + \gamma \sum_{s' \in S} P(s'|s, a) V(s')] \]

    • Update the policy too:

      \[ \pi_{k+1}(s) = argmax[R(s,a) + \gamma \sum_{s' \in S} P(s'|s, a) V(s')]\]

Doubt: There’s something about the “backing up” that I don’t yet fully get. I understand it goes stepwise kinda but still.

Python playground where I copy code for a OOP MDP model and value iteration from the example in the CS221 lecs and then adapt it to a problem in a very similar problem proposed in a Computerphile video.

I like the CS221 way of thinking about it as a graph a bit more than the abstract CS234 approach. Also reflects in the OOP approach (which I recall being used in Karpathy’s Micrograd too for the gradient computational graph, and so am taking as common practice for representing graphs)

References

  1. CIS 522 UPenn
  2. CS234 Stanford
  3. CS221 Stanford
  4. D2L.ai