1 Model vs Free

In a model-free framework we don´t have an explicit transition or reward function and we need to estimate values by interacting with the environment.

Q-Learning: Value-based Method
Policy Gradient: Policy-Based Method
Actor critic: Value and Policy based Method

In a Model-Based Reinforcement Learning we have an explicit expression of the transition (T) and the reward (R) model. We can plan based on the model and increase our sample efficiency (Once we have the model we can use it to collect data without interacting with the environment). Nevertheless, we can increase the complexity.

Policy Iteration
Value Iteration

2 Maze Example

Imagine the next Maze with the current Policy:

If we calculate all the transition probabilities distributions we can use the bellman equation to plan.

\[V^*(s) = max_a (R(s,a) + \gamma \sum_{s´}Pr(s´ | s,a) V^* (s´))\]

ModelBasedRL(s)
$\qquad$ Repeat
$\qquad \qquad$ Select and execute $a$
$\qquad \qquad$ Observe $s´$ and $r$
$\qquad \qquad$ Update counts: $n(s,a) \leftarrow n(s,a) +1$ and $n(s,a,s´) \leftarrow n(s,a,s´) + 1$
$\qquad \qquad$ Update Transition: $Pr(s´|s,a) \leftarrow \frac{n(s,a,s')}{n(s,a)} \forall s´$
$\qquad \qquad$ Update reward $R(s,a) \leftarrow \frac{r+(n(s,a)-1)R(s,a)}{n(s,a)}$
$\qquad \qquad$ Solve: $V^*(s) = max_a (R(s,a) + \gamma \sum_{s´}Pr(s´ | s,a) V^* (s´))$
$\qquad \qquad$ $s \leftarrow s´$
$\qquad$ Until Convergence of V^*
$\qquad$ Return $V^*$

Above algorithm works in simple models, but if we are dealing on more complex ones (Atari Games or Robot Control) we could function approximators for reward and transition models:

Linear: Gaussian
Non-linear:

+Stochastic -> Gaussian Process

+Deterministic -> Neural Networks

3 Partial Planning

Once we are dealing with a complex model that are parametized with a Neural Network or a Gaussian Process, is not obvious how we are going to accomplish the planning task. in this complex models, fully optimizing the policy or value function at each time step is intractable.

Consider the use of a partial planning:
- A few steps of Q-learning (or another algorithm)
- A few steps of policy gradient

Both are model-free approaches, so, we can collect data by interacting with the environment but also use the model for simulation.

ModelBasedRL(s)
$\qquad$ Repeat
$\qquad \qquad$ Select and execute $a$, observe $s´$ and $r$
$\qquad \qquad$ Update transition: $W_T \leftarrow W_T - \alpha_T (T(s,a)-s´) \nabla_{W_T}T(s,a)$
$\qquad \qquad$ Update reward: $W_R \leftarrow W_R - \alpha_R (R(s,a)-s´) \nabla_{W_R}R(s,a)$
$\qquad \qquad$ Repeat N times:
$\qquad \qquad \qquad$ sample $\hat{s},\hat{a}$ arbitrarily
$\qquad \qquad \qquad$ $\delta \leftarrow R(\hat{s},\hat{a}) + \gamma max_{\hat{a}´} Q(T(\hat{s},\hat{a}),\hat{a}´)-Q(\hat{s},\hat{a})$
$\qquad \qquad \qquad$ Update $Q$:$W_Q \leftarrow W_Q - \alpha_Q \delta \nabla_{W_Q} Q(\hat{s},\hat{a})$
$\qquad \qquad$ $s \leftarrow s´$
$\qquad$ Until convergence of Q
$\qquad$ Return Q (to extract the policy)

We are assuming a T(.) model that is deterministic, so, this is the entire expectation. If we assume a stochastic model, we need to minimize the cross entropy.

Note the hat notation in the inner loop, these are not the current state and action. When we update Q-function we use fix state and action with my model, with are both together summarizing our previous experience. The benefit of this, if we have a good model it would generalize well beyond just the actual experience we gather over time. If not, model converges very slowly.

4 Planing vs Replay Buffer

Difference: Instead of updating Q-function based on samples from replay buffer (real interaction samples), generate samples using the model (imagine or fake experience).

In RB we sample from the buffer instead of sampling arbitrarily, also it stores all the previous experiences, and MB stores it in not an explicit fashion (at the end of the day all the previous experiences was sampled from a model we don´t know).

Is simple to work with a replay buffer but we cant generalize beyond the samples in the buffer.

Can we use both?

5 Dyna

Dyna is an algorithm that learn explicit transition and/or reward model and plan over that model but also learn directly from real experience

Dyna-Q(s)
$\qquad$ Repeat
$\qquad \qquad$ Select and execute $a$, observe $s´$ and $r$
$\qquad \qquad$ Update transition: $W_T \leftarrow W_T - \alpha_T (T(s,a)-s´) \nabla_{W_T}T(s,a)$
$\qquad \qquad$ Update reward: $W_R \leftarrow W_R - \alpha_R (R(s,a)-s´) \nabla_{W_R}R(s,a)$
$\qquad \qquad$ $\delta \leftarrow R(s,a) + \gamma max_{a´} Q(s´,a´)-Q(s,a)$
$\qquad \qquad$ Update $Q$:$W_Q \leftarrow W_Q - \alpha_Q \delta \nabla_{W_Q} Q(s,a)$
$\qquad \qquad$ Repeat N times:
$\qquad \qquad \qquad$ sample $\hat{s},\hat{a}$ arbitrarily
$\qquad \qquad \qquad$ $\delta \leftarrow R(\hat{s},\hat{a}) + \gamma max_{\hat{a}´} Q(T(\hat{s},\hat{a}),\hat{a}´)-Q(\hat{s},\hat{a})$
$\qquad \qquad \qquad$ Update $Q$:$W_Q \leftarrow W_Q - \alpha_Q \delta \nabla_{W_Q} Q(\hat{s},\hat{a})$
$\qquad \qquad$ $s \leftarrow s´$
$\qquad$ Until convergence of Q
$\qquad$ Return Q (to extract the policy)

Inner Loop solves the MDP (Planning over the model).

6 Monte Carlo Tree Search

What if we only plan since the current state?

Monte Carlo Tree Search is an algorithm that will choose actions that it thinks will be good and focus efforts on moves that are most likely to happen.

How can we expand search tree in a tractable way?

First thing coming up to mind is: What could be a nice depth?. If we gonna cut it off, we need a good estimate of what is the value in each leaf state.

Leaf nodes: Approximate leaf values with value of default policy $\pi$. We simply use a base policy $\pi$ to rollout for each leaf node to estimate their value.

\[Q^* (s,a) \approx Q^{\pi}(s,a) \approx \frac{1}{n(s,a)} \sum_{k=1}^{n} G_k\]

Chance Nodes: Approximate expectation by sampling from transition distribution model.

\[Q^*(s,a) \approx R(s,a) + \gamma \frac{1}{n(s,a)} \sum_{s´\sim Pr(s´ | s,a)} V(s´)\]

Decision nodes: expand only most promising actions using upper confidence bounds in order to not explore possibly bad actions.

\[a^* = argmax_a Q(s,a) + x \sqrt{\frac{2\ln n(s)}{n(s,a)}} , V^*=Q(s,a*)\]

UBCT($s_0$)
$\qquad$ Create root $node_0$ with state state($node_0$) $\leftarrow s_0$
$\qquad$ while within computational budget do
$\qquad \qquad$ $node_l \leftarrow TreePolicy(node_0)$
$\qquad \qquad$ $value$ $\leftarrow DefaultPolicy(state(node_l))$
$\qquad \qquad$ $Backup(node_l, value)$
$\qquad$ return $action(BestChild(node_0,0))$

TreePolicy function navigate through the tree selecting the best actions and nodes to expand using upper bound confidences (UBC) until we reach a leaf node $node_l$.

TreePolicy($node$)
$\qquad$ while $node$ is nonterminal do
$\qquad \qquad$ if $node$ is not fully expanded do
$\qquad \qquad \qquad$ return $Expand(node)$
$\qquad \qquad$ else
$\qquad \qquad \qquad$ $node \leftarrow BestChild(node,C)$
$\qquad \qquad$ return $node$

Expand function receives a node and expand it to all possible actions.

Expand($node$)
$\qquad$ choose $a \in$ untried actions of $A(state(node))$
$\qquad$ add a new child $node´$ to $node$
$\qquad \qquad$ with $state(node´) \leftarrow T(state(node),a)$
$\qquad$ retunr $node´$

The idea is we are going to estimate values at each one of the nodes while traveling the tree. Those values maybe or not accurate depending if we visit enough that node. If we have not visited a node very often, then their estimate would be inaccurate and have a lot of uncertainty and incrementing his UCB.

The upper boundary is proportional to the squared root of $2\ln n(node)$, which means that when the experiment progresses, all child have their upper boundaries increases by a factor of squared root of $2 \ln n(node)$.

This upper boundary is inversely proportional to the squared root of $n(node´)$. The more times the specific child has been engaged before in the past, the greater the confidence boundary reduces towards the point estimate.

BestChild($node,c$)
$\qquad$ return $\underset{node´ \in children(node)}{\operatorname{argmax}} V(node´) +c \sqrt{\frac{(2 \ln n(node))}{n(node´)}}$

DefaultPolicy initialize a rollout following a base policy to get an estimation of his value. (This policy could be improved while the algorithm is running)

DefaultPolicy(node)
$\qquad$ while $node$ is not terminal do
$\qquad \qquad$ sample $a \sim \pi (a | state(node))$
$\qquad \qquad$ $s´ \leftarrow T(state(node),a)$ $\qquad$ return $R(s,a)$

Backup function calculate (sampling using the model) the new values of the nodes in the path once we have the value of the leaf node.

Backup($node,value$)
$\qquad$ while $node$ is not null do
$\qquad \qquad$ $V(node) \leftarrow \frac{n(node)V(node)+value}{n(node) +1}$
$\qquad \qquad$ $n(node \leftarrow n(node) + 1)$
$\qquad \qquad$ $node \leftarrow parent(node)$

In two player games (like go), this function returns $-value$ as player 1 move affect next movement of player 2.

BackupMinMax($node,value$)
$\qquad$ while $node$ is not null do
$\qquad \qquad$ $V(node) \leftarrow \frac{n(node)V(node)+value}{n(node) +1}$
$\qquad \qquad$ $n(node \leftarrow n(node) + 1)$
$\qquad \qquad$ $value -value $ $\qquad \qquad$ $node \leftarrow parent(node)$

7 Alpha-Go Zero

It’s a beautiful piece of work that trains an agent for the game of Go through pure self-play without any human knowledge except the rules of the game. The methods are fairly simple compared to previous papers by DeepMind, and AlphaGo Zero ends up beating AlphaGo (trained using data from expert games and beat the best human Go players) convincingly.

7.1 Neural Network

There is a neural network in the core of the algorithm. The NN $f_{\theta}$ is parametized by $\theta$ and take as an input a raw state of the board and his history (7 frame boards of the past) $s$ of. Its outputs are a continuous value of the board state (conditioned of player perspective) $v_{\theta} \in [-1,1]$, and a policy $\vec{p_{\theta}}(s)$ that is a probability vector over all valid moves $(p_a=Pr(a|s))$.

This NN is trained using self-play and the training examples provided are in the form of the next triplet: $(s_t, \vec{\pi}_t,zt)$. Here, $\vec{\pi_t}$ is an estimate of the policy from state $s_t$, and $z_t \in \{-1,1\}$ is the final outcome of the game from the player perspective. The loss function used to train this NN:

\[l_f = \sum_{t}(v_{\theta}(s_t)-z_t)^2 - \vec{\pi}_t * log(\vec{p_{\theta}}(s_t)) \]

After a self-play step (computed by Monte Carlo Tree Search) the next action is taken following the strengthed distribution $\pi_t$ to get $S_{t+1}$. In parallel the Neural Network still training by samples of past triplets computed by MCTS $(s_t,\pi_t,z_t)$.

7.2 Monte Carlo Tree Search

The Neural Network provides an estimate of the policy $\vec{p}_\theta$. During the training phase we wish to improve this estimates using a Monte Carlo Tree Search.A board configuration is represented as a node in the search tree, and a direct edge exist $i \rightarrow j$ if i can select a valid action to move from configuration $i$ to $j$.

For a search tree we maintain 3 parameters on each node:

Q(s,a): Q value of take action $a$ in state $s$.
N(s,a): Number of times we took action $a$ at state $s$
$P(s,*) = \vec{p_\theta}(s)$: Initial estimate of taking an action from the state $s$ following the policy given by the NN.

The Upper Confidence Bound of the Q-vale is given by:

\[U(s,a)= Q(s,a) + c*P(s,a)*\frac{1}{1+ N(s,a)}\]

A single simulation proceeds as follows. We compute the action $a$ that maximizes U(s,a). If the expanded node $s´$ has already visited (it exist), then we apply the same recurrence on $s´$. If does not exist, we add our new node to the tree and calculate $\vec{p}_\theta (s´)$ and his value $v(s´)$. Also, we initialize Q(s´,a) and N(s´,a) for all the actions available. using our NN. Then, we propagate $v(s´)$ up along the path in the current simulation and update all $Q(s,a)$. If we arrive at a terminal state, then, we propagate the actual reward (+1 if win, or -1 if lose).

During self-play, after we perform MCTS we pick the next move based on the improved policy $\vec{\pi}(s)$.

Search probabilities $\pi$ are proportional to $N^{\frac{1}{\tau}}$, where $N$ is the visit count of each move from the root state and $\tau$ is a parameter controlling temperature.

Final Considerations:

History Of States
Temperature: AlphaGo Zero uses $\tau =1$ (simply the normalized counts) for the first 30 moves of each game, and then sets it to an infinitesimal value (picking the move with the maximum counts).
Symmetry: The Go board is invariant to rotation and reflection. When MCTS reaches a leaf node, the current neural network is called with a reflected or rotated version of the board to exploit this symmetry. In general, this can be extended to other games using symmetries that hold for the game.
Asynchronous: The neural network queries are batched and each search thread is locked until evaluation completes. In addition, the 3 main processes: self-play, neural network training and comparison between old and new networks are all done in parallel.
NN design

Model Based RL: An Introduction

Mauro Alejandro Montenegro Meza

2/2/2021