1 Model vs Free

In a model-free framework we don´t have an explicit transition or reward function and we need to estimate values by interacting with the environment.

  • Q-Learning: Value-based Method
  • Policy Gradient: Policy-Based Method
  • Actor critic: Value and Policy based Method

In a Model-Based Reinforcement Learning we have an explicit expression of the transition (T) and the reward (R) model. We can plan based on the model and increase our sample efficiency (Once we have the model we can use it to collect data without interacting with the environment). Nevertheless, we can increase the complexity.

  • Policy Iteration
  • Value Iteration

2 Maze Example

Imagine the next Maze with the current Policy:

If we calculate all the transition probabilities distributions we can use the bellman equation to plan.

\[V^*(s) = max_a (R(s,a) + \gamma \sum_{s´}Pr(s´ | s,a) V^* (s´))\]

ModelBasedRL(s)
\(\qquad\) Repeat
\(\qquad \qquad\) Select and execute \(a\)
\(\qquad \qquad\) Observe \(s´\) and \(r\)
\(\qquad \qquad\) Update counts: \(n(s,a) \leftarrow n(s,a) +1\) and \(n(s,a,s´) \leftarrow n(s,a,s´) + 1\)
\(\qquad \qquad\) Update Transition: \(Pr(s´|s,a) \leftarrow \frac{n(s,a,s')}{n(s,a)} \forall s´\)
\(\qquad \qquad\) Update reward \(R(s,a) \leftarrow \frac{r+(n(s,a)-1)R(s,a)}{n(s,a)}\)
\(\qquad \qquad\) Solve: \(V^*(s) = max_a (R(s,a) + \gamma \sum_{s´}Pr(s´ | s,a) V^* (s´))\)
\(\qquad \qquad\) \(s \leftarrow s´\)
\(\qquad\) Until Convergence of V^*
\(\qquad\) Return \(V^*\)

Above algorithm works in simple models, but if we are dealing on more complex ones (Atari Games or Robot Control) we could function approximators for reward and transition models:

  • Linear: Gaussian

  • Non-linear:

    +Stochastic -> Gaussian Process

    +Deterministic -> Neural Networks

3 Partial Planning

Once we are dealing with a complex model that are parametized with a Neural Network or a Gaussian Process, is not obvious how we are going to accomplish the planning task. in this complex models, fully optimizing the policy or value function at each time step is intractable.

  • Consider the use of a partial planning:
    • A few steps of Q-learning (or another algorithm)
    • A few steps of policy gradient

Both are model-free approaches, so, we can collect data by interacting with the environment but also use the model for simulation.

ModelBasedRL(s)
\(\qquad\) Repeat
\(\qquad \qquad\) Select and execute \(a\), observe \(s´\) and \(r\)
\(\qquad \qquad\) Update transition: \(W_T \leftarrow W_T - \alpha_T (T(s,a)-s´) \nabla_{W_T}T(s,a)\)
\(\qquad \qquad\) Update reward: \(W_R \leftarrow W_R - \alpha_R (R(s,a)-s´) \nabla_{W_R}R(s,a)\)
\(\qquad \qquad\) Repeat N times:
\(\qquad \qquad \qquad\) sample \(\hat{s},\hat{a}\) arbitrarily
\(\qquad \qquad \qquad\) \(\delta \leftarrow R(\hat{s},\hat{a}) + \gamma max_{\hat{a}´} Q(T(\hat{s},\hat{a}),\hat{a}´)-Q(\hat{s},\hat{a})\)
\(\qquad \qquad \qquad\) Update \(Q\):\(W_Q \leftarrow W_Q - \alpha_Q \delta \nabla_{W_Q} Q(\hat{s},\hat{a})\)
\(\qquad \qquad\) \(s \leftarrow s´\)
\(\qquad\) Until convergence of Q
\(\qquad\) Return Q (to extract the policy)

We are assuming a T(.) model that is deterministic, so, this is the entire expectation. If we assume a stochastic model, we need to minimize the cross entropy.

Note the hat notation in the inner loop, these are not the current state and action. When we update Q-function we use fix state and action with my model, with are both together summarizing our previous experience. The benefit of this, if we have a good model it would generalize well beyond just the actual experience we gather over time. If not, model converges very slowly.

4 Planing vs Replay Buffer

Difference: Instead of updating Q-function based on samples from replay buffer (real interaction samples), generate samples using the model (imagine or fake experience).

In RB we sample from the buffer instead of sampling arbitrarily, also it stores all the previous experiences, and MB stores it in not an explicit fashion (at the end of the day all the previous experiences was sampled from a model we don´t know).

Is simple to work with a replay buffer but we cant generalize beyond the samples in the buffer.

Can we use both?

5 Dyna

Dyna is an algorithm that learn explicit transition and/or reward model and plan over that model but also learn directly from real experience

Dyna-Q(s)
\(\qquad\) Repeat
\(\qquad \qquad\) Select and execute \(a\), observe \(s´\) and \(r\)
\(\qquad \qquad\) Update transition: \(W_T \leftarrow W_T - \alpha_T (T(s,a)-s´) \nabla_{W_T}T(s,a)\)
\(\qquad \qquad\) Update reward: \(W_R \leftarrow W_R - \alpha_R (R(s,a)-s´) \nabla_{W_R}R(s,a)\)
\(\qquad \qquad\) \(\delta \leftarrow R(s,a) + \gamma max_{a´} Q(s´,a´)-Q(s,a)\)
\(\qquad \qquad\) Update \(Q\):\(W_Q \leftarrow W_Q - \alpha_Q \delta \nabla_{W_Q} Q(s,a)\)
\(\qquad \qquad\) Repeat N times:
\(\qquad \qquad \qquad\) sample \(\hat{s},\hat{a}\) arbitrarily
\(\qquad \qquad \qquad\) \(\delta \leftarrow R(\hat{s},\hat{a}) + \gamma max_{\hat{a}´} Q(T(\hat{s},\hat{a}),\hat{a}´)-Q(\hat{s},\hat{a})\)
\(\qquad \qquad \qquad\) Update \(Q\):\(W_Q \leftarrow W_Q - \alpha_Q \delta \nabla_{W_Q} Q(\hat{s},\hat{a})\)
\(\qquad \qquad\) \(s \leftarrow s´\)
\(\qquad\) Until convergence of Q
\(\qquad\) Return Q (to extract the policy)

Inner Loop solves the MDP (Planning over the model).

7 Alpha-Go Zero

It’s a beautiful piece of work that trains an agent for the game of Go through pure self-play without any human knowledge except the rules of the game. The methods are fairly simple compared to previous papers by DeepMind, and AlphaGo Zero ends up beating AlphaGo (trained using data from expert games and beat the best human Go players) convincingly.

7.1 Neural Network

There is a neural network in the core of the algorithm. The NN \(f_{\theta}\) is parametized by \(\theta\) and take as an input a raw state of the board and his history (7 frame boards of the past) \(s\) of. Its outputs are a continuous value of the board state (conditioned of player perspective) \(v_{\theta} \in [-1,1]\), and a policy \(\vec{p_{\theta}}(s)\) that is a probability vector over all valid moves \((p_a=Pr(a|s))\).

This NN is trained using self-play and the training examples provided are in the form of the next triplet: \((s_t, \vec{\pi}_t,zt)\). Here, \(\vec{\pi_t}\) is an estimate of the policy from state \(s_t\), and \(z_t \in \{-1,1\}\) is the final outcome of the game from the player perspective. The loss function used to train this NN:

\[l_f = \sum_{t}(v_{\theta}(s_t)-z_t)^2 - \vec{\pi}_t * log(\vec{p_{\theta}}(s_t)) \]

After a self-play step (computed by Monte Carlo Tree Search) the next action is taken following the strengthed distribution \(\pi_t\) to get \(S_{t+1}\). In parallel the Neural Network still training by samples of past triplets computed by MCTS \((s_t,\pi_t,z_t)\).

7.2 Monte Carlo Tree Search

The Neural Network provides an estimate of the policy \(\vec{p}_\theta\). During the training phase we wish to improve this estimates using a Monte Carlo Tree Search.A board configuration is represented as a node in the search tree, and a direct edge exist \(i \rightarrow j\) if i can select a valid action to move from configuration \(i\) to \(j\).

For a search tree we maintain 3 parameters on each node:

  • Q(s,a): Q value of take action \(a\) in state \(s\).
  • N(s,a): Number of times we took action \(a\) at state \(s\)
  • \(P(s,*) = \vec{p_\theta}(s)\): Initial estimate of taking an action from the state \(s\) following the policy given by the NN.

The Upper Confidence Bound of the Q-vale is given by:

\[U(s,a)= Q(s,a) + c*P(s,a)*\frac{1}{1+ N(s,a)}\]

A single simulation proceeds as follows. We compute the action \(a\) that maximizes U(s,a). If the expanded node \(s´\) has already visited (it exist), then we apply the same recurrence on \(s´\). If does not exist, we add our new node to the tree and calculate \(\vec{p}_\theta (s´)\) and his value \(v(s´)\). Also, we initialize Q(s´,a) and N(s´,a) for all the actions available. using our NN. Then, we propagate \(v(s´)\) up along the path in the current simulation and update all \(Q(s,a)\). If we arrive at a terminal state, then, we propagate the actual reward (+1 if win, or -1 if lose).

During self-play, after we perform MCTS we pick the next move based on the improved policy \(\vec{\pi}(s)\).

Search probabilities \(\pi\) are proportional to \(N^{\frac{1}{\tau}}\), where \(N\) is the visit count of each move from the root state and \(\tau\) is a parameter controlling temperature.

Final Considerations:

  • History Of States
  • Temperature: AlphaGo Zero uses \(\tau =1\) (simply the normalized counts) for the first 30 moves of each game, and then sets it to an infinitesimal value (picking the move with the maximum counts).
  • Symmetry: The Go board is invariant to rotation and reflection. When MCTS reaches a leaf node, the current neural network is called with a reflected or rotated version of the board to exploit this symmetry. In general, this can be extended to other games using symmetries that hold for the game.
  • Asynchronous: The neural network queries are batched and each search thread is locked until evaluation completes. In addition, the 3 main processes: self-play, neural network training and comparison between old and new networks are all done in parallel.
  • NN design