class: center, middle, inverse, title-slide # Interpretable Deep Reinforcement Learning: ## A Model-Based Approach ### Mauro Alejandro Montenegro Meza ### Network and Data Science Laboratory (C.I.C.) ### 2021-02-18 --- class: split-10 white with-border .row.bg-white[.content.center[ # .black[Review of the Black-Box Explanation Methods] <br> ]] .row.bg-gray[.content.split-three[ .column[.content.center[ # Model Explanation <br> .black[ ##* Decision Trees ##* Decision Rules ##* Feature Importance] ]] .column.bg-blue-gray[.content.center[ # Model <br> Inspection <br> .black[ ##* PDP ##* Sensitivity Analysis ##* Optimization] ]] .column[.content.center[ # Outcome Explanation <br> .black[ ##* Saliency Masks ##* Prototype Selection] ]] ]] --- class: split-10 white with-border with-thick-border border-black .row.bg-white[.content.center[ # .black[Model Based vs Model Free] <br> ]] .row.bg-gray[.content.split-40[ .row.bg-gray[.content.split-two[ .column.bg-gray[.content.center[ ##Model-Free ### * Estimate Values by interaction with environment ### * Learning ]] .column.bg-white[.content.center[ ## .black[Model-Based] ### .blue[*Explicit Transition and reward <br>function] ### .blue[*Planning] ]] ]] .row.bg-white[.content.center[ <img src="pictures/Model_Based.png" width="60%" /> ]] ]] --- class: split-10 with-border border-black .row[.content.center[ #Explainability trough Modeling <br> ]] .row[.content.split-three[ .row[.content.vmiddle.center[ ### * Model-based RL may be an important element of explainability, since it allows the agent to communicate not only its goals, but also the way it intends to achieve them. ]] .row[.content.vmiddle.center[ ### * While learning a single policy is good for one task, if you can predict the dynamics of the environment, you can generalize those insights to multiple tasks. ]] .row[.content.vmiddle.center[ ### * Having a model means you can determine some degree of model uncertainty, so that you can gauge how confident you should be about the resulting decision process. ]] ]] --- class: center, middle ##“The next big step forward in AI will be systems that actually understand their worlds. The world is only accessed through the lens of experience, so to understand the world means to be able to predict and control your experience, your sense data, with some accuracy and flexibility. In other words, understanding means forming a predictive model of the world and using it to get what you want. This is model-based reinforcement learning.” -------------------------------- ##*Richard S. Sutton* --- class: split-10 with-border border-black .row[.content.center[ #Integration of Planning and Learning <br> ]] .row.bg-white[.content.center.vmiddle[ <div class="figure"> <img src="pictures/MB_Frameworks.png" alt="Model Based Frameworks" width="90%" /> <p class="caption">Model Based Frameworks</p> </div> ]] --- class: split-10 white with-border .row.bg-white[.content.center[ # .black[#At which state start planning?] <br> ]] .row.bg-gray[.content.split-three[ .column[.content.center[ # Visited <br> .black[ ## Plan only in previously visited states] ]] .column.bg-blue-gray[.content.center[ # Prioritized <br> .black[ ##Order by relevance in reachable states] ]] .column[.content.center[ # Current <br> .black[ ##Find Better Solution in the region we are currently operating] ]] ]] --- class: split-10 border-black with border-tick bg-white .row[.content.center.bg-white[ #Budget for planning ]] .row[.content.split-two[ .column[.content.bg-white.center[ ##When to start? ###Dyna plans N-times before action! <img src="pictures/Dyna.png" width="90%" /> ###PILCO collects data before planning <img src="pictures/PILCO.png" width="90%" /> ]] .column[.content.center.bg-white[ ##How much time spending planning? <br><br><br> ###Dyna makes N iterations of Budget 1, but Alpha Go Zero makes 1 iteration of self-play with a budget of ~200 ! <img src="pictures/Alpha_Go_Zero.png" width="100%" /> ### We call Budget as a measure of how many times we use the model ]] ]] --- <img src="pictures/B_D_MB.png" width="100%" /> --- class: split-10 with-border border-black with border-tick .row[.content.center[ # How to Plan? <br> ]] .row[.content.split-three[ .row[.content.bg-teal[ ### Type: ### Discrete Planning: Monte Carlo tree Search (MTCS), Minimax_Search. ### Differential Planning: Differentiate trough NN or GP. <br><br> ]] .row[.content.bg-blue[ ### Depht and Breadth: ### A model is by definition reversible, and we are now free to choose and adaptively balance the breadth ( `\(b\)` ) and depth ( `\(d\)` ) of the plan. ### Dyna(b=1), MCTS(b=a), DP (b=f) --- Dyna(d=1), MCTS(d=a), PILCO (d=f) ]] .row[.content.bg-indigo[ ### Uncertainty: ###Data Close Planning: ensure that the planning iterations stay close to regions where we have actually observed data. (Dyna and Guided Policy Search). ###Propagate Uncertainty: explicitly estimate model uncertainty, which allows us to robustly plan over long horizons. (Parametric and Particle-Methods). ]] ]] --- class: split-10 border-black with border-tick bg-white .row[.content.center[ # Integrating Planning and Learning! ]] .row[.content.split-two[ .column[.content.center.vmiddle[ ## Planning input from learned functions (b) <br> ## Planning update for policy or value update (c) <br> ## Planning output for action selection in the real environment (d) <br> ]] .column[.content.center.vmiddle[ <img src="pictures/PL_Loop.png" width="100%" /> ]] ]] --- class: split-three with-border border-black .column[.content.center[ ## Planning Input from Learned Functions using Priors ----- <br><br><br> ### .blue[Value Priors:] ###Using Bootstrapping <img src="pictures/bootstrapping.png" width="100%" /> <br><br><br> ### .blue[Policy Priors:] ### UCB used in MCTS `\(U(s,a) ∝ P(s, a) / 1+N(s, a)\)` ]] .column[.content.center[ ## Planning update for policy or value update ----- ### .blue[Value Update:] ###State-action value estimate at the root(depends on back-up policy). And for the loss usually use MSE or cross-entropy. ### .blue[Policy Update:] <img src="pictures/AGZ_training.png" width="100%" /> ]] .column[.content.center[ ## Planning output for action selection in the real environment ---- ### .blue[Plan over Learning:] ###Model Predictive Control select a greedy action in every step. <br> ### .blue[Value of Perfect Information (VPI)]: Estimates from the model which exploratory action has the highest potential to change the greedy policy. ]] --- class: split-10 with-border border-black .row[.content[ # Model-free planning ]] .row[.content[ ## .blue[Network structure inspired by planning:] ### Embeds planning priors in the neural network architecture. Base networks that represent dynamics, reward, value and/or back-up functions. These base networks are chained (like planning) to form a large network, and the output of this network is trained on an outer supervised loss, like the ability to predict the value at the root. <img src="pictures/MuZero.png" width="90%" /> ]] --- class: split-10 with-border border-black .row[.content[ # Model-Free Planning ]] .row[.content[ ## .blue[Networks that learn planning operations::] ### Why not learn the planning algorithm?. This approach is trained in a supervised fashion, and we therefore require knowledge of the optimal policy or value during training (usually first obtained with a model-free RL algorithm). Effectively, we first need to solve the task to be able to train the model-free planner, which can then hopefully generalize to different tasks. <img src="pictures/MCTSNet.png" width="90%" /> ]] --- class: split-10 with-border border-black .row[.content[ # Model-Free Planning ]] .row[.content[ ## .blue[Black-box recurrent networks with planning-like characteristics:] ### Specifies a RNN, and lets the optimization figure out how to potentially use it for planning. Imagination-Augmented Agents (I2A) make roll-outs in a recurrent transition model and feed its raw output (representation) directly into a policy and value network. They then end-to-end optimize for the policy and value, leaving it up to the optimization how to use the transition model. <img src="pictures/I2A.png" width="85%" /> ]]