7 May 2017

Roadmap

  • Bayesian Deep Learning
  • Probabilistic Programming with Edward
  • Dropout as a Bayesian Approximation in Deep Neural Networks
  • Experiments

Bayesian Deep Learning

Deep Learning

  • State of the art in a wide range of applications: computer vision, speech recognition, natural language processing etc.

  • Neural network architectures: Feedforward, Recurrent, Convolutional…

  • Deeper neural networks and massive datasets: stochastic gradient descent, backpropagation, dropout

  • Deep Reinforcement Learning: represent Q-value functions, which estimate quality of an agent's next action, as deep neural networks

Drawbacks of Standard Deep Learning

  • Neural networks compute point estimates
  • Neural networks make overly confident decisions about the correct class, prediction or action
  • Neural networks are prone to overfitting
  • Neural networks have many hyperparameters that may require specific tuning

Probabilistic Machine Learning

Aim: Use probability theory to express all forms of uncertainty

  • Uncertainty estimates needed in forecasting, decision making, learning from limited and noisy data

  • AI Safety: medicine, engineering, finance, etc.

  • The probabilistic approach provides a way to reason about uncertainty and learn from data

Bayesian Neural Networks

A Bayesian neural network is a neural network with a prior distribution on the weights

  • Accounts for uncertainty in weights
  • Propagates this into uncertainty about predictions
  • More robust against overfitting: randomly sampling over network weights as a cheap form of model averaging
  • Can infer network hyperparameters by marginalising them out of the posterior distribution

Bayesian Neural Networks

  • Data \(\mathcal{D} = \{\mathbf{X}_i, y_i \}_{i=1}^{N}\) for input data point \(\mathbf{X}_i \in \mathbb{R}^d\) and output \(y_i \in \mathbb{R}\).

  • Parameters: network weights \(\mathbf{w}\) and biases \(b\).

  • For Bayesian neural networks normal priors are often used: \(p(\mathbf{w)} = \mathcal{N}(\mu, \Sigma)\)

Bayesian Neural Networks

  • Posterior: \(p(\mathbf{w} \vert \mathbf{X},y) = \frac{\displaystyle{ p(y \vert \mathbf{X}, \mathbf{w} )p(\mathbf{w}) }} { \displaystyle{p(y\vert \mathbf{X})} }\)

  • Predictive distribution of output \(y^*\) given a new input \(\mathbf{x}^*\): \[p(y^* \vert \mathbf{x}^*, \mathbf{X}, y) = \int p(y^* \vert \mathbf{x}^*, \mathbf{w})p(\mathbf{w} \vert \mathbf{X}, y) d\mathbf{w}\]

Bayesian Neural Networks

  • But exact inference is intractable!
  • Need to approximate the weight posterior \(p(\mathbf{w} \vert \mathbf{X},y)\)

History of Bayesian Neural Networks

Approximate Inference in Bayesian Neural Networks

  • Laplace Approximation - David MacKay (1992)
  • Minimal Description Length - Hinton & Van Kamp (1993)
  • Hamiltonian Monte Carlo - Radford Neal (1995)
  • Ensemble Learning - Barber & Bishop (1998)
  • Methods not scalable for modern applications and massive data sets
See Zoubin Ghahramani's NIPS 2016 keynote address on 'History of Bayesian Neural Networks': https://www.youtube.com/watch?v=FD8l2vPU5FY

Modern Revival: Bayesian Deep Learning

NIPS 2016: bayesiandeeplearning.org

Modern Revival: Bayesian Deep Learning

  • Practical Variational Inference for Neural Networks, A Graves, NIPS 2011.

  • Auto-Encoding Variational Bayes, D P. Kingma, M Welling, CoRR abs/1312.6114, 2013.

  • Stochastic Backpropagation and Approximate Inference in Deep Generative Models, D Rezende, S Mohamed, and D Wierstra, ICML 2014.

  • Weight uncertainty in neural networks, C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra, ICML 2015.

  • Probabilistic backpropagation for scalable learning of Bayesian neural networks, J M Hernandez-Lobato, R P Adams, ICML 2015

  • Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Y Gal, Z Ghahramani, ICML 2016

See Yarin Gal's PhD thesis: Uncertainty in Deep Learning, University of Cambridge (2016)

Probabilistic Programming with Edward

Probabilistic Programming

Overview

  • Represent probabilistic models as programs that generate samples

  • Automate inference of unobserved variables in the model conditioned on observed progam output

  • Probabilistic Programming Languages (PPLs): Church, Venture, Anglican, Figaro, WebPPL, Stan, PyMC3, Edward etc.

  • Many PPLs treat inference as a black box abstracted away from model

  • Separation of model and inference: harder to represent some recent advances, e.g. Generative Adversarial Networks

  • Expressivity versus Scalability trade-off

Probabilistic Programming

"All sufficiently complex PPLs eventually reimplement probabilistic LISP."

Daniel Roy, University of Toronto

Edward

edwardlib.org

  • Edward is a thin platform over TensorFlow
  • Edward = TensorFlow

Edward

edwardlib.org

  • Edward is a thin platform over TensorFlow

  • Edward = TensorFlow + Random Variables

Edward

edwardlib.org

  • Edward is a thin platform over TensorFlow

  • Edward = TensorFlow + Random Variables + Inference Algorithms

A brief diversion into TensorFlow

  • TensorFlow is a computational graph framework

  • Graph nodes are operations: arithmetic, looping, indexing etc.

  • Graph edges are tensors representing data communicated between nodes

  • Do everything in the computational graph: scale, distribute, GPU etc.

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, White Paper, M Abadi et al (2015)

Edward

Models

Edward's Random Variables are class objects with methods: log_prob(), samples(), summary statistics etc.

How to embed random variables into the computational graph?

  • A random variable \(\mathbf{x}\) wraps a TensorFlow tensor \(\mathbf{x^*}\), where \(\mathbf{x^*} \sim p(\mathbf{x})\) is a sample from the random variable object
  • Use this to define probabilistic models

Edward

Inference

  • Edward's Inference algorithms define a TensorFlow computational graph

  • Inference can be written as a collection of separate inference programs

  • Leverage TensorFlow optimisers, e.g. ADAM for Variational Inference; gradient structure for Hamiltonian Monte Carlo

Edward

Expressivity and Scalability

  • Allows composition of Models and of Inference methods for greater expressivity

  • Allows data subsampling for greater scalability

Edward Philosophy

George Edward Pelham Box's loop as a Design Principle

1.    Build a model of the science
2.    Infer the model given data
3.    Criticize the model given data (Posterior Predictive Checks)
Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787.
Dustin Tran, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. Deep Probabilistic Programming. International Conference on Learning Representations.

2-Layer Neural Network in TensorFlow

import tensorflow as tf

def neural_network(x, W_0, W_1, W_2, b_0, b_1, b2):
  h = tf.tanh(tf.matmul(x, W_0) + b_0)    
  h = tf.tanh(tf.matmul(h, W_1) + b_1)
  h = tf.matmul(h, W_2) + b_2
  return tf.reshape(h, [-1]) 

2-Layer Neural Network in TensorFlow + Edward

import tensorflow as tf

def neural_network(x, W_0, W_1, W_2, b_0, b_1, b2):
  h = tf.tanh(tf.matmul(x, W_0) + b_0)    
  h = tf.tanh(tf.matmul(h, W_1) + b_1)
  h = tf.matmul(h, W_2) + b_2
  return tf.reshape(h, [-1]) 
from edward.models import Normal

W_0 = Normal(loc=tf.zeros([D, n_hidden]), scale=tf.ones([D, n_hidden]))      
W_1 = Normal(loc=tf.zeros([n_hidden, n_hidden]), scale=tf.ones([n_hidden, n_hidden]))
W_2 = Normal(loc=tf.zeros([n_hidden, 1]), scale=tf.ones([n_hidden, 1]))
b_0 = Normal(loc=tf.zeros(n_hidden), scale=tf.ones(n_hidden))
b_1 = Normal(loc=tf.zeros(n_hidden), scale=tf.ones(n_hidden))
b_2 = Normal(loc=tf.zeros(1), scale=tf.ones(1))

Inference in Edward

  • Let x and y be observed random variables, and x_train and y_train be atually observed data
  • Let w_0, w_1, b_0 and b_1 be latent variables in model
  • Let qw_0, qw_1, qb_0 and qb_1 be random variables defined to approximate the posterior
  • Bind model latent random variables to posterior random variables: {w_0: qw_0, b_0: qb_0, w_1: qw_1, b_1: qb_1}
  • Bind observed random variable to the training data: {x: x_train, y: y_train}

Inference in Edward

  • Select some inference algorithm (e.g. Variational Inference, Monte Carlo)
inference = ed.Inference(latent_vars={w_0: qw_0, b_0: qb_0, 
                                      w_1: qw_1, b_1: qb_1}, 
                         data={x: x_train, y: y_train})
  • This abstraction can represent a broad class of inference algorithms

Variational Inference

Recasting Bayesian inference as optimisation

  • Approximate the posterior \(p(\mathbf{w} \vert \mathbf{X, Y})\) with a simpler distribution from a variational family of distributions \(q_{\theta}(\mathbf{w})\) parameterised by \(\theta\)

  • Minimise the Kullback-Leibler (KL) divergence between approximate posterior \(q_{\theta}(\mathbf{w})\) and the actual posterior \(p(\mathbf{w} \vert \mathbf{X, Y})\) w.r.t. \(\theta\)

  • Mean Field approximation: factorise the approximating posterior \(q_{\theta}(\mathbf{w})\) into independent distributions for each weight layer

  • But minimising KL divergence is hard! (because of the evidence term \(\text{log } p(\mathbf{Y}| \mathbf{X})\)!)
  • Minimising the KL divergence = maximising the Evidence Lower BOund (ELBO) w.r.t. \(\theta\)

Variational Inference

Evidence Lower Bound

\[\text{ELBO} = \underbrace{\mathbb{E}_{q_{\theta}}[\log p(\mathbf{x} \vert \mathbf{z})]}_\text{Expected Log Likelihood} - \underbrace{ \text{KL}( q_{\theta}(\mathbf{z}) \| p(\mathbf{z})] }_\text{Prior KL}\]

  • Expected Log Likelihood: measures how well samples from approximate posterior \(q_{\theta}(\mathbf{z})\) explain data \(\mathbf{x}\) (reconstruction cost)
  • Prior KL: ensures explanation of doesn't deviate too far from our prior beliefs (penalises complexity)

Variational Inference

Estimating the gradient

  • Optimise the ELBO : need unbiased estimator of its gradient
  • For normal distributions, the KL prior term can be computed analytically –>
  • No assumptions made about underlying model: 'black box' approach for scalable learning of complex models
  • Problem: high variance of Monte Carlo estimator of the gradient!

Black Box Variational Inference with Edward

  • Edward minimises the objective by automatically selecting from a variety of black box inference techniques

  • Edward applies variance reduction to the Monte Carlo gradient estimator

  • Edward computes the Prior KL automatically when \(p(\mathbf{z})\) and \(q_{\theta}(\mathbf{z})\) are Normal

  • Allows Black Box Variational Inference to be implemented in a wide variety of models

Dropout as a Bayesian Approximation

Dropout

  • Deep neural networks: many parameters; very prone to overfitting; require large data sets

  • Dropout is an empirical technique used to avoid overfitting in neural networks

  • Dropout multiplies hidden activations by Bernoulli distributed random variables which take the value 1 with probability \(p\) and 0 otherwise

  • Randomly "drop out" hidden units and their connections during training time to prevents hidden units from co-adapting too much

  • Often yields significant improvements over other regularisation methods

  • Dropout: A Simple Way to Prevent Neural Networks from Overfitting - Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov - J. Machine Learning Research (2014)

Dropout Objective

\[\mathcal{L}_{\text{dropout}} = \underbrace{ \frac{1}{N} \sum_{i=1}^{N} E(\mathbf{y}_{i}, \mathbf{\hat{y}}_{i})}_\text{Loss function} + \underbrace{ \lambda \sum_{i=1}^{L} (\|\mathbf{W}_{i} \|^2 + \| \mathbf{b}_{i} \|^2 )}_{L_2 \text{ regularisation with weight decay} \lambda}\]

How is Dropout in non-probabilistic neural nets connected to variational inference in Bayesian neural nets?
  • Gal & Ghahramani: Reparameterise the approximate variational distribution \(q_{\theta}(\mathbf{w})\) to be non-Gaussian (Bernouilli)

  • Factorise the distribution for each neural network weight row in each weight matrix

  • Now compare dropout objective to ELBO objective using this reparameterised \(q_{\theta}(\mathbf{w})\)

Dropout as a Bayesian Approximation

  • Gal & Ghahramani: Approximate ELBO and Dropout NN objectives are identical (given this choice of approximating reparameterised posterior \(q_{\theta}(\mathbf{w})\) and normal priors over network weights)
  • Optimising any neural network with dropout is equivalent to a form of approximate Bayesian inference
  • A network trained with dropout already is a Bayesian Neural Network!

Dropout as a Bayesian Approximation

Practical Application

  • Gal and Ghahramani's method (MC Dropout) requires applying dropout at every weight layer at test time

  • For input \(\mathbf{x}^{*}\) the predictive distribution for output \(\mathbf{y}^*\) is \[q(\mathbf{y}^{*}\vert \mathbf{x}^{*}) = \int p(\mathbf{y}^{*}\vert \mathbf{x}^{*}, \mathbf{w}) q(\mathbf{w} \vert \mathbf{X, Y} )\]

  • MC Dropout averages over \(T\) forward passes through the network at test time

  • To estimate predictive mean and predictive uncertainty:

  • Simply collect the results of stochastic forward passes through an existing Neural Network already trained with Dropout!

MC Dropout Experiments

Highly competitive results for

  • Convolutional Neural Networks: Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, Y Gal, Z Ghahramani, ICLR 2016

  • Recurrent Neural Networks: A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, Y Gal, Z Ghahramani, NIPS 2016

  • Reinforcement Learning: Improving PILCO with Bayesian Neural Network Dynamics Models, Y Gal, R McAllister, C Rasmussen, ICML 2016

http://mlg.eng.cam.ac.uk/yarin/publications.html

Experiments

MC Dropout versus Black Box Variational Inference

  • Feedforward Neural Networks with 2 hidden layers and 50 units per layer

  • Same benchmark data sets and architecture as Hernandez-Lobato & Adams (2015) and Gal & Ghaharmani (2016)

Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks, J M Hernandez-Lobato, R P Adams, ICML 2015
Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Y Gal, Z Ghahramani, ICML 2016

MC Dropout

Results

Model-Specific versus Black Box

  • Black Box Variational Inference: aims for maximum flexibility and model independence

  • Edward uses mean field VI: factorises the approximating posterior \(q(\mathbf{w})\) into independent distributions for each latent variable: \(\prod_{i=i}^{l}q_i(w_i)\):

  • Mean Field VI cannot capture posterior dependencies between network weight layers

Model-Specific versus Black Box

  • MC Dropout: joint distribution over each layer's weights is multimodal (mixtures of delta peaks)

  • For Bayesian Neural Networks, including correlations between layers improves results: see Louizos & Welling (2016)

Current Research in Variational Inference

  • Better variational posterior approximations: Normalising Flows, Hierarchical Variational Models etc.
    [Rezende & Mohamed 2015, Ranganath, Tran & Blei 2016, Louizos & Welling 2016]
  • Lower variance ELBO estimators:
    [G Roeder et al 2017, C Naesseth et al 2017, J Sakaya et al 2017]

Thanks for listening!

We are hiring!

References

  • Structured and Efficient Variational Deep Learning with Matrix Posteriors - C Louizos & M Welling. arXiv:1603.04733 (2016)

  • Multiplicative Normalizing Flows for Variational Bayesian Neural Networks - C Louizos & M Welling. arXiv:1703.01961 (2017)

  • Hierarchical variational models, Ranganath, D. Tran, and D. M. Blei. ICML 2016.

  • Black box Variational Inference. R. Ranganath, S. Gerrish, and D. M. Blei. Artificial Intelligence and Statistics, 2014.

  • Variational inference with normalizing flows, D J Rezende, S Mohamed, ICML 2015.

  • Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms - C Naesseth, F Ruiz, S Linderman, D M Blei (2017)

  • Importance Sampled Stochastic Optimization for Variational Inference, J Sakaya, A Klami, arXiv:1704.05786 (2017)

  • Sticking the Landing: An Asymptotically Zero-Variance Gradient Estimator for Variational Inference, G Roeder, Y Wu, D Duvenaud (2017)

  • Edward: A library for probabilistic modeling, inference, and criticism., D Tran, A Kucukelbir, A B Dieng, M Rudolph, D Liang, and D M Blei. arXiv:1610.09787.

  • Deep Probabilistic Programming. D Tran, M Hoffman, R A Saurous, E Brevdo, K Murphy, and D M Blei, ICLR 2017

  • Practical Variational Inference for Neural Networks, A Graves, NIPS 2011.

  • Auto-Encoding Variational Bayes, D P. Kingma, M Welling, CoRR abs/1312.6114, 2013.

  • Stochastic Backpropagation and Approximate Inference in Deep Generative Models, D J Rezende, S Mohamed, and D Wierstra, ICML 2014.

  • Weight uncertainty in neural networks, C Blundell, J Cornebise, K Kavukcuoglu, and D Wierstra, ICML 2015.

  • Probabilistic backpropagation for scalable learning of Bayesian neural networks, J M Hernandez-Lobato, R P Adams, ICML 2015

  • Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning, Y Gal, Z Ghahramani, ICML 2016

References

  • Uncertainty in Deep Learning, Y Gal, PhD Thesis, University of Cambridge (2016)

  • Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference, Y Gal, Z Ghahramani, ICLR 2016

  • A Theoretically Grounded Application of Dropout in Recurrent Neural Networks, Y Gal, Z Ghahramani, NIPS 2016

  • Improving PILCO with Bayesian Neural Network Dynamics Models, Y Gal, R McAllister, C Rasmussen, ICML 2016