class: center, middle, inverse, title-slide # Obtaining Uncertainty Estimates from Neural Networks Using TensorFlow Probability ### Sigrid Keydana ### Predictive Analytics World 2019 --- class: inverse, middle, center # Neural networks are awesome ... --- # Deep learning yields excellent results in e.g. <br /> - Image recognition (classification, segmentation, object detection) - Natural language processing (language models, translation, generation) - Domain modeling (embeddings) - Entity generation ("fakes"): Faces, text ... - Tasks involving tabular data (e.g., recommender systems) - Playing games (deep reinforcement learning) --- class: inverse, middle, center # So what are we missing? --- # Networks output _point estimates_ <br /> - A single numeric prediction - A single segmentation mask - A single translation - A single embedding <br /> But wait ... aren't there probabilities in there, _somewhere_? --- # Name that animal <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/p1.png"/> <figcaption><br />Image: Ben Tubby [CC BY 2.0 (https://creativecommons.org/licenses/by/2.0)] https://upload.wikimedia.org/wikipedia/commons/3/30/Falkland_Islands_Penguins_35.jpg</figcaption> </figure> Let's ask the network ... --- # What network? ... TensorFlow, but from R! <figure> <img src="https://raw.githubusercontent.com/skeydan/uncertainty_neural_networks/master/dlwr.jpg" style = "float:left;" width= "35%"> </figure> <figure> <img src="https://raw.githubusercontent.com/skeydan/uncertainty_neural_networks/master/r-tf.png" style = "float:right;" width= "45%"> </figure> --- # R-TensorFlow <br /> - [R bindings to TensorFlow](https://tensorflow.rstudio.com/) (subsumes Keras) - TensorFlow-related infrastructure... - [Dataset preprocessing pipeline](https://github.com/rstudio/tfdatasets) - [TensorFlow datasets](https://github.com/rstudio/tfds) - [tfhub](https://github.com/rstudio/tfhub) - and more ... - [_r-tensorflow_ ecosystem](https://github.com/r-tensorflow): - Model implementations, e.g. U-Net, WaveNet ... - User-contributed packages for automation, tuning, ... - Blog: https://blogs.rstudio.com/tensorflow/ --- # With R-Keras, let's ask a pretrained model... ```r library(tensorflow) library(keras) model <- application_mobilenet_v2() probs <- model %>% predict(image) imagenet_decode_predictions(probs) ``` <br /> ``` class_name class_description score 1 n02056570 king_penguin 0.9899597168 2 n01847000 drake 0.0011793933 3 n01798484 prairie_chicken 0.0002387235 4 n02058221 albatross 0.0002117234 5 n02071294 killer_whale 0.0001432021 ``` --- # Let's do another one <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/p2.png"/> <figcaption><br />Image: M. Murphy [Public domain]<br /> https://upload.wikimedia.org/wikipedia/commons/2/22/RoyalPenguins3.JPG</figcaption> </figure> --- # So what is that? ```r probs <- model %>% predict(image) imagenet_decode_predictions(probs) ``` <br /> ``` class_name class_description score 1 n02051845 pelican 0.24614049 2 n02009912 American_egret 0.18564136 3 n02058221 albatross 0.06848499 4 n02012849 crane 0.04572001 5 n02009229 little_blue_heron 0.03902744 ``` <br /> ### Hm... what really is that _score_? --- # Stepping back: What's a neural network? <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/playground.png" width = "70%"/> --- # Zooming in: Weights and activations <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/perceptron.png" width = "60%"> <figcaption>Source: https://en.wikipedia.org/wiki/Perceptron</figcaption> </figure> --- # Activation functions in regression ```r model <- keras_model_sequential() %>% layer_dense(units = 32, activation = "relu", input_shape = 7) %>% # default activation = linear layer_dense(units = 1) ``` <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/relu.png" width = "60%"> <figcaption><b>ReLU</b> activation</figcaption> </figure> --- # Sigmoid activation for binary classification ```r model <- keras_model_sequential() %>% layer_dense(units = 32, activation = "relu", input_shape = 7) %>% layer_dense(units = 1, activation = "sigmoid") ``` <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/sigmoid.png" width = "60%"> <figcaption><b>Sigmoid</b> activation</figcaption> </figure> --- # Softmax activation for multiple classification ```r model <- keras_model_sequential() %>% layer_dense(units = 32, activation = "relu", input_shape = 7) %>% layer_dense(units = 10, activation = "softmax") ``` <br /> <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/softmax_pre.png" style = "float:left;" width= "50%"> </figure> <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/softmax_post.png" style = "float:right;" width= "50%"> </figure> Before and after __softmax__ activation. --- # So our "probability" is just a maximum likelihood estimate ... <br /> ... how can we make this probabilistic? --- class: inverse, middle, center # Indirect ways: ensembling (model averaging), dropout ... --- # Dropout uncertainty (Gal & Ghahramani, 2016) <sup>1</sup> <br /> - _dropout_: stochastic removal of units in hidden (or input) layers - habitually used as a regularization measure during training - idea: use at test time and average predictions <br /> ## Question: How to choose the dropout rate? .footnote[ [1] Gal and Ghahramani (2016): Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning ] --- # Dropout uncertainty, next level (Kendall & Gal, 2017) <sup>1</sup><sup>2</sup> <br /> - let the network learn the dropout rate - let the network learn the variance (account for _heteroscedasticity_) - independently calculate _epistemic_ and _aleatoric_ uncertainty .footnote[ [1] Kendall & Gal (2017): What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? [2] Gal, Hron, & Kendall (2017): Concrete Dropout. ] --- # Types of uncertainty: aleatoric vs. epistemic<sup>1</sup> - aleatoric: - variation due to measurement process, irreducible - "known unknowns" - epistemic: - insufficiency in the model; reducible by getting more data - "unknown unknowns" .footnote[ [1] Hora, S. (1996) Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. ] --- # Yet more terminology? How does this matter? <br /> - Circle in what you don't know ... - get more data - try other models / model families - try getting more context - Deal with unavoidable risk (safety margins) - include complementary methods - keep in mind in subsequent decision-making --- # Assessing both types of uncertainty, the Gal et al. way <br /> - __Aleatoric__: Have network output not just predictive mean, but also _predictive variance_ - Learning different variances for different data points! - __Epistemic__: Use Monte Carlo sampling to calculate variance of predictions (that are inherently probabilistic due to use of dropout at test time) - Making use of learned dropout rate --- class: inverse, middle, center ## This is nice, but computationally intensive, and, well, _indirect_ ... --- class: inverse, middle, center # Enter the direct way: Bayesian networks! --- # The idea: Put distributions over the network's weights <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/weight_uncertainty.png" width = "120%"> <figcaption>Blundell et al. (2015): Weight Uncertainty in Neural Networks</figcaption> </figure> --- # The idea itself is all but new... <br /> - McKay (1992): A practical Bayesian framework for backprop networks - Hinton and van Camp (1993): Keeping neural networks simple by minimizing the description length of the weights - R. Neal (1996): Bayesian learning for neural networks <br /> ## ... but gained momentum rather recently: - Graves (2011): Practical Variational Inference for Neural Networks - Kingma et al. (2015): Variational dropout and the local reparameterization trick - Blundell et al. (2015): Weight uncertainty in neural networks --- # So how can we do this? <br /> [TensorFlow Probability](https://www.tensorflow.org/probability/) (Python library on top of TensorFlow) [tfprobability](https://rstudio.github.io/tfprobability/) (R package) <br /> ```r install.packages("tfprobability") library(tfprobability) install_tfprobability() ``` --- # tfprobability (1): Basic building blocks #### Distributions ```r d <- tfd_binomial(total_count = 7, probs = 0.3) d %>% tfd_mean() #> tf.Tensor(2.1000001, shape=(), dtype=float32) d %>% tfd_variance() #> tf.Tensor(1.47, shape=(), dtype=float32) d %>% tfd_log_prob(2.3) #> tf.Tensor(-1.1914139, shape=(), dtype=float32) ``` #### Bijectors ```r b <- tfb_affine_scalar(shift = 3.33, scale = 0.5) x <- c(100, 1000, 10000) b %>% tfb_forward(x) #> tf.Tensor([ 53.33 503.33 5003.33], shape=(3,), dtype=float32) ``` --- # tfprobability (2): Higher-level modules <br /> - Keras layers - Markov Chain Monte Carlo (Hamiltonian Monte Carlo, NUTS) - Variational inference - State space models - GLMs --- class: inverse, middle, center # How does that help us? --- # With TFP, neural network layers can be _distributions_ <br /> A network that has a multivariate normal distribution as output ```r model <- keras_model_sequential() %>% layer_dense(units = params_size_multivariate_normal_tri_l(d)) %>% layer_multivariate_normal_tri_l(event_size = d) log_loss <- function (y, model) - (model %>% tfd_log_prob(y)) model %>% compile(optimizer = "adam", loss = log_loss) model %>% fit( x, y, batch_size = 100, epochs = 1, steps_per_epoch = 10 ) ``` --- # Back to uncertainty... <br /> - If our network can output a distribution, we can have it learn the distribution's parameters - Choosing e.g. a normal distribution, we can train the network to learn the actual _spread_ in the data - Thus, we have found a way to estimate _aleatoric_ uncertainty --- # Example: Aleatoric uncertainty ```r model <- keras_model_sequential() %>% layer_dense(units = 8, activation = "relu") %>% layer_dense(units = 2, activation = "linear") %>% layer_distribution_lambda(function(x) tfd_normal(loc = x[, 1, drop = FALSE], scale = 1e-3 + tf$math$softplus(x[, 2, drop = FALSE]) ) ) negloglik <- function(y, model) - (model %>% tfd_log_prob(y)) model %>% compile( optimizer = optimizer_adam(lr = 0.01), loss = negloglik ) model %>% fit(x, y, epochs = 1000) ``` --- # The network has learned about variance in the data ```r yhat <- model(tf$constant(x_test)) mean <- yhat %>% tfd_mean() sd <- yhat %>% tfd_stddev() ``` <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/aleatoric.png" width = "80%"> </figure> --- # How about epistemic uncertainty? Use _tfprobability_ variational layers: - layer_dense_variational - layer_dense_flipout - layer_dense_reparameterization - layer_dense_local_reparameterization - layer_conv_nd_flipout (n = [1..3]) - layer_conv_nd_flipout (n = [1..3]) - layer_conv_nd_flipout (n = [1..3]) --- # Posterior weights are computed using variational inference <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/blei_vi.png" width = "80%"> <figcaption>Source: <a href="https://media.nips.cc/Conferences/2016/Slides/6199-Slides.pdf">David Blei, Rajesh Ranganath, Shakir Mohamed: Variational Inference:Foundations and Modern Methods. NIPS 2016 Tutorial·December 5, 2016.</a></figcaption> </figure> --- # Example: Epistemic uncertainty ```r model <- keras_model_sequential() %>% layer_dense_variational( units = 1, make_posterior_fn = posterior_mean_field, make_prior_fn = prior_trainable, kl_weight = 1 / n ) %>% layer_distribution_lambda( function(x) tfd_normal(loc = x, scale = 1) ) negloglik <- function(y, model) - (model %>% tfd_log_prob(y)) model %>% compile( optimizer = optimizer_adam(lr = 0.1), loss = negloglik ) model %>% fit(x, y, epochs = 1000) ``` --- # Now every prediction uses different _weight samples_ ```r yhats <- purrr::map(1:100, function(x) model(tf$constant(x_test))) ``` <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/epistemic.png" width = "80%"> </figure> --- # Aleatoric, epistemic ... why not have both? ```r model <- keras_model_sequential() %>% layer_dense_variational( units = 2, ) %>% layer_distribution_lambda(function(x) tfd_normal(loc = x[, 1, drop = FALSE], scale = 1e-3 + tf$math$softplus(0.01 * x[, 2, drop = FALSE]) ) ) yhats <- purrr::map(1:100, function(x) model(tf$constant(x_test))) means <- purrr::map(yhats, purrr::compose(as.matrix, tfd_mean)) %>% abind::abind() sds <- purrr::map(yhats, purrr::compose(as.matrix, tfd_stddev)) %>% abind::abind() ``` --- # Main challenge now is how to display ... Each line is one draw from the posterior weights; each line has its own standard deviation. <figure> <img src="https://raw.githubusercontent.com/skeydan/whyR2019/master/both.png" width = "70%"> </figure> More background: https://blogs.rstudio.com/tensorflow/posts/2019-06-05-uncertainty-estimates-tfprobability/ --- class: inverse, middle, center # Case study: Smartphone activity detection --- # Non-probabilistic cassification Nick Strayer, Classifying physical activity from smartphone data [https://blogs.rstudio.com/tensorflow/posts/2018-07-17-activity-detection/](https://blogs.rstudio.com/tensorflow/posts/2018-07-17-activity-detection/) <br > <figure> <img src="https://raw.githubusercontent.com/skeydan/uncertainty_neural_networks/master/measurements.png" width = "110%"> </figure> --- # Original confusion matrix <figure> <img src="https://raw.githubusercontent.com/skeydan/uncertainty_neural_networks/master/confusion_matrix.png" width = "80%"> </figure> --- # Goal: Port model to TensorFlow Probability <br /> - replace normal conv layers by _layer_conv_1d_flipout_ - replace deterministic output layer by _layer_one_hot_categorical_ - training objective: ELBO: minimize: - KL divergence between a (standard) multivariate normal prior and the approximate posterior - negative log probability of the data under the approximate posterior --- # Probabilistic model ```r model <- keras_model_sequential() model %>% layer_conv_1d_flipout( filters = 12, kernel_size = 3, activation = "relu", kernel_divergence_fn = kl_div ) %>% #... #... layer_dense_flipout( num_classes, kernel_divergence_fn = kl_div, ) %>% layer_one_hot_categorical(event_size = num_classes) nll <- function(y, model) - (model %>% tfd_log_prob(y)) ``` --- # Loss curve <figure> <img src="https://raw.githubusercontent.com/skeydan/uncertainty_neural_networks/master/losscurve.png" width = "110%"> </figure> --- # Uncertainties - __maxprob__: single-run highest per-obs mean of categorical distribution - __maxprob_sd__: standard deviation associated with single-run highest per-obs mean of categorical distribution - __maxprob_sd_mc__: standard deviation of means obtained from 100 prediction runs (weights are probabilistic!) ``` obs maxprob maxprob_sd maxprob_mc_sd predicted truth correct 1 1 0.997 0.0561 0.175 STAND_TO_SIT STAND_TO_S… TRUE 2 2 0.653 0.476 0.235 STAND_TO_SIT STAND_TO_S… TRUE 3 3 0.987 0.112 0.121 STAND_TO_SIT STAND_TO_S… TRUE 4 4 0.960 0.195 0.130 STAND_TO_SIT STAND_TO_S… TRUE 5 5 0.987 0.115 0.325 STAND_TO_SIT STAND_TO_S… TRUE 6 6 0.568 0.495 0.265 STAND_TO_SIT STAND_TO_S… TRUE 7 7 0.701 0.458 0.320 STAND_TO_SIT STAND_TO_S… TRUE 8 8 0.982 0.134 0.0651 STAND_TO_SIT STAND_TO_S… TRUE 9 9 0.589 0.492 0.235 STAND_TO_SIT STAND_TO_S… TRUE 10 10 0.546 0.498 0.114 STAND_TO_SIT STAND_TO_S… TRUE 11 11 0.798 0.401 0.220 STAND_TO_SIT STAND_TO_S… TRUE ... ``` --- # Are wrong classifications associated with higher uncertainty? ``` correct count avg_mean avg_sd avg_mc_sd FALSE 17 0.673 0.420 0.229 TRUE 52 0.814 0.314 0.178 ``` ``` truth predicted avg_mean avg_sd avg_mc_sd correct SIT_TO_LIE SIT_TO_LIE 0.904 0.240 0.136 TRUE LIE_TO_SIT LIE_TO_STAND 0.532 0.498 0.153 FALSE SIT_TO_STAND SIT_TO_STAND 0.904 0.238 0.165 TRUE STAND_TO_LIE STAND_TO_LIE 0.833 0.297 0.179 TRUE LIE_TO_STAND LIE_TO_STAND 0.651 0.470 0.191 TRUE STAND_TO_SIT STAND_TO_SIT 0.814 0.290 0.194 TRUE LIE_TO_SIT LIE_TO_SIT 0.682 0.448 0.196 TRUE LIE_TO_STAND LIE_TO_SIT 0.639 0.454 0.213 FALSE SIT_TO_LIE STAND_TO_SIT 0.692 0.462 0.240 FALSE SIT_TO_LIE STAND_TO_LIE 0.738 0.379 0.265 FALSE STAND_TO_LIE SIT_TO_LIE 0.813 0.282 0.299 FALSE ``` --- # Where to from here? <br /> - TensorFlow Probability is under rapid development - Follow the TensorFlow for R blog for updates: https://blogs.rstudio.com/tensorflow/ <br /> ## Thanks for your attention!