Sampling the Imaginary

Using Bayesian Models for Understanding and Prediction
February 14, 2024

Using Bayesian Models for Understanding

Posterior Distributions

https://observablehq.com/embed/63e653806da0caed@933?cell=*

Why sample from the posterior?

Binomial model is 1-dimensional and relatively simple.
Thus, can compute posterior explicitly (even analytically!)
Imagine a model that has 10 parameters and suppose we wanted a resolution of 100 grid points per parameter.

Discuss: How many values would we have in our grid approximation?

Answer: \(100^{10}\)

How to sample from the posterior

Compute posterior

Suppose we flip the world 9 times and get 6 waters.

p_grid <- seq( from=0, to=1, length.out=1000 )
prob_p <- rep( 1, 1000 )
prob_data <- dbinom( 6, size=9, prob=p_grid )
posterior <- prob_data * prob_p
posterior <- posterior / sum(posterior)

How to sample from the posterior

Sample from posterior

We can use the sample command to draw random parameters from the posterior in proportion to their probability.

samples <- sample( p_grid, prob=posterior, size=1e4, replace=TRUE )

How to sample from the posterior

Sample from posterior

We can use the sample command to draw random parameters from the posterior in proportion to their probability.

How to sample from the posterior

Sample from posterior: Plot sample values

plot( samples )

How to sample from the posterior

Sample from posterior: Plot sample values

Add the maximum a posteriori (MAP) estimate (i.e. 6/9).

MAP is similar to maximum likelihood estimate (MLE).

How to sample from the posterior

Sample from posterior: Plot sample distribution

library(rethinking)
dens( samples )

Note: Only need to call library(rethinking) once per R session!

How to sample from the posterior

Sample from posterior: Plot sample distribution

Add analytical posterior to check.

Summarizing posterior: Interval estimates

Intervals of defined boundaries

https://observablehq.com/embed/63e653806da0caed@933?cell=*

Summarizing posterior: Interval estimates

Intervals of defined boundaries

If we have the grid-approximate posterior (posterior), we can calculate Pr[A \(\leq\) p \(\leq\) B]. For example, if A = 0 and B = 0.5:

(area <- sum( posterior[ 0 <= p_grid & p_grid <= 0.5 ] ))

[1] 0.1718746

So, Pr[p \(\leq\) 0.5] = 0.1718746.

Summarizing posterior: Interval estimates

Intervals of defined boundaries

Pr[p \(\leq\) 0.5] = 0.1718746

Summarizing posterior: Interval estimates

Intervals of defined boundaries

If we have samples from the posterior (samples), we can calculate Pr[A \(\leq\) p \(\leq\) B] as well. For example, if A = 0 and B = 0.5:

sum( samples <= 0.5 ) / 1e4

[1] 0.1725

Since samples appear in proportion to their probability, we only need to count the number of samples less than 0.5 and divide by the sample size.

Summarizing posterior: Interval estimates

Intervals of defined boundaries

(area <- sum( 0.6 <= samples & samples <= 0.7 ) / 1e4)

[1] 0.2617

Pr[0.6 \(\leq\) p \(\leq\) 0.7] = 0.2617

Summarizing: Interval estimates

Intervals of defined mass

Definition: An interval [A,B] with posterior probability mass \(x\) is called a (100\(x\))% credible interval or (100\(x\))% compatibility interval, given by Pr[A \(\leq p \leq\) B] = \(x\).

3 special types:

Quantiles: When A = 0, B is called the \(x\) quantile.
Percential Intervals (PI): Equal probability mass in each tail.
Highest Posterior Density Interval (HPDI): Narrowest interval containing specified probability mass.

Summarizing: Interval estimates

Intervals of defined mass

https://observablehq.com/embed/63e653806da0caed@933?cell=*

Summarizing: Interval estimates

Intervals of defined mass

From samples, we can compute quantiles using the quantile function in R.

quantile( samples, 0.8 )

      80% 
0.7627628

Summarizing: Interval estimates

Intervals of defined mass

We can also compute percentile intervals. For example, the 80% PI is:

quantile( samples, c(0.1, 0.9) )

      10%       90% 
0.4474474 0.8128128

Summarizing: Interval estimates

Highest Posterior Density Interval (HPDI): Narrowest interval containing specified probability mass

Let’s look at different data.

https://observablehq.com/embed/63e653806da0caed@933?cell=*

Summarizing: Interval estimates

Generating posterior and samples

p_grid <- seq( from=0, to=1, length.out=1000 )
prob_p <- rep( 1, 1000 )
prob_data <- dbinom( 3, size=3, prob=p_grid )
posterior <- prob_data * prob_p
posterior <- posterior / sum(posterior)
samples <- sample( p_grid, size=1e4, replace=TRUE, prob=posterior )

Summarizing: Interval estimates

Take a peak at samples

plot( samples )

Summarizing: Interval estimates

Take a peak at distribution

dens( samples )

Summarizing: Interval estimates

Compute 50% HPDI

HPDI( samples , prob=0.5 )

     |0.5      0.5| 
0.8378378 0.9989990

Summarizing: Interval estimates

Code summary using samples

p-th quantile:

quantile( samples, p )

100*p-th Percentile Intervals (PI):

quantile( samples, c( (1-p)/2, (1+p)/2 ) )

100*p-th HPDI Interval:

HPDI( samples, p )

Summarizing posterior: Point estimates

https://observablehq.com/embed/63e653806da0caed@933?cell=*

Summarizing: Point estimates

Definition: The maximum a posteriori (MAP) estimate is the value of the random variable that maximizes the posterior distribution.

Definition: The median is 0.5 quantile (50% percentile).

Definition: The mean is the average of the posterior P, written as \[ \int p\mathrm{P}(p)\,dp. \]

Summarizing: Point estimates

Computing MAP estimate from posterior

Useful in Homework #4.

p_grid[ which.max(posterior) ]

[1] 1

In general, we will not use this technique. We will instead compute estimates from samples from the posterior.

Summarizing: Point estimates

Code summary using samples

MAP estimate:

chainmode( samples, adj=0.01 )

[1] 0.97785

Mean:

mean( samples )

[1] 0.7998607

Median:

median( samples )

[1] 0.8388388

Summarizing posterior: Final thoughts

In Bayesian Inference, the entire posterior distribution is your estimate! It includes all of the uncertainty in your inference.
Summarizing using intervals or point estimates severely limits the power of Bayesian and introduces certainty where none exists (i.e. throws away uncertainty).
Best practice is to use entire posterior when possible.

Using Bayesian Models for Prediction

Training the Model for Prediction

When using the model for inference, the posterior probability is the object of interest (distribution; point and interval estimates).
When using the model for prediction, the simulated samples from the trained model is of interest.
The trained model is simply the statistical model with the posterior probability trained from the data.

Unconstrained Model Parameters

Before training

import {aq, op} from "@uwdata/arquero" // JavaScript dplyr
import {vl} from "@vega/vega-lite-api-v5" // JavaScript ggplot2

ss = require('simple-statistics')

viewof binomial_par = Inputs.range(
  [0, 1], 
  {value: 0.7, step: 0.1, label: "Prop"}
)
N = 7

dtd = aq.table({
  x: Array(N+1).fill().map((element, index) => index),
  y: ss.binomialDistribution(N, binomial_par)
})

vl.markBar().data(dtd).encode(
  vl.x().fieldN('x').scale({domain: dtd.array('x')}).axis({title: "Number of successes"}),
  vl.y().fieldQ('y').axis({title: "Probability"})
)
.width(600)
.height(400)
.render();

Constrained Model Parameters

After training

Let’s go back to the original model with 6 waters in 9 tosses, assuming a uniform prior. The resulting posterior distribution from the trained model is:

Predicting Future Data

Definition: A Posterior Predictive Distribution is an average data distribution, averaged over values of the posterior.

\[ \mathrm{Pr}[x \, | \, \mathrm{posterior}] = \int_{0}^{1} \mathrm{Pr}[p \, | \, \mathrm{data}]\left(\begin{array}{c}n \\ x\end{array}\right)p^{x}(1-p)^{n-x}\, dp \]

For fixed \(x\), this is a weighted average using the posterior probabilities as weights. Note that:

\[ \int_{0}^{1} Pr[p \, | \, \mathrm{data}]\, dp = 1. \]

Predicting Future Data

Posterior Predictive Distributions

Predicting Future Data

Posterior Predictive Distributions

https://observablehq.com/embed/894758716da3f579@1064?cell=*

Predicting Future Data

Posterior Predictive Distributions

Question: How to sample from a posterior predictive distribution?

Two step process:

Step 1. Sample a parameter from the posterior using sample:

(p <- sample(p_grid, prob=posterior, size=1, replace=TRUE))

[1] 0.7537538

Step 2. Use this parameter to sample from a binomial distribution using rbinom:

(data <- rbinom(n=1, size=9, prob=p))

[1] 8

Note: must specify desired data type, i.e. number of trials \(n\) (e.g. \(n=9\) in code above).

Predicting Future Data

Posterior Predictive Distributions

For multiple samples, can use the previous samples vector (recreating for clarity):

samples <- sample( p_grid, prob=posterior, size=1e4, replace=TRUE )
data <- rbinom( n=length(samples), size=9, prob=samples )

Let’s take a look at the resulting approximate posterior predictive distribution:

simplehist(data)

Posterior Predictive Check

It is usually a good idea to check to see how well the model predicts the data you used to train it. It SHOULD be a good fit, as long as

the prior is not too constrained, and
the Bayesian inference machine was coded correctly.

Posterior Predictive Check

So how likely is 6 waters out of 9 (our training data) according to the posterior predictive distribution?

Prediction intervals

Discuss: Where do I predict 89% of future data will be?

Definition: Prediction intervals are percentile intervals of a posterior predictive distribution.

quantile(data, c((1-0.89)/2, (1+0.89)/2))

 5.5% 94.5% 
    3     8