Sampling the Imaginary

By: Richard McElreath

1| Posterior Inference

Example: Let’s suppose there is a blood test that correctly identifies vampirism 95% of the time. Pr(+|vampire) = 0.95. It is a very accurate test, nearly always catching real life vampires. It makes mistakes in the form of false positives, 1% of the time, Pr(+|mortal)= 0.01. Finally, vampires are really rare, Pr(vampire) = 0.001. Suppose someone tests positive for vampirism. What is the probability that they are a bloodsucking immortal?

\[ \color{#00B9BE}{\Pr(vampire\mid +)} = \frac{ \color{#CA75FD}{\Pr(+\mid vampire)} \color{#78A900}{\Pr(vampire)} }{ \color{#F0726A}{\Pr(+)} } \]


Pr(+) = Pr(+|vampire) Pr(vampire) + Pr(+|mortal)(1-Pr(vampire))


positive_vampire <- 0.95
vampire <- 0.001
positive_mortal <- 0.01

positive <- positive_vampire*vampire+positive_mortal*(1-vampire)

vampire_positive <- (positive_vampire*vampire)/positive

vampire_positive*100
## [1] 8.683729

There is an 8.7% chance that the suspect is actually a vampire.

McElreath mentions he is not a fan of the example above for Bayesian Inference because it is distinguished by a broad view of probability and not by the use of Bayes Theorem. Frequencies of events? Bayes. Theoretical parameters? Bayesian Inference! And when you are just plugging numbers into Bayes’ Theorem it can make inference seem harder than it needs to be.

The posterior distribution is a probability distribution. Like all probability distributions, we can take samples from it and when taking samples from the posterior distribution – those samples are parameters!

This chapter teaches the basic skills for working with samples from the posterior distribution. The integrals of typical Bayesian context is the total probability of some interval – but once you have the samples from the probability distribution, it becomes just a matter of counting values in the interval. An empirical attack on the model posterior allows the scientist to ask and answer more questions about the model, without relying on a mathematician. For this reason, it is easier and more intuitive to work with samples from the posterior than to work with probabilities and integrals directly.

2| Samples from GAP

Grid Approximated Posterior (GAP)

p_grid <- seq(0, 1, length.out=1000) #make a grid with 1000 numbers that fall between 0 and 1. These values are the parameter values

probability_p <- rep(1, 1000) #create a list of 1000 ones! 

probability_data <- dbinom(6, 9, p_grid)#random generation for the binomial distribution dbinom()

#in teh dbinom above, the 6 is how many Waters we have, the 9 is how many times we spun the globe and wrote down a value, and the probability_grid is the probability! It's giving a ton of probablility values like in our grid search. 

posterior <- probability_data*probability_p

posterior <- posterior/sum(posterior) #posterior contains the posterior probabilities

samples <- sample(p_grid, prob=posterior, size=1e4, replace=T) #integers, prob is the weighted probability

plot(samples)

dens(samples)

3| Sampling to Summarize

Once your model produces a posterior distribution, the models work is done. Some common questions you can ask about your posterior distribution include:

  • How much posterior probability lies below some parameter value?

  • How much posterior probability lies between two parameter values?

  • Which parameter value marks the lower 5% of the posterior probability?

  • Which range of parameter values contains 90% of the posterior distribution?

  • Which parameter value has highest posterior probability?

AKA: intervals of defined boundaries, probability mass, and point estimates.

Boundaries

Question: What proportion of water is less than 0.5?

sum(posterior[p_grid < 0.5])
## [1] 0.1718746

Answer: About 17% of the posterior probability is below 0.5.

sum(samples < 0.5)/length(samples)
## [1] 0.174

This gets you about the same answer as sum(posterior[p_grid < 0.5]).

Question: How much of the probability lies between 0.5 and 0.75?

sum(samples > 0.5 & samples < 0.75)/length(samples)
## [1] 0.6022
lower <- 0.5
upper <- 0.75

d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lower & df$x <= upper

p1 <- ggplot(df, aes(x, y)) +
  geom_line() +
  geom_area(data = subset(df, in_interval), fill="#E863E3", alpha = 0.6) +
  geom_vline(xintercept = c(lower, upper), linetype = 2) +
  labs(title = "50-75%",
       x = "proportion of water (p)", y = "Density")

Answer: About 61% of the posterior probability lies between 0.5 and 0.75.

Probability mass

It is most common to see scientific journals reporting intervals of defined mass, usually known as a confidence interval but since we are working with posterior probabilities it is called a credible interval. This textbook will refer to it as a

Compatibility Interval

Compatibility intervals indicate the range of parameter values compatible with the model and data.

Question: What are the boundaries of the lower 80% posterior probability?

quantile(samples, 0.8)
##       80% 
## 0.7597598
lowers <- 0.76

d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x <= lowers

p2 <- ggplot(df, aes(x, y)) +
  geom_line() +
  geom_area(data = subset(df, in_interval), fill="#8BD0D0", alpha = 0.6) +
  geom_vline(xintercept = c(lowers), linetype = 2) +
  labs(title = "80%",
       x = "proportion of water (p)", y = "Density")

Answer: About 76% of the posterior probability lies within the 80th percentile.

Question: What proportion of of the probability lies between 10th and 90th percentiles?

quantile(samples, c(0.1, 0.9))
##       10%       90% 
## 0.4474474 0.8138138
ups <- .808
lows <- .45

d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lows & df$x <= ups

p3 <- ggplot(df, aes(x, y)) +
  geom_line() +
  geom_area(data = subset(df, in_interval), fill="#009FF7", alpha = 0.6) +
  geom_vline(xintercept = c(lows, ups), linetype = 2) +
  labs(title = "10-90%",
       x = "proportion of water (p)", y = "Density")

Answer: 0.81-0.45 = 0.36, About 36% of the posterior proportions lie between the 10th and 90th percentiles.

Equal Probability Mass

In the examples above, equal probability mass has been applied to each tail. Let’s see what happens when we don’t. This new posterior is consistent with having observed 3 W out of 3 tosses with a uniform (flat) prior. It is highly skewed, and has its max value at the boundary, p=1.

p_grid <- seq(0,1, length.out = 1000)

prior <- rep(1, 1000)

likelihood <- dbinom(3, 3, p_grid)

posterior <- likelihood*prior

posterior <- posterior/sum(posterior)

samples <- sample(p_grid, size=1000, replace = T, prob = posterior)

Question: What is the 50th percentile compatibility interval?

PI(samples, prob=0.5)
##       25%       75% 
## 0.7044545 0.9269269
lower <- 0.72
upper <- 0.93

d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lower & df$x <= upper

p4 <- ggplot(df, aes(x, y)) +
  geom_line() +
  geom_area(data = subset(df, in_interval), fill="#DA8700", alpha = 0.6) +
  geom_vline(xintercept = c(lower, upper), linetype = 2) +
  labs(title = "50th% PI",
       x = "proportion of water (p)", y = "Density")

Answer: 0.72 - 0.93 is the 25th and 75th percentiles.

Question: What if you want to weight the tails differently because the data are skewed?

Answer: Use HPDI to get an interval that best represents the parameter values. The HPDI interval captures the parameters with the highest posterior probability, as well as being noticeably narrower.

HPDI(samples, 0.5)
##      |0.5      0.5| 
## 0.8348348 0.9979980
lower <- 0.72
upper <- 0.93

d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lower & df$x <= upper

p5 <- ggplot(df, aes(x, y)) +
  geom_line() +
  geom_area(data = subset(df, in_interval), fill="#78A900", alpha = 0.6) +
  geom_vline(xintercept = c(lower, upper), linetype = 2) +
  labs(title = "50th% HPDI",
       x = "proportion of water (p)", y = "Density")

Answer: 0.82 - 0.99 is the 25th and 75th percentiles with the HPDI function.

Most of the time these two functions will be nearly identical. They were only so different because of how skewed the data we had were. When the posterior is bell-shaped, it doesn’t matter which function you use, PI or HPDI. HDPI is more computationally intensive and suffers from greater simulation variance (it is sensitive to how many samples you draw from the posterior)

Point estimates

The final common summary task for the posterior is to produce point estimates of some kind. Most of the time this summary is unnecessary and even at times, harmful.

Here is an example to consider. Suppose again the globe toss of 3 W out of 3 tosses. In the example below we will consider 3 possibilities:

  • Maximum posterior (MAP)

Point estimate: The mode

p_grid[which.max(posterior)]
## [1] 1
#or if you have samples already extracted from teh posterior you can use this code:

chainmode(samples, adj=0.01)
## [1] 0.9882442

Point estimate: The mean

mean(samples)
## [1] 0.7972683

Point estimate: The median

median(samples)
## [1] 0.8358358

If you must report a point estimate then you should do so using a loss function. The loss function you use is different depending on the point estimate you are reporting.

The loss is proportional to the distance of your decision to the true value. The point estimate that minimizes expected loss is actually the median.

Example Suppose we choose p = 0.05, then our expected loss is:

sum(posterior*abs(0.5-p_grid)) #this code computes the average loss where each loss is weighted by the posterior distribution. 
## [1] 0.3128752

There is a trick to computing the loss by using sapply function.

loss <- sapply(p_grid, function(d) sum(posterior*abs(d-p_grid)))

#now to find the parameter value that minimizes the loss: 

minLoss <- p_grid[which.min(loss)]

minLoss
## [1] 0.8408408
#and try median(sample) for comparison. (see above)

lowss <- minLoss   # your loss-minimizer on the p_grid

d <- density(samples)
df <- data.frame(x = d$x, y = d$y)


p6 <- ggplot(df, aes(x, y)) +
  geom_line() +
  geom_area(data = subset(df, x <= lowss), fill = "#ACD62A") +
  geom_area(data = subset(df, x >= lowss), fill = "#238188") +
  geom_vline(xintercept = lowss, color="#451C6D", linewidth=2, linetype=2) +
  labs(title = paste0("Median= ", round(lowss, 3)),
       x = "proportion of water (p)", y = "Density")+
  theme_classic()

The posterior median that splits the posterior density in half is 0.84.

The loss functions:

  • Absolute Value: used for the MEDIAN

  • Quadratic: used for the MEAN

loss_mean <- sapply(p_grid, function(d) sum(posterior*(d-p_grid)^2))

#now to find the parameter value that minimizes the loss: 

meanmin <- p_grid[which.min(loss_mean)]

meanmin
## [1] 0.8008008
p7 <- ggplot(df, aes(x, y)) +
  geom_line() +
  geom_area(data = subset(df, x <= meanmin), fill = "#EFABA6") +
  geom_area(data = subset(df, x >= meanmin), fill = "#DCD6D6") +
  geom_vline(xintercept = meanmin, color="#7C6D7E", linewidth=2, linetype=2) +
  labs(title = paste0("Mean= ", round(meanmin, 3)),
       x = "proportion of water (p)", y = "Density")+
  theme_classic()

Posterior Distributions

p1+p2+p3+p4+p5+p6+p7

4| Sampling to simulate prediction

In the globe tossing model, the dummy data arises from the binomial likelihood.

\(Pr(W|N, p) = \frac{N!}{W!(N-W)!}p^W(1-p)^{N-W}\)

Where \(W\) is the observed water count and \(N\) is the number of tosses. Suppose that \(N=2\) and let’s suppose p = 0.7. There are only three possible observations: 0 water landings, 1, water landing, or 2 water landings.

dbinom(0:2, 2, 0.7)
## [1] 0.09 0.42 0.49

If you toss the globe 2 times and Pr(W) = 0.7, then there is a 9% chance of observing 0 water, a 42% chance of observing 1 water, and a 49% chance of 2 waters.

Simulate observations

A| Dummy data

Using the probabilities above, let’s simulate some observations! You can use sample to do this again, or you can use r-base functions because they are quick and easy.

rbinom(10, size=2, prob=0.7) #generate 10 random data points from the binomial distribution
##  [1] 2 2 2 1 2 1 2 1 2 0
#let's generate a bunch! 

dummy_water <- rbinom(100000, 2, 0.7)

table(dummy_water)/100000
## dummy_water
##       0       1       2 
## 0.08959 0.42116 0.48925

Okay, let’s do different size and probabilities.

par(mfrow=c(2, 3))

dummy_w <- rbinom(100000, 9, 0.7)

simplehist(dummy_w, col="pink", xlab="dummy water count (size=9, prob=0.7)")
dummy_w_smallsamp <- rbinom(100000, 2, 0.7)

simplehist(dummy_w_smallsamp, col="#C183CC", xlab="dummy water count (size=2, prob=0.7)")


dummy_w_bigsamp <- rbinom(100000, 100, 0.7)

simplehist(dummy_w_bigsamp, col="#2EA5C9", xlab="dummy water count (size=100, prob=0.7)")

dummy_w_prob1 <- rbinom(100000, 100, 0.2)

simplehist(dummy_w_prob1, col="#8085A1", xlab="dummy water count (size=100, prob=0.2)")

dummy_w_prob2 <- rbinom(100000, 100, 0.99)

simplehist(dummy_w_prob2, col="#A4D336", xlab="dummy water count (size=100, prob=0.99)")



dummy_w_prob3 <- rbinom(100000, 20, 0.5)

simplehist(dummy_w_prob3, col="#DC343D", xlab="dummy water count (size=20, prob=0.5)")

B| Model checking

Here you ensure your model fitting worked correctly and you evaluate the adequacy of a model for some purpose.

C| Software check

Ask how well the model reproduces the data used to educate it. If there is no correspondence, the software is likely to blame. Implied predictions here are called, retrodictions.

Model adequacy

We need to combine sampling of simulated observations with sampling of the posterior distribution. We also want to carry our uncertainty forward while evaluating the implied predictions. To get the posterior predictive distribution we compute the sampling distribution of outcomes at each value of p and then average all of these predicted distributions together using the posterior probabilities of each value of p. 

To simulate predicted observations for a single value of p, say p=0.6, use rbinom to generate random binomial samples.

w <- rbinom(10000, 9, 0.6)

This generates 10,000 simulated predictions of 9 globe tosses assuming p=0.6. These predictions are stored in the vector as water counts. So the theoretical minimum is zero and the theoretical maximum is nine. You can use simplehist(w) to get a clean histogram of your simulated outcomes.

simplehist(w)

All you have to do to carry our uncertainty foreward is:

#replace the prob with posterior samples

head(samples)
## [1] 0.9069069 0.7487487 0.8978979 0.8708709 0.3793794 0.9639640
w <- rbinom(10000, 9, samples)

simplehist(w)

5| Practice

Problems 1-6 are using the samples from teh posterior distribution for the globe tossing example. This code will give you a specific set of samples, so you can check your answers.

p_grid <- seq( from=0 , to=1 , length.out=1000 )
prior <- rep( 1 , 1000 )
likelihood <- dbinom( 6 , size=9 , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
set.seed(100)
samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )

3E1. How much posterior probability lies below p = 0.2? 3E2. How much posterior probability lies above p = 0.8? 3E3. How much posterior probability lies between p = 0.2 and p = 0.8? 3E4. 20% of the posterior probability lies below which value of p? 3E5. 20% of the posterior probability lies above which value of p? 3E6. Which values of p contain the narrowest interval equal to 66% of the posterior probability? 3E7. Which values of p contain 66% of the posterior probability, assuming equal posterior probability both below and above the interval?

#3e1
sum(samples < 0.2)/10000
## [1] 4e-04
#3e2
sum(samples > 0.8)/10000
## [1] 0.1116
#3e3
sum(samples > 0.2 & samples < 0.8)
## [1] 8880
#3e4
quantile(samples, 0.2)
##       20% 
## 0.5185185
sum(samples<0.52)/10000
## [1] 0.204
#3e5
quantile(samples, 0.8)
##       80% 
## 0.7557558
sum(samples>0.76)/10000
## [1] 0.1907
#3e6
HPDI(samples, 0.66)
##     |0.66     0.66| 
## 0.5085085 0.7737738
#3e7
PI(samples, prob=0.66)
##       17%       83% 
## 0.5025025 0.7697698

Medium.

3M1. Suppose the globe tossing data had turned out to be 8 water in 15 tosses. Construct the posterior distribution, using grid approximation. Use the same flat prior as before.

p_grid <- seq( from=0 , to=1 , length.out=1000)

prior <- rep( 1 , 1000)

likelihood <- dbinom( 8 , size=15 , prob=p_grid)

posterior <- likelihood * prior

posterior <- posterior / sum(posterior)

plot(posterior~p_grid, type="l")

3M2. Draw 10,000 samples from the grid approximation from above. Then use the samples to calculate the 90% HPDI for p.

samples <- sample( p_grid , prob=posterior , size=10000 , replace=TRUE)

#compute 90% hpdi

HPDI(samples, 0.9)
##      |0.9      0.9| 
## 0.3293293 0.7167167

3M3. Construct a posterior predictive check for this model and data. This means simulate the distribution of samples, averaging over the posterior uncertainty in p. What is the probability of observing 8 water in 15 tosses?

#using the same sample vector as 3m2

w <- rbinom(10000, 15, samples)

#compute proportion of these simulated observations that match 8

sum(w==8)/10000
## [1] 0.1444
simplehist(w)

3M4. Using the posterior distribution constructed from the new (8/15) data, now calculate the probability of observing 6 water in 9 tosses.

w <- rbinom(10000, 9, samples)
sum(w==6)/10000
## [1] 0.1751

3M5. Start over at 3M1, but now use a prior that is zero below p = 0.5 and a constant above p = 0.5. This corresponds to prior information that a majority of the Earth’s surface is water. Repeat each problem above and compare the inferences. What difference does the better prior make? If it helps, compare inferences (using both priors) to the true value p = 0.7.

p_grid <- seq( from=0 , to=1 , length.out=1000)

prior <- ifelse( p_grid < 0.5 , 0 , 1 )

likelihood <- dbinom(8 , size=15 , prob=p_grid)

posterior <- likelihood * prior

posterior <- posterior / sum(posterior)

samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )

plot( posterior ~ p_grid , type="l" )

HPDI( samples , prob=0.9 )
##      |0.9      0.9| 
## 0.5005005 0.7117117
w <- rbinom(10000 , size=15 , prob=samples )
simplehist(w)