By: Richard McElreath
Example: Let’s suppose there is a blood test that correctly
identifies vampirism 95% of the time. Pr(+|vampire) = 0.95.
It is a very accurate test, nearly always catching real life vampires.
It makes mistakes in the form of false positives, 1% of the time,
Pr(+|mortal)= 0.01. Finally, vampires are really rare,
Pr(vampire) = 0.001. Suppose someone tests positive for
vampirism. What is the probability that they are a bloodsucking
immortal?
\[ \color{#00B9BE}{\Pr(vampire\mid +)} = \frac{ \color{#CA75FD}{\Pr(+\mid vampire)} \color{#78A900}{\Pr(vampire)} }{ \color{#F0726A}{\Pr(+)} } \]
Pr(+)
= Pr(+|vampire) Pr(vampire) +
Pr(+|mortal)(1-Pr(vampire))
positive_vampire <- 0.95
vampire <- 0.001
positive_mortal <- 0.01
positive <- positive_vampire*vampire+positive_mortal*(1-vampire)
vampire_positive <- (positive_vampire*vampire)/positive
vampire_positive*100
## [1] 8.683729
There is an 8.7% chance that the suspect is actually a vampire.
McElreath mentions he is not a fan of the example above for Bayesian Inference because it is distinguished by a broad view of probability and not by the use of Bayes Theorem. Frequencies of events? Bayes. Theoretical parameters? Bayesian Inference! And when you are just plugging numbers into Bayes’ Theorem it can make inference seem harder than it needs to be.
The posterior distribution is a probability distribution. Like all probability distributions, we can take samples from it and when taking samples from the posterior distribution – those samples are parameters!
This chapter teaches the basic skills for working with samples from the posterior distribution. The integrals of typical Bayesian context is the total probability of some interval – but once you have the samples from the probability distribution, it becomes just a matter of counting values in the interval. An empirical attack on the model posterior allows the scientist to ask and answer more questions about the model, without relying on a mathematician. For this reason, it is easier and more intuitive to work with samples from the posterior than to work with probabilities and integrals directly.
Grid Approximated Posterior (GAP)
p_grid <- seq(0, 1, length.out=1000) #make a grid with 1000 numbers that fall between 0 and 1. These values are the parameter values
probability_p <- rep(1, 1000) #create a list of 1000 ones!
probability_data <- dbinom(6, 9, p_grid)#random generation for the binomial distribution dbinom()
#in teh dbinom above, the 6 is how many Waters we have, the 9 is how many times we spun the globe and wrote down a value, and the probability_grid is the probability! It's giving a ton of probablility values like in our grid search.
posterior <- probability_data*probability_p
posterior <- posterior/sum(posterior) #posterior contains the posterior probabilities
samples <- sample(p_grid, prob=posterior, size=1e4, replace=T) #integers, prob is the weighted probability
plot(samples)
dens(samples)
Once your model produces a posterior distribution, the models work is done. Some common questions you can ask about your posterior distribution include:
How much posterior probability lies below some parameter value?
How much posterior probability lies between two parameter values?
Which parameter value marks the lower 5% of the posterior probability?
Which range of parameter values contains 90% of the posterior distribution?
Which parameter value has highest posterior probability?
AKA: intervals of defined boundaries, probability mass, and point estimates.
Question: What proportion of water is less than 0.5?
sum(posterior[p_grid < 0.5])
## [1] 0.1718746
Answer: About 17% of the posterior probability is below 0.5.
sum(samples < 0.5)/length(samples)
## [1] 0.174
This gets you about the same answer as
sum(posterior[p_grid < 0.5]).
Question: How much of the probability lies between 0.5 and 0.75?
sum(samples > 0.5 & samples < 0.75)/length(samples)
## [1] 0.6022
lower <- 0.5
upper <- 0.75
d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lower & df$x <= upper
p1 <- ggplot(df, aes(x, y)) +
geom_line() +
geom_area(data = subset(df, in_interval), fill="#E863E3", alpha = 0.6) +
geom_vline(xintercept = c(lower, upper), linetype = 2) +
labs(title = "50-75%",
x = "proportion of water (p)", y = "Density")
Answer: About 61% of the posterior probability lies between 0.5 and 0.75.
It is most common to see scientific journals reporting intervals of defined mass, usually known as a confidence interval but since we are working with posterior probabilities it is called a credible interval. This textbook will refer to it as a
Compatibility Interval
Compatibility intervals indicate the range of parameter values compatible with the model and data.
Question: What are the boundaries of the lower 80% posterior probability?
quantile(samples, 0.8)
## 80%
## 0.7597598
lowers <- 0.76
d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x <= lowers
p2 <- ggplot(df, aes(x, y)) +
geom_line() +
geom_area(data = subset(df, in_interval), fill="#8BD0D0", alpha = 0.6) +
geom_vline(xintercept = c(lowers), linetype = 2) +
labs(title = "80%",
x = "proportion of water (p)", y = "Density")
Answer: About 76% of the posterior probability lies within the 80th percentile.
Question: What proportion of of the probability lies between 10th and 90th percentiles?
quantile(samples, c(0.1, 0.9))
## 10% 90%
## 0.4474474 0.8138138
ups <- .808
lows <- .45
d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lows & df$x <= ups
p3 <- ggplot(df, aes(x, y)) +
geom_line() +
geom_area(data = subset(df, in_interval), fill="#009FF7", alpha = 0.6) +
geom_vline(xintercept = c(lows, ups), linetype = 2) +
labs(title = "10-90%",
x = "proportion of water (p)", y = "Density")
Answer: 0.81-0.45 = 0.36, About 36% of the posterior proportions lie between the 10th and 90th percentiles.
Equal Probability Mass
In the examples above, equal probability mass has been applied to each tail. Let’s see what happens when we don’t. This new posterior is consistent with having observed 3 W out of 3 tosses with a uniform (flat) prior. It is highly skewed, and has its max value at the boundary, p=1.
p_grid <- seq(0,1, length.out = 1000)
prior <- rep(1, 1000)
likelihood <- dbinom(3, 3, p_grid)
posterior <- likelihood*prior
posterior <- posterior/sum(posterior)
samples <- sample(p_grid, size=1000, replace = T, prob = posterior)
Question: What is the 50th percentile compatibility interval?
PI(samples, prob=0.5)
## 25% 75%
## 0.7044545 0.9269269
lower <- 0.72
upper <- 0.93
d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lower & df$x <= upper
p4 <- ggplot(df, aes(x, y)) +
geom_line() +
geom_area(data = subset(df, in_interval), fill="#DA8700", alpha = 0.6) +
geom_vline(xintercept = c(lower, upper), linetype = 2) +
labs(title = "50th% PI",
x = "proportion of water (p)", y = "Density")
Answer: 0.72 - 0.93 is the 25th and 75th percentiles.
Question: What if you want to weight the tails differently because the data are skewed?
Answer: Use HPDI to get an interval that best represents
the parameter values. The HPDI interval captures the
parameters with the highest posterior probability, as well as being
noticeably narrower.
HPDI(samples, 0.5)
## |0.5 0.5|
## 0.8348348 0.9979980
lower <- 0.72
upper <- 0.93
d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
df$in_interval <- df$x >= lower & df$x <= upper
p5 <- ggplot(df, aes(x, y)) +
geom_line() +
geom_area(data = subset(df, in_interval), fill="#78A900", alpha = 0.6) +
geom_vline(xintercept = c(lower, upper), linetype = 2) +
labs(title = "50th% HPDI",
x = "proportion of water (p)", y = "Density")
Answer: 0.82 - 0.99 is the 25th and 75th percentiles with the HPDI function.
Most of the time these two functions will be nearly identical. They
were only so different because of how skewed the data we had were. When
the posterior is bell-shaped, it doesn’t matter which function you use,
PI or HPDI. HDPI is more
computationally intensive and suffers from greater simulation variance
(it is sensitive to how many samples you draw from the posterior)
The final common summary task for the posterior is to produce point estimates of some kind. Most of the time this summary is unnecessary and even at times, harmful.
Here is an example to consider. Suppose again the globe toss of 3 W out of 3 tosses. In the example below we will consider 3 possibilities:
Point estimate: The mode
p_grid[which.max(posterior)]
## [1] 1
#or if you have samples already extracted from teh posterior you can use this code:
chainmode(samples, adj=0.01)
## [1] 0.9882442
Point estimate: The mean
mean(samples)
## [1] 0.7972683
Point estimate: The median
median(samples)
## [1] 0.8358358
If you must report a point estimate then you should do so using a loss function. The loss function you use is different depending on the point estimate you are reporting.
The loss is proportional to the distance of your decision to the true value. The point estimate that minimizes expected loss is actually the median.
Example Suppose we choose p = 0.05,
then our expected loss is:
sum(posterior*abs(0.5-p_grid)) #this code computes the average loss where each loss is weighted by the posterior distribution.
## [1] 0.3128752
There is a trick to computing the loss by using sapply
function.
loss <- sapply(p_grid, function(d) sum(posterior*abs(d-p_grid)))
#now to find the parameter value that minimizes the loss:
minLoss <- p_grid[which.min(loss)]
minLoss
## [1] 0.8408408
#and try median(sample) for comparison. (see above)
lowss <- minLoss # your loss-minimizer on the p_grid
d <- density(samples)
df <- data.frame(x = d$x, y = d$y)
p6 <- ggplot(df, aes(x, y)) +
geom_line() +
geom_area(data = subset(df, x <= lowss), fill = "#ACD62A") +
geom_area(data = subset(df, x >= lowss), fill = "#238188") +
geom_vline(xintercept = lowss, color="#451C6D", linewidth=2, linetype=2) +
labs(title = paste0("Median= ", round(lowss, 3)),
x = "proportion of water (p)", y = "Density")+
theme_classic()
The posterior median that splits the posterior density in half is 0.84.
The loss functions:
Absolute Value: used for the MEDIAN
Quadratic: used for the MEAN
loss_mean <- sapply(p_grid, function(d) sum(posterior*(d-p_grid)^2))
#now to find the parameter value that minimizes the loss:
meanmin <- p_grid[which.min(loss_mean)]
meanmin
## [1] 0.8008008
p7 <- ggplot(df, aes(x, y)) +
geom_line() +
geom_area(data = subset(df, x <= meanmin), fill = "#EFABA6") +
geom_area(data = subset(df, x >= meanmin), fill = "#DCD6D6") +
geom_vline(xintercept = meanmin, color="#7C6D7E", linewidth=2, linetype=2) +
labs(title = paste0("Mean= ", round(meanmin, 3)),
x = "proportion of water (p)", y = "Density")+
theme_classic()
Posterior Distributions
p1+p2+p3+p4+p5+p6+p7
In the globe tossing model, the dummy data arises from the binomial likelihood.
\(Pr(W|N, p) = \frac{N!}{W!(N-W)!}p^W(1-p)^{N-W}\)
Where \(W\) is the observed water count and \(N\) is the number of tosses. Suppose that \(N=2\) and let’s suppose p = 0.7. There are only three possible observations: 0 water landings, 1, water landing, or 2 water landings.
dbinom(0:2, 2, 0.7)
## [1] 0.09 0.42 0.49
If you toss the globe 2 times and Pr(W) = 0.7, then there is a 9% chance of observing 0 water, a 42% chance of observing 1 water, and a 49% chance of 2 waters.
Using the probabilities above, let’s simulate some observations! You
can use sample to do this again, or you can use r-base
functions because they are quick and easy.
rbinom(10, size=2, prob=0.7) #generate 10 random data points from the binomial distribution
## [1] 2 2 2 1 2 1 2 1 2 0
#let's generate a bunch!
dummy_water <- rbinom(100000, 2, 0.7)
table(dummy_water)/100000
## dummy_water
## 0 1 2
## 0.08959 0.42116 0.48925
Okay, let’s do different size and probabilities.
par(mfrow=c(2, 3))
dummy_w <- rbinom(100000, 9, 0.7)
simplehist(dummy_w, col="pink", xlab="dummy water count (size=9, prob=0.7)")
dummy_w_smallsamp <- rbinom(100000, 2, 0.7)
simplehist(dummy_w_smallsamp, col="#C183CC", xlab="dummy water count (size=2, prob=0.7)")
dummy_w_bigsamp <- rbinom(100000, 100, 0.7)
simplehist(dummy_w_bigsamp, col="#2EA5C9", xlab="dummy water count (size=100, prob=0.7)")
dummy_w_prob1 <- rbinom(100000, 100, 0.2)
simplehist(dummy_w_prob1, col="#8085A1", xlab="dummy water count (size=100, prob=0.2)")
dummy_w_prob2 <- rbinom(100000, 100, 0.99)
simplehist(dummy_w_prob2, col="#A4D336", xlab="dummy water count (size=100, prob=0.99)")
dummy_w_prob3 <- rbinom(100000, 20, 0.5)
simplehist(dummy_w_prob3, col="#DC343D", xlab="dummy water count (size=20, prob=0.5)")
Here you ensure your model fitting worked correctly and you evaluate the adequacy of a model for some purpose.
Ask how well the model reproduces the data used to educate it. If there is no correspondence, the software is likely to blame. Implied predictions here are called, retrodictions.
We need to combine sampling of simulated observations with sampling of the posterior distribution. We also want to carry our uncertainty forward while evaluating the implied predictions. To get the posterior predictive distribution we compute the sampling distribution of outcomes at each value of p and then average all of these predicted distributions together using the posterior probabilities of each value of p.
To simulate predicted observations for a single value of p, say
p=0.6, use rbinom to generate random binomial samples.
w <- rbinom(10000, 9, 0.6)
This generates 10,000 simulated predictions of 9 globe tosses
assuming p=0.6. These predictions are stored in the vector as water
counts. So the theoretical minimum is zero and the theoretical maximum
is nine. You can use simplehist(w) to get a clean histogram
of your simulated outcomes.
simplehist(w)
All you have to do to carry our uncertainty foreward is:
#replace the prob with posterior samples
head(samples)
## [1] 0.9069069 0.7487487 0.8978979 0.8708709 0.3793794 0.9639640
w <- rbinom(10000, 9, samples)
simplehist(w)
Problems 1-6 are using the samples from teh posterior distribution for the globe tossing example. This code will give you a specific set of samples, so you can check your answers.
p_grid <- seq( from=0 , to=1 , length.out=1000 )
prior <- rep( 1 , 1000 )
likelihood <- dbinom( 6 , size=9 , prob=p_grid )
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
set.seed(100)
samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )
3E1. How much posterior probability lies below p = 0.2? 3E2. How much posterior probability lies above p = 0.8? 3E3. How much posterior probability lies between p = 0.2 and p = 0.8? 3E4. 20% of the posterior probability lies below which value of p? 3E5. 20% of the posterior probability lies above which value of p? 3E6. Which values of p contain the narrowest interval equal to 66% of the posterior probability? 3E7. Which values of p contain 66% of the posterior probability, assuming equal posterior probability both below and above the interval?
#3e1
sum(samples < 0.2)/10000
## [1] 4e-04
#3e2
sum(samples > 0.8)/10000
## [1] 0.1116
#3e3
sum(samples > 0.2 & samples < 0.8)
## [1] 8880
#3e4
quantile(samples, 0.2)
## 20%
## 0.5185185
sum(samples<0.52)/10000
## [1] 0.204
#3e5
quantile(samples, 0.8)
## 80%
## 0.7557558
sum(samples>0.76)/10000
## [1] 0.1907
#3e6
HPDI(samples, 0.66)
## |0.66 0.66|
## 0.5085085 0.7737738
#3e7
PI(samples, prob=0.66)
## 17% 83%
## 0.5025025 0.7697698
Medium.
3M1. Suppose the globe tossing data had turned out to be 8 water in 15 tosses. Construct the posterior distribution, using grid approximation. Use the same flat prior as before.
p_grid <- seq( from=0 , to=1 , length.out=1000)
prior <- rep( 1 , 1000)
likelihood <- dbinom( 8 , size=15 , prob=p_grid)
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
plot(posterior~p_grid, type="l")
3M2. Draw 10,000 samples from the grid approximation from above. Then use the samples to calculate the 90% HPDI for p.
samples <- sample( p_grid , prob=posterior , size=10000 , replace=TRUE)
#compute 90% hpdi
HPDI(samples, 0.9)
## |0.9 0.9|
## 0.3293293 0.7167167
3M3. Construct a posterior predictive check for this model and data. This means simulate the distribution of samples, averaging over the posterior uncertainty in p. What is the probability of observing 8 water in 15 tosses?
#using the same sample vector as 3m2
w <- rbinom(10000, 15, samples)
#compute proportion of these simulated observations that match 8
sum(w==8)/10000
## [1] 0.1444
simplehist(w)
3M4. Using the posterior distribution constructed from the new (8/15) data, now calculate the probability of observing 6 water in 9 tosses.
w <- rbinom(10000, 9, samples)
sum(w==6)/10000
## [1] 0.1751
3M5. Start over at 3M1, but now use a prior that is zero below p = 0.5 and a constant above p = 0.5. This corresponds to prior information that a majority of the Earth’s surface is water. Repeat each problem above and compare the inferences. What difference does the better prior make? If it helps, compare inferences (using both priors) to the true value p = 0.7.
p_grid <- seq( from=0 , to=1 , length.out=1000)
prior <- ifelse( p_grid < 0.5 , 0 , 1 )
likelihood <- dbinom(8 , size=15 , prob=p_grid)
posterior <- likelihood * prior
posterior <- posterior / sum(posterior)
samples <- sample( p_grid , prob=posterior , size=1e4 , replace=TRUE )
plot( posterior ~ p_grid , type="l" )
HPDI( samples , prob=0.9 )
## |0.9 0.9|
## 0.5005005 0.7117117
w <- rbinom(10000 , size=15 , prob=samples )
simplehist(w)