Here we will perform Bayesian inference of the probability of heads based on coin tosses. We will use different algorithms: first uniform or naïve sampling, then importance sampling, and finally MCMC.
We have some data providing the results of 300 coin tosses. We would like to estimate how fair the coin is, i.e. what is the probability of getting heads (heads are noted 1 below).
data = c(1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1)
sum(data)
## [1] 190
From our data, we see that we obtained 190 heads out of 300 tosses.
All coin tosses are supposed to be independent tosses of the same coin, which always have the same probability of returning a head. We want to perform Bayesian inference, therefore we need a prior. For inference, we will be using different sampling algorithms.
We model the coin tosses as independent draws from a Bernoulli distribution with parameter \(parameter\) (it’s always the same coin, so we consider that it’s always the same \(parameter\)). By doing so, the probability of observing 190 heads out of 300 tosses can be computed according to a Binomial distribution. Thus we can write the likelihood function as follows:
# Function to compute the likelihood P(D|M)
# We use the binomial formula.
likelihood <-function (numHeads, numTosses, parameter){
lk = choose(numTosses, numHeads)*parameter^numHeads * (1-parameter)^(numTosses-numHeads)
return (lk)
}
Once that I have defined the model generating the data, I can compute the Maximum Likelihood estimate. In our case we can get it analytically. It is simply the ratio of heads over all tosses:
sum(data)/length(data)
## [1] 0.6333333
We need to put some prior probability on the fairness of the coin, i.e. parameter \(parameter\). For this, a beta distribution seems appropriate, as it is a continuous distribution between 0 and 1. Let’s display a beta distribution with various parameter values.
x <- seq(-0, 1, length=100)
colors <- c("red", "purple", "blue", "darkgreen", "gold")
shape1s = c(1,2,4,1,0.2)
shape2s = c(1,2,4,4,0.2)
labels <- c("a=1, b=1", "a=2, b=2", "a=4, b=4", "a=1, b=4", "a=0.2, b=0.2")
plot(x, dbeta(x, shape1=shape1s[1], shape2=shape2s[1]), type="l", xlab="x value",
ylab="Density", main="Comparison of beta Distributions", lwd=2, col=colors[1], ylim=c(0,7), lty=2)
for (i in 2:5){
lines(x, dbeta(x, shape1=shape1s[i], shape2=shape2s[i]), lwd=2, col=colors[i], lty=2)
}
legend("topleft", labels, lwd=2, col=colors, ncol=2, lty=2)
a=4
b=4
Thus we can write a prior function:
# Function to compute the prior P(M)
prior <- function (parameter, shape1, shape2){
return (dbeta(parameter, shape1=shape1, shape2=shape2) )
}
The posterior probability is: \[ P(\theta|D) = P(D|\theta)P(\theta)/P(D) \] We want to sample from \(P(\theta|D)\), which requires in principle computing \(P(D)\). But \(P(D)\) will be the same for all parameter values: so we decide to drop it and compute an un-normalized posterior.
# Function to compute the un-normalized posterior P(D|M) * P(M)
unnormalized_posterior <-function (numHeads, numTosses, parameter, shape1, shape2) {
return (likelihood(numHeads, numTosses, parameter) * prior(parameter, shape1, shape2))
}
We randomly sample values of our parameter and compute the posterior probability for those values.
naiveSampler <- function (numHeads, numTosses, shape1, shape2, number_iterations) {
likelihoods <- c()
values <- c()
posteriors <- c()
for (i in 1:num_samples) {
value <- runif(1)
values <- c(values, value)
likelihoods <- c(likelihoods, likelihood(numHeads, numTosses, value))
posteriors <- c(posteriors, unnormalized_posterior(numHeads, numTosses, value, shape1, shape2))
}
return (list("values"=values, "posteriors"=posteriors, "likelihoods"=likelihoods))
}
Let’s run this naive sampling algorithm:
# Number of samples for the naive algorithm
num_samples = 20000
naive_result <- naiveSampler(sum(data), length(data), a, b, num_samples)
Let’s plot the posterior probability of the different values for the probabilities of heads that have been sampled. We also plot the Maximum Likelihood value, i.e. the ratio of the number of heads over the total number of tosses.
plot(naive_result$values, naive_result$posteriors, xlab="Probability of Heads", ylab="Likelihood or unnormalized posterior probability", pch=20, main ="Naive sampling")
points(naive_result$values, naive_result$likelihoods, col=rgb(0,1,0,0.5), pch=20)
legend("topleft", legend=c("Unnormalized posterior probability", "Likelihood"), col=c("black", rgb(0,1,0,0.5)), pch=20)
abline(v=sum(data)/length(data), col="red", lwd=2)
Our Bayesian estimate would be the weighted average of the samples:
sum(naive_result$values* naive_result$posteriors)/sum(naive_result$posteriors)
## [1] 0.6296099
The two seem to differ a little bit. This is expected: since the amount of data is limited (only 300 throws), the prior probability, which we chose to be a bit informative, provides a little amount of information, and pulls the estimate towards fairness, which was our weak prior.
Let’s compare the number of samples drawn between \(0.6\) and \(0.65\), and the number of samples drawn between \(0.05\) and \(0.1\):
length(which(naive_result[[1]]>0.6 & naive_result[[1]]<0.65 ))
## [1] 1023
length(which(naive_result[[1]]>0.05 & naive_result[[1]]<0.1 ))
## [1] 1024
The two values are comparable, which is expected, because we performed a uniform sampling of the interval \([0-1]\). A graphical way to get the same conclusion is to plot the histogram of the values we have sampled:
hist(naive_result[[1]], main="Naive sampling", xlab="Probability of Heads", ylab="Number of samples per bin")
abline(v=sum(data)/length(data), col="red", lwd=2)
In our case, where the space to sample is limited and has only one parameter, that is not a problem. But in high dimensional cases, or when parameter values range over large spaces, we could have missed the values of high probability, and wasted time and resources on values of low probability. A better sampling approach is importance sampling.
With importance sampling, we sample from a “helper” distribution which we assume is shaped similarly to the distribution we want to sample from. Then, we have to renormalize the number of times we have sampled values to control for the fact that the helper distribution is not uniform. In our case, we can use a Beta distribution weakly centered on fairness (purple curve above, a=b=2) as helper to sample from the posterior.
# Number of samples for the importance sampling algorithm
num_samples = 20000
drawFromBeta_2_2 <- function() {
return(rbeta(n=1, shape1=2, shape2=2))
}
computeImportanceWeightFromBeta_2_2 <- function(value) {
return(dbeta(value, shape1=2, shape2=2))
}
importanceSampler <- function (numHeads, numTosses, shape1, shape2, number_iterations) {
likelihoods <- c()
values <- c()
posteriors <- c()
importance_weights <- c()
for (i in 1:num_samples) {
value <- drawFromBeta_2_2()
values <- c(values, value)
likelihoods <- c(likelihoods, likelihood(numHeads, numTosses, value))
posteriors <- c(posteriors, unnormalized_posterior(numHeads, numTosses, value, shape1, shape2))
importance_weights <- c(importance_weights, computeImportanceWeightFromBeta_2_2(value))
}
return (list(values, posteriors, likelihoods, importance_weights))
}
Let’s have a look at the distribution we get with importance sampling:
importance_result <- importanceSampler(sum(data), length(data), a, b, num_samples)
Let’s plot the different values for the probabilities of heads that have been sampled. We also plot the Maximum Likelihood value, i.e. the ratio of heads over the total number of tosses.
plot(importance_result[[1]], importance_result[[2]], xlab="Probability of Heads", ylab="Likelihood or unnormalized posterior probability", pch=20, main ="Importance sampling")
points(importance_result[[1]], importance_result[[3]], col=rgb(0,1,0,0.5), pch=20)
legend("topleft", legend=c("Unnormalized posterior probability", "Likelihood"), col=c("black", rgb(0,1,0,0.5)), pch=20)
abline(v=sum(data)/length(data), col="red", lwd=2)
The results in this case are the same whether we have sampled according to naive sampling or to importance sampling. However, importance sampling should have sampled differently in different parts of the space:
hist(importance_result[[1]], main="Importance sampling", xlab="Probability of Heads", ylab="Number of samples per bin")
abline(v=sum(data)/length(data), col="red", lwd=2)
Because we have sampled according to the helper beta(2,2) distribution, we have a non-uniform distribution of the samples, and we have sampled more often in \([0.6;0.65]\) than in \([0.05,0.1]\).
Now let’s use the importance weights, which will reweigh the number of times we have sampled each value:
library(plotrix)
weighted.hist(x=importance_result[[1]], w= importance_result[[2]] / importance_result[[4]] , main="Importance sampling, after correction", xlab="Probability of Heads", ylab="Number of samples per bin")
abline(v=sum(data)/length(data), col="red", lwd=2)
Thanks to the reweighting trick, our final distribution really is a good approximation to the posterior probability distribution.
Let’s see if our reweighted sample is centered around the high posterior probability region:
sum(importance_result[[1]] * importance_result[[2]] / importance_result[[4]])/sum(importance_result[[2]] / importance_result[[4]])
## [1] 0.6296993
Thanks to importance sampling, after reweighting, we can obtain a distribution of parameter values that corresponds to the posterior probability distribution.
We build a MCMC chain to estimate the probability of heads for this coin. First we define the model, with the prior, the likelihood and the posterior probability, then we implement a Metropolis MCMC inference mechanism.
# Number of iterations for the MCMC algorithm
num_iterations = 20000
# Function to propose a new parameter value, randomly drawn between 0 and 1
propose_new_parameter_value <-function() {
return (runif(1))
}
# Function to run Metropolis MCMC inference
MetropolisMCMC <- function (numHeads, numTosses, shape1, shape2, number_iterations) {
current_parameter_value <- propose_new_parameter_value()
record_parameter <- c()
record_parameter <- c(record_parameter,current_parameter_value)
#print(paste("Initial parameter value for the MCMC: ", current_parameter_value, sep=""))
current_posterior <- unnormalized_posterior(numHeads, numTosses, current_parameter_value, shape1, shape2)
#print(paste("Initial probability of the model: ", current_posterior, sep=""))
record_posterior <- c()
record_posterior <- c(record_posterior, current_posterior)
# We also record the likelihood:
current_likelihood <- likelihood(numHeads, numTosses, current_parameter_value)
record_likelihood <- c()
record_likelihood <- c(record_likelihood, current_likelihood)
for (i in 1:number_iterations) {
acceptance_threshold <- runif(1)
proposed_parameter_value <- propose_new_parameter_value()
proposed_posterior = unnormalized_posterior(numHeads , numTosses, proposed_parameter_value, shape1, shape2)
if (proposed_posterior / current_posterior > acceptance_threshold) {
current_parameter_value <- proposed_parameter_value
current_posterior <- proposed_posterior
current_likelihood <- likelihood(numHeads, numTosses, current_parameter_value)
}
record_parameter <- c(record_parameter, current_parameter_value)
record_posterior <-c(record_posterior, current_posterior)
record_likelihood <- c(record_likelihood, current_likelihood)
}
return (list(record_parameter, record_posterior, record_likelihood))
}
result = MetropolisMCMC(sum(data), length(data), a, b, num_iterations)
plot(result[[2]], main="Unnormalized posterior probability", ylab="Probability", xlab="Iteration", t="l")
The MCMC chain more often samples parameter values that have high
posterior probability than parameter values that have low posterior
probability.
plot(result[[1]], main="Probability of Heads", ylab="Probability", xlab="Iteration", t="l")
We compare the posterior distribution, which combines the prior and the likelihood, to the prior probability distribution alone.
hist(result[[1]], main="Posterior distribution of the probability of Heads", ylab="Density", xlab="Probability of Heads", nc=80, freq=F, xlim=c(0,1))
lines(density(result[[1]]), col=colors[2], lwd=2)
lines(x, dbeta(x, shape1=shape1s[3], shape2=shape2s[3]), lwd=2, col=colors[2], lty=2)
legend("topleft", c("Posterior", "Prior"), col=c("blue", "blue"), lwd=c(2,2), lty=c(1,2))
The posterior distribution is quite different from the prior
distribution, which indicates that we have learned something from the
analysis of our data.
To test if my coin is fair, I can benefit from the sample from the posterior distribution obtained from the MCMC. If the coin is fair, I should obtain a probability of heads above 0.5 the same proportion of time I get a probability below 0.5. Let’s count:
above = length(which(result[[1]]>0.5)) / length(result[[1]])
below = 1-above
print(paste("The probability that my coin is favoring Heads is ", above, ", whereas it is ", below, " that it is favoring Tails.", sep=""))
## [1] "The probability that my coin is favoring Heads is 0.99990000499975, whereas it is 9.99950002500416e-05 that it is favoring Tails."
My coin does not seem to be fair.
The ML estimate is simply the ratio of heads over all tosses:
sum(data)/length(data)
## [1] 0.6333333
Our Bayesian estimate would be the median of the samples:
median(result[[1]])
## [1] 0.6298248
The two seem to differ a little bit. This is expected: since the amount of data is limited (only 300 throws), the prior probability, which we chose to be a bit informative, provides a little amount of information, and pulls our estimate towards fairness, which was our prior.
To further investigate the impact of the prior, we can change the prior distribution and see if our estimate changes drastically. To choose a prior that really differs from the one we chose above, we use the distribution plotted in yellow, where the two shape parameters of the Beta distribution are 0.2.
result_02 = MetropolisMCMC(sum(data), length(data), 0.2, 0.2, num_iterations)
We plot in dashed lines the prior distributions, and in solid lines the posterior distributions.
plot(density(result[[1]]), col=colors[2], lwd=2, main="Posterior distribution of the probability of Heads", ylab="Density", xlab="Probability of Heads", ylim=c(0,24), xlim=c(0,1))
lines(x, dbeta(x, shape1=shape1s[2], shape2=shape2s[2]), lwd=2, col=colors[2], lty=2)
lines(density(result_02[[1]]), col=colors[5], lwd=2)
lines(x, dbeta(x, shape1=shape1s[5], shape2=shape2s[5]), lwd=2, col=colors[5], lty=2)
legend("topleft", legend=c("Posterior, a=4, b=4","Prior, a=4, b=4", "Posterior, a=0.2, b=0.2","Prior, a=0.2, b=0.2"), col=c(colors[3], colors[3], colors[5], colors[5]), lty=c(1,2,1,2), lwd=c(2,2,2,2), ncol=2)
Let’s look at summary statistics, first for the first prior (shapes =2):
summary(result[[1]])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2489 0.6119 0.6298 0.6301 0.6481 0.7239
and the second prior (shapes=0.2):
summary((result_02[[1]]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2542 0.6151 0.6339 0.6342 0.6533 0.7406
In this case, the prior distribution has little impact on the posterior distribution of the parameter: it does not affect our assessment of the fairness of the coin.
One way to explicitly test hypotheses about the fairness of the coin is to use a Bayes factor. Here, we will compute the Bayes factor between a model where we fix the probability of Heads to 0.5, i.e. a model that states that the coin is fair, and a model where we use our weak prior. This latter model is much more careful about whether the coin is fair or not. The Bayes factor is computed as the ratio of the two marginal likelihoods.
To obtain this marginal likelihood, we have to integrate over all the prior distribution. We will do this in two ways: using maths, or using a numerical approach, called the harmonic mean. The harmonic mean is the simplest way to obtain the marginal likelihood numerically, and it is consistent: if it were given an infinite number of samples, it would be correct. The problem of this estimator is it can have infinite variance, which may lead to wildly inaccurate estimates. We will use both the exact mathematical way and the harmonic mean.
The Bayes factor we want to compute is between the model with unknown probability of heads, and the model with probability of heads at 0.5. \[ BF = \frac{Marginal~Likelihood~model~with~unknown~probability}{Marginal~Likelihood~model~with~probability~0.5} = \frac{P_{M1}(D)}{P_{M2}(D)} \\ P_{M1}(D) = \int_0^1\frac{1}{B(\alpha, \beta)}p^{\alpha-1}(1-p)^{\beta-1} \binom{n}{k} p^k(1-p)^{n-k} dp \\ P_{M1}(D) = \frac{1}{B(\alpha, \beta)} \binom{n}{k} \int_0^1 p^{\alpha-1}(1-p)^{\beta-1} p^k(1-p)^{n-k} dp \\ P_{M1}(D) = \frac{1}{B(\alpha, \beta)} \binom{n}{k} \int_0^1 p^{k+\alpha-1}(1-p)^{n+\beta-k-1} dp \\ \] Here we can recognize that the integral could look a lot like a Beta distribution: \[ P_{M1}(D) = \frac{B(k+\alpha, n+\beta-k)}{B(\alpha, \beta)} \binom{n}{k} \int_0^1 \frac{p^{k+\alpha-1}(1-p)^{n+\beta-k-1} } {B(k+\alpha, n+\beta-k)} dp \]
The integral on the right has become the integral of Beta density between 0 and 1, which means its value is 1!
\[ P_{M1}(D) = \frac{B(k+\alpha, n+\beta-k)}{B(\alpha, \beta)} \binom{n}{k} \] Then we can compute \(P_{M2}(D)\). Here we have a probability mass of 1.0 on the parameter value \(0.5\). The Marginal Likelihood is just the Binomial probability for the counts, given parameter \(p=1-p=0.5\)
\[ P_{M2}(D) = \binom{n}{k} \frac{1}{2^n} \]
Thus, we have the following Bayes Factor:
\[ BF = \frac{P_{M1}(D)}{P_{M2}(D)} = {2^n} \frac{B(k+\alpha, n+\beta-k)}{B(\alpha, \beta)} \] Its value is:
\[ BF = {2^{300}} \frac{B(190+2, 300+2-190)}{B(2, 2)}={2^{300}} \frac{B(192, 112)}{B(2, 2)} \]
According to the analytical approach, the Bayes Factor is:
BF<-2^300 * beta(194,114)/beta(4,24)
print(BF)
## [1] 2977161
There seems to be a “very strong” support (cf. https://en.wikipedia.org/wiki/Bayes_factor) in favor of the model with the somewhat informative prior instead of the hypothesis with all probability mass on \(0.5\).
Let’s compute the harmonic mean.
We have seen the expression above:
P_M2 <- likelihood(sum(data), length(data), 0.5)
print(P_M2)
## [1] 9.772302e-07
As written above, the harmonic mean estimator can be unstable, so it is good to run the estimator several times. Let’s do thirty replicates.
P_M1=c()
for (i in 1:30) {
result = MetropolisMCMC(sum(data), length(data), 4, 4, num_iterations)
P_M1 = c(P_M1,1/mean(1/result[[3]]))
}
Now we can compute the Bayes factor in favor of the model with unknown probability of Heads:
summary(P_M1 / P_M2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 3487 5912 17140
Let’s plot the results and compare the analytical estimate to our 30 replicates:
hist(P_M1/P_M2, xlab="Numerical estimate using the harmonic mean", freq=T, nc=15)
abline(v=BF, col="red", lwd=2)
legend("topleft", col="red", legend="Analytical estimate", lwd=2)
As we can see, the numerical estimate using the harmonic mean is quite variable. Other numerical estimators have been proposed, including Thermodynamic Integration, or Stepping Stone sampling. However, in all cases when a mathematical solution cannot be obtained, it remains difficult and time-intensive to obtain good estimates of Bayes factors.