Here we will perform Bayesian inference of the probability of heads based on coin tosses. We will use different algorithms: first uniform or naïve sampling, then importance sampling, and finally MCMC.

We have some data providing the results of 300 coin tosses. We would like to estimate how fair the coin is, i.e. what is the probability of getting heads (heads are noted 1 below).

data = c(1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1,1,0,1,0,0,1,1,1,0,0,1,1,1,0,1,1,1,0,0,1,1,0,1,1,0,1,1,0,1,1)
sum(data)
## [1] 190

From our data, we see that we obtained 190 heads out of 300 tosses.

We build a probabilistic model of coin tossing.

All coin tosses are supposed to be independent tosses of the same coin, which always have the same probability of returning a head. We want to perform Bayesian inference, therefore we need a prior. For inference, we will be using different sampling algorithms.

Defining the likelihood

We model the coin tosses as independent draws from a Bernoulli distribution with parameter \(parameter\) (it’s always the same coin, so we consider that it’s always the same \(parameter\)). By doing so, the probability of observing 190 heads out of 300 tosses can be computed according to a Binomial distribution. Thus we can write the likelihood function as follows:

# Function to compute the likelihood P(D|M)
# We use the binomial formula.
likelihood <-function (numHeads, numTosses, parameter){
    lk = choose(numTosses, numHeads)*parameter^numHeads * (1-parameter)^(numTosses-numHeads)
    return (lk)
}

Once that I have defined the model generating the data, I can compute the Maximum Likelihood estimate. In our case we can get it analytically. It is simply the ratio of heads over all tosses:

sum(data)/length(data)
## [1] 0.6333333

Defining a prior

We need to put some prior probability on the fairness of the coin, i.e. parameter \(parameter\). For this, a beta distribution seems appropriate, as it is a continuous distribution between 0 and 1. Let’s display a beta distribution with various parameter values.

x <- seq(-0, 1, length=100)
colors <- c("red", "purple", "blue", "darkgreen", "gold")
shape1s = c(1,2,4,1,0.2)
shape2s = c(1,2,4,4,0.2)
labels <- c("a=1, b=1", "a=2, b=2", "a=4, b=4", "a=1, b=4", "a=0.2, b=0.2")

plot(x, dbeta(x, shape1=shape1s[1], shape2=shape2s[1]), type="l", xlab="x value",
  ylab="Density", main="Comparison of beta Distributions", lwd=2, col=colors[1], ylim=c(0,7), lty=2)

for (i in 2:5){
  lines(x, dbeta(x, shape1=shape1s[i], shape2=shape2s[i]), lwd=2, col=colors[i], lty=2)
}
legend("topleft", labels, lwd=2, col=colors, ncol=2, lty=2)

We choose to use a=4, b=4 (blue). This is a somewhat informative prior that the coin should be fair, given our past experience with coins.

a=4
b=4

Thus we can write a prior function:

# Function to compute the prior P(M)
prior <- function (parameter, shape1, shape2){
    return (dbeta(parameter, shape1=shape1, shape2=shape2) )
}

Computing the posterior probability

The posterior probability is: \[ P(\theta|D) = P(D|\theta)P(\theta)/P(D) \] We want to sample from \(P(\theta|D)\), which requires in principle computing \(P(D)\). But \(P(D)\) will be the same for all parameter values: so we decide to drop it and compute an un-normalized posterior.

# Function to compute the un-normalized posterior P(D|M) * P(M)
unnormalized_posterior <-function (numHeads, numTosses, parameter, shape1, shape2) {
    return (likelihood(numHeads, numTosses, parameter) * prior(parameter, shape1, shape2))
}

Inference of the fairness of the coin using naive sampling

We randomly sample values of our parameter and compute the posterior probability for those values.

naiveSampler <- function (numHeads, numTosses, shape1, shape2, number_iterations) {
  likelihoods <- c()
  values <- c()
  posteriors <- c()
  for (i in 1:num_samples) {
    value <- runif(1)
    values <- c(values, value)
    likelihoods <- c(likelihoods, likelihood(numHeads, numTosses, value))
    posteriors <- c(posteriors, unnormalized_posterior(numHeads, numTosses, value, shape1, shape2))
  }
  return (list("values"=values, "posteriors"=posteriors, "likelihoods"=likelihoods))
}

Let’s run this naive sampling algorithm:

# Number of samples for the naive algorithm
num_samples = 20000

naive_result <- naiveSampler(sum(data), length(data), a, b, num_samples)

Let’s plot the posterior probability of the different values for the probabilities of heads that have been sampled. We also plot the Maximum Likelihood value, i.e. the ratio of the number of heads over the total number of tosses.

plot(naive_result$values, naive_result$posteriors, xlab="Probability of Heads", ylab="Likelihood or unnormalized posterior probability", pch=20, main ="Naive sampling")
points(naive_result$values, naive_result$likelihoods, col=rgb(0,1,0,0.5), pch=20)
legend("topleft", legend=c("Unnormalized posterior probability", "Likelihood"), col=c("black", rgb(0,1,0,0.5)), pch=20)

abline(v=sum(data)/length(data), col="red", lwd=2)

Is our Bayesian estimate comparable to the ML estimate of the probability of heads?

Our Bayesian estimate would be the weighted average of the samples:

sum(naive_result$values* naive_result$posteriors)/sum(naive_result$posteriors)
## [1] 0.6296099

The two seem to differ a little bit. This is expected: since the amount of data is limited (only 300 throws), the prior probability, which we chose to be a bit informative, provides a little amount of information, and pulls the estimate towards fairness, which was our weak prior.

How many samples were in the high probability interval?

Let’s compare the number of samples drawn between \(0.6\) and \(0.65\), and the number of samples drawn between \(0.05\) and \(0.1\):

length(which(naive_result[[1]]>0.6 & naive_result[[1]]<0.65 ))
## [1] 1023
length(which(naive_result[[1]]>0.05 & naive_result[[1]]<0.1 ))
## [1] 1024

The two values are comparable, which is expected, because we performed a uniform sampling of the interval \([0-1]\). A graphical way to get the same conclusion is to plot the histogram of the values we have sampled:

hist(naive_result[[1]], main="Naive sampling", xlab="Probability of Heads", ylab="Number of samples per bin")
abline(v=sum(data)/length(data), col="red", lwd=2)

In our case, where the space to sample is limited and has only one parameter, that is not a problem. But in high dimensional cases, or when parameter values range over large spaces, we could have missed the values of high probability, and wasted time and resources on values of low probability. A better sampling approach is importance sampling.

Inference of the fairness of the coin using importance sampling

With importance sampling, we sample from a “helper” distribution which we assume is shaped similarly to the distribution we want to sample from. Then, we have to renormalize the number of times we have sampled values to control for the fact that the helper distribution is not uniform. In our case, we can use a Beta distribution weakly centered on fairness (purple curve above, a=b=2) as helper to sample from the posterior.

# Number of samples for the importance sampling algorithm
num_samples = 20000

drawFromBeta_2_2 <- function() {
  return(rbeta(n=1, shape1=2, shape2=2))
}

computeImportanceWeightFromBeta_2_2 <- function(value) {
  return(dbeta(value, shape1=2, shape2=2))
}

importanceSampler <- function (numHeads, numTosses, shape1, shape2, number_iterations) {
  likelihoods <- c()
  values <- c()
  posteriors <- c()
  importance_weights <- c()
  for (i in 1:num_samples) {
    value <- drawFromBeta_2_2()
    values <- c(values, value)
    likelihoods <- c(likelihoods, likelihood(numHeads, numTosses, value))
    posteriors <- c(posteriors, unnormalized_posterior(numHeads, numTosses, value, shape1, shape2))
    importance_weights <- c(importance_weights, computeImportanceWeightFromBeta_2_2(value))
  }
  return (list(values, posteriors, likelihoods, importance_weights))
}

Let’s have a look at the distribution we get with importance sampling:

importance_result <- importanceSampler(sum(data), length(data), a, b, num_samples)

Let’s plot the different values for the probabilities of heads that have been sampled. We also plot the Maximum Likelihood value, i.e. the ratio of heads over the total number of tosses.

plot(importance_result[[1]], importance_result[[2]], xlab="Probability of Heads", ylab="Likelihood or unnormalized posterior probability", pch=20, main ="Importance sampling")
points(importance_result[[1]], importance_result[[3]], col=rgb(0,1,0,0.5), pch=20)
legend("topleft", legend=c("Unnormalized posterior probability", "Likelihood"), col=c("black", rgb(0,1,0,0.5)), pch=20)
abline(v=sum(data)/length(data), col="red", lwd=2)

The results in this case are the same whether we have sampled according to naive sampling or to importance sampling. However, importance sampling should have sampled differently in different parts of the space:

hist(importance_result[[1]], main="Importance sampling", xlab="Probability of Heads", ylab="Number of samples per bin")
abline(v=sum(data)/length(data), col="red", lwd=2)

Because we have sampled according to the helper beta(2,2) distribution, we have a non-uniform distribution of the samples, and we have sampled more often in \([0.6;0.65]\) than in \([0.05,0.1]\).

Now let’s use the importance weights, which will reweigh the number of times we have sampled each value:

library(plotrix)
weighted.hist(x=importance_result[[1]],  w= importance_result[[2]] / importance_result[[4]] , main="Importance sampling, after correction", xlab="Probability of Heads", ylab="Number of samples per bin")
abline(v=sum(data)/length(data), col="red", lwd=2)

Thanks to the reweighting trick, our final distribution really is a good approximation to the posterior probability distribution.

Let’s see if our reweighted sample is centered around the high posterior probability region:

sum(importance_result[[1]] * importance_result[[2]] / importance_result[[4]])/sum(importance_result[[2]] / importance_result[[4]])
## [1] 0.6296993

Thanks to importance sampling, after reweighting, we can obtain a distribution of parameter values that corresponds to the posterior probability distribution.

Inference of the fairness of the coin using MCMC

We build a MCMC chain to estimate the probability of heads for this coin. First we define the model, with the prior, the likelihood and the posterior probability, then we implement a Metropolis MCMC inference mechanism.

Implementing the MCMC algorithm

# Number of iterations for the MCMC algorithm
num_iterations = 20000

# Function to propose a new parameter value, randomly drawn between 0 and 1
propose_new_parameter_value <-function() {
    return (runif(1))
}

# Function to run Metropolis MCMC inference
MetropolisMCMC <- function (numHeads, numTosses, shape1, shape2, number_iterations) {
    current_parameter_value <- propose_new_parameter_value()
    record_parameter <- c()
    record_parameter <- c(record_parameter,current_parameter_value)
    #print(paste("Initial parameter value for the MCMC: ", current_parameter_value, sep=""))
    current_posterior <- unnormalized_posterior(numHeads, numTosses, current_parameter_value, shape1, shape2)
    #print(paste("Initial probability of the model: ", current_posterior, sep=""))
    record_posterior <- c()
    record_posterior <- c(record_posterior, current_posterior)
    # We also record the likelihood:
    current_likelihood <- likelihood(numHeads, numTosses, current_parameter_value)
    record_likelihood <- c()
    record_likelihood <- c(record_likelihood, current_likelihood)
    for (i in 1:number_iterations) {
        acceptance_threshold <- runif(1)
        proposed_parameter_value <- propose_new_parameter_value()
        proposed_posterior = unnormalized_posterior(numHeads , numTosses, proposed_parameter_value, shape1, shape2)
        if (proposed_posterior / current_posterior > acceptance_threshold) {
            current_parameter_value <- proposed_parameter_value
            current_posterior <- proposed_posterior
            current_likelihood <- likelihood(numHeads, numTosses, current_parameter_value)
        }
        record_parameter <- c(record_parameter, current_parameter_value)
        record_posterior <-c(record_posterior, current_posterior)
        record_likelihood <- c(record_likelihood, current_likelihood)
    }
    return (list(record_parameter, record_posterior, record_likelihood))
}

Running the MCMC

result = MetropolisMCMC(sum(data), length(data), a, b, num_iterations)

Analysis of the output

plot(result[[2]], main="Unnormalized posterior probability", ylab="Probability", xlab="Iteration", t="l")

The MCMC chain more often samples parameter values that have high posterior probability than parameter values that have low posterior probability.

plot(result[[1]], main="Probability of Heads", ylab="Probability", xlab="Iteration", t="l")

Posterior distribution of the probability of heads

We compare the posterior distribution, which combines the prior and the likelihood, to the prior probability distribution alone.

hist(result[[1]], main="Posterior distribution of the probability of Heads", ylab="Density", xlab="Probability of Heads", nc=80, freq=F, xlim=c(0,1))
lines(density(result[[1]]), col=colors[2], lwd=2)
lines(x, dbeta(x, shape1=shape1s[3], shape2=shape2s[3]), lwd=2, col=colors[2], lty=2)
legend("topleft", c("Posterior", "Prior"), col=c("blue", "blue"), lwd=c(2,2), lty=c(1,2))

The posterior distribution is quite different from the prior distribution, which indicates that we have learned something from the analysis of our data.

Is my coin fair?

To test if my coin is fair, I can benefit from the sample from the posterior distribution obtained from the MCMC. If the coin is fair, I should obtain a probability of heads above 0.5 the same proportion of time I get a probability below 0.5. Let’s count:

above = length(which(result[[1]]>0.5)) / length(result[[1]])
below = 1-above
print(paste("The probability that my coin is favoring Heads is ", above, ", whereas it is ", below, " that it is favoring Tails.", sep=""))
## [1] "The probability that my coin is favoring Heads is 0.99990000499975, whereas it is 9.99950002500416e-05 that it is favoring Tails."

My coin does not seem to be fair.

Is our Bayesian estimate comparable to the ML estimate of the probability of heads?

The ML estimate is simply the ratio of heads over all tosses:

sum(data)/length(data)
## [1] 0.6333333

Our Bayesian estimate would be the median of the samples:

median(result[[1]])
## [1] 0.6298248

The two seem to differ a little bit. This is expected: since the amount of data is limited (only 300 throws), the prior probability, which we chose to be a bit informative, provides a little amount of information, and pulls our estimate towards fairness, which was our prior.

Investigating the impact of the prior

To further investigate the impact of the prior, we can change the prior distribution and see if our estimate changes drastically. To choose a prior that really differs from the one we chose above, we use the distribution plotted in yellow, where the two shape parameters of the Beta distribution are 0.2.

result_02 = MetropolisMCMC(sum(data), length(data), 0.2, 0.2, num_iterations)

Comparison of the posterior distributions

We plot in dashed lines the prior distributions, and in solid lines the posterior distributions.

plot(density(result[[1]]), col=colors[2], lwd=2, main="Posterior distribution of the probability of Heads", ylab="Density", xlab="Probability of Heads", ylim=c(0,24), xlim=c(0,1))
lines(x, dbeta(x, shape1=shape1s[2], shape2=shape2s[2]), lwd=2, col=colors[2], lty=2)
lines(density(result_02[[1]]), col=colors[5], lwd=2)
lines(x, dbeta(x, shape1=shape1s[5], shape2=shape2s[5]), lwd=2, col=colors[5], lty=2)
legend("topleft", legend=c("Posterior, a=4, b=4","Prior, a=4, b=4", "Posterior, a=0.2, b=0.2","Prior, a=0.2, b=0.2"), col=c(colors[3], colors[3], colors[5], colors[5]), lty=c(1,2,1,2), lwd=c(2,2,2,2), ncol=2)

Let’s look at summary statistics, first for the first prior (shapes =2):

summary(result[[1]])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2489  0.6119  0.6298  0.6301  0.6481  0.7239

and the second prior (shapes=0.2):

summary((result_02[[1]]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2542  0.6151  0.6339  0.6342  0.6533  0.7406

In this case, the prior distribution has little impact on the posterior distribution of the parameter: it does not affect our assessment of the fairness of the coin.

Bayes factors for comparing models

One way to explicitly test hypotheses about the fairness of the coin is to use a Bayes factor. Here, we will compute the Bayes factor between a model where we fix the probability of Heads to 0.5, i.e. a model that states that the coin is fair, and a model where we use our weak prior. This latter model is much more careful about whether the coin is fair or not. The Bayes factor is computed as the ratio of the two marginal likelihoods.

To obtain this marginal likelihood, we have to integrate over all the prior distribution. We will do this in two ways: using maths, or using a numerical approach, called the harmonic mean. The harmonic mean is the simplest way to obtain the marginal likelihood numerically, and it is consistent: if it were given an infinite number of samples, it would be correct. The problem of this estimator is it can have infinite variance, which may lead to wildly inaccurate estimates. We will use both the exact mathematical way and the harmonic mean.

Mathematical computation of the Bayes factor

The Bayes factor we want to compute is between the model with unknown probability of heads, and the model with probability of heads at 0.5. \[ BF = \frac{Marginal~Likelihood~model~with~unknown~probability}{Marginal~Likelihood~model~with~probability~0.5} = \frac{P_{M1}(D)}{P_{M2}(D)} \\ P_{M1}(D) = \int_0^1\frac{1}{B(\alpha, \beta)}p^{\alpha-1}(1-p)^{\beta-1} \binom{n}{k} p^k(1-p)^{n-k} dp \\ P_{M1}(D) = \frac{1}{B(\alpha, \beta)} \binom{n}{k} \int_0^1 p^{\alpha-1}(1-p)^{\beta-1} p^k(1-p)^{n-k} dp \\ P_{M1}(D) = \frac{1}{B(\alpha, \beta)} \binom{n}{k} \int_0^1 p^{k+\alpha-1}(1-p)^{n+\beta-k-1} dp \\ \] Here we can recognize that the integral could look a lot like a Beta distribution: \[ P_{M1}(D) = \frac{B(k+\alpha, n+\beta-k)}{B(\alpha, \beta)} \binom{n}{k} \int_0^1 \frac{p^{k+\alpha-1}(1-p)^{n+\beta-k-1} } {B(k+\alpha, n+\beta-k)} dp \]

The integral on the right has become the integral of Beta density between 0 and 1, which means its value is 1!

\[ P_{M1}(D) = \frac{B(k+\alpha, n+\beta-k)}{B(\alpha, \beta)} \binom{n}{k} \] Then we can compute \(P_{M2}(D)\). Here we have a probability mass of 1.0 on the parameter value \(0.5\). The Marginal Likelihood is just the Binomial probability for the counts, given parameter \(p=1-p=0.5\)

\[ P_{M2}(D) = \binom{n}{k} \frac{1}{2^n} \]

Thus, we have the following Bayes Factor:

\[ BF = \frac{P_{M1}(D)}{P_{M2}(D)} = {2^n} \frac{B(k+\alpha, n+\beta-k)}{B(\alpha, \beta)} \] Its value is:

\[ BF = {2^{300}} \frac{B(190+2, 300+2-190)}{B(2, 2)}={2^{300}} \frac{B(192, 112)}{B(2, 2)} \]

According to the analytical approach, the Bayes Factor is:

BF<-2^300 * beta(194,114)/beta(4,24)
print(BF)
## [1] 2977161

There seems to be a “very strong” support (cf. https://en.wikipedia.org/wiki/Bayes_factor) in favor of the model with the somewhat informative prior instead of the hypothesis with all probability mass on \(0.5\).

Numerical approach using the harmonic mean

Let’s compute the harmonic mean.

Marginal likelihood according to the fair coin \(P_{M2}(D)\)

We have seen the expression above:

P_M2 <- likelihood(sum(data), length(data), 0.5)
print(P_M2)
## [1] 9.772302e-07

Marginal likelihood according to the coin with unknown probability \(P_{M1}(D)\)

As written above, the harmonic mean estimator can be unstable, so it is good to run the estimator several times. Let’s do thirty replicates.

P_M1=c()
for (i in 1:30) {
  result = MetropolisMCMC(sum(data), length(data), 4, 4, num_iterations)
  P_M1 = c(P_M1,1/mean(1/result[[3]]))
}

Now we can compute the Bayes factor in favor of the model with unknown probability of Heads:

summary(P_M1 / P_M2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    3487    5912   17140

Let’s plot the results and compare the analytical estimate to our 30 replicates:

hist(P_M1/P_M2, xlab="Numerical estimate using the harmonic mean", freq=T, nc=15)
abline(v=BF, col="red", lwd=2)
legend("topleft", col="red", legend="Analytical estimate", lwd=2)

As we can see, the numerical estimate using the harmonic mean is quite variable. Other numerical estimators have been proposed, including Thermodynamic Integration, or Stepping Stone sampling. However, in all cases when a mathematical solution cannot be obtained, it remains difficult and time-intensive to obtain good estimates of Bayes factors.