Binomial Probability in Bayesian Data Analysis

Introduction

This writing is about binomial probability and how to use it in Bayesian data analysis. It is suggested you first read this writing. This is one of the simplest examples of Bayesian data analysis and exact mathematical analysis could be used.

Bernoulli likelihood function

There is Bernoulli distribution, which shows probability of one of the mutually exclusive events happening (for example getting a head or a tail while flipping a coin). This is also a reason why it is called binomial probability, it has only two (bi-) outcomes. Bernoulli distribution has following formula:

\[ P(\gamma|\theta)=\theta^{\gamma} (1-\theta)^{(1-\gamma)} \]

where:

\(\gamma\) - value of discrete outcome (for example head or a tail in case of flipping a coin). Value which is being estimated.
\(\theta\) - probability of particular outcome (for example probability of heads)

This formula basically shows what is probability distribution of two discrete values of \(\gamma\) for any fixed value of \(\theta\). If we believe/have data that coin is fair (meaning that \(\theta\)=0.5), meaning that probability of a head is: \(P(\gamma=1|\theta=0.5)=0.5^{1}(1-0.5)^{(1-1)}=0.5\) and probability of a tail is: \(P(\gamma=0|\theta=0.5)=0.5^{0} (1-0.5)^{(1-0)}=0.5\).

But why did I call it Bernoulli likelihood function? Because we can also think of this function a little bit differently. We might use this function to estimate parameter \(\theta\) given values for \(\gamma\). In this case:

\[ P(\{\gamma_i\}|\theta)=\theta^{z} (1-\theta)^{(N-z)} \]

where:

\(\theta\) - parameter which value is to be estimated, shows probability of binary outcomes
N - number of observations (for example number of coin flips)
z - number of positive (or negative) outcomes (for example number of heads)

Now this thing is likelihood function, function that helps us to interpret values from observed data into probability. But to use Bayesian data analysis via exact mathematical analysis we need to present our prior beliefs in formula that is close enough to Bernoulli likelihood function.

Beta distribution

We would like to show prior beliefs (probability density) in form that has similarity with Bernoulli likelihood: \(\theta^a(1-\theta)^b\). One function that fits this description is beta distribution:

\[ p(\theta|a,b)=beta(\theta|a,b)=\frac {\theta^{(a-1)}(1-\theta)^{(b-1)}}{B(a,b)} \]

where \(B(a,b)\) is normalizing factor, making sure that area under beta density sums up to 1. In other words B(a,b) is beta function. It is important not to confuse beta function with beta distribution. Beta distribution has values only in the interval [0,1], and values \(a\) and \(b\) must be positive. Beta function is not a function of \(\theta\) because it has been integrated out (I am not going into details, for more closer look into references). Lets have closer look on beta distribution:

i = 1
df_list = list()
for(a in c(0.1, 1, 2, 3, 4)){
    for(b in c(0.1, 1, 2, 3, 4)){
        x = seq(0, 1, 0.01)
        y = dbeta(x, shape1=a, shape2=b)
        df_list[[i]] = data.frame(x=x, y=y, 
                                  group = paste("b =", b, ", a =", a))
        i = i + 1
    }
}

library(ggplot2)
library(dplyr)
df = rbind_all(df_list)
ggplot(df, aes(x, y)) + 
  geom_area(fill="lightblue") + 
  facet_wrap(~ group)+
  coord_cartesian(ylim=c(0,4))+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.text.y = element_blank())+
  xlab("Theta")+
  ylab("Probability density")

Plots might look like weird shaped animal moving. But it is quite simple. In coin flipping example we can think parameters \(a\) and \(b\) as number of heads and tails occured. If we use beta density for prior, than previous plots show different prior probabilities we might have about certain coin. If we don’t have much prevoius information we could think that density distribution is flat (we have seen one tail and we have see one head). All \(\theta\) values are equally possible. If we have some more prevoius information about the coin, we might think that it is relatively fair coin. This is the case of having seen 4 heads and 4 tails. If we have observed 100 heads and 100 tails, we would be much more certain about the prior, which means that density distributon is pretty narrow. Or might have prior information that coin is a trick coin (having only two tails or heads, but we don’t know which of them). This is the case for the first plot (\(b\)=0.1, \(a\)=0.1). So we can use beta distribution to quantify our prior knowledge about the coin.

Mean of beta distribution is \(\mu=a/(a+b)\) and mode is \(\omega=(a-1)/(a+b-2)\) for \(a\)>1 and \(b\)>1. Spread of the distribution (how wide the “bump” is) is related to consentration parameter \(k=a+b\), the bigger \(a\) and \(b\) are the bigger is concentration (the more sure we are about the prior). In coin example k shows how many flips have we seen before. If we rearrange the formulas we get:

\[a=\mu k\] \[b=(1-\mu)k\] \[a=\omega(k-2)+1 \quad for\quad k>2\] \[b=(1-\omega)(k-2)+1 \quad for \quad k>2\]

One more way to compute parameters \(a\) and \(b\) is using mean \(\mu\) and standard deviation \(\sigma\): \[a=\mu \left(\frac{\mu(1-\mu)} {\sigma ^2}-1\right)\] \[b=(1-\mu)\left(\frac{\mu(1-\mu)}{\sigma ^2}-1 \right)\]

It is worth noting that standard deviation should be used with care. Maximum standard deviation of beta distribution is 0.28867, which is standard devation of uniform distirbution (\(a\)=1, \(b\)=1).

Why we would ever need these formulas? Because they could be used to estimate parameters a and b. If our prior knowledge is expressed in terms of concentration and mean we could use formulas to get parameters \(a\) and \(b\) for function computers use to calculate beta density. It is also worth noting that mode is probably more meaningful to use mode than mean:

a=5.6
b=1.4
x = seq(0, 1, 0.01)
y = dbeta(x, shape1=a, shape2=b)
density = data.frame(x=x, y=y)
mean=a/(a+b)
mode=(a-1)/(a+b-2)

library(ggplot2)
ggplot(density, aes(x, y)) + 
  geom_area(fill="lightblue") + 
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        axis.text.y = element_blank())+
  geom_vline(xintercept = mean, colour="red")+
  geom_vline(xintercept = mode, colour="darkblue")+
  xlab("Theta")+
  ylab("Probability density")

In previous plot red line shows mean and blue line shows mode. Because distribution is skewed, mode shows really the value which is most popular (frequent), mean doesn’t have this meaning in skewed distributions.

Beta posterior

For now we have seen prior and likelihood from beta distribution, but what about posterior? We can get it now. Let’s make an example. We have prior knowledge:

#function for computin prior, likelihood and posterior density
bayesBeta=function(k,mode, N, z) {
a=mode*(k-2)+1
b=(1-mode)*(k-2)+1
Theta = seq( 0 , 1 , length=1000)
prior = dbeta(Theta, a, b)
prior=prior/sum(prior)#normalize it
data=data.frame(Theta, prior)
data$prior_cumsum=cumsum(data$prior)
data$prior_CI=ifelse(data$prior_cumsum<0.025|data$prior_cumsum>0.975, "outside CI", "inside CI")

data$likelihood = Theta^z * (1-Theta)^(N-z)
data$likelihood=data$likelihood/sum(data$likelihood)
data$likelihood_cumsum=cumsum(data$likelihood)
data$likelihood_CI=ifelse(data$likelihood_cumsum<0.025|data$likelihood_cumsum>0.975, "outside CI", "inside CI")

data$posterior = dbeta(Theta , a+z , b+N-z)
data$posterior=data$posterior/sum(data$posterior)
data$posterior_cumsum=cumsum(data$posterior)
data$posterior_CI=ifelse(data$posterior_cumsum<0.025|data$posterior_cumsum>0.975, "outside CI", "inside CI")
data
}
#function for ploting densities
plotting=function(data, ... ,title ) {
  library(ggplot2)
ggplot(data, ...)+
  geom_bar(stat="identity", alpha=0.5)+
  theme_minimal()+
  theme(axis.text.y = element_blank())+
  ylab("Probability density")+
  ggtitle(title)+
  labs(fill='95% confidence interval')
}
#compute prior, likelihood and posterior
data=bayesBeta(k=100, mode=0.5, N=10, z=2)
plotting(data=data, aes(x=Theta, y=prior, fill=prior_CI), title="Prior")

We have some knowledge that coin is fare. Maybe we have flipped this coin (or other coin from the same mint) before. Because k=100, we have flipped it 100 times before our current experiment. Let’s say we flip it 10 times now, our likelihood is following:

plotting(data=data, aes(x=Theta, y=likelihood, fill=likelihood_CI), title="Likelihood")

I made it a little bit more interesting and this time only 2 heads came from 10 flips. We might think that this coin is biased. Our posterior is:

plotting(data=data, aes(x=Theta, y=posterior, fill=posterior_CI), title="Posterior")

Because our prior is pretty certain (we flipped it 100 times and got 50 heads) and likelihood is pretty wide (we flipped it only 10 times) our prior doesn’t change very much and \(\theta=0.5\) is still very probable. It is more likely that getting only two heads were just random fluctuation (and coin is not biased).

Now let’s see different case, where we have only flipped coin form same mint 10 times, but we’ll be flipping it this time 100 times (and get only 20 heads):

data=bayesBeta(k=10, mode=0.5, N=100, z=20)
plotting(data=data, aes(x=Theta, y=prior, fill=prior_CI), title="Prior")

plotting(data=data, aes(x=Theta, y=likelihood, fill=likelihood_CI), title="Likelihood")

plotting(data=data, aes(x=Theta, y=posterior, fill=posterior_CI), title="Posterior")

Now cards have turned upside down. Our prior was not very “strong” but our likelihood is pretty narrow because we have more data in there than in prior. So our prior didn’t change posterior much. Now it is more likely that our prior belief was wrong than that likelihood is different from real value.

Conclusion

I hope you got some idea what beta distirbution is and how to use it in Bayesian data analysis. If you didn’t than I suggest you to write your article about it, it will help.

Reference

Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan 2nd Edition, John Kruschke