When model contains multiple parameters and allows hierarchical reparameterization, Bayesian approach means working with joint distribution of parameters.
Hierarchical structure allows breaking joint distribution into marginal distributions for hyper-parameters (p\((\omega), \ p(\omega \mid y)\)) and conditional distributions for main parameters (\(p(\theta \mid \omega), \ p(\theta \mid \omega,y)\))
The goal of the experiment is estimation of conversion rate based on trial of marketing campaign with 100 participants.
Step 1. Define model: data are generated by binomial distribution with parameter \(\theta\) \[ y_i \sim Binom(\theta) \ \text{or} \ Binom(s_i,\theta)\]
Step 2. Define prior distribution for parameter \(\theta\) \[ \theta \sim Beta(\alpha, \beta)\]
Interpretation of the model: each member of the population has individual conversion rate, i.e. conversion rates from a population with beta distribution.
Define prior distribution with mode \(\omega\) = 5% and concentration \(\kappa\) = 100.
Use formulas to calculate parameters of beta distribution from mode and concentration: \[ \alpha = \omega(\kappa - 2) + 1, \ \beta = (1-\omega)(\kappa - 2), \ \kappa > 2\]
omega <- .05
kappa <- 100
a <- omega*(kappa-2)+1
b <- (1-omega)*(kappa-2)+1
c(a=a,b=b)
## a b
## 5.9 94.1
Plot the prior density and find its mode.
x <- seq(from = 0, to = 1, by = .001)
priorBetaDens <- dbeta(x,shape1 = a, shape2 = b)
(x.max <- x[which.max(priorBetaDens)])
## [1] 0.05
plot(x, priorBetaDens, type = 'l', lwd = 3)
abline(v = x.max)
Step 3. Obtain the sample
Sample is a sequence of 100 observations of zeros (no conversion) or ones (conversion). But it is enough to know sufficient statistics: number of successes k and total number of observations s.
(smple<-c(rep(0,times=92),rep(1,times=8)))
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
(k<-sum(smple))
## [1] 8
(s<-length(smple))
## [1] 100
Step 4. Find posterior distribution
Since beta distribution is conjugate with binomial posterior distribution is also beta.
Proof.
\[\begin{align*} p(\theta \mid y) &\propto p(y\mid\theta).p(\theta) \\ &= {s \choose y_i} \theta^k (1-\theta)^{s-k} \frac{1}{B(\alpha,\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} \\ &\propto \theta^{\alpha+k-1} (1-\theta)^{\beta+s-k-1} \end{align*}\]
actually where, \(k = \sum^s_{i=1} y_i\) and \(S = \sum^s_{i=1} s_i\)
Hence, we can calculate parameters of posterior distribution using
\[ \alpha_{posterior} = k + \alpha_{prior} \ \text{,} \ \beta_{posterior} = s - k + \beta_{prior}\]
postA <- k+a
postB <- s-k+b
c(postA=postA,postB=postB)
## postA postB
## 13.9 186.1
posteriorBetaDens <- dbeta(x,shape1=postA,shape2=postB)
The posterior distribution has new mode that can be calculated by formula \(\omega=\frac{\alpha - 1}{\alpha + \beta - 2}\) for \(\alpha, \beta>1\), or directly from the plot of the density:
x[which.max(posteriorBetaDens)]
## [1] 0.065
Plot prior and posterior distributions.
plot(x, priorBetaDens,type='l', lwd=3,col='blue',ylim=c(0,25), xlim=c(0,.2), ylab="Density of Theta")
lines(x,posteriorBetaDens,lwd=3,col='orange')
legend("topright",legend=c("prior","posterior"),col=c("blue","orange"),lwd=3)
Step 5. Make conclusions
Posterior distribution contains all information about the population of conversion rates.
Point Bayesian estimate of conversion rate is the mean of the posterior distribution:
\[ E[X_B] = \mu = \frac{\alpha}{\alpha+\beta} \]
\[ V[X_B] = \sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \]
Calculate posterior mean.
(muPosterior=postA/(postA+postB))
## [1] 0.0695
Calculate posterior variance of conversion rate:
(varPosterior<-postA*postB/(postA+postB)^2/(postA+postB+1))
## [1] 0.00032174
The new concentration
(kappaPosterior<-postA+postB)
## [1] 200
Simplest hierarchical model, excerpt from Example in section 5.2 of [G].
Data \(y_{ij}\) are generated by binomial model with parameter \(\theta_j\): patient \(i\) treated in hospital \(j\) with binary success. Parameters \(\theta_j\) are populated from prior distribution depending on parameter \(\phi\). So, the hierarchy is \(\phi \rightarrow \theta \rightarrow y\).
For selected 8 states recorded divorce rates per 1,000 population in 1981 are: \(y=[y_1,y_2,y_3,y_4,y_5,y_6,y_7,y_8].\)
What can we say about \(y_8\)?
There is no information that allows distinguishing this observation from others, so we have to assume that the joint prior distribution is exchangeable.
Prior distribution for y can be selected as beta or other distribution restricted to \([0,1]\). Unless we are familiar with the U.S. divorce statistics we should select a vague prior distribution.
Now 7 out of 8 states are randomly selected and those observations are given to us:
y.7<-c(5.8,6.6,7.8,5.6,7.0,7.1,5.4)
Again, what can we say about the remaining \(8^{th}\) state?
A reasonable predictive posterior distribution would be based on the sample and centered around 6.5 with most of its mass concentrated between 5.7 and 7.
summary(y.7)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.400 5.700 6.600 6.471 7.050 7.800
Putting the remaining state rate at any position within the sample does not change the joint distribution. The data are still exchangeable.
The next additional prior information is: the 8 states are the Mountain states, i.e. Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, Wyoming selected in random order.
We still don’t know which divorce rate corresponds to which state.
With this information before the 7 state rates are observed the data should still be modeled exchangeably.
But the prior for all 8 numbers must be different:
Such consideration may result in a wider tails of prior distribution to allow for outliers.
After observing the rates for random 7 states we notice that the numbers are close together.
It might mean that the missing state is either Utah or Nevada.
If that is true the posterior distribution might be bi-modal or even tri-modal.
But the joint distribution on all 8 states is still exchangeable.
Now, if we learn that the missing state is Nevada, we cannot think of data as exchangeable even before the 7 state rates are revealed, since there is information distinguishing \(y_8\) from other states: we expect it to be larger than others.
After observing the 7 states we would expect the posterior distribution to have the property that \(p(y_8 > \text{max} (y_1,...,y_7) \mid y_1,...,y_7)\) is large.
Indeed the divorce rate per 1,000 population in Nevada in 1981 was \(13.9\),
The model description that was considered earlier was:
\[ y_i \sim Binom(\theta) \]
\[ \theta \sim Beta(\alpha, \beta)\]
Parameters (\(\alpha, \beta\)) were transformed into mode and concentration for \(\alpha, \beta > 1\):
\[ \omega = \frac{\alpha - 1}{\alpha + \beta - 2} \]
\[ \kappa = \alpha + \beta \]
And the reverse transformation is
\[ \alpha = \omega(\kappa - 2) + 1\]
\[ \beta = (1 - \omega)(\kappa - 2) + 1, \ \kappa > 2\]
Then parameters \(\omega, \kappa\) were assumed constants fixed from previous knowledge. Posterior distribution was estimated as \[ p(\theta \mid y) \propto Beta(\alpha + k, s - k + \beta)\text{,} \]
where (\(s, k\)) were the observed number of Bernoulli experiments and the number of successes in them.
To make the next step and create a hierarchical model assume that concentration \(\kappa\) is still constant, but mode \(\omega\) is random, so, we need to define a prior and find posterior distribution for it.
Let the prior distribution for \(\omega\) be \(Beta(A_{\omega},B_{\omega})\), where \(A_{\omega}\) and \(B_{\omega}\) are constants.
A model with two parameters could in such case be: