1 Summary of the approach

  • When model contains multiple parameters and allows hierarchical reparameterization, Bayesian approach means working with joint distribution of parameters.

  • Hierarchical structure allows breaking joint distribution into marginal distributions for hyper-parameters (p\((\omega), \ p(\omega \mid y)\)) and conditional distributions for main parameters (\(p(\theta \mid \omega), \ p(\theta \mid \omega,y)\))

2 Review of binomial model

The goal of the experiment is estimation of conversion rate based on trial of marketing campaign with 100 participants.

Step 1. Define model: data are generated by binomial distribution with parameter \(\theta\) \[ y_i \sim Binom(\theta) \ \text{or} \ Binom(s_i,\theta)\]

Step 2. Define prior distribution for parameter \(\theta\) \[ \theta \sim Beta(\alpha, \beta)\]

Interpretation of the model: each member of the population has individual conversion rate, i.e. conversion rates from a population with beta distribution.
Define prior distribution with mode \(\omega\) = 5% and concentration \(\kappa\) = 100.

Use formulas to calculate parameters of beta distribution from mode and concentration: \[ \alpha = \omega(\kappa - 2) + 1, \ \beta = (1-\omega)(\kappa - 2), \ \kappa > 2\]

omega <- .05
kappa <- 100
a <- omega*(kappa-2)+1
b <- (1-omega)*(kappa-2)+1
c(a=a,b=b)
##    a    b 
##  5.9 94.1

Plot the prior density and find its mode.

x <- seq(from = 0, to = 1, by = .001)
priorBetaDens <- dbeta(x,shape1 = a, shape2 = b)
(x.max <- x[which.max(priorBetaDens)])
## [1] 0.05
plot(x, priorBetaDens, type = 'l', lwd = 3)
abline(v = x.max)

Step 3. Obtain the sample

Sample is a sequence of 100 observations of zeros (no conversion) or ones (conversion). But it is enough to know sufficient statistics: number of successes k and total number of observations s.

(smple<-c(rep(0,times=92),rep(1,times=8)))
##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
(k<-sum(smple))
## [1] 8
(s<-length(smple))
## [1] 100

Step 4. Find posterior distribution

Since beta distribution is conjugate with binomial posterior distribution is also beta.

Proof.

\[\begin{align*} p(\theta \mid y) &\propto p(y\mid\theta).p(\theta) \\ &= {s \choose y_i} \theta^k (1-\theta)^{s-k} \frac{1}{B(\alpha,\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} \\ &\propto \theta^{\alpha+k-1} (1-\theta)^{\beta+s-k-1} \end{align*}\]

actually where, \(k = \sum^s_{i=1} y_i\) and \(S = \sum^s_{i=1} s_i\)

Hence, we can calculate parameters of posterior distribution using

\[ \alpha_{posterior} = k + \alpha_{prior} \ \text{,} \ \beta_{posterior} = s - k + \beta_{prior}\]

postA <- k+a
postB <- s-k+b
c(postA=postA,postB=postB)
## postA postB 
##  13.9 186.1
posteriorBetaDens <- dbeta(x,shape1=postA,shape2=postB)

The posterior distribution has new mode that can be calculated by formula \(\omega=\frac{\alpha - 1}{\alpha + \beta - 2}\) for \(\alpha, \beta>1\), or directly from the plot of the density:

x[which.max(posteriorBetaDens)]
## [1] 0.065

Plot prior and posterior distributions.

plot(x, priorBetaDens,type='l', lwd=3,col='blue',ylim=c(0,25), xlim=c(0,.2), ylab="Density of Theta")
lines(x,posteriorBetaDens,lwd=3,col='orange')
legend("topright",legend=c("prior","posterior"),col=c("blue","orange"),lwd=3)

Step 5. Make conclusions

Posterior distribution contains all information about the population of conversion rates.
Point Bayesian estimate of conversion rate is the mean of the posterior distribution:

\[ E[X_B] = \mu = \frac{\alpha}{\alpha+\beta} \]

\[ V[X_B] = \sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \]

Calculate posterior mean.

(muPosterior=postA/(postA+postB))
## [1] 0.0695

Calculate posterior variance of conversion rate:

(varPosterior<-postA*postB/(postA+postB)^2/(postA+postB+1))
## [1] 0.00032174

The new concentration

(kappaPosterior<-postA+postB)
## [1] 200

3 Exchangeability and sampling

Simplest hierarchical model, excerpt from Example in section 5.2 of [G].

Data \(y_{ij}\) are generated by binomial model with parameter \(\theta_j\): patient \(i\) treated in hospital \(j\) with binary success. Parameters \(\theta_j\) are populated from prior distribution depending on parameter \(\phi\). So, the hierarchy is \(\phi \rightarrow \theta \rightarrow y\).

3.1 Data

For selected 8 states recorded divorce rates per 1,000 population in 1981 are: \(y=[y_1,y_2,y_3,y_4,y_5,y_6,y_7,y_8].\)

What can we say about \(y_8\)?

There is no information that allows distinguishing this observation from others, so we have to assume that the joint prior distribution is exchangeable.

Prior distribution for y can be selected as beta or other distribution restricted to \([0,1]\). Unless we are familiar with the U.S. divorce statistics we should select a vague prior distribution.

3.2 Data on 7 states out of 8

Now 7 out of 8 states are randomly selected and those observations are given to us:

y.7<-c(5.8,6.6,7.8,5.6,7.0,7.1,5.4)

Again, what can we say about the remaining \(8^{th}\) state?

A reasonable predictive posterior distribution would be based on the sample and centered around 6.5 with most of its mass concentrated between 5.7 and 7.

summary(y.7)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.400   5.700   6.600   6.471   7.050   7.800

Putting the remaining state rate at any position within the sample does not change the joint distribution. The data are still exchangeable.

3.3 Known group of states

The next additional prior information is: the 8 states are the Mountain states, i.e. Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, Wyoming selected in random order.
We still don’t know which divorce rate corresponds to which state.

With this information before the 7 state rates are observed the data should still be modeled exchangeably.

But the prior for all 8 numbers must be different:

  • Utah has large Mormon population which makes divorce rate lower than other states
  • Nevada has more liberal divorce laws which makes the rate higher than in other states

Such consideration may result in a wider tails of prior distribution to allow for outliers.

After observing the rates for random 7 states we notice that the numbers are close together.
It might mean that the missing state is either Utah or Nevada.
If that is true the posterior distribution might be bi-modal or even tri-modal.
But the joint distribution on all 8 states is still exchangeable.

3.4 Known missing state

Now, if we learn that the missing state is Nevada, we cannot think of data as exchangeable even before the 7 state rates are revealed, since there is information distinguishing \(y_8\) from other states: we expect it to be larger than others.

After observing the 7 states we would expect the posterior distribution to have the property that \(p(y_8 > \text{max} (y_1,...,y_7) \mid y_1,...,y_7)\) is large.

Indeed the divorce rate per 1,000 population in Nevada in 1981 was \(13.9\),

4 Binomial Model with hyperparameters

The model description that was considered earlier was:

\[ y_i \sim Binom(\theta) \]

\[ \theta \sim Beta(\alpha, \beta)\]

Parameters (\(\alpha, \beta\)) were transformed into mode and concentration for \(\alpha, \beta > 1\):

\[ \omega = \frac{\alpha - 1}{\alpha + \beta - 2} \]

\[ \kappa = \alpha + \beta \]

And the reverse transformation is

\[ \alpha = \omega(\kappa - 2) + 1\]

\[ \beta = (1 - \omega)(\kappa - 2) + 1, \ \kappa > 2\]

Then parameters \(\omega, \kappa\) were assumed constants fixed from previous knowledge. Posterior distribution was estimated as \[ p(\theta \mid y) \propto Beta(\alpha + k, s - k + \beta)\text{,} \]

where (\(s, k\)) were the observed number of Bernoulli experiments and the number of successes in them.

To make the next step and create a hierarchical model assume that concentration \(\kappa\) is still constant, but mode \(\omega\) is random, so, we need to define a prior and find posterior distribution for it.

4.1 High concentration, uncertain hyperparameter

Let the prior distribution for \(\omega\) be \(Beta(A_{\omega},B_{\omega})\), where \(A_{\omega}\) and \(B_{\omega}\) are constants.

A model with two parameters could in such case be:

4.2 Low concentration, more certain hyperparameter

5 References

  • Adaptive from lecture note UC Bayesian Methods
  • [G]elman, A., et al. (2013). Bayesian Data Analysis, Third Edition, 3rd Edition, CRC Press.
  • [K]ruschke, J. K. (2015). Doing Bayesian data analysis : a tutorial with R, JAGS, and Stan. Amsterdam, Academic Press is an imprint of Elsevier.