Note: Some of the solutions presented here are have been reverse engineered from here.

domain          <- seq(-7,10,.02)
dens            <- 0.5*dnorm(domain,1,2) + 0.5*dnorm(domain,2,2)
plot (domain, dens, ylim=c(0,1.1*max(dens)),
type="l", xlab="y", ylab="", xaxs="i",
yaxs="i", yaxt="n", bty="n", cex=2)

plot of chunk unnamed-chunk-1

p.theta.1       <- function(sigma) {
  res1          <- (0.5*dnorm(1,1,sigma)) /(sum(0.5*dnorm(1,c(1,2),sigma)))
  return(res1)
  }

Evaluating the last expression in the respective cumulative distribution function we get:0.5312. Note: even though we are adding “discrete” number of probabilities, we are still in the continuous space (but for \(y=1\)) and should evaluate the probabilities in the density function.

Table 1: Posterior probabilty of \(\theta = 1\), af a function of \(\sigma\)

$$ \(p(\theta = 1 | y = 1 )\)
0.25 0.9997
0.5 0.8808
1 0.6225
2 0.5312
4 0.5078
8 0.502

Lets define the following events:
* \(A:\) The participant chose the right box at the beginning.
* \(B:\) The host opens a particular box, among the unchosen ones, such that is empty.
* \(C:\) Among the unchosen boxes the host chooses a empty box.

And let’s compute the probabilities of each of this events.
\[ \begin{align*} Pr(A) &= 1/3\\ Pr(C) &= 1/2\\ Pr(B) &= Pr(B|A)Pr(A) + Pr(B|\neg A)Pr(\neg A) = (1/2)*(1/3) + Pr(B|\neg A)*(2/3)\\ &= 1/6 + 2/3*(Pr(B|\neg A,C)Pr(C) + Pr(B|\neg A,\neg C)Pr(\neg C)) \\ &= 1/6 + 2/3*(1*(1/2) + 0*(1/2)) = 1/2 \end{align*} \] Using Bayes’ theorem we have that the probability of choosing the right box from the beginning, conditional on a unchosen box being revealed as a losing one is:

\[ Pr(A|B) = \frac{Pr(A)Pr(B|A)}{Pr(B)} = \frac{(1/3)*(1/2)}{1/2}=\frac{1}{3}\]

The participant’s chances are not equal across remaining boxes! She is worst of staying with her original choice (33% probability of wining instead of 50%!).

More generally if there were \(n\) boxes in total and \(i\) boxes where revealed, we have that the wrong way of updating the probabilities (\(1/(n-i)\)) and the Bayesian update (\(\frac{i+n*(n-1-i)}{n*(n-i)*(n-i-1)}\)) differ significantly as \(i \rightarrow n\). For example the following graph plots both probabilities of winning in a contest with 50 boxes as the host opens \(i\) boxes.

plot of chunk unnamed-chunk-4

Looking at the graph it seems that the advantages of thinking in a Bayesian fashion are certainly parameter-specific. Also notice that the player here chooses a “stubborn” strategy, I suspect that if she changes boxes in a optimal way the improvement in her chances will be slightly less. Maybe that is the reason why we don’t think in a Bayesian fashion all the time.


domain <- seq(0,1,.01)
dens = apply(sapply(0:2,function(x) choose(10,x)*dbeta(domain,4+x,4+10-x)),1,sum)
plot(domain, dens, type="l")

plot of chunk unnamed-chunk-5

Note: a good reminder of the main conjugacy relationships can be found here

#Data:
school.id       <- LETTERS[1:8]
effect          <- c(28,8 ,-3,7 ,-1,1 ,18,12)
se.effect       <- c(15,10,16,11,9 ,11,10,18) 

pool.est        <- sum(effect*se.effect^-2)/sum(se.effect^-2)
pool.var        <- sum(se.effect^-2)^-1
pool.ci         <- c(-1.96,1.96)*pool.var^.5 + pool.est

The pooled estimated effect and variance are 7.69 and 16.58, with a 95% CI of [-0.3, 15.67].

Posterior simulation under ther hierarchical model
Using the identity:
\[ \begin{align} p(\theta,\mu,\tau|y) = p(\tau|y)p(\mu|\tau,y)p(\theta|\mu,\tau,y) \end{align} \] And the results from BDA in equation 5.17, 5.20 and 5.21 we code the joint posterior:

# Eqn 5.17 of BDA3
post.theta.j    <- function(mu,tau,j) (effect[j]/(se.effect[j]^2) + mu/(tau^2)) / (1/(se.effect[j]^2) + 1/(tau^2)) 
post.v.theta.j  <- function(tau,j) 1 / (1/(se.effect[j]^2) + 1/(tau^2)) 
# Eqn 5.20 of BDA3
post.mu.hat     <- function(tau) sum(effect*1/(se.effect^2 +tau^2))/sum(1/(se.effect^2 +tau^2))
post.v.mu       <- function(tau) (sum(1/(se.effect^2 +tau^2)))^-1

# Eqn 5.21 of BDA3
marginal.tau    <- function(tau) {
  hyper.prior(tau) * (post.v.mu(tau)^.5) * prod(((se.effect^2 + tau^2)^(-1/2)) * 
                  exp(-((effect - post.mu.hat(tau))^2) / (2 * (se.effect^2 + tau^2))))
}

#testing alternative
marginal.tau1   <- function(tau) {
    marg.post   <- 1
        for (i in 1:length(effect)) {
      marg.post   <- ((se.effect[i]^2 + tau^2)^(-1/2)) * 
                 exp(-((effect[i] - post.mu.hat(tau))^2) / (2 * (se.effect[i]^2 + tau^2))) *
                 marg.post    
        return(hyper.prior(tau) * (post.v.mu(tau)^.5) * marg.post)
    }
} 

Define a hyper-prior and draw 1000 samples from each distribution (for all 8 schools).

set.seed(142857)
samps           <- 1000 

hyper.prior     <-  function(tau) 1
tau.grid        <-  seq(0.001,40, length=samps)
pdf.tau         <-  sapply(tau.grid,function(x) marginal.tau(x))
pdf.tau         <-  pdf.tau/sum(pdf.tau)

s.tau           <- sample(tau.grid,samps,prob=pdf.tau, replace=TRUE)
s.mu            <- sapply(s.tau,function(x) rnorm(1,post.mu.hat(x),(post.v.mu(x))^0.5))
s.theta         <- NULL 
for (j in 1:length(school.id)) {
  s.theta[[j]]         <- sapply(1:samps, 
                            function(x) 
                            rnorm(1,
                                  post.theta.j(s.mu[x],s.tau[x],j),
                                  (post.v.theta.j(s.tau[x],j))^0.5
                                  ) )
  }
par(mfrow=c(1,1))  
par(mar = rep(2, 4))
plot(tau.grid,pdf.tau, type="l", main="Figure 5.5 from BDA3", xlab=expression(tau), ylab="Density")

plot of chunk ex 5.3 sims

The sampling method in BDA3 suggest to apply the inverse method from the posterior of \(\tau\). I don’t do this for two reasons: (i) I’m not sure the posterior has a closed for solution for its inverse, and (ii) given that I already have the density, I can directly draw from that distribution sampling using the sample command (which leads me to think that this command applies the inverse method, but need to check with Susan).

s.theta         <- matrix(unlist(s.theta), ncol = 8, byrow = FALSE)
s.theta.sort    <- apply(s.theta, 2, sort)
p               <- t( apply(s.theta.sort, 2, function(x) quantile(x,c(.025,.25,.5, .75, .975),type=1)) )
p               <- round(p,3) 

Table 5.3 from BDA3:

School
2.5% 25% median 75% 97.5%
A -1.517 6.275 10.086 15.274 33.036
B -4.568 3.624 7.682 11.594 20.251
C -10.038 1.954 6.699 10.697 19.491
D -5.786 3.827 7.539 11.633 20.098
E -8.444 1.178 5.309 9.039 16.515
F -8.304 2.325 6.373 10.336 18.247
G -0.422 6.026 10.192 14.357 25.733
H -6.794 4.223 8.296 12.752 25.081

Here we reproduce figure 5.8 (with the same problems as above)

par(mfrow=c(1,2))  
par(mar = rep(2, 4))
domain           <- c(-20,60)
hist(s.theta[,1], breaks=10, xlab="Effect in School A", main="", xlim=domain)
hist(apply(s.theta,1,max), breaks=10, xlab="Largest Effect", main="", xlim=domain)
title(main="Figure 5.8 from BDA3")

plot of chunk unnamed-chunk-6

This last figure (“largest effect”) is a good example of one the main advantage of a fully Bayesian hierarchical model: once we have correctly simulated the posterior, we can test all kinds of complicated hypothesis.

aux1            <- apply(s.theta,1,max)
best            <- apply(1*(s.theta==aux1), 2,mean)

Table 2: Probability that each coaching program is the best among the eight schools

School Probability of having the best coaching program
A 0.231
B 0.113
C 0.094
D 0.093
E 0.065
F 0.068
G 0.21
H 0.126
p               <- sapply(1:8,function(y) sapply(1:8,function(x) mean(1*(s.theta[,x]>s.theta[,y]))))

Table 2: Probability that \(j\) (row) has a better program that school \(k\) (column)

School \(j\)/School \(k\)
A B C D E F G H
A 0 0.632 0.672 0.626 0.744 0.707 0.5 0.613
B 0.368 0 0.578 0.512 0.611 0.555 0.377 0.468
C 0.328 0.422 0 0.443 0.557 0.505 0.318 0.421
D 0.374 0.488 0.557 0 0.622 0.574 0.368 0.467
E 0.256 0.389 0.443 0.378 0 0.445 0.268 0.346
F 0.293 0.445 0.495 0.426 0.555 0 0.32 0.417
G 0.5 0.623 0.682 0.632 0.732 0.68 0 0.59
H 0.387 0.532 0.579 0.533 0.654 0.583 0.41 0

Right way to do it:
\[ \begin{align} p(\theta_{j}>max_{i\neq j}\{\theta_{i}\}) &= \int \prod_{i\neq j} p(\theta_{j}>\theta_{i}) \phi(\theta_{j}|y_{j},\sigma_{j})d\theta_{j} \\ &= \int \prod_{i\neq j} \Phi\left(\frac{\theta_{j} - \theta_{i}}{\sigma_{i}}\right) \phi(\theta_{j}|y_{j},\sigma_{j})d\theta_{j} \end{align} \]

This integral has to be solved numerically:

set.seed(142857)
best            <-  sapply(1:8,
                           function(y) mean(sapply(1:1000   ,
                           function(x) prod(pnorm(( rnorm( 1,effect[y],se.effect[y] ) - effect[-y]) / se.effect[-y] )))
                           )
                           )
# Ad-hoc normalization:
best            <- best/sum(best)  

Table 2.1: Probability that each coaching program is the best among the eight schools (with \(\tau = \infty\))

School Probability of having the best coaching program
A 0.5599
B 0.033
C 0.0265
D 0.0364
E 0.0034
F 0.0136
G 0.1615
H 0.1656

The following table presents the different values for the expression above:

p               <- sapply(1:8,function(x) 
                   sapply(1:8,function(y) 
                  pnorm( q = 0, mean = (effect[x] - effect[y]) / sqrt(se.effect[x]^2 + se.effect[y]^2) , 
                  sd = 1 )
                  ) )  
# Force all elementens in the diagonal to zero. 
p               <- p - .5 * diag(8)

Table 2.1: Probability that \(j\) (row) has a better program that school \(k\) (column). With \(\tau = \infty\)

School \(j\)/School \(k\)
A B C D E F G H
A 0 0.8664 0.9212 0.8705 0.9513 0.9267 0.7105 0.7527
B 0.1336 0 0.7201 0.5268 0.7482 0.6811 0.2398 0.423
C 0.0788 0.2799 0 0.3033 0.4566 0.4184 0.1329 0.2667
D 0.1295 0.4732 0.6967 0 0.7132 0.6501 0.2297 0.4063
E 0.0487 0.2518 0.5434 0.2868 0 0.444 0.0789 0.2591
F 0.0733 0.3189 0.5816 0.3499 0.556 0 0.1264 0.301
G 0.2895 0.7602 0.8671 0.7703 0.9211 0.8736 0 0.6146
H 0.2473 0.577 0.7333 0.5937 0.7409 0.699 0.3854 0
#Load data
data             <- read.csv("C:/Users/fhocesde/Documents/tutorials/Bayesian/bicycles.txt", header=T, sep = "")
data             <- data[1,]
y                <- as.numeric(select(data,b1:b10))
n                <- as.numeric(select(data, b1:b10)+select(data,v1:v10))

\[ \begin{align} p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)p(\theta|\alpha, \beta)p(y|\theta,\alpha, \beta)\\ p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta_{j}^{\alpha - 1} (1 - \theta_{j})^{\beta - 1}\prod^{J}_{j=1} \theta_{j}^{y_{j}} (1 - \theta_{j})^{n_{j}-y_{j}}\label{rat.joint.post1} \end{align} \]

Now we compute the posterior marginal of \((\alpha,\beta)\). Given that we do have a closed form solution in step 2, we compute the ratio of (\ref{rat.joint.post1}) and (\ref{rat.cond.post.theta1}). \[ \begin{align} p(\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha+y_{j})\Gamma(\beta+n_{j}-y_{j})}{\Gamma(\alpha + \beta +n_{j})} \label{rat.marg.post.phi1} \end{align} \]

\[ \begin{align} \hat{\mu} &= 0.1961 = \frac{\hat{\alpha_{0}}}{\hat{\alpha_{0}}+\hat{\beta_{0}}}\\ \hat{\sigma^2} &= 0.0111 = \frac{\hat{\alpha_{0}}\hat{\beta_{0}}}{(\hat{\alpha_{0}}+\hat{\beta_{0}})^{2}(\hat{\alpha_{0}}+\hat{\beta_{0}}+1)} \end{align} \]

Solving form \((\hat{\alpha_{0}},\hat{\beta_{0}})\):

#Here 'x' represents alpha and beta
dslnex          <- function(x) {
    z           <- numeric(2)
    z[1]        <- x[1]/(x[1]+x[2]) - mean(y/n)
    z[2]        <- x[1]*x[2]/(((x[1]+x[2])^2)*(x[1]+x[2]+1)) - sd(y/n)^2
    z
}

sol1            <- nleqslv(c(1,1), dslnex) 
res1            <- paste("(",round(sol1$x[1],1), ",", round(sol1$x[2],1), ")",sep="")

We get: \((\hat{\alpha_{0}},\hat{\beta_{0}}) = (2.6,10.6)\).

We center the grid (approximately) around that initial estimate and expand the grid to cover up to a factor of 4 of each parameter. The result is plotted in the following figure:

rat.marg.post.phi =  function(alpha, beta) {
  post    = 1
  #notice the censoring in n
  for (i in 1:length(y)) {
    if (n[i] > 100) n[i] = 100
    post  = post * ( ( ( gamma(alpha + beta) ) / ( gamma(alpha) * gamma(beta) ) ) * 
            ( ( gamma(alpha + y[i] ) * gamma(beta + n[i] - y[i]) )/( gamma(alpha + beta + n[i]) ) ) )
  }
  # The hyper prior is defined below
  rat.hyper.prior(alpha,beta) * post
}

rat.hyper.prior   = function(alpha,beta) {
  alpha*beta*(alpha + beta)^(-5/2)
}


v1        = seq(log(sol1$x[1]/sol1$x[2])*1.5,log(sol1$x[1]/sol1$x[2])/1.5,length.out =151)
v2        = seq(log(sol1$x[1]+sol1$x[2])/1.5,log(sol1$x[1]+sol1$x[2])*1.5,length.out =151)
beta      = exp(v2)/(exp(v1)+1)
alpha     = exp(v2+v1)/(exp(v1)+1)

post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)

contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")

plot of chunk unnamed-chunk-12

Adjust the grid and repeat:

v1        = seq(log(sol1$x[1]/sol1$x[2])*1.5,log(sol1$x[1]/sol1$x[2])/1.5,length.out =151)
v2        = seq(log(sol1$x[1]+sol1$x[2])/3,log(sol1$x[1]+sol1$x[2])*1.5,length.out =151)
beta      = exp(v2)/(exp(v1)+1)
alpha     = exp(v2+v1)/(exp(v1)+1)

post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)

contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")

plot of chunk unnamed-chunk-13

Draw samples \((\alpha^{s}, \beta^{s})\) from \(p(\alpha,\beta|y)\) (finally!). Here we repeat the procedure used in section 3.(v) of the book replication document.

samps       = 1000
v1.dens     = apply(post.dens ,1, sum)
s.v1        = sample(v1,samps, replace=TRUE, prob = v1.dens)

#Select the colum of the joint density corresponding to a specific value of v1 (p(v2|v1))
cond.v2     = function(x) {
  post.dens[which(v1 == s.v1[x]),]
}
#Sample a value of v2 according the the conditional probatility above
s.v2        = sapply(1:samps,function(x) sample(v2,1,replace=TRUE,prob=cond.v2(x)))

#Add a uniform random jitter centered at zero with with equal to the grid spacing. This will make the simulation draws more continuous. Plot the sampled values.  
grid.v1     = v1[2]-v1[1]
grid.v2     = v2[2]-v2[1]
s.v2        = s.v2 + runif(length(s.v2),-grid.v2/2,grid.v2/2)
s.v1        = s.v1 + runif(length(s.v1),-grid.v1/2,grid.v1/2)
plot(s.v1, s.v2, xlab=expression(log(alpha/beta)^s), ylab=expression(log(alpha+beta)^s), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), main="Scatter Plot of Sample Draws of log(alpha/beta) and log(alpha+beta)")

plot of chunk unnamed-chunk-14

By applying the inverse of the transformation we recover the marginal distribution of the original hyper-parameters.

s.beta      = exp(s.v2)/(exp(s.v1)+1)
s.alpha     = exp(s.v2+s.v1)/(exp(s.v1)+1)
s.beta      = exp(s.v2)/(exp(s.v1)+1)
s.alpha     = exp(s.v2+s.v1)/(exp(s.v1)+1)
theta.dist  = sapply(1:10, function(x) rbeta(1000,s.alpha+y[x], s.beta + n[x] - y[x]))
theta.dist  = apply(theta.dist,2,sort)
plot(0:600/1000,0:600/1000, type="l", xlab="Observed rate",ylab="95% CI and median of posterior")
jitter.x    = y/n + runif(length(y),-0.01,0.01)
points(jitter.x,theta.dist[500,])
segments(jitter.x,theta.dist[25,], jitter.x,theta.dist[975,] )
title(main="Posterior Distribution of Bike Rates for all 10 streets")

plot of chunk unnamed-chunk-16

The estimated proportions are almost the same as the raw proportions (no shrinkage).

* 5.13d  We generate 1000 draws from a $Beta(\alpha^{s},\beta^{s})$ where the parameters come from the draws obtained above:
s.theta         <- rbeta(1000, shape1 =s.alpha , shape2 = s.beta)   
CI.num          <- round(s.theta[order(s.theta)][c(25,975)],2)
CI.str          <- paste("(" , CI.num[1] , "," , CI.num[2] , ")")

The posterior interval for $ = ( 0.01 , 0.48 ) $


Replicating sections of the book

1 - Replicating example from pg 66-67

In Ch3 of BDA they present a first exercise of simulation (pg 66-67) where from a sample of 66 obs with mean 26.2 and standard deviation 10.8, they simulate the posterior for the mean assuming a joint normal likelihood and uninformative prior (where \(\sigma^{2}|y\sim Inv-\chi^{2}(n-1,s^{2})\) and \(\mu|\sigma^{2},y \sim N(\bar{y},\sigma^{2}/n)\)).

# Draw 1000 random numbers from a Inverse chi-squared
f.sim.sigma   = function(draws) (n-1)*(se^2)/rchisq(draws,n-1)
f.sim.norm    = function(draws) rnorm(draws,ybar,(f.sim.sigma(draws)/(n-1))^.5)
sim.norm      = f.sim.norm(1000)
ci.mu         = sim.norm[order(sim.norm)][c(25,975)]
print(ci.mu)
## [1] 23.64 28.84
# And now repeating the similation excercise R = 100 times:
R             = 100
R.sim.norm    = sapply(1:R, function(x) f.sim.norm(1000))
ci.mu         = t(apply(R.sim.norm,2,function(x) x[order(x)][c(25,975)]))
print(c(median_low=median(ci.mu[,1]), median_high=median(ci.mu[,2]), p10_low =quantile(ci.mu[,1],c(.1),type=1), p90_up =quantile(ci.mu[,2],c(.9),type=1) ))
##  median_low median_high p10_low.10%  p90_up.90% 
##       23.52       28.88       23.38       29.02

2 - Simulating values from a postetior normal with \(\sigma^{2}\) and \(\mu\) unknown and conjugate prior.

The following is not an exercise in the book (that I’m aware of), but I though that it would be interesting to work on this. After going over the model derived in pg 67-69 in BDA3, I will simulate values from the joint posterior (and hopefully leave the code flexible enough to play later on with the hyper-parameters).

With the prior density: \[ \begin{align} p(\mu,\sigma^{2}) &\propto \sigma^{-1}(\sigma^{2})^{(\nu_{0}/2 + 1}exp\left ( -\frac{1}{2\sigma^{2}} [\nu_{0}\sigma^{2}_{0} + \kappa_{0}(\mu_{0} - \mu)^{2} ]\right ) \end{align} \] And a likelihood: \[ \begin{align} p(y|\mu,\sigma^{2}) &\propto \sigma^{-n}exp\left ( -\frac{1}{2\sigma^{2}} [(n-1)s^{2} + n(\bar{y} - \mu)^{2} ]\right ) \end{align} \]

We obtain the following joint posterior: \[ \begin{align} p(\mu,\sigma^{2}|y) &\propto \sigma^{-1}(\sigma^{2})^{(\nu_{0}/2 + 1}exp\left ( -\frac{1}{2\sigma^{2}} [\nu_{0}\sigma^{2}_{0} + \kappa_{0}(\mu_{0} - \mu)^{2} ]\right ) \times \nonumber \\ &\times (\sigma^{2})^{-n/2}exp\left ( -\frac{1}{2\sigma^{2}} [(n-1)s^{2} + n(\bar{y} - \mu)^{2} ]\right ) \nonumber\\ &= N-Inv-\chi^{2}(\mu_n,\sigma^{2}_{n};\nu_{n},\sigma^{2}_{n})\\ \mu_{n} &= \frac{\kappa_{0}}{\kappa_{0} + n} \mu_{0} + \frac{n}{\kappa_{0} + n}\bar{y} \nonumber\\ \kappa_{n} &= \kappa_{0} + n \nonumber\\ \nu_{n} &= \nu_{0} + n \nonumber\\ \nu_{n}\sigma_{n}^{2} &= \nu_{0}\sigma_{0}^{2} + (n-1)s^{2} + \frac{\kappa_{0}n}{\kappa_{0} + n} (\bar{y} - \mu_0)^{2}. \nonumber \end{align} \]

Using the fact that \(p(\mu,\sigma^{2}|y) = p(\mu|\sigma^{2},y)p(\sigma^{2}|y)\) with:
\[ \begin{align} \mu|\sigma^{2},y &\sim N(\mu_{n},\sigma^{2}_{n}/\kappa_{n}) \label{post.mu}\\ \sigma^{2}|y &\sim Inv-\chi^{2}(\nu_{n},\sigma^{2}_{n}) \label{post.sig}\\ \end{align} \] The simulation algorithm is as follows:
* Draw a random number from the conditional posterior of sigma (equation \ref{post.sig})
* Using that value of \(\sigma^{2}\), draw a random number for the conditional posterior of \(\mu\) (equation \ref{post.mu})

The following code implements this algorithm.

# Hyperparameters
p.mu_0      = 0
p.sigma2_0  = 1
p.nu_0      = 4
p.kappa_0   = 3

# Data
n           = 10
y_bar       = 1.3
s           = 2

# Simulation
mu_n        = ((p.kappa_0)/(p.kappa_0 + n)) * p.mu_0 + ((n)/(p.kappa_0 + n)) * y_bar
kappa_n     = p.kappa_0 + n
nu_n        = p.nu_0 + n
sigma2_n    =  (p.nu_0*p.sigma2_0 + (n-1)*s^{2} + ((p.kappa_0*n)/(p.kappa_0 + n)) *  (y_bar - p.mu_0)^{2})/nu_n

set.seed(142857)
# Draw 1000 random numbers from a Inverse chi-squared
f.mu.post  = function(draws) {
  sigma2.cond = (nu_n)*(sigma2_n)/rchisq(draws,nu_n)
  mu.cond     = rnorm(draws,mu_n,(sigma2.cond/(kappa_n))^.5)
  return(mu.cond)
  }
sim.mu.post= f.mu.post(1000)
plot(density(sim.mu.post), main = "Draws from the marginal of the mean", col="red")
lines(density(mu_n +((sigma2_n/kappa_n)^.5)*rt(1000,nu_n)))
legend("topleft",col = c("red","black") ,legend=c("Simulated","Analytical"),lty = c(1,1)) 

plot of chunk unnamed-chunk-20

The plot above should be in 3D (\(\sigma^{2}, \mu, p(\sigma^{2}, \mu|y)\)) but still don’t know how to do that in R.

It also happens that for this particular problem there is a close form solution for the marginal of \(\mu\). (remember that \(p(\mu|y) = \int p(\mu|\sigma^{2},y) p(\sigma^{2}|y)d\sigma^{2}\)):

\[ \begin{align} \mu|\sigma^{2},y &\sim t_{\nu_{n}}(\mu|\mu_{n}, \sigma^2_{n}/\kappa_{n}) \end{align} \]

I’m still confused with the following why is that to simulate the marginal of \(\mu\) (\(\mu|\sigma^2,y\)) I just need to draw from the conditional posterior of the mean (equation \ref{post.mu}). More generally, my confusion is the following: why is that to simulate \(h(x) = \int g(x) dF(x)\) I just need to draw \(x^{s}\) from \(dF(x)\) and evaluate \(g(x^{s})\). I know from MC simulation that by the LLN \(E_{dF}[g(x)] = \int g(x) dF(x) \approx \sum_{s}g(x^{s})/S\), but I’m having trouble seen why is that this applies to all points in the distribution (and not just the mean).

My own explanation so far: as MC can be used to approximate the mean using the LLN, the same technique can be used to approximate the whole distribution using CLT.


Simulating numbers from a Dirichlet distribution

draws       = 1000
#This are the parameters of the gamma used in generating the final Dirichlet. 
gamma.scale = c(728,538,138)
aux1        = sapply(gamma.scale,function(x) rgamma(100,x))
aux1        = aux1 / apply(aux1,1,sum)

3 - Replicating Simulation Example 3.7 from BDA3 (pg 74+)

We are interested in modeling the dose-response of a certain drug (\(x_{i}\)) over the number of dead’s (\(y_{i}\)) in a group of trial animals (\(n_{i}\)).We have 4 observations. Defining \(\theta_{i}\) as the true death rate for each dosage, we can model this phenomena using a binomial distribution (\(y_{i} \sim Bin(n_{i},\theta_{i})\)). To model the dose-response relationship we start by looking at \(\theta_{i} = \alpha +\beta x_{i}\) but we realize that this model predicts values out of range (\(\theta\), as a probability has to be between 0 and 1). Hence we apply the logit (\(log(\theta_{i}/(1-\theta_{i}))\)) transformation. This implies the following likelihood.

\[ \begin{align} p(y_{i}|\alpha,\beta,n_{i},x_{i}) \propto [logit^{-1}(\alpha +\beta x_{i})]^{y_i}[1-logit^{-1}(\alpha +\beta x_{i})]^{n_{i}- y_{i}} \end{align} \]

Question for Susan: why is that in the likelihood above we do not multiply by the Jacobian of the transformation? Using a non-informative prior (\(p(\alpha,\beta)\propto 1\)) we get that the conditional posterior has the form: \[ \begin{align} p(\alpha,\beta|y,n,x) \propto \prod_{i=1}^{k} p(y_{i}|\alpha,\beta,n_{i},x_{i}) \label{post.1} \end{align} \]

First we compute, as a rough approximation of the parameters, the ML estimates.

# Data
x       = c(-0.86, -0.30, -0.05,  0.73)
y       = c(0,1,3,5)
n       = rep(5,4)

# ML estimation
mlfit <- glm( cbind(y, n-y) ~ x, family = binomial)
mlfit$coefficients = round(mlfit$coefficients*1000)/1000
res1 = paste(paste("(",mlfit$coefficients[1], sep=""), paste(mlfit$coefficients[2],")", sep=""), sep=",")

With the estimates \((\hat{\alpha},\hat{\beta}) = (0.847,7.749)\).

Now we will simulated values of \(\alpha\) and \(\beta\) from the posterior distribution. The approach is as follows.

    1. Build a grid for all possible values of \(\alpha\) and \(\beta\). Although this can be the whole real line, we use the ML estimates and some trial and error to restrict the space to \(\alpha \in [-5,10]\) and \(\beta \in [-10,40]\).
alpha     = seq(-5,10,.1)
beta      = seq(-10,40,1)
    1. Compute the posterior density (equation \ref{post.1}) over the whole grid and normalize it to 1.
post.a    = function(a,b,x,y,n) {
  post    = 1
  for (i in 1:length(y)) {
    post  = post * ( ((inv.logit(a+b*x[i]))^y[i])*((1-inv.logit(a+b*x[i]))^(n[i]-y[i])) )
  }
  post
}

post.dens = outer(alpha,beta,function(x1,x2) post.a(x1,x2,x,y,n))
post.dens = post.dens/sum(post.dens)
    1. Inspect the density using a contour plot (here we are looking for some indication that we are covering all the possible domain)
contours <- seq(min(post.dens), max(post.dens) , length=10)
par(mar = rep(2, 4))
contour(alpha, beta, post.dens,levels=contours, xlab=expression(alpha), ylab=expression(beta), xlim=c(min(alpha),max(alpha)) , ylim=c(min(beta),max(beta)), drawlabels=FALSE, main="Contour plot of joint posterior")

plot of chunk unnamed-chunk-25

    1. Compute the marginal posterior of \(\alpha\) by summing over all \(\beta\) for each value of \(\alpha\). Notice so far that we had only compute the probability distribution of \(\alpha\) and \(\beta\). Is not until this point where we would be able to do things like compute the mean, median and CI’s of these parameters.
samps       = 1000
alpha.dens  = apply(post.dens ,1, sum)
    1. For \(s = 1 \dots 1000\):
    1. Draw \(\alpha^{s}\) from its marginal posterior.
s.alpha     = sample(alpha,samps, replace=TRUE, prob = alpha.dens)
  1. Draw \(\beta^{s}\) from \(p(\beta|\alpha,y)\) for each value of \(\alpha^{s}\)
#Select the colum of the joint density corresponding to a specific value of alpha (p(beta|alpha))
cond.beta   = function(x) {
  post.dens[which(alpha == s.alpha[x]),]
}
#Sample a value of beta according the the conditional probatility above
s.beta      = sapply(1:samps,function(x) sample(beta,1,replace=TRUE,prob=cond.beta(x)))
  1. For each sample values of \(\alpha\) and \(\beta\) add a uniform random jitter centered at zero with with equal to the grid spacing. This will make the simulation draws more continuous. Plot the sampled values.
s.beta      = s.beta + runif(length(s.beta),-.5,.5)
s.alpha     = s.alpha + runif(length(s.alpha),-.05,.05)
plot(s.alpha, s.beta, xlab=expression(alpha^s), ylab=expression(beta^s), xlim=c(min(alpha),max(alpha)) , ylim=c(min(beta),max(beta)), main="Scatter Plot of Sample Draws of alpha and beta")

plot of chunk unnamed-chunk-29

The final result are two vectors \((\{\alpha^{s}\}), (\{\beta^{s}\})\), that represent probability distribution of each parameter.

4 - Description with my own words, of the fully Bayesian analysis of conjugate hierarchical models described in section 5.3 of BDA (pg 108 - 113)

The overall goal of hierarchical models is to describe a statistical (uncertain) phenomena where are multiple parameters involved and exists a dependence structure.

The analysis described here builds on what we just did in the example above. For this purpose we enumerate the steps followed, now in a more general notation.

  1. Compute the joint posterior probability distribution of all the parameters conditional on the data and a prior distribution of those parameters (in the example above we use a non-informative prior).
  2. Compute the marginal posterior of one parameter (\(p(\theta_{2}|y)\)) by summing the joint posterior over all the possible values of the other parameter (\(\theta_{1}\)).
  3. Compute the marginal of the other parameter (\(p(\theta_{1}|y)\)) using the following identity:

\[ \begin{align} p(\theta_{1}|y) = \int p(\theta_{1}|\theta_{2},y) p(\theta_{2}|y) d\theta_{2} \approx p(\theta_{1}|\theta^{s}_{2},y) \end{align} \] Where \(\theta^{s}_{2}\) is a random draw using \(p(\theta_{2}|y)\) (the justification for this last step is in section 2 of this document).

Now we will distinguish between two sets of parameters. The original parameters of interest, still represented by \(\theta\). And the parameters of the prior distribution, or hyperparameters represented by \(\phi\). This two sets of parameters fully describe the probabilistic process and have a joint distribution function:

\[ \begin{align} p(\phi,\theta) = p(\phi)p(\theta|\phi) \label{joint} \end{align} \]

And a joint posterior distribution: \[ \begin{align} p(\phi,\theta|y) &\propto p(\theta,\phi) p(y|\theta,\phi) \nonumber\\ &= p(\phi)p(\theta|\phi)p(y|\theta) \label{joint.post} \end{align} \] Where in the last line we used the assumption that \(p(y|\theta,\phi)\) depends on \(\phi\) only through \(\theta\). We call \(p(\phi)\) the hyperprior distribution, \(p(\theta|\phi)\) the population distribution, and the usual suspect \(p(y|\theta)\) is the likelihood, or sampling, distribution.

As always our goal is to make inference about \(\theta\) (and maybe \(\phi\) also?). As in the exercise before, our end result will be a set of matrices \((\{\phi^{s}\}), (\{\theta^{s}\})\) with values for the parameters that follow the posterior joint distribution. To get this result we take the following steps:

  1. Compute the conditional joint posterior \(p(\phi,\theta|y)\). Choosing the right hyper-prior distribution is a whole topic on itself and will be addressed later.
  2. Compute the marginal posterior of \(\theta\), conditional on \(\phi\), \(p(\theta|\phi,y)\). This conditional posterior of \(\theta\) can be obtain analytically in conjugate models for a given value of \(\phi\), it is simply the posterior of the non-hierarchical Bayesian case.
  3. Compute the posterior marginal of \(\phi\). When step 2 has a closed form solution we can compute the marginal of \(\phi\) as:
    \[ \begin{align} p(\phi|y) &= \frac{p(\theta,\phi |y)}{p(\theta|\phi,y)} \label{marg.post.phi} \end{align} \]

In the absence of a closed form solution for step 2. We can compute the marginal by integrating the joint over all the possible values of \(\theta\).
4. Draw samples \(\phi^{s}\) from \(p(\phi|y)\). Notice that \(\phi\) can have multiple components. If it has more than one element we follow the procedure described at the beginning of this section.
5. For each draw of \(\phi^{s}\), draw a sample of \(\theta\) from \(p(\theta|\phi^{s},y)\). This allow us to fully characterize the parameter of interest (our goal, remember?)

Application

We have data from \(j = 1 \dots J, J = 71\) experiments where, in each experiment \(n_{j}\) rats were exposed to a drug and \(y_{j}\) of them, presented tumors afterwards (think also something similar to unemployment rate in different states with \(n_{j}\) people in the active labor force and \(y_{j}\) unemployed). The results from the experiments are assumed to follow a binomial distribution (\(y_{j} \sim Bin(n_{j},\theta_{j})\)), and the parameters \(\theta_{j}\) are assume to follow a beta prior distribution (\(\theta_{j} \sim Beta(\alpha,\beta)\)), this last assumption is made to take advantage of conjugacy. With all this elements now we can follow the steps described above.

  1. Compute the conditional joint posterior \(p(\phi=(\alpha,\beta),\theta = \{\theta_{j}\}_{j=1}^{J}|y)\). \[ \begin{align} p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)p(\theta|\alpha, \beta)p(y|\theta,\alpha, \beta) \nonumber\\ p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta_{j}^{\alpha - 1} (1 - \theta_{j})^{\beta - 1}\prod^{J}_{j=1} \theta_{j}^{y_{j}} (1 - \theta_{j})^{n_{j}-y_{j}}\label{rat.joint.post} \end{align} \]

  2. Compute the marginal posterior of \(\theta\), conditional on \(\alpha, \beta\). For the beta-binomial case we have that given the hyper-parameters, each \(\theta_{j}\) has a posterior distribution \(Beta(\alpha + y_{j}, \beta +n_{j} - y_{j})\). Assuming exchangeability: \[ \begin{align} p(\theta|\alpha,\beta,y) &= p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta +n_{j})}{\Gamma(\alpha+y_{j})\Gamma(\beta+n_{j}-y_{j})} \theta_{j}^{\alpha+y_{j} - 1} (1 - \theta_{j})^{\beta+n_{j}-y_{j} - 1}\label{rat.cond.post.theta} \end{align} \]

  3. Compute the posterior marginal of \((\alpha,\beta)\). Given that we do have a closed form solution in step 2, we compute the ratio of (\ref{rat.joint.post}) and (\ref{rat.cond.post.theta}). \[ \begin{align} p(\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha+y_{j})\Gamma(\beta+n_{j}-y_{j})}{\Gamma(\alpha + \beta +n_{j})} \label{rat.marg.post.phi} \end{align} \]

And here is the code for this function:

rat.marg.post.phi =  function(alpha, beta) {
  post    = 1
  for (i in 1:length(y)) {
    post  = post * ( ( ( gamma(alpha + beta) )/( gamma(alpha)*gamma(beta) ) ) * ( ( gamma(alpha + y[i] )*gamma(beta + n[i] - y[i]) )/( gamma(alpha + beta + n[i]) ) ) )
  }
  # The hyper prior is defined below
  rat.hyper.prior(alpha,beta) * post
}

Note: we assume a hyper-prior of the form \(p(\alpha,\beta)\propto (\alpha + \beta)^{-5/2}\). So far I do not understand very well the whole discussion about proper and improper priors. Need to review Ch10 of DeGroot’s book

  1. Draw samples \((\alpha^{s}, \beta^{s})\) from \(p(\alpha,\beta|y)\). Before drawing the samples, the authors applied the following transformation to the parameter space: \((\alpha, \beta) \rightarrow (log(\frac{\alpha}{\beta}), log(\alpha+\beta))\). I still don’t know why they do this and why is that the transformation only applies to the hyper-prior only. This step requires a important number of sub-steps:
    4.1. Compute the (prior) probability distribution of the transformed variables. Here is where we multiply by the Jacobian of the inverse transformation. The transformation method states that if \(u\sim p_{u}(u)\) and there is a 1:1 function over \(u\), \(v = f(u)\), then \(v\sim p_{u}(u)|J|\) where the \(J\) is the Jacobian of the function \(f^{-1}(v)\) with respect to \(v\). For this case we have \(v_{1} = f_{1}(\alpha,\beta) = log(\alpha/\beta)\) and \(v_{2} = f_{2}(\alpha,\beta) = log(\alpha + \beta)\), with inverse \(f^{-1}_{1}(v_{1},v_{2}) = \exp(v_{1}+v_{2})/(1+exp(v_{1}))\) and \(f^{-1}_{2}(v_{1},v_{2}) = exp(v_{2})/(1+exp(v_{1}))\). Hence the Jacobian is:

\[ J = \begin{bmatrix} \frac{\partial f^{-1}_{1}}{\partial v_{1}} & \frac{\partial f^{-1}_{1}}{\partial v_{2}} \\[0.3em] \frac{\partial f^{-1}_{2}}{\partial v_{1}} & \frac{\partial f^{-1}_{2}}{\partial v_{2}} \end{bmatrix} = - \alpha \beta \]
Hence, using the transformation method we have that \(p(log(\frac{\alpha}{\beta}), log(\alpha+\beta)) \propto \alpha \beta(\alpha + \beta)^{-5/2}\). And here is the code for this function:

data      = read.csv("C:/Users/fhocesde/Documents/tutorials/Bayesian/rats.csv", header=T, sep = " ")
y         = data[,1]
n         = data[,2]
rat.hyper.prior   = function(alpha,beta) {
  alpha*beta*(alpha + beta)^(-5/2)
}

4.2. Identify the relevant domain for the transformed variables and its counterpart with the original variables. To do so we start computing the parameters \(\alpha\) and \(\beta\) that would match the sample mean and standard deviation of all 70 experiments with a beta prior distribution:

\[ \begin{align} \hat{\mu} &= 0.1381 = \frac{\hat{\alpha_{0}}}{\hat{\alpha_{0}}+\hat{\beta_{0}}}\\ \hat{\sigma^2} &= 0.0109 = \frac{\hat{\alpha_{0}}\hat{\beta_{0}}}{(\hat{\alpha_{0}}+\hat{\beta_{0}})^{2}(\hat{\alpha_{0}}+\hat{\beta_{0}}+1)} \end{align} \]

Solving form \((\hat{\alpha_{0}},\hat{\beta_{0}})\):

dslnex          <- function(x) {
    z           <- numeric(2)
    z[1]        <- x[1]/(x[1]+x[2]) - mean(y/n)
    z[2]        <- x[1]*x[2]/(((x[1]+x[2])^2)*(x[1]+x[2]+1)) - sd(y/n)^2
    z
}

sol1            <- nleqslv(c(1,1), dslnex) 
res1            <- paste("(",round(sol1$x[1],1), ",", round(sol1$x[2],1), ")",sep="")

We get: \((\hat{\alpha_{0}},\hat{\beta_{0}}) = (1.4,8.6)\).

We center the grid (approximately) around that initial estimate and expand the grid to cover up to a factor of 4 of each parameter. The result is plotted in the following figure:

v1        = seq(log(sol1$x[1]/sol1$x[2])*2,log(sol1$x[1]/sol1$x[2])/2,length.out =151)
v2        = seq(log(sol1$x[1]+sol1$x[2])/2,log(sol1$x[1]+sol1$x[2])*2,length.out =151)
beta      = exp(v2)/(exp(v1)+1)
alpha     = exp(v2+v1)/(exp(v1)+1)

post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)

contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")

plot of chunk unnamed-chunk-33

4.3. Recalculate the range of the grid such that includes all the density

v1        = seq(-2.3,-1.3,length.out =151)
# The booksets the range of "v2" up to 5, but my gamma function calculator gives me trouble after 4.8
v2        = seq(1 , 4.5,length.out =151)
beta      = exp(v2)/(exp(v1)+1)
alpha     = exp(v2+v1)/(exp(v1)+1)

post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)

contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")  

plot of chunk unnamed-chunk-34

4.4. Draw samples \((\alpha^{s}, \beta^{s})\) from \(p(\alpha,\beta|y)\) (finally!). Here we repeat the procedure used in section 3.(v) of this document.

samps       = 1000
v1.dens     = apply(post.dens ,1, sum)
s.v1        = sample(v1,samps, replace=TRUE, prob = v1.dens)

#Select the colum of the joint density corresponding to a specific value of v1 (p(v2|v1))
cond.v2     = function(x) {
  post.dens[which(v1 == s.v1[x]),]
}
#Sample a value of v2 according the the conditional probatility above
s.v2        = sapply(1:samps,function(x) sample(v2,1,replace=TRUE,prob=cond.v2(x)))

#Add a uniform random jitter centered at zero with with equal to the grid spacing. This will make the simulation draws more continuous. Plot the sampled values.  
grid.v1     = v1[2]-v1[1]
grid.v2     = v2[2]-v2[1]
s.v2        = s.v2 + runif(length(s.v2),-grid.v2/2,grid.v2/2)
s.v1        = s.v1 + runif(length(s.v1),-grid.v1/2,grid.v1/2)
plot(s.v1, s.v2, xlab=expression(log(alpha/beta)^s), ylab=expression(log(alpha+beta)^s), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), main="Scatter Plot of Sample Draws of log(alpha/beta) and log(alpha+beta)")

plot of chunk unnamed-chunk-35

Note:this two figures do not match exactly their counterparts in the textbook (5.3a and 5.3b in BDA3), but so far I have not been able to detect my mistake.

4.5 By applying the inverse of the transformation we recover the marginal distribution of the original hyper-parameters.

s.beta      = exp(s.v2)/(exp(s.v1)+1)
s.alpha     = exp(s.v2+s.v1)/(exp(s.v1)+1)
  1. For each draw of \(\phi^{s}\), draw a sample of \(\theta\) from \(p(\theta|\phi^{s},y)\)
s.beta      = exp(s.v2)/(exp(s.v1)+1)
s.alpha     = exp(s.v2+s.v1)/(exp(s.v1)+1)
theta.dist  = sapply(1:71, function(x) rbeta(1000,s.alpha+y[x], s.beta + n[x] - y[x]))
theta.dist  = apply(theta.dist,2,sort)
plot(0:450/1000,0:450/1000, type="l", xlab="Observed rate",ylab="95% CI and median of posterior")
jitter.x    = y/n + runif(length(y),-0.01,0.01)
points(jitter.x,theta.dist[500,])
segments(jitter.x,theta.dist[25,], jitter.x,theta.dist[975,] )
title(main="Posterior Distribution of Tumor Rates for all 71 Experiments \n (Remember the goal? This is it!)")

plot of chunk unnamed-chunk-37

5 - Replicating section 5.5: Experiments in eight schools (normal model) [Without Stan]

#Data:
school.id       <- LETTERS[1:8]
effect          <- c(28,8,-3,7,-1,1,18,12)
se.effect       <- c(15,10,16,11,9,11,10,18) 

pool.est        <- sum(effect*se.effect^-2)/sum(se.effect^-2)
pool.var        <- sum(se.effect^-2)^-1
pool.ci         <- c(-1.96,1.96)*pool.var^.5 + pool.est

The pooled estimated effect and variance are 7.69 and 16.58, with a 95% CI of [-0.3, 15.67].

Posterior simulation under ther hierarchical model
Using the identity:
\[ \begin{align} p(\theta,\mu,\tau|y) = p(\tau|y)p(\mu|\tau,y)p(\theta|\mu,\tau,y) \end{align} \] And the results from BDA in equation 5.17, 5.20 and 5.21 we code the joint posterior:

# Eqn 5.17 of BDA3
post.theta.j    <- function(mu,tau,j) (effect[j]/(se.effect[j]^2) + mu/(tau^2)) / (1/(se.effect[j]^2) + 1/(tau^2)) 
post.v.theta.j  <- function(tau,j) 1/(1/(se.effect[j]^2) + 1/(tau^2)) 
# Eqn 5.20 of BDA3
post.mu.hat     <- function(tau) sum(effect*1/(se.effect^2 +tau^2))/sum(1/(se.effect^2 +tau^2))
post.v.mu       <- function(tau) (sum(1/(se.effect^2 +tau^2)))^-1

# Eqn 5.21 of BDA3
marginal.tau     <- function(tau) {
  hyper.prior(tau)*(post.v.mu(tau)^.5)*prod(((se.effect^2 + tau^2)^(-.5)) * 
                  exp(-((effect - post.mu.hat(tau))^2)/(2*(se.effect^2 + tau^2))))
}

Define a hyper-prior and draw 200 samples from each distribution (for all 8 schools).

samps           <- 200 

hyper.prior     <-  function(tau) 1
tau.grid        <-  seq(0.001,30, length=samps)
pdf.tau         <-  sapply(tau.grid,function(x) marginal.tau(x))
pdf.tau         <-  pdf.tau/sum(pdf.tau)

plot(tau.grid,pdf.tau, type="l", main="Figure 5.5 from BDA3", xlab=expression(tau), ylab="Density")

plot of chunk unnamed-chunk-40

The sampling method in BDA3 suggest to apply the inverse method from the posterior of \(\tau\). I don’t do this for two reasons: (i) I’m not sure the posterior has a closed for solution for its inverse, and (ii) given that I already have the density, I can directly draw from that distribution sampling using the sample command (which leads me to think that this command applies the inverse method, but need to check with Susan).

# Sampling
s.tau           <- sample(tau.grid,samps,prob=pdf.tau, replace=TRUE)
s.mu            <- sapply(s.tau,function(x) rnorm(1,post.mu.hat(x),(post.v.mu(x))^0.5))
s.theta         <- NULL 
for (j in 1:length(school.id)) {
  s.theta[[j]]         <- sapply(1:samps, 
                            function(x) 
                            rnorm(1,
                                  post.theta.j(s.mu[x],s.tau[x],j),
                                  (post.v.theta.j(s.tau[x],j))^0.5
                                  ) )
  }

The following figures replicate the figures in pg 122 in BDA. Before doing the plots we need to ‘average over \(\mu\)

\[ \begin{align} E(\theta_{j}|\tau,y) &= E_{\mu}\left[E(\theta_{j}|\tau,y,\mu)|\tau,y\right] \nonumber \\ &= E_{\mu}\left[\frac{ \frac{1}{\sigma_{j}^{2}}y_{j} + \frac{1}{\tau^{2}}\mu }{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} } }|\tau,y\right] = \frac{ \frac{1}{\sigma_{j}^{2}}y_{j} + \frac{1}{\tau^{2}}\hat{\mu} }{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} } }\\ \nonumber \\ Var(\theta_{j}|\tau,y) &= E_{\mu}\left[Var(\theta_{j}|\tau,y,\mu)|\tau,y\right] + Var_{\mu}\left[E(\theta_{j}|\tau,y,\mu)|\tau,y\right] \nonumber \\ &= \frac{1}{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} }} + V_{\mu}\left(\frac{\frac{1}{\tau^{2}}}{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} }}\right) \end{align} \]

Where \(V_{\mu}\) and \(\hat{\mu}\) correspond to the expressions defined in Eq 5.20 of BDA3. Below is the code and plot of both equations.

post.theta.j.no.mu     <- function(tau,j) post.theta.j(post.mu.hat(tau),tau,j)  
post.se.theta.j.no.mu  <- function(tau,j) sqrt( (post.v.theta.j(tau,j)) * (1+post.v.mu(tau)*tau^(-2)) )

plot( tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,1)), type="l", ylim=c(-5,30), xlab="", ylab="")
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,2)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,3)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,4)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,5)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,6)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,7)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,8)))
title(main="Figure 5.6 from BDA3", xlab=expression(tau), ylab="Estimated treatment effect")

plot of chunk unnamed-chunk-42

plot(tau.grid,sapply(tau.grid, function(x)  post.se.theta.j.no.mu(x,1)), type="l", ylim=c(0,20), xlab="", ylab="")
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,2)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,3)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,4)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,5)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,6)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,7)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,8)))
title(main="Figure 5.7 from BDA3", xlab=expression(tau), ylab="Posterior Standard Deviation")

plot of chunk unnamed-chunk-42

s.theta         <- matrix(unlist(s.theta), ncol = 8, byrow = FALSE)
s.theta.sort    <- apply(s.theta, 2, sort)
p               <- t( apply(s.theta.sort, 2, function(x) quantile(x,c(.025,.25,.5, .75, .975),type=1)) )
p               <- round(p,3)

Table 5.3 from BDA3:

School
2.5% 25% median 75% 97.5%
A -0.959 5.126 8.799 14.701 31.639
B -4.707 3.321 7.599 11.384 20.985
C -15.041 2.191 6.602 10.44 17.38
D -9.037 2.942 6.87 11.321 20.173
E -8.086 2.217 6.21 9.562 18.036
F -6.893 2.292 6.263 10.17 19.596
G -1.023 5.646 9.295 14.107 29.198
H -7.684 4.477 7.525 12.355 24.877

Here we reproduce figure 5.8 (with the same problems as above)

par(mfrow=c(1,2))  
domain           <- c(-20,60)
hist(s.theta[,1], breaks=10, xlab="Effect in School A", main="", xlim=domain)
hist(apply(s.theta,1,max), breaks=10, xlab="Largest Effect", main="", xlim=domain)
title(main="Figure 5.8 from BDA3")

plot of chunk unnamed-chunk-44

This last figure (“largest effect”) is a good example of one the main advantage of a fully Bayesian hierarchical model: once we have correctly simulated the posterior, we can test all kinds of complicated hypothesis.

6 - Replicating section 5.5: Experiments in eight schools (normal model) [With Stan]

Appendix C in BDA3

7 - Replicating example of section 11.2: Metropolis sampling from bivariate normal [Witout Stan]

Here we follow the steps in p278 to simulate draws from a bivariate normal \(N(0,I_{2})\).

  1. Define a starting point \(\theta^{0}\), for which \(p(\theta^{0}|y)>0\):
set.seed(142857)
library(mvtnorm)
theta.0         <- c(-2.5, 2.5)

Our starting point is \((-2.5, 2.5)\), the upper left most square dot from figure 11.1a.

  1. For \(t=1,2,...:\)
  1. Sample a proposal \(\theta^{*}\) at time \(t\) from \(J_{t}(\theta^{*}|\theta^{t-1})\). For this example the jumping distribution is \(N(\theta^{*}|\theta^{t-1},0.2^{2}I_{2})\):
theta.star      <- function(theta.t_1) {
  n.p         <- length(theta.t_1) 
  # I would like to learn a way to draw from a multivariate normal without a black-box function. 
  rmvnorm(1, mean = theta.t_1, sigma = 0.2^2 * diag(n.p))  
} 
  1. Calculate the ratio of densities,
    \[ \begin{align} r &= \frac{p(\theta^{*}|y)}{p(\theta^{t-1}|y)} \label{r.dens} \end{align} \]
r.dens          <- function(theta.t_1, theta.star) {
  n.p         <- length(theta.t_1) 
  dmvnorm(theta.star, mean = rep(0, n.p), sigma = diag(n.p)) /
  dmvnorm(theta.t_1 , mean = rep(0, n.p), sigma = diag(n.p))  
  } 
  1. Set \[ \begin{align} \theta^{t} &= \left\{ \begin{array}{l l} \theta^{*} & \quad \text{with probability $min(r,1)$}\\ \theta^{t-1} & \quad \text{otherwise.} \end{array} \right. \nonumber \end{align} \]
theta.t         <- function(theta.t_1, theta.star, r) {
  prob.aux      <- runif(1) 
  if (prob.aux <= min(c(r,1)))  
    theta.star
  else
    theta.t_1
}
  1. Execute:
theta           <- function(sims,theta.0) {
  s.theta       <- matrix(NA,sims,2) 
  s.theta[1,]   <- t(as.matrix(theta.0))

  for (t in 2:sims) {
    s.theta.star<- theta.star(s.theta[t-1,])
    r           <- r.dens(s.theta[t-1,], s.theta.star)
    s.theta[t,] <- theta.t(s.theta[t-1,], s.theta.star, r) 
  }
return(s.theta)
}
par(mfrow=c(1,3))  
par(mar = rep(2,4))

# 11.1a
s.theta.1       <- theta(50,theta.0)
s.theta.2       <- theta(50,c(2.5,2.5))
s.theta.3       <- theta(50,c(2.5,-2.5))
s.theta.4       <- theta(50,c(-2.5,-2.5))
s.theta.5       <- theta(50,c(0,0))
plot(s.theta.1, xlim=c(-4,4), ylim=c(-4,4), type="l", sub = "50 simulations")
points(theta.0[1], theta.0[2], pch=15)
lines(s.theta.2)
points(2.5, 2.5, pch=15)
lines(s.theta.3)
points(2.5, -2.5, pch=15)
lines(s.theta.4)
points(-2.5, -2.5, pch=15)
lines(s.theta.5)
points(0, 0, pch=15)

# 11.1b
s.theta.1       <- theta(1000, theta.0)
s.theta.2       <- theta(1000, c(2.5,2.5))
s.theta.3       <- theta(1000, c(2.5,-2.5))
s.theta.4       <- theta(1000, c(-2.5,-2.5))
s.theta.5       <- theta(1000, c(0,0))
plot(s.theta.1, xlim=c(-4,4), ylim=c(-4,4), type="l",  sub = "1000 simulations", main="Figure 11.1 From BDA3")
points(theta.0[1], theta.0[2], pch=15)
lines(s.theta.2)
points(2.5, 2.5, pch=15)
lines(s.theta.3)
points(2.5, -2.5, pch=15)
lines(s.theta.4)
points(-2.5, -2.5, pch=15)
lines(s.theta.5)
points(0, 0, pch=15)

s.theta.1       <- s.theta.1[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.2       <- s.theta.2[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.3       <- s.theta.3[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.4       <- s.theta.4[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.5       <- s.theta.5[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)

# 11.1c
plot(s.theta.1, xlim=c(-4,4), ylim=c(-4,4), type="p", pch = 20, cex =.1, sub = "Second half of 1000 simulations")
points(s.theta.2, pch = 20, cex =.1)
points(s.theta.3, pch = 20, cex =.1)
points(s.theta.4, pch = 20, cex =.1)
points(s.theta.5, pch = 20, cex =.1)

plot of chunk unnamed-chunk-49

8 - Replicating example of section 11.6: Gibbs sampling for a hierarchical normal model (Coagulations experiment data) [Witout Stan]

#Data
id              <- rep(LETTERS[1:4],c(4,6,6,8))
y               <- c(62,60,63,59,63,67,71,64,65,66,68,66,71,67,68,68,56,62,60,61,63,64,63,59)  
J               <- length(unique(id))
n               <- length(y)

The Model
Data \(y_{ij}, i= 1,..., n_{j} , j= 1,..,J\) are iid within each of \(J\) groups with \(y_{ij} \sim N(\theta_{j}, \sigma^{2})\). The prior distribution of \(\theta_{j}\) is assume to be normal with hyperparameters \(\mu, \tau\) (\(\theta_{j} \sim N(\mu, \tau^{2}\)). \(\sigma\) is assumed to be unkonwn (confusing!) and the hyperprior is assumed to be uniform over \((\mu, log(\sigma), \tau)\), which implies \(p(\mu, log(\sigma), log(\tau)) \propto \tau\). The joint posterior denstity for all the parameters is:

\[ \begin{align} p(\theta,\mu, log(\sigma), log(\tau) | y) \propto \tau \prod^{J}_{j=1} N(\theta_{j}|\mu, \tau^{2}) \prod^{J}_{j=1} \prod^{n_{j}}_{i=1} N(y_{ij}|\theta_{j}, \sigma^{2}) \label{norm.norm4gibs} \end{align} \]

Starting points
Select 10 \(\theta_{j}\) randomly from the \(y_{ij}\) sample. And take \(\mu\) to be the average starting values of \(\theta_{j}\).

set.seed(142857)
theta.0         <- sapply(1:4,function(x) sample(y[id==LETTERS[x]],10, replace=TRUE))
mu.0            <- apply(theta.0, 1,mean) 

[Here I follow the exact reverse order that is proposed in the book, going from step 4 to 1 instead of 1 to 4. It is not clear to me how to do it otherwise].

4. Draw from conditional posterior of \(\tau^{2}\) using: \[ \begin{align} \hat{\tau}^{2} &= \frac{1}{J-1}\sum_{j=1}^{J}(\theta_{j} - \mu)^{2} \end{align} \]

tau.hat.2       <- function(theta) {
  mu            <- mean(theta)
  tau.2         <- ( 1/(J-1) ) * sum((theta - mu)^2)
  return(tau.2)
} 

And the fact that: \[ \begin{align} \tau^{2}|\theta,\mu,\sigma, y \sim Inv-\chi^2(J-1,\hat{\tau}^{2}) \end{align} \]

We can draws samples for \(\tau^{2}\)

s.tau.post      <- function(theta) {
  tau.2         <- tau.hat.2(theta) 
  tau.cond      <- (J - 1) * (tau.2)/rchisq(1,J-1)
  return(tau.cond)
  }

3. Draw from conditional posterior of \(\sigma^{2}\) Using:
\[ \begin{align} \hat{\sigma}^{2} &= \frac{1}{n}\sum_{j=1}^{J}\sum_{i=1}^{n_{j}}(y_{ij} - \theta_{j})^{2} \end{align} \]

f.sigma.hat.2   <- function(theta) {
  sigma.hat.2   <-  sapply(1:4, function(x) (y[id==LETTERS[x]] - theta[x])^2)  
  sigma.hat.2   <-  (1/n) * sum(unlist(sigma.hat.2))
  return(sigma.hat.2)
}

And the fact that:

\[ \begin{align} \sigma^{2}|\theta,\mu,\tau,y \sim Inv-\chi^2(n,\hat{\sigma}^{2}) \end{align} \] We can draws samples for \(\sigma^{2}\)

s.sigma.post      <-  function(theta) {
  sigma2.hat      <- f.sigma.hat.2(theta) 
  sigma2.post      <-  (n) * (sigma2.hat)/rchisq(1,n)
  return(sigma2.post)
  }

2. Draw from conditional posterior of \(\mu\) Using:
\[ \begin{align} \hat{\mu} &= \frac{1}{J} \sum_{j=1}^{J}\theta_{j} \end{align} \]

mu.hat          <- function(theta) {
  mean(theta)
} 

And the fact that:

\[ \begin{align} \mu|\theta,\sigma,\tau,y \sim N(\hat{\mu}, \tau^{2}/J) \end{align} \]

We can draw values for \(\mu\)

s.mu        <- function(theta,tau2) {
  mu.hat    <- mu.hat(theta)
  rnorm(1,mu.hat,sqrt(tau2/J))
}

1. Finally, we can draw values for \(\theta\) Using the fact that:
\[ \begin{align} \theta_{j}|\mu,\sigma,\tau,y \sim N(\hat{\theta_{j}}, V_{\theta_{j}}) \end{align} \]
With: \[ \begin{align} \hat{\theta_{j}} &= \frac{ \frac{1}{\tau^{2}}\mu + \frac{n_{j}}{\sigma^{2}}\bar{y_{\dot j}} }{\frac{1}{\tau^{2}} + \frac{n_{j}}{\sigma^{2}}} \nonumber \\ \\ V_{\theta_{j}} &= \frac{1}{\frac{1}{\tau^{2}} + \frac{n_{j}}{\sigma^{2}}} \nonumber \\ \end{align} \]

theta.hat.j     <- function(j,mu,sigma2,tau2) {
  y.bar.j       <- mean(y[id==LETTERS[j]])
  n.j           <- length(y[id==LETTERS[j]])
  ( (1/tau2) * mu + (n.j/sigma2) * y.bar.j  ) / ( (1/tau2) + (n.j/sigma2)  )
} 

V.theta.hat.j   <- function(j,mu,sigma2,tau2) {
  n.j           <- length(y[id==LETTERS[j]])
  ( 1 ) / ( (1/tau2) + (n.j/sigma2) )
} 

s.theta         <- function(mu,sigma2,tau2) {
  theta         <- NULL 
  for (j in 1:J) {
    t.hat       <- theta.hat.j(j,mu,sigma2,tau2)   
    v.t.hat     <- V.theta.hat.j(j,mu,sigma2,tau2)
    theta[j]    <- rnorm(1,t.hat,sqrt(v.t.hat)) 
  }
  return(theta)
}

mcmc.gibbs      <- function(chain) {
  res1          <- as.list(NULL) 
  sims          <- 200
  param         <- 7
  s.param       <- matrix(NA, ncol = param, nrow = sims )
  colnames(s.param)<-  c("theta1", "theta2", "theta3", "theta4", "mu", "sigma2", "tau2")
  s.param[1,1:4]<- theta.0[chain,]
  s.param[1,7]  <- s.tau.post(theta.0[chain,])
  s.param[1,6]  <- s.sigma.post(theta.0[chain,])
  s.param[1,5]  <- s.mu(theta.0[chain,],s.param[1,7])
   
  for (s in 2:sims) {
    s.param[s,1:4]<- s.theta(s.param[s-1,5],s.param[s-1,6],s.param[s-1,7])
    s.param[s,7]  <- s.tau.post(s.param[s,1:4])
    s.param[s,6]  <- s.sigma.post(s.param[s,1:4])
    s.param[s,5]  <- s.mu(s.param[s,1:4],s.param[s,7])
  }
return(s.param)
#Warm-up
}
s.param           <- lapply(1:10,function(x) mcmc.gibbs(x))
s.param.1         <- s.param[[1]][101:200, ] 

#Transform the variance in to sd.
s.param.1[,6:7]   <- sqrt(s.param.1[,6:7] )

t(apply(s.param.1,2, function(x) quantile(x, c(.025,.25,.5,.75,.975))))
##          2.5%    25%    50%    75%  97.5%
## theta1 58.539 60.668 61.290 62.136 63.551
## theta2 64.097 65.187 65.966 66.521 67.694
## theta3 65.779 67.124 67.719 68.310 69.505
## theta4 59.654 60.640 61.285 61.767 62.598
## mu     58.590 62.809 63.768 65.404 69.947
## sigma2  1.851  2.169  2.395  2.694  3.328
## tau2    1.825  2.777  3.833  6.389 17.368
r.hat             <- function(arg1) {
  n               <- dim(arg1)[1]
  m               <- dim(arg1)[2]
  B               <- (n/(m-1)) * sum( ( apply(arg1,2,mean) - mean(arg1) )^2 )
  W               <- (1/m) * (1/(n-1)) * sum( (arg1 - apply(arg1,2,mean))^2 )  
  sqrt( (n-1)/(n) + B/(W*n) )
}

 r.hat(sapply(1:10,function(x) s.param[[x]][,4] ))
## [1] 1.032

Simulate 200 draws for each of the 10 chains
Note: Problems with \(\sigma\) and \(\tau\), need to review.