Note: Some of the solutions presented here are have been reverse engineered from here.
1.1
1a: \[ \begin{align} p(y) &= \frac{1}{2} \left( p(y| \theta = 1) + p(y| \theta = 2) \right) \nonumber\\ &= \frac{1}{2} \left( N(y|1,2^{2}) + N(y|2,2^{2}) \right) \end{align} \]
domain <- seq(-7,10,.02)
dens <- 0.5*dnorm(domain,1,2) + 0.5*dnorm(domain,2,2)
plot (domain, dens, ylim=c(0,1.1*max(dens)),
type="l", xlab="y", ylab="", xaxs="i",
yaxs="i", yaxt="n", bty="n", cex=2)
p.theta.1 <- function(sigma) {
res1 <- (0.5*dnorm(1,1,sigma)) /(sum(0.5*dnorm(1,c(1,2),sigma)))
return(res1)
}
Evaluating the last expression in the respective cumulative distribution function we get:0.5312. Note: even though we are adding “discrete” number of probabilities, we are still in the continuous space (but for \(y=1\)) and should evaluate the probabilities in the density function.
Table 1: Posterior probabilty of \(\theta = 1\), af a function of \(\sigma\)
$$ | \(p(\theta = 1 | y = 1 )\) |
---|---|
0.25 | 0.9997 |
0.5 | 0.8808 |
1 | 0.6225 |
2 | 0.5312 |
4 | 0.5078 |
8 | 0.502 |
Lets define the following events:
* \(A:\) The participant chose the right box at the beginning.
* \(B:\) The host opens a particular box, among the unchosen ones, such that is empty.
* \(C:\) Among the unchosen boxes the host chooses a empty box.
And let’s compute the probabilities of each of this events.
\[
\begin{align*}
Pr(A) &= 1/3\\
Pr(C) &= 1/2\\
Pr(B) &= Pr(B|A)Pr(A) + Pr(B|\neg A)Pr(\neg A) = (1/2)*(1/3) + Pr(B|\neg A)*(2/3)\\
&= 1/6 + 2/3*(Pr(B|\neg A,C)Pr(C) + Pr(B|\neg A,\neg C)Pr(\neg C)) \\
&= 1/6 + 2/3*(1*(1/2) + 0*(1/2)) = 1/2
\end{align*}
\] Using Bayes’ theorem we have that the probability of choosing the right box from the beginning, conditional on a unchosen box being revealed as a losing one is:
\[ Pr(A|B) = \frac{Pr(A)Pr(B|A)}{Pr(B)} = \frac{(1/3)*(1/2)}{1/2}=\frac{1}{3}\]
The participant’s chances are not equal across remaining boxes! She is worst of staying with her original choice (33% probability of wining instead of 50%!).
More generally if there were \(n\) boxes in total and \(i\) boxes where revealed, we have that the wrong way of updating the probabilities (\(1/(n-i)\)) and the Bayesian update (\(\frac{i+n*(n-1-i)}{n*(n-i)*(n-i-1)}\)) differ significantly as \(i \rightarrow n\). For example the following graph plots both probabilities of winning in a contest with 50 boxes as the host opens \(i\) boxes.
Looking at the graph it seems that the advantages of thinking in a Bayesian fashion are certainly parameter-specific. Also notice that the player here chooses a “stubborn” strategy, I suspect that if she changes boxes in a optimal way the improvement in her chances will be slightly less. Maybe that is the reason why we don’t think in a Bayesian fashion all the time.
domain <- seq(0,1,.01)
dens = apply(sapply(0:2,function(x) choose(10,x)*dbeta(domain,4+x,4+10-x)),1,sum)
plot(domain, dens, type="l")
2.14
2.14a Deriving the posterior for a normal likelihood with known variance, unknown mean, and using a normal prior. Slide 15 here
Note: a good reminder of the main conjugacy relationships can be found here
#Data:
school.id <- LETTERS[1:8]
effect <- c(28,8 ,-3,7 ,-1,1 ,18,12)
se.effect <- c(15,10,16,11,9 ,11,10,18)
pool.est <- sum(effect*se.effect^-2)/sum(se.effect^-2)
pool.var <- sum(se.effect^-2)^-1
pool.ci <- c(-1.96,1.96)*pool.var^.5 + pool.est
The pooled estimated effect and variance are 7.69 and 16.58, with a 95% CI of [-0.3, 15.67].
Posterior simulation under ther hierarchical model
Using the identity:
\[
\begin{align}
p(\theta,\mu,\tau|y) = p(\tau|y)p(\mu|\tau,y)p(\theta|\mu,\tau,y)
\end{align}
\] And the results from BDA in equation 5.17, 5.20 and 5.21 we code the joint posterior:
# Eqn 5.17 of BDA3
post.theta.j <- function(mu,tau,j) (effect[j]/(se.effect[j]^2) + mu/(tau^2)) / (1/(se.effect[j]^2) + 1/(tau^2))
post.v.theta.j <- function(tau,j) 1 / (1/(se.effect[j]^2) + 1/(tau^2))
# Eqn 5.20 of BDA3
post.mu.hat <- function(tau) sum(effect*1/(se.effect^2 +tau^2))/sum(1/(se.effect^2 +tau^2))
post.v.mu <- function(tau) (sum(1/(se.effect^2 +tau^2)))^-1
# Eqn 5.21 of BDA3
marginal.tau <- function(tau) {
hyper.prior(tau) * (post.v.mu(tau)^.5) * prod(((se.effect^2 + tau^2)^(-1/2)) *
exp(-((effect - post.mu.hat(tau))^2) / (2 * (se.effect^2 + tau^2))))
}
#testing alternative
marginal.tau1 <- function(tau) {
marg.post <- 1
for (i in 1:length(effect)) {
marg.post <- ((se.effect[i]^2 + tau^2)^(-1/2)) *
exp(-((effect[i] - post.mu.hat(tau))^2) / (2 * (se.effect[i]^2 + tau^2))) *
marg.post
return(hyper.prior(tau) * (post.v.mu(tau)^.5) * marg.post)
}
}
Define a hyper-prior and draw 1000 samples from each distribution (for all 8 schools).
set.seed(142857)
samps <- 1000
hyper.prior <- function(tau) 1
tau.grid <- seq(0.001,40, length=samps)
pdf.tau <- sapply(tau.grid,function(x) marginal.tau(x))
pdf.tau <- pdf.tau/sum(pdf.tau)
s.tau <- sample(tau.grid,samps,prob=pdf.tau, replace=TRUE)
s.mu <- sapply(s.tau,function(x) rnorm(1,post.mu.hat(x),(post.v.mu(x))^0.5))
s.theta <- NULL
for (j in 1:length(school.id)) {
s.theta[[j]] <- sapply(1:samps,
function(x)
rnorm(1,
post.theta.j(s.mu[x],s.tau[x],j),
(post.v.theta.j(s.tau[x],j))^0.5
) )
}
par(mfrow=c(1,1))
par(mar = rep(2, 4))
plot(tau.grid,pdf.tau, type="l", main="Figure 5.5 from BDA3", xlab=expression(tau), ylab="Density")
The sampling method in BDA3 suggest to apply the inverse method from the posterior of \(\tau\). I don’t do this for two reasons: (i) I’m not sure the posterior has a closed for solution for its inverse, and (ii) given that I already have the density, I can directly draw from that distribution sampling using the sample
command (which leads me to think that this command applies the inverse method, but need to check with Susan).
s.theta <- matrix(unlist(s.theta), ncol = 8, byrow = FALSE)
s.theta.sort <- apply(s.theta, 2, sort)
p <- t( apply(s.theta.sort, 2, function(x) quantile(x,c(.025,.25,.5, .75, .975),type=1)) )
p <- round(p,3)
Table 5.3 from BDA3:
School | |||||
---|---|---|---|---|---|
2.5% | 25% | median | 75% | 97.5% | |
A | -1.517 | 6.275 | 10.086 | 15.274 | 33.036 |
B | -4.568 | 3.624 | 7.682 | 11.594 | 20.251 |
C | -10.038 | 1.954 | 6.699 | 10.697 | 19.491 |
D | -5.786 | 3.827 | 7.539 | 11.633 | 20.098 |
E | -8.444 | 1.178 | 5.309 | 9.039 | 16.515 |
F | -8.304 | 2.325 | 6.373 | 10.336 | 18.247 |
G | -0.422 | 6.026 | 10.192 | 14.357 | 25.733 |
H | -6.794 | 4.223 | 8.296 | 12.752 | 25.081 |
Here we reproduce figure 5.8 (with the same problems as above)
par(mfrow=c(1,2))
par(mar = rep(2, 4))
domain <- c(-20,60)
hist(s.theta[,1], breaks=10, xlab="Effect in School A", main="", xlim=domain)
hist(apply(s.theta,1,max), breaks=10, xlab="Largest Effect", main="", xlim=domain)
title(main="Figure 5.8 from BDA3")
This last figure (“largest effect”) is a good example of one the main advantage of a fully Bayesian hierarchical model: once we have correctly simulated the posterior, we can test all kinds of complicated hypothesis.
aux1 <- apply(s.theta,1,max)
best <- apply(1*(s.theta==aux1), 2,mean)
Table 2: Probability that each coaching program is the best among the eight schools
School | Probability of having the best coaching program |
---|---|
A | 0.231 |
B | 0.113 |
C | 0.094 |
D | 0.093 |
E | 0.065 |
F | 0.068 |
G | 0.21 |
H | 0.126 |
p <- sapply(1:8,function(y) sapply(1:8,function(x) mean(1*(s.theta[,x]>s.theta[,y]))))
Table 2: Probability that \(j\) (row) has a better program that school \(k\) (column)
School \(j\)/School \(k\) | ||||||||
---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | |
A | 0 | 0.632 | 0.672 | 0.626 | 0.744 | 0.707 | 0.5 | 0.613 |
B | 0.368 | 0 | 0.578 | 0.512 | 0.611 | 0.555 | 0.377 | 0.468 |
C | 0.328 | 0.422 | 0 | 0.443 | 0.557 | 0.505 | 0.318 | 0.421 |
D | 0.374 | 0.488 | 0.557 | 0 | 0.622 | 0.574 | 0.368 | 0.467 |
E | 0.256 | 0.389 | 0.443 | 0.378 | 0 | 0.445 | 0.268 | 0.346 |
F | 0.293 | 0.445 | 0.495 | 0.426 | 0.555 | 0 | 0.32 | 0.417 |
G | 0.5 | 0.623 | 0.682 | 0.632 | 0.732 | 0.68 | 0 | 0.59 |
H | 0.387 | 0.532 | 0.579 | 0.533 | 0.654 | 0.583 | 0.41 | 0 |
Right way to do it:
\[
\begin{align}
p(\theta_{j}>max_{i\neq j}\{\theta_{i}\}) &= \int \prod_{i\neq j} p(\theta_{j}>\theta_{i}) \phi(\theta_{j}|y_{j},\sigma_{j})d\theta_{j} \\
&= \int \prod_{i\neq j} \Phi\left(\frac{\theta_{j} - \theta_{i}}{\sigma_{i}}\right) \phi(\theta_{j}|y_{j},\sigma_{j})d\theta_{j}
\end{align}
\]
This integral has to be solved numerically:
set.seed(142857)
best <- sapply(1:8,
function(y) mean(sapply(1:1000 ,
function(x) prod(pnorm(( rnorm( 1,effect[y],se.effect[y] ) - effect[-y]) / se.effect[-y] )))
)
)
# Ad-hoc normalization:
best <- best/sum(best)
Table 2.1: Probability that each coaching program is the best among the eight schools (with \(\tau = \infty\))
School | Probability of having the best coaching program |
---|---|
A | 0.5599 |
B | 0.033 |
C | 0.0265 |
D | 0.0364 |
E | 0.0034 |
F | 0.0136 |
G | 0.1615 |
H | 0.1656 |
The following table presents the different values for the expression above:
p <- sapply(1:8,function(x)
sapply(1:8,function(y)
pnorm( q = 0, mean = (effect[x] - effect[y]) / sqrt(se.effect[x]^2 + se.effect[y]^2) ,
sd = 1 )
) )
# Force all elementens in the diagonal to zero.
p <- p - .5 * diag(8)
Table 2.1: Probability that \(j\) (row) has a better program that school \(k\) (column). With \(\tau = \infty\)
School \(j\)/School \(k\) | ||||||||
---|---|---|---|---|---|---|---|---|
A | B | C | D | E | F | G | H | |
A | 0 | 0.8664 | 0.9212 | 0.8705 | 0.9513 | 0.9267 | 0.7105 | 0.7527 |
B | 0.1336 | 0 | 0.7201 | 0.5268 | 0.7482 | 0.6811 | 0.2398 | 0.423 |
C | 0.0788 | 0.2799 | 0 | 0.3033 | 0.4566 | 0.4184 | 0.1329 | 0.2667 |
D | 0.1295 | 0.4732 | 0.6967 | 0 | 0.7132 | 0.6501 | 0.2297 | 0.4063 |
E | 0.0487 | 0.2518 | 0.5434 | 0.2868 | 0 | 0.444 | 0.0789 | 0.2591 |
F | 0.0733 | 0.3189 | 0.5816 | 0.3499 | 0.556 | 0 | 0.1264 | 0.301 |
G | 0.2895 | 0.7602 | 0.8671 | 0.7703 | 0.9211 | 0.8736 | 0 | 0.6146 |
H | 0.2473 | 0.577 | 0.7333 | 0.5937 | 0.7409 | 0.699 | 0.3854 | 0 |
5.3c The estimated differences between the closed form solutions (5.3b) and the bayesian analysis (5.3a) is that the latter presents less extreme probability estimates (shrinkage)
5.3d If \(\tau = 0\), then all effects are the same so the probabilities can be 0 or 1 for all schools (all are the largest effect and the smallest at the same time)
5.13 - Bicycles
#Load data
data <- read.csv("C:/Users/fhocesde/Documents/tutorials/Bayesian/bicycles.txt", header=T, sep = "")
data <- data[1,]
y <- as.numeric(select(data,b1:b10))
n <- as.numeric(select(data, b1:b10)+select(data,v1:v10))
\[ \begin{align} p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)p(\theta|\alpha, \beta)p(y|\theta,\alpha, \beta)\\ p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta_{j}^{\alpha - 1} (1 - \theta_{j})^{\beta - 1}\prod^{J}_{j=1} \theta_{j}^{y_{j}} (1 - \theta_{j})^{n_{j}-y_{j}}\label{rat.joint.post1} \end{align} \]
Now we compute the posterior marginal of \((\alpha,\beta)\). Given that we do have a closed form solution in step 2, we compute the ratio of (\ref{rat.joint.post1}) and (\ref{rat.cond.post.theta1}). \[ \begin{align} p(\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha+y_{j})\Gamma(\beta+n_{j}-y_{j})}{\Gamma(\alpha + \beta +n_{j})} \label{rat.marg.post.phi1} \end{align} \]
\[ \begin{align} \hat{\mu} &= 0.1961 = \frac{\hat{\alpha_{0}}}{\hat{\alpha_{0}}+\hat{\beta_{0}}}\\ \hat{\sigma^2} &= 0.0111 = \frac{\hat{\alpha_{0}}\hat{\beta_{0}}}{(\hat{\alpha_{0}}+\hat{\beta_{0}})^{2}(\hat{\alpha_{0}}+\hat{\beta_{0}}+1)} \end{align} \]
Solving form \((\hat{\alpha_{0}},\hat{\beta_{0}})\):
#Here 'x' represents alpha and beta
dslnex <- function(x) {
z <- numeric(2)
z[1] <- x[1]/(x[1]+x[2]) - mean(y/n)
z[2] <- x[1]*x[2]/(((x[1]+x[2])^2)*(x[1]+x[2]+1)) - sd(y/n)^2
z
}
sol1 <- nleqslv(c(1,1), dslnex)
res1 <- paste("(",round(sol1$x[1],1), ",", round(sol1$x[2],1), ")",sep="")
We get: \((\hat{\alpha_{0}},\hat{\beta_{0}}) = (2.6,10.6)\).
We center the grid (approximately) around that initial estimate and expand the grid to cover up to a factor of 4 of each parameter. The result is plotted in the following figure:
rat.marg.post.phi = function(alpha, beta) {
post = 1
#notice the censoring in n
for (i in 1:length(y)) {
if (n[i] > 100) n[i] = 100
post = post * ( ( ( gamma(alpha + beta) ) / ( gamma(alpha) * gamma(beta) ) ) *
( ( gamma(alpha + y[i] ) * gamma(beta + n[i] - y[i]) )/( gamma(alpha + beta + n[i]) ) ) )
}
# The hyper prior is defined below
rat.hyper.prior(alpha,beta) * post
}
rat.hyper.prior = function(alpha,beta) {
alpha*beta*(alpha + beta)^(-5/2)
}
v1 = seq(log(sol1$x[1]/sol1$x[2])*1.5,log(sol1$x[1]/sol1$x[2])/1.5,length.out =151)
v2 = seq(log(sol1$x[1]+sol1$x[2])/1.5,log(sol1$x[1]+sol1$x[2])*1.5,length.out =151)
beta = exp(v2)/(exp(v1)+1)
alpha = exp(v2+v1)/(exp(v1)+1)
post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)
contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")
Adjust the grid and repeat:
v1 = seq(log(sol1$x[1]/sol1$x[2])*1.5,log(sol1$x[1]/sol1$x[2])/1.5,length.out =151)
v2 = seq(log(sol1$x[1]+sol1$x[2])/3,log(sol1$x[1]+sol1$x[2])*1.5,length.out =151)
beta = exp(v2)/(exp(v1)+1)
alpha = exp(v2+v1)/(exp(v1)+1)
post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)
contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")
Draw samples \((\alpha^{s}, \beta^{s})\) from \(p(\alpha,\beta|y)\) (finally!). Here we repeat the procedure used in section 3.(v) of the book replication document.
samps = 1000
v1.dens = apply(post.dens ,1, sum)
s.v1 = sample(v1,samps, replace=TRUE, prob = v1.dens)
#Select the colum of the joint density corresponding to a specific value of v1 (p(v2|v1))
cond.v2 = function(x) {
post.dens[which(v1 == s.v1[x]),]
}
#Sample a value of v2 according the the conditional probatility above
s.v2 = sapply(1:samps,function(x) sample(v2,1,replace=TRUE,prob=cond.v2(x)))
#Add a uniform random jitter centered at zero with with equal to the grid spacing. This will make the simulation draws more continuous. Plot the sampled values.
grid.v1 = v1[2]-v1[1]
grid.v2 = v2[2]-v2[1]
s.v2 = s.v2 + runif(length(s.v2),-grid.v2/2,grid.v2/2)
s.v1 = s.v1 + runif(length(s.v1),-grid.v1/2,grid.v1/2)
plot(s.v1, s.v2, xlab=expression(log(alpha/beta)^s), ylab=expression(log(alpha+beta)^s), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), main="Scatter Plot of Sample Draws of log(alpha/beta) and log(alpha+beta)")
By applying the inverse of the transformation we recover the marginal distribution of the original hyper-parameters.
s.beta = exp(s.v2)/(exp(s.v1)+1)
s.alpha = exp(s.v2+s.v1)/(exp(s.v1)+1)
s.beta = exp(s.v2)/(exp(s.v1)+1)
s.alpha = exp(s.v2+s.v1)/(exp(s.v1)+1)
theta.dist = sapply(1:10, function(x) rbeta(1000,s.alpha+y[x], s.beta + n[x] - y[x]))
theta.dist = apply(theta.dist,2,sort)
plot(0:600/1000,0:600/1000, type="l", xlab="Observed rate",ylab="95% CI and median of posterior")
jitter.x = y/n + runif(length(y),-0.01,0.01)
points(jitter.x,theta.dist[500,])
segments(jitter.x,theta.dist[25,], jitter.x,theta.dist[975,] )
title(main="Posterior Distribution of Bike Rates for all 10 streets")
The estimated proportions are almost the same as the raw proportions (no shrinkage).
* 5.13d We generate 1000 draws from a $Beta(\alpha^{s},\beta^{s})$ where the parameters come from the draws obtained above:
s.theta <- rbeta(1000, shape1 =s.alpha , shape2 = s.beta)
CI.num <- round(s.theta[order(s.theta)][c(25,975)],2)
CI.str <- paste("(" , CI.num[1] , "," , CI.num[2] , ")")
The posterior interval for $ = ( 0.01 , 0.48 ) $
5.13e If a new street is opening with 100 vehicles per day. The posterior interval predicts with 95% confidence that between 1 and 48. This CI is not so informative as it covers almost all the possible oberved bike rates.
5.13f The beta assumption might not have been so reasonable as the posterior estimates did not show much shrinkage.
5.14
(To do)
10.1
(To do)
10.4a
(To do)
11.2
(To do)
11.3
(To do)
11.6a
(To do)
12
Install and run examples in STAN [DONE]
14
(To do)
Suggestion: Run a logit regression using a frequentist (through ML) and Bayesian (through GS or MH) and compare the results.
15
(To do)
Suggestion: Run a full HLM example and compare with a fixed effect regression.
In Ch3 of BDA they present a first exercise of simulation (pg 66-67) where from a sample of 66 obs with mean 26.2 and standard deviation 10.8, they simulate the posterior for the mean assuming a joint normal likelihood and uninformative prior (where \(\sigma^{2}|y\sim Inv-\chi^{2}(n-1,s^{2})\) and \(\mu|\sigma^{2},y \sim N(\bar{y},\sigma^{2}/n)\)).
# Draw 1000 random numbers from a Inverse chi-squared
f.sim.sigma = function(draws) (n-1)*(se^2)/rchisq(draws,n-1)
f.sim.norm = function(draws) rnorm(draws,ybar,(f.sim.sigma(draws)/(n-1))^.5)
sim.norm = f.sim.norm(1000)
ci.mu = sim.norm[order(sim.norm)][c(25,975)]
print(ci.mu)
## [1] 23.64 28.84
# And now repeating the similation excercise R = 100 times:
R = 100
R.sim.norm = sapply(1:R, function(x) f.sim.norm(1000))
ci.mu = t(apply(R.sim.norm,2,function(x) x[order(x)][c(25,975)]))
print(c(median_low=median(ci.mu[,1]), median_high=median(ci.mu[,2]), p10_low =quantile(ci.mu[,1],c(.1),type=1), p90_up =quantile(ci.mu[,2],c(.9),type=1) ))
## median_low median_high p10_low.10% p90_up.90%
## 23.52 28.88 23.38 29.02
The following is not an exercise in the book (that I’m aware of), but I though that it would be interesting to work on this. After going over the model derived in pg 67-69 in BDA3, I will simulate values from the joint posterior (and hopefully leave the code flexible enough to play later on with the hyper-parameters).
With the prior density: \[ \begin{align} p(\mu,\sigma^{2}) &\propto \sigma^{-1}(\sigma^{2})^{(\nu_{0}/2 + 1}exp\left ( -\frac{1}{2\sigma^{2}} [\nu_{0}\sigma^{2}_{0} + \kappa_{0}(\mu_{0} - \mu)^{2} ]\right ) \end{align} \] And a likelihood: \[ \begin{align} p(y|\mu,\sigma^{2}) &\propto \sigma^{-n}exp\left ( -\frac{1}{2\sigma^{2}} [(n-1)s^{2} + n(\bar{y} - \mu)^{2} ]\right ) \end{align} \]
We obtain the following joint posterior: \[ \begin{align} p(\mu,\sigma^{2}|y) &\propto \sigma^{-1}(\sigma^{2})^{(\nu_{0}/2 + 1}exp\left ( -\frac{1}{2\sigma^{2}} [\nu_{0}\sigma^{2}_{0} + \kappa_{0}(\mu_{0} - \mu)^{2} ]\right ) \times \nonumber \\ &\times (\sigma^{2})^{-n/2}exp\left ( -\frac{1}{2\sigma^{2}} [(n-1)s^{2} + n(\bar{y} - \mu)^{2} ]\right ) \nonumber\\ &= N-Inv-\chi^{2}(\mu_n,\sigma^{2}_{n};\nu_{n},\sigma^{2}_{n})\\ \mu_{n} &= \frac{\kappa_{0}}{\kappa_{0} + n} \mu_{0} + \frac{n}{\kappa_{0} + n}\bar{y} \nonumber\\ \kappa_{n} &= \kappa_{0} + n \nonumber\\ \nu_{n} &= \nu_{0} + n \nonumber\\ \nu_{n}\sigma_{n}^{2} &= \nu_{0}\sigma_{0}^{2} + (n-1)s^{2} + \frac{\kappa_{0}n}{\kappa_{0} + n} (\bar{y} - \mu_0)^{2}. \nonumber \end{align} \]
Using the fact that \(p(\mu,\sigma^{2}|y) = p(\mu|\sigma^{2},y)p(\sigma^{2}|y)\) with:
\[
\begin{align}
\mu|\sigma^{2},y &\sim N(\mu_{n},\sigma^{2}_{n}/\kappa_{n}) \label{post.mu}\\
\sigma^{2}|y &\sim Inv-\chi^{2}(\nu_{n},\sigma^{2}_{n}) \label{post.sig}\\
\end{align}
\] The simulation algorithm is as follows:
* Draw a random number from the conditional posterior of sigma (equation \ref{post.sig})
* Using that value of \(\sigma^{2}\), draw a random number for the conditional posterior of \(\mu\) (equation \ref{post.mu})
The following code implements this algorithm.
# Hyperparameters
p.mu_0 = 0
p.sigma2_0 = 1
p.nu_0 = 4
p.kappa_0 = 3
# Data
n = 10
y_bar = 1.3
s = 2
# Simulation
mu_n = ((p.kappa_0)/(p.kappa_0 + n)) * p.mu_0 + ((n)/(p.kappa_0 + n)) * y_bar
kappa_n = p.kappa_0 + n
nu_n = p.nu_0 + n
sigma2_n = (p.nu_0*p.sigma2_0 + (n-1)*s^{2} + ((p.kappa_0*n)/(p.kappa_0 + n)) * (y_bar - p.mu_0)^{2})/nu_n
set.seed(142857)
# Draw 1000 random numbers from a Inverse chi-squared
f.mu.post = function(draws) {
sigma2.cond = (nu_n)*(sigma2_n)/rchisq(draws,nu_n)
mu.cond = rnorm(draws,mu_n,(sigma2.cond/(kappa_n))^.5)
return(mu.cond)
}
sim.mu.post= f.mu.post(1000)
plot(density(sim.mu.post), main = "Draws from the marginal of the mean", col="red")
lines(density(mu_n +((sigma2_n/kappa_n)^.5)*rt(1000,nu_n)))
legend("topleft",col = c("red","black") ,legend=c("Simulated","Analytical"),lty = c(1,1))
The plot above should be in 3D (\(\sigma^{2}, \mu, p(\sigma^{2}, \mu|y)\)) but still don’t know how to do that in R.
It also happens that for this particular problem there is a close form solution for the marginal of \(\mu\). (remember that \(p(\mu|y) = \int p(\mu|\sigma^{2},y) p(\sigma^{2}|y)d\sigma^{2}\)):
\[ \begin{align} \mu|\sigma^{2},y &\sim t_{\nu_{n}}(\mu|\mu_{n}, \sigma^2_{n}/\kappa_{n}) \end{align} \]
I’m still confused with the following why is that to simulate the marginal of \(\mu\) (\(\mu|\sigma^2,y\)) I just need to draw from the conditional posterior of the mean (equation \ref{post.mu}). More generally, my confusion is the following: why is that to simulate \(h(x) = \int g(x) dF(x)\) I just need to draw \(x^{s}\) from \(dF(x)\) and evaluate \(g(x^{s})\). I know from MC simulation that by the LLN \(E_{dF}[g(x)] = \int g(x) dF(x) \approx \sum_{s}g(x^{s})/S\), but I’m having trouble seen why is that this applies to all points in the distribution (and not just the mean).
My own explanation so far: as MC can be used to approximate the mean using the LLN, the same technique can be used to approximate the whole distribution using CLT.
Simulating numbers from a Dirichlet distribution
draws = 1000
#This are the parameters of the gamma used in generating the final Dirichlet.
gamma.scale = c(728,538,138)
aux1 = sapply(gamma.scale,function(x) rgamma(100,x))
aux1 = aux1 / apply(aux1,1,sum)
We are interested in modeling the dose-response of a certain drug (\(x_{i}\)) over the number of dead’s (\(y_{i}\)) in a group of trial animals (\(n_{i}\)).We have 4 observations. Defining \(\theta_{i}\) as the true death rate for each dosage, we can model this phenomena using a binomial distribution (\(y_{i} \sim Bin(n_{i},\theta_{i})\)). To model the dose-response relationship we start by looking at \(\theta_{i} = \alpha +\beta x_{i}\) but we realize that this model predicts values out of range (\(\theta\), as a probability has to be between 0 and 1). Hence we apply the logit (\(log(\theta_{i}/(1-\theta_{i}))\)) transformation. This implies the following likelihood.
\[ \begin{align} p(y_{i}|\alpha,\beta,n_{i},x_{i}) \propto [logit^{-1}(\alpha +\beta x_{i})]^{y_i}[1-logit^{-1}(\alpha +\beta x_{i})]^{n_{i}- y_{i}} \end{align} \]
Question for Susan: why is that in the likelihood above we do not multiply by the Jacobian of the transformation? Using a non-informative prior (\(p(\alpha,\beta)\propto 1\)) we get that the conditional posterior has the form: \[ \begin{align} p(\alpha,\beta|y,n,x) \propto \prod_{i=1}^{k} p(y_{i}|\alpha,\beta,n_{i},x_{i}) \label{post.1} \end{align} \]
First we compute, as a rough approximation of the parameters, the ML estimates.
# Data
x = c(-0.86, -0.30, -0.05, 0.73)
y = c(0,1,3,5)
n = rep(5,4)
# ML estimation
mlfit <- glm( cbind(y, n-y) ~ x, family = binomial)
mlfit$coefficients = round(mlfit$coefficients*1000)/1000
res1 = paste(paste("(",mlfit$coefficients[1], sep=""), paste(mlfit$coefficients[2],")", sep=""), sep=",")
With the estimates \((\hat{\alpha},\hat{\beta}) = (0.847,7.749)\).
Now we will simulated values of \(\alpha\) and \(\beta\) from the posterior distribution. The approach is as follows.
alpha = seq(-5,10,.1)
beta = seq(-10,40,1)
post.a = function(a,b,x,y,n) {
post = 1
for (i in 1:length(y)) {
post = post * ( ((inv.logit(a+b*x[i]))^y[i])*((1-inv.logit(a+b*x[i]))^(n[i]-y[i])) )
}
post
}
post.dens = outer(alpha,beta,function(x1,x2) post.a(x1,x2,x,y,n))
post.dens = post.dens/sum(post.dens)
contours <- seq(min(post.dens), max(post.dens) , length=10)
par(mar = rep(2, 4))
contour(alpha, beta, post.dens,levels=contours, xlab=expression(alpha), ylab=expression(beta), xlim=c(min(alpha),max(alpha)) , ylim=c(min(beta),max(beta)), drawlabels=FALSE, main="Contour plot of joint posterior")
samps = 1000
alpha.dens = apply(post.dens ,1, sum)
s.alpha = sample(alpha,samps, replace=TRUE, prob = alpha.dens)
#Select the colum of the joint density corresponding to a specific value of alpha (p(beta|alpha))
cond.beta = function(x) {
post.dens[which(alpha == s.alpha[x]),]
}
#Sample a value of beta according the the conditional probatility above
s.beta = sapply(1:samps,function(x) sample(beta,1,replace=TRUE,prob=cond.beta(x)))
s.beta = s.beta + runif(length(s.beta),-.5,.5)
s.alpha = s.alpha + runif(length(s.alpha),-.05,.05)
plot(s.alpha, s.beta, xlab=expression(alpha^s), ylab=expression(beta^s), xlim=c(min(alpha),max(alpha)) , ylim=c(min(beta),max(beta)), main="Scatter Plot of Sample Draws of alpha and beta")
The final result are two vectors \((\{\alpha^{s}\}), (\{\beta^{s}\})\), that represent probability distribution of each parameter.
The overall goal of hierarchical models is to describe a statistical (uncertain) phenomena where are multiple parameters involved and exists a dependence structure.
The analysis described here builds on what we just did in the example above. For this purpose we enumerate the steps followed, now in a more general notation.
\[ \begin{align} p(\theta_{1}|y) = \int p(\theta_{1}|\theta_{2},y) p(\theta_{2}|y) d\theta_{2} \approx p(\theta_{1}|\theta^{s}_{2},y) \end{align} \] Where \(\theta^{s}_{2}\) is a random draw using \(p(\theta_{2}|y)\) (the justification for this last step is in section 2 of this document).
Now we will distinguish between two sets of parameters. The original parameters of interest, still represented by \(\theta\). And the parameters of the prior distribution, or hyperparameters represented by \(\phi\). This two sets of parameters fully describe the probabilistic process and have a joint distribution function:
\[ \begin{align} p(\phi,\theta) = p(\phi)p(\theta|\phi) \label{joint} \end{align} \]
And a joint posterior distribution: \[ \begin{align} p(\phi,\theta|y) &\propto p(\theta,\phi) p(y|\theta,\phi) \nonumber\\ &= p(\phi)p(\theta|\phi)p(y|\theta) \label{joint.post} \end{align} \] Where in the last line we used the assumption that \(p(y|\theta,\phi)\) depends on \(\phi\) only through \(\theta\). We call \(p(\phi)\) the hyperprior distribution, \(p(\theta|\phi)\) the population distribution, and the usual suspect \(p(y|\theta)\) is the likelihood, or sampling, distribution.
As always our goal is to make inference about \(\theta\) (and maybe \(\phi\) also?). As in the exercise before, our end result will be a set of matrices \((\{\phi^{s}\}), (\{\theta^{s}\})\) with values for the parameters that follow the posterior joint distribution. To get this result we take the following steps:
In the absence of a closed form solution for step 2. We can compute the marginal by integrating the joint over all the possible values of \(\theta\).
4. Draw samples \(\phi^{s}\) from \(p(\phi|y)\). Notice that \(\phi\) can have multiple components. If it has more than one element we follow the procedure described at the beginning of this section.
5. For each draw of \(\phi^{s}\), draw a sample of \(\theta\) from \(p(\theta|\phi^{s},y)\). This allow us to fully characterize the parameter of interest (our goal, remember?)
Application
We have data from \(j = 1 \dots J, J = 71\) experiments where, in each experiment \(n_{j}\) rats were exposed to a drug and \(y_{j}\) of them, presented tumors afterwards (think also something similar to unemployment rate in different states with \(n_{j}\) people in the active labor force and \(y_{j}\) unemployed). The results from the experiments are assumed to follow a binomial distribution (\(y_{j} \sim Bin(n_{j},\theta_{j})\)), and the parameters \(\theta_{j}\) are assume to follow a beta prior distribution (\(\theta_{j} \sim Beta(\alpha,\beta)\)), this last assumption is made to take advantage of conjugacy. With all this elements now we can follow the steps described above.
Compute the conditional joint posterior \(p(\phi=(\alpha,\beta),\theta = \{\theta_{j}\}_{j=1}^{J}|y)\). \[ \begin{align} p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)p(\theta|\alpha, \beta)p(y|\theta,\alpha, \beta) \nonumber\\ p(\theta,\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta_{j}^{\alpha - 1} (1 - \theta_{j})^{\beta - 1}\prod^{J}_{j=1} \theta_{j}^{y_{j}} (1 - \theta_{j})^{n_{j}-y_{j}}\label{rat.joint.post} \end{align} \]
Compute the marginal posterior of \(\theta\), conditional on \(\alpha, \beta\). For the beta-binomial case we have that given the hyper-parameters, each \(\theta_{j}\) has a posterior distribution \(Beta(\alpha + y_{j}, \beta +n_{j} - y_{j})\). Assuming exchangeability: \[ \begin{align} p(\theta|\alpha,\beta,y) &= p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta +n_{j})}{\Gamma(\alpha+y_{j})\Gamma(\beta+n_{j}-y_{j})} \theta_{j}^{\alpha+y_{j} - 1} (1 - \theta_{j})^{\beta+n_{j}-y_{j} - 1}\label{rat.cond.post.theta} \end{align} \]
Compute the posterior marginal of \((\alpha,\beta)\). Given that we do have a closed form solution in step 2, we compute the ratio of (\ref{rat.joint.post}) and (\ref{rat.cond.post.theta}). \[ \begin{align} p(\alpha,\beta|y) &\propto p(\alpha, \beta)\prod^{J}_{j=1} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha+y_{j})\Gamma(\beta+n_{j}-y_{j})}{\Gamma(\alpha + \beta +n_{j})} \label{rat.marg.post.phi} \end{align} \]
And here is the code for this function:
rat.marg.post.phi = function(alpha, beta) {
post = 1
for (i in 1:length(y)) {
post = post * ( ( ( gamma(alpha + beta) )/( gamma(alpha)*gamma(beta) ) ) * ( ( gamma(alpha + y[i] )*gamma(beta + n[i] - y[i]) )/( gamma(alpha + beta + n[i]) ) ) )
}
# The hyper prior is defined below
rat.hyper.prior(alpha,beta) * post
}
Note: we assume a hyper-prior of the form \(p(\alpha,\beta)\propto (\alpha + \beta)^{-5/2}\). So far I do not understand very well the whole discussion about proper and improper priors. Need to review Ch10 of DeGroot’s book
\[
J = \begin{bmatrix}
\frac{\partial f^{-1}_{1}}{\partial v_{1}} & \frac{\partial f^{-1}_{1}}{\partial v_{2}} \\[0.3em]
\frac{\partial f^{-1}_{2}}{\partial v_{1}} & \frac{\partial f^{-1}_{2}}{\partial v_{2}}
\end{bmatrix}
= - \alpha \beta
\]
Hence, using the transformation method we have that \(p(log(\frac{\alpha}{\beta}), log(\alpha+\beta)) \propto \alpha \beta(\alpha + \beta)^{-5/2}\). And here is the code for this function:
data = read.csv("C:/Users/fhocesde/Documents/tutorials/Bayesian/rats.csv", header=T, sep = " ")
y = data[,1]
n = data[,2]
rat.hyper.prior = function(alpha,beta) {
alpha*beta*(alpha + beta)^(-5/2)
}
4.2. Identify the relevant domain for the transformed variables and its counterpart with the original variables. To do so we start computing the parameters \(\alpha\) and \(\beta\) that would match the sample mean and standard deviation of all 70 experiments with a beta prior distribution:
\[ \begin{align} \hat{\mu} &= 0.1381 = \frac{\hat{\alpha_{0}}}{\hat{\alpha_{0}}+\hat{\beta_{0}}}\\ \hat{\sigma^2} &= 0.0109 = \frac{\hat{\alpha_{0}}\hat{\beta_{0}}}{(\hat{\alpha_{0}}+\hat{\beta_{0}})^{2}(\hat{\alpha_{0}}+\hat{\beta_{0}}+1)} \end{align} \]
Solving form \((\hat{\alpha_{0}},\hat{\beta_{0}})\):
dslnex <- function(x) {
z <- numeric(2)
z[1] <- x[1]/(x[1]+x[2]) - mean(y/n)
z[2] <- x[1]*x[2]/(((x[1]+x[2])^2)*(x[1]+x[2]+1)) - sd(y/n)^2
z
}
sol1 <- nleqslv(c(1,1), dslnex)
res1 <- paste("(",round(sol1$x[1],1), ",", round(sol1$x[2],1), ")",sep="")
We get: \((\hat{\alpha_{0}},\hat{\beta_{0}}) = (1.4,8.6)\).
We center the grid (approximately) around that initial estimate and expand the grid to cover up to a factor of 4 of each parameter. The result is plotted in the following figure:
v1 = seq(log(sol1$x[1]/sol1$x[2])*2,log(sol1$x[1]/sol1$x[2])/2,length.out =151)
v2 = seq(log(sol1$x[1]+sol1$x[2])/2,log(sol1$x[1]+sol1$x[2])*2,length.out =151)
beta = exp(v2)/(exp(v1)+1)
alpha = exp(v2+v1)/(exp(v1)+1)
post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)
contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")
4.3. Recalculate the range of the grid such that includes all the density
v1 = seq(-2.3,-1.3,length.out =151)
# The booksets the range of "v2" up to 5, but my gamma function calculator gives me trouble after 4.8
v2 = seq(1 , 4.5,length.out =151)
beta = exp(v2)/(exp(v1)+1)
alpha = exp(v2+v1)/(exp(v1)+1)
post.dens = outer(alpha,beta,function(x1,x2) log(rat.marg.post.phi(x1, x2)) )
post.dens = exp(post.dens - max(post.dens))
post.dens = post.dens/sum(post.dens)
contours <- seq(min(post.dens), max(post.dens) , length=10)
contour(v1, v2, post.dens,levels=contours, xlab=expression(log(alpha/beta)), ylab=expression(log(alpha+beta)), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), drawlabels=FALSE, main="Contour plot of joint posterior")
4.4. Draw samples \((\alpha^{s}, \beta^{s})\) from \(p(\alpha,\beta|y)\) (finally!). Here we repeat the procedure used in section 3.(v) of this document.
samps = 1000
v1.dens = apply(post.dens ,1, sum)
s.v1 = sample(v1,samps, replace=TRUE, prob = v1.dens)
#Select the colum of the joint density corresponding to a specific value of v1 (p(v2|v1))
cond.v2 = function(x) {
post.dens[which(v1 == s.v1[x]),]
}
#Sample a value of v2 according the the conditional probatility above
s.v2 = sapply(1:samps,function(x) sample(v2,1,replace=TRUE,prob=cond.v2(x)))
#Add a uniform random jitter centered at zero with with equal to the grid spacing. This will make the simulation draws more continuous. Plot the sampled values.
grid.v1 = v1[2]-v1[1]
grid.v2 = v2[2]-v2[1]
s.v2 = s.v2 + runif(length(s.v2),-grid.v2/2,grid.v2/2)
s.v1 = s.v1 + runif(length(s.v1),-grid.v1/2,grid.v1/2)
plot(s.v1, s.v2, xlab=expression(log(alpha/beta)^s), ylab=expression(log(alpha+beta)^s), xlim=c(min(v1),max(v1)) , ylim=c(min(v2),max(v2)), main="Scatter Plot of Sample Draws of log(alpha/beta) and log(alpha+beta)")
Note:this two figures do not match exactly their counterparts in the textbook (5.3a and 5.3b in BDA3), but so far I have not been able to detect my mistake.
4.5 By applying the inverse of the transformation we recover the marginal distribution of the original hyper-parameters.
s.beta = exp(s.v2)/(exp(s.v1)+1)
s.alpha = exp(s.v2+s.v1)/(exp(s.v1)+1)
s.beta = exp(s.v2)/(exp(s.v1)+1)
s.alpha = exp(s.v2+s.v1)/(exp(s.v1)+1)
theta.dist = sapply(1:71, function(x) rbeta(1000,s.alpha+y[x], s.beta + n[x] - y[x]))
theta.dist = apply(theta.dist,2,sort)
plot(0:450/1000,0:450/1000, type="l", xlab="Observed rate",ylab="95% CI and median of posterior")
jitter.x = y/n + runif(length(y),-0.01,0.01)
points(jitter.x,theta.dist[500,])
segments(jitter.x,theta.dist[25,], jitter.x,theta.dist[975,] )
title(main="Posterior Distribution of Tumor Rates for all 71 Experiments \n (Remember the goal? This is it!)")
#Data:
school.id <- LETTERS[1:8]
effect <- c(28,8,-3,7,-1,1,18,12)
se.effect <- c(15,10,16,11,9,11,10,18)
pool.est <- sum(effect*se.effect^-2)/sum(se.effect^-2)
pool.var <- sum(se.effect^-2)^-1
pool.ci <- c(-1.96,1.96)*pool.var^.5 + pool.est
The pooled estimated effect and variance are 7.69 and 16.58, with a 95% CI of [-0.3, 15.67].
Posterior simulation under ther hierarchical model
Using the identity:
\[
\begin{align}
p(\theta,\mu,\tau|y) = p(\tau|y)p(\mu|\tau,y)p(\theta|\mu,\tau,y)
\end{align}
\] And the results from BDA in equation 5.17, 5.20 and 5.21 we code the joint posterior:
# Eqn 5.17 of BDA3
post.theta.j <- function(mu,tau,j) (effect[j]/(se.effect[j]^2) + mu/(tau^2)) / (1/(se.effect[j]^2) + 1/(tau^2))
post.v.theta.j <- function(tau,j) 1/(1/(se.effect[j]^2) + 1/(tau^2))
# Eqn 5.20 of BDA3
post.mu.hat <- function(tau) sum(effect*1/(se.effect^2 +tau^2))/sum(1/(se.effect^2 +tau^2))
post.v.mu <- function(tau) (sum(1/(se.effect^2 +tau^2)))^-1
# Eqn 5.21 of BDA3
marginal.tau <- function(tau) {
hyper.prior(tau)*(post.v.mu(tau)^.5)*prod(((se.effect^2 + tau^2)^(-.5)) *
exp(-((effect - post.mu.hat(tau))^2)/(2*(se.effect^2 + tau^2))))
}
Define a hyper-prior and draw 200 samples from each distribution (for all 8 schools).
samps <- 200
hyper.prior <- function(tau) 1
tau.grid <- seq(0.001,30, length=samps)
pdf.tau <- sapply(tau.grid,function(x) marginal.tau(x))
pdf.tau <- pdf.tau/sum(pdf.tau)
plot(tau.grid,pdf.tau, type="l", main="Figure 5.5 from BDA3", xlab=expression(tau), ylab="Density")
The sampling method in BDA3 suggest to apply the inverse method from the posterior of \(\tau\). I don’t do this for two reasons: (i) I’m not sure the posterior has a closed for solution for its inverse, and (ii) given that I already have the density, I can directly draw from that distribution sampling using the sample
command (which leads me to think that this command applies the inverse method, but need to check with Susan).
# Sampling
s.tau <- sample(tau.grid,samps,prob=pdf.tau, replace=TRUE)
s.mu <- sapply(s.tau,function(x) rnorm(1,post.mu.hat(x),(post.v.mu(x))^0.5))
s.theta <- NULL
for (j in 1:length(school.id)) {
s.theta[[j]] <- sapply(1:samps,
function(x)
rnorm(1,
post.theta.j(s.mu[x],s.tau[x],j),
(post.v.theta.j(s.tau[x],j))^0.5
) )
}
The following figures replicate the figures in pg 122 in BDA. Before doing the plots we need to ‘average over \(\mu\)’
\[ \begin{align} E(\theta_{j}|\tau,y) &= E_{\mu}\left[E(\theta_{j}|\tau,y,\mu)|\tau,y\right] \nonumber \\ &= E_{\mu}\left[\frac{ \frac{1}{\sigma_{j}^{2}}y_{j} + \frac{1}{\tau^{2}}\mu }{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} } }|\tau,y\right] = \frac{ \frac{1}{\sigma_{j}^{2}}y_{j} + \frac{1}{\tau^{2}}\hat{\mu} }{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} } }\\ \nonumber \\ Var(\theta_{j}|\tau,y) &= E_{\mu}\left[Var(\theta_{j}|\tau,y,\mu)|\tau,y\right] + Var_{\mu}\left[E(\theta_{j}|\tau,y,\mu)|\tau,y\right] \nonumber \\ &= \frac{1}{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} }} + V_{\mu}\left(\frac{\frac{1}{\tau^{2}}}{\frac{1}{\sigma_{j}^{2}} + \frac{1}{\tau^{2} }}\right) \end{align} \]
Where \(V_{\mu}\) and \(\hat{\mu}\) correspond to the expressions defined in Eq 5.20 of BDA3. Below is the code and plot of both equations.
post.theta.j.no.mu <- function(tau,j) post.theta.j(post.mu.hat(tau),tau,j)
post.se.theta.j.no.mu <- function(tau,j) sqrt( (post.v.theta.j(tau,j)) * (1+post.v.mu(tau)*tau^(-2)) )
plot( tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,1)), type="l", ylim=c(-5,30), xlab="", ylab="")
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,2)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,3)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,4)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,5)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,6)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,7)))
lines(tau.grid,sapply(tau.grid, function(x) post.theta.j.no.mu(x,8)))
title(main="Figure 5.6 from BDA3", xlab=expression(tau), ylab="Estimated treatment effect")
plot(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,1)), type="l", ylim=c(0,20), xlab="", ylab="")
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,2)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,3)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,4)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,5)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,6)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,7)))
lines(tau.grid,sapply(tau.grid, function(x) post.se.theta.j.no.mu(x,8)))
title(main="Figure 5.7 from BDA3", xlab=expression(tau), ylab="Posterior Standard Deviation")
s.theta <- matrix(unlist(s.theta), ncol = 8, byrow = FALSE)
s.theta.sort <- apply(s.theta, 2, sort)
p <- t( apply(s.theta.sort, 2, function(x) quantile(x,c(.025,.25,.5, .75, .975),type=1)) )
p <- round(p,3)
Table 5.3 from BDA3:
School | |||||
---|---|---|---|---|---|
2.5% | 25% | median | 75% | 97.5% | |
A | -0.959 | 5.126 | 8.799 | 14.701 | 31.639 |
B | -4.707 | 3.321 | 7.599 | 11.384 | 20.985 |
C | -15.041 | 2.191 | 6.602 | 10.44 | 17.38 |
D | -9.037 | 2.942 | 6.87 | 11.321 | 20.173 |
E | -8.086 | 2.217 | 6.21 | 9.562 | 18.036 |
F | -6.893 | 2.292 | 6.263 | 10.17 | 19.596 |
G | -1.023 | 5.646 | 9.295 | 14.107 | 29.198 |
H | -7.684 | 4.477 | 7.525 | 12.355 | 24.877 |
Here we reproduce figure 5.8 (with the same problems as above)
par(mfrow=c(1,2))
domain <- c(-20,60)
hist(s.theta[,1], breaks=10, xlab="Effect in School A", main="", xlim=domain)
hist(apply(s.theta,1,max), breaks=10, xlab="Largest Effect", main="", xlim=domain)
title(main="Figure 5.8 from BDA3")
This last figure (“largest effect”) is a good example of one the main advantage of a fully Bayesian hierarchical model: once we have correctly simulated the posterior, we can test all kinds of complicated hypothesis.
Appendix C in BDA3
Here we follow the steps in p278 to simulate draws from a bivariate normal \(N(0,I_{2})\).
set.seed(142857)
library(mvtnorm)
theta.0 <- c(-2.5, 2.5)
Our starting point is \((-2.5, 2.5)\), the upper left most square dot from figure 11.1a.
theta.star <- function(theta.t_1) {
n.p <- length(theta.t_1)
# I would like to learn a way to draw from a multivariate normal without a black-box function.
rmvnorm(1, mean = theta.t_1, sigma = 0.2^2 * diag(n.p))
}
r.dens <- function(theta.t_1, theta.star) {
n.p <- length(theta.t_1)
dmvnorm(theta.star, mean = rep(0, n.p), sigma = diag(n.p)) /
dmvnorm(theta.t_1 , mean = rep(0, n.p), sigma = diag(n.p))
}
theta.t <- function(theta.t_1, theta.star, r) {
prob.aux <- runif(1)
if (prob.aux <= min(c(r,1)))
theta.star
else
theta.t_1
}
theta <- function(sims,theta.0) {
s.theta <- matrix(NA,sims,2)
s.theta[1,] <- t(as.matrix(theta.0))
for (t in 2:sims) {
s.theta.star<- theta.star(s.theta[t-1,])
r <- r.dens(s.theta[t-1,], s.theta.star)
s.theta[t,] <- theta.t(s.theta[t-1,], s.theta.star, r)
}
return(s.theta)
}
par(mfrow=c(1,3))
par(mar = rep(2,4))
# 11.1a
s.theta.1 <- theta(50,theta.0)
s.theta.2 <- theta(50,c(2.5,2.5))
s.theta.3 <- theta(50,c(2.5,-2.5))
s.theta.4 <- theta(50,c(-2.5,-2.5))
s.theta.5 <- theta(50,c(0,0))
plot(s.theta.1, xlim=c(-4,4), ylim=c(-4,4), type="l", sub = "50 simulations")
points(theta.0[1], theta.0[2], pch=15)
lines(s.theta.2)
points(2.5, 2.5, pch=15)
lines(s.theta.3)
points(2.5, -2.5, pch=15)
lines(s.theta.4)
points(-2.5, -2.5, pch=15)
lines(s.theta.5)
points(0, 0, pch=15)
# 11.1b
s.theta.1 <- theta(1000, theta.0)
s.theta.2 <- theta(1000, c(2.5,2.5))
s.theta.3 <- theta(1000, c(2.5,-2.5))
s.theta.4 <- theta(1000, c(-2.5,-2.5))
s.theta.5 <- theta(1000, c(0,0))
plot(s.theta.1, xlim=c(-4,4), ylim=c(-4,4), type="l", sub = "1000 simulations", main="Figure 11.1 From BDA3")
points(theta.0[1], theta.0[2], pch=15)
lines(s.theta.2)
points(2.5, 2.5, pch=15)
lines(s.theta.3)
points(2.5, -2.5, pch=15)
lines(s.theta.4)
points(-2.5, -2.5, pch=15)
lines(s.theta.5)
points(0, 0, pch=15)
s.theta.1 <- s.theta.1[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.2 <- s.theta.2[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.3 <- s.theta.3[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.4 <- s.theta.4[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
s.theta.5 <- s.theta.5[501:1000,] + matrix(runif(1000)/20, nrow = 500, ncol = 2)
# 11.1c
plot(s.theta.1, xlim=c(-4,4), ylim=c(-4,4), type="p", pch = 20, cex =.1, sub = "Second half of 1000 simulations")
points(s.theta.2, pch = 20, cex =.1)
points(s.theta.3, pch = 20, cex =.1)
points(s.theta.4, pch = 20, cex =.1)
points(s.theta.5, pch = 20, cex =.1)
#Data
id <- rep(LETTERS[1:4],c(4,6,6,8))
y <- c(62,60,63,59,63,67,71,64,65,66,68,66,71,67,68,68,56,62,60,61,63,64,63,59)
J <- length(unique(id))
n <- length(y)
The Model
Data \(y_{ij}, i= 1,..., n_{j} , j= 1,..,J\) are iid within each of \(J\) groups with \(y_{ij} \sim N(\theta_{j}, \sigma^{2})\). The prior distribution of \(\theta_{j}\) is assume to be normal with hyperparameters \(\mu, \tau\) (\(\theta_{j} \sim N(\mu, \tau^{2}\)). \(\sigma\) is assumed to be unkonwn (confusing!) and the hyperprior is assumed to be uniform over \((\mu, log(\sigma), \tau)\), which implies \(p(\mu, log(\sigma), log(\tau)) \propto \tau\). The joint posterior denstity for all the parameters is:
\[ \begin{align} p(\theta,\mu, log(\sigma), log(\tau) | y) \propto \tau \prod^{J}_{j=1} N(\theta_{j}|\mu, \tau^{2}) \prod^{J}_{j=1} \prod^{n_{j}}_{i=1} N(y_{ij}|\theta_{j}, \sigma^{2}) \label{norm.norm4gibs} \end{align} \]
Starting points
Select 10 \(\theta_{j}\) randomly from the \(y_{ij}\) sample. And take \(\mu\) to be the average starting values of \(\theta_{j}\).
set.seed(142857)
theta.0 <- sapply(1:4,function(x) sample(y[id==LETTERS[x]],10, replace=TRUE))
mu.0 <- apply(theta.0, 1,mean)
[Here I follow the exact reverse order that is proposed in the book, going from step 4 to 1 instead of 1 to 4. It is not clear to me how to do it otherwise].
4. Draw from conditional posterior of \(\tau^{2}\) using: \[ \begin{align} \hat{\tau}^{2} &= \frac{1}{J-1}\sum_{j=1}^{J}(\theta_{j} - \mu)^{2} \end{align} \]
tau.hat.2 <- function(theta) {
mu <- mean(theta)
tau.2 <- ( 1/(J-1) ) * sum((theta - mu)^2)
return(tau.2)
}
And the fact that: \[ \begin{align} \tau^{2}|\theta,\mu,\sigma, y \sim Inv-\chi^2(J-1,\hat{\tau}^{2}) \end{align} \]
We can draws samples for \(\tau^{2}\)
s.tau.post <- function(theta) {
tau.2 <- tau.hat.2(theta)
tau.cond <- (J - 1) * (tau.2)/rchisq(1,J-1)
return(tau.cond)
}
3. Draw from conditional posterior of \(\sigma^{2}\) Using:
\[
\begin{align}
\hat{\sigma}^{2} &= \frac{1}{n}\sum_{j=1}^{J}\sum_{i=1}^{n_{j}}(y_{ij} - \theta_{j})^{2}
\end{align}
\]
f.sigma.hat.2 <- function(theta) {
sigma.hat.2 <- sapply(1:4, function(x) (y[id==LETTERS[x]] - theta[x])^2)
sigma.hat.2 <- (1/n) * sum(unlist(sigma.hat.2))
return(sigma.hat.2)
}
And the fact that:
\[ \begin{align} \sigma^{2}|\theta,\mu,\tau,y \sim Inv-\chi^2(n,\hat{\sigma}^{2}) \end{align} \] We can draws samples for \(\sigma^{2}\)
s.sigma.post <- function(theta) {
sigma2.hat <- f.sigma.hat.2(theta)
sigma2.post <- (n) * (sigma2.hat)/rchisq(1,n)
return(sigma2.post)
}
2. Draw from conditional posterior of \(\mu\) Using:
\[
\begin{align}
\hat{\mu} &= \frac{1}{J} \sum_{j=1}^{J}\theta_{j}
\end{align}
\]
mu.hat <- function(theta) {
mean(theta)
}
And the fact that:
\[ \begin{align} \mu|\theta,\sigma,\tau,y \sim N(\hat{\mu}, \tau^{2}/J) \end{align} \]
We can draw values for \(\mu\)
s.mu <- function(theta,tau2) {
mu.hat <- mu.hat(theta)
rnorm(1,mu.hat,sqrt(tau2/J))
}
1. Finally, we can draw values for \(\theta\) Using the fact that:
\[
\begin{align}
\theta_{j}|\mu,\sigma,\tau,y \sim N(\hat{\theta_{j}}, V_{\theta_{j}})
\end{align}
\]
With: \[
\begin{align}
\hat{\theta_{j}} &= \frac{ \frac{1}{\tau^{2}}\mu + \frac{n_{j}}{\sigma^{2}}\bar{y_{\dot j}} }{\frac{1}{\tau^{2}} + \frac{n_{j}}{\sigma^{2}}} \nonumber \\
\\
V_{\theta_{j}} &= \frac{1}{\frac{1}{\tau^{2}} + \frac{n_{j}}{\sigma^{2}}} \nonumber \\
\end{align}
\]
theta.hat.j <- function(j,mu,sigma2,tau2) {
y.bar.j <- mean(y[id==LETTERS[j]])
n.j <- length(y[id==LETTERS[j]])
( (1/tau2) * mu + (n.j/sigma2) * y.bar.j ) / ( (1/tau2) + (n.j/sigma2) )
}
V.theta.hat.j <- function(j,mu,sigma2,tau2) {
n.j <- length(y[id==LETTERS[j]])
( 1 ) / ( (1/tau2) + (n.j/sigma2) )
}
s.theta <- function(mu,sigma2,tau2) {
theta <- NULL
for (j in 1:J) {
t.hat <- theta.hat.j(j,mu,sigma2,tau2)
v.t.hat <- V.theta.hat.j(j,mu,sigma2,tau2)
theta[j] <- rnorm(1,t.hat,sqrt(v.t.hat))
}
return(theta)
}
mcmc.gibbs <- function(chain) {
res1 <- as.list(NULL)
sims <- 200
param <- 7
s.param <- matrix(NA, ncol = param, nrow = sims )
colnames(s.param)<- c("theta1", "theta2", "theta3", "theta4", "mu", "sigma2", "tau2")
s.param[1,1:4]<- theta.0[chain,]
s.param[1,7] <- s.tau.post(theta.0[chain,])
s.param[1,6] <- s.sigma.post(theta.0[chain,])
s.param[1,5] <- s.mu(theta.0[chain,],s.param[1,7])
for (s in 2:sims) {
s.param[s,1:4]<- s.theta(s.param[s-1,5],s.param[s-1,6],s.param[s-1,7])
s.param[s,7] <- s.tau.post(s.param[s,1:4])
s.param[s,6] <- s.sigma.post(s.param[s,1:4])
s.param[s,5] <- s.mu(s.param[s,1:4],s.param[s,7])
}
return(s.param)
#Warm-up
}
s.param <- lapply(1:10,function(x) mcmc.gibbs(x))
s.param.1 <- s.param[[1]][101:200, ]
#Transform the variance in to sd.
s.param.1[,6:7] <- sqrt(s.param.1[,6:7] )
t(apply(s.param.1,2, function(x) quantile(x, c(.025,.25,.5,.75,.975))))
## 2.5% 25% 50% 75% 97.5%
## theta1 58.539 60.668 61.290 62.136 63.551
## theta2 64.097 65.187 65.966 66.521 67.694
## theta3 65.779 67.124 67.719 68.310 69.505
## theta4 59.654 60.640 61.285 61.767 62.598
## mu 58.590 62.809 63.768 65.404 69.947
## sigma2 1.851 2.169 2.395 2.694 3.328
## tau2 1.825 2.777 3.833 6.389 17.368
r.hat <- function(arg1) {
n <- dim(arg1)[1]
m <- dim(arg1)[2]
B <- (n/(m-1)) * sum( ( apply(arg1,2,mean) - mean(arg1) )^2 )
W <- (1/m) * (1/(n-1)) * sum( (arg1 - apply(arg1,2,mean))^2 )
sqrt( (n-1)/(n) + B/(W*n) )
}
r.hat(sapply(1:10,function(x) s.param[[x]][,4] ))
## [1] 1.032
Simulate 200 draws for each of the 10 chains
Note: Problems with \(\sigma\) and \(\tau\), need to review.