We will select a bi-variate normal diffuse prior for \(p(\alpha, \beta)\).
\[ \alpha \sim N(0, 10^2)\\ \beta \sim N(0, 10^2)\\ Corr(\alpha, \beta) = -0.5 \implies \\ (\alpha,\beta) \sim N_2(\mu_0, \Sigma_0)\\ \mu_0 = (0,0)^T\\ [\Sigma_0]_{11} = 10^2\\ [\Sigma_0]_{12} = -50\\ [\Sigma_0]_{22} = 10^2 \]
Since our prior is proper, it follows that our posterior is proper.
We will use a bi-variate Normal proposal distribution for \(\alpha\) and \(\beta\) with tuning parameter \(\Sigma_{Tune}\).
\[ \text{proposal: } (\alpha^*, \beta^*) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\\ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]
In summary our model is as follows:
\[ \text{prior: } (\alpha, \beta) \sim N_2(\mu_0, \Sigma_0)\\ u_0 = (0,0)^T\\ [\Sigma_0]_{11} = 10^2\\ [\Sigma_0]_{12} = -50\\ [\Sigma_0]_{22} = 10^2\\ \text{Data}: y_i|\theta_i \sim Gamma(4, \theta_i)\\ \text{Link Function: } \log(\theta_i) = \alpha + \beta x_i \implies \theta_i = \exp(\alpha + \beta x_i)\\ \text{Proposal Distribution for MH: } (\alpha^*, \beta^*) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\\ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]
Step 1.) Initial values for \(\alpha_{MH}^{0}\) and \(\beta_{MH}^{0}\)
For b = 1 to B:
Step 2.) Sample \((\alpha^{*^b}, \beta^{*^b}) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\) where the \(\Sigma_{Tune}\) is a tuning parameter that will, amongst other things, primarily control the acceptance rate. This is our proposed new \((\alpha_{MH}, \beta_{MH})\) sample. This proposed value will be used in our ratio \(r_b\) and we will decide on keeping it or not in Step 5.)
Step 3.) Compute the log ratio \(\log(r_b)\). Note that since our proposal is symmetric we do not need to include it in our ratio since they cancel by symmetry.
\[ \log(r_b) = \log(\text{Likelihood}(y_i, \alpha^{*^b}, \beta^{*^b})) + \log(\text{Prior}(\alpha^{*^b},\beta^{*^b})) - \log(\text{Likelihood}(y_i, \alpha_{MH}^{(b-1)}, \beta_{MH}^{(b-1)})) - \log(\text{Prior}(\alpha_{MH}^{(b-1)}, \beta_{MH}^{(b-1)})) \]
Step 4.) Compute \(r_b\) by taking \(\exp(\log(r_b))\)
Step 5.) Sample a \(u \sim Unif(0,1)\).
If \(u < \text{min}(1,r)\), then we accept the current \((\alpha^*, \beta^*)\). In other words, set \((\alpha_{MH}^{(b)}, \beta_{MH}^{(b)}) = (\alpha^*, \beta^*)\).
Else, we reject the current \(\alpha^*\) and set \((\alpha_{MH}^{(b)}, \beta_{MH}^{(b)}) = (\alpha_{MH}^{(b-1)},\beta_{MH}^{(b-1)})\)
We use the initial chain values of \((\hat{\alpha}, \hat{\beta}) = (0.76, -0.35)\). These were approximated by maximizing the log likelihood of the data as function of \((\alpha, \beta)\).
Libraries needed.
library(mvtnorm)
library(gtools)
library(coda)
#dataset
data.rat <- read.csv('rat.csv')
y <- data.rat$V1
x <- data.rat$group
n <- nrow(data.rat)
##Predefine Functions used in M-H###
#prior values for alpha and beta coming from bivariate normal
prior.mu = c(0,0)
prior.sig = matrix(c(10^2, -50, -50, 10^2), nrow = 2, byrow = TRUE)
#Log of the prior density
Log.Prior <- function(alpha, beta, prior.mean, prior.sigma) {
z = dmvnorm(c(alpha, beta), mean = prior.mean, sigma = prior.sigma, log = TRUE)
return(z)
}
#Log of the likelihood of data
Log.Likelihood2 <- function(y, alpha, beta, x) {
theta = exp(alpha + beta*x)
val = sum(dgamma(y, shape = 4, rate = theta, log = TRUE))
return(val)
}
### -- main function -- ##
Metrop.func <- function(s1, s2, B, alpha.chain.init, beta.chain.init, data) {
#read data
y <- data$V1
x <- data$group
## initialize
accept = 0
alpha.post.MH <- numeric()
beta.post.MH <- numeric()
alpha.post.MH[1] <- alpha.chain.init #initalize alpha
beta.post.MH[1] <- beta.chain.init #initialize beta
set.seed(11)
#tuning parameter covariance matrix
diag.delta <- matrix(c(s1, 0, 0, s2), nrow = 2, byrow = TRUE)
for(b in 2:(B+1)) {
#print(paste('iteration: ', b))
#M-H Step for alpha and beta | data
ab.star <- as.numeric(rmvnorm(1, mean = c(alpha.post.MH[(b-1)], beta.post.MH[(b-1)]), sigma = diag.delta))
a.star <- ab.star[1]
b.star <- ab.star[2]
num1 <- Log.Likelihood2(y, a.star, b.star, x)
num2 <- Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig)
denom1 <- Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x)
denom2 <- Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
r.log <- Log.Likelihood2(y, a.star, b.star, x) + Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig) - Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x) - Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
r = exp(r.log)
if(runif(1) < min(1,r)) {
alpha.post.MH[(b)] = a.star
beta.post.MH[(b)] = b.star
accept = accept + 1
} else {
alpha.post.MH[(b)] <- alpha.post.MH[(b-1)]
beta.post.MH[(b)] <- beta.post.MH[(b-1)]
}
#Print every 100th iteration
if(b %% 10000 == 0) {
cat(paste0("iteration: ", b, "\n"))
}
} #end for loop
return.list = list(accep.rate = accept/B, beta.samples = beta.post.MH, alpha.samples = alpha.post.MH)
return(return.list)
}
We’ll look at our acceptance rate. Per Lecture from Math 534 and online sources, an acceptance ratio near 25% is an approriate acceptance ratio for higher dimensions. We ran the algorithm a few times with differeing values of \(s_1\) and \(s_2\). With \(s_1 = s_2 = 1\) the acceptance rate was far too low. After a few trials, we found that using \(s_1 = s_2 = 0.01\) gave us an acceptance rate of approximately 25%.
#initial chain values for alpha and beta. (see above explanation)
a.init <- 0.76
b.init <- -0.35
#Tuning parameters for variance of each of alpha and beta
s1 = .01
s2 = .01
#prior values for alpha and beta coming from bivariate normal
prior.mu = c(0,0)
prior.sig = matrix(c(10^2, -50, -50, 10^2), nrow = 2, byrow = TRUE)
#chain length
B = 50000
z <- Metrop.func(s1 = s1, s2 = s2, B = 50000, a.init, b.init, data = data.rat)
## iteration: 10000
## iteration: 20000
## iteration: 30000
## iteration: 40000
## iteration: 50000
z$accep.rate
## [1] 0.26582
We will first report our tuning parameter (covariance matrix \(\Sigma_{Tune}\)). As discussed in Part (b.), we have:
\[ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]
With our choice of \(s_1 = s_2 = 0.01\) as discussed in Part (b.), this is an optimal acceptance rate for higher dimensions.
Let’s do some diagnostics. Let’s first examine some trace plots.
par(mfrow = c(1,2))
plot(z$beta.samples[1:10000], type = 'l', main = 'Trace Plot of Beta[1:10,000]')
plot(z$alpha.samples[1:10000], type = 'l', main = 'Trace Plot of Alpha[1:10,000]')
It appears that both of our \(\alpha\) and \(\beta\) samples are thick in the first 10,000 samples.
We will approximate a burn-in value using the running mean for alpha and beta samples.
params = cbind(z$beta.samples, z$alpha.samples)
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 1, to = (B+1), by = 1)
means.beta = temp[,1]/seq.div
means.alpha = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(1:(B+1), means.beta, type = 'l', main = 'Running Mean of Beta')
plot(1:(B+1), means.alpha, type = 'l', main = 'Running Mean of Alpha')
It appears that a burn-in value of approximately 7,500 will be sufficient.
burnin = 7500
#Autocorrelation plots#
par(mfrow = c(1,2))
acf(z$beta.samples[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(z$alpha.samples[-c(1:burnin)], main = 'Alpha Samples after Burn-in')
It appears our \(\alpha\) and \(\beta\) samples have some ACF we need to be aware of. Let’s see when it goes to 0.
autocorr(as.mcmc(z$alpha.samples[-c(1:burnin)]), lags = c(10,20,30,40,50, 60, 70, 80, 90, 100), relative = FALSE)
## , , 1
##
## [,1]
## Lag 10 0.2310767119
## Lag 20 0.0648119004
## Lag 30 0.0162263040
## Lag 40 0.0207803543
## Lag 50 0.0238873039
## Lag 60 0.0073002487
## Lag 70 -0.0106006388
## Lag 80 -0.0069793375
## Lag 90 0.0006591463
## Lag 100 0.0050989047
It appears that our thinning rate will be 60 due to lag 60 final showing an auto-correlation going close to 0.
Now let’s thin our chain. We first need to calculate the chain length \(B\) in order to have an effective sample size of 2,000 independent samples. Since we will be taking every 60th value, we will need \(60 * 2000 = B = 120,000\) as our chain length. In addition, since we burned \(7,500\), we will need \(B + \text{burn-in} = 120,000 + 7,500 = 127,500\) as our chain length.
B = 127500
burnin = 7500
thin = 60
Eff.samp = floor((B - burnin)/thin)
Eff.samp
## [1] 2000
B = 127500
burnin = 7500
z <- Metrop.func(s1 = s1, s2 = s2, B = B, a.init, b.init, data = data.rat)
## iteration: 10000
## iteration: 20000
## iteration: 30000
## iteration: 40000
## iteration: 50000
## iteration: 60000
## iteration: 70000
## iteration: 80000
## iteration: 90000
## iteration: 100000
## iteration: 110000
## iteration: 120000
alpha.thin = beta.thin = numeric()
for(i in 1:Eff.samp) {
alpha.thin[i] = z$alpha.samples[(burnin+1+(thin*(i-1)))]
beta.thin[i] = z$beta.samples[(burnin + 1 + (thin*(i-1)))]
}
In summary, with the tuning covariance matrix \(\Sigma_{Tune}\) discussed above with \(s_1 = 0.01\) and \(s_2 = 0.01\) we obtained an acceptance ratio of approximately 25%, burnin of 7,500, a thinning rate of 60, and an effective sample size of 2,000 needing \(B = 127,500\) total chain length.
Let’s see if we can further fine-tune our covariance matrix. With this initial run we see the following joint posterior plot:
Let’s provide the joint posterior plot.
plot(alpha.thin, beta.thin, main = 'p(alpha, beta|data) under MH', xlab = 'alpha', ylab = 'beta', pch = 20)
cov(alpha.thin, beta.thin)
## [1] -0.002970878
Thus, we see that we can fine-tune or covariance tuning matrix by inserting an approximate -0.0029 covariance term. We will do that now.
Metrop.func2 <- function(s1, s2, B, alpha.chain.init, beta.chain.init, data) {
#read data
y <- data$V1
x <- data$group
## initialize
accept = 0
alpha.post.MH <- numeric()
beta.post.MH <- numeric()
alpha.post.MH[1] <- alpha.chain.init #initalize alpha
beta.post.MH[1] <- beta.chain.init #initialize beta
set.seed(11)
#tuning parameter covariance matrix
diag.delta <- matrix(c(s1, -0.0029, -0.0029, s2), nrow = 2, byrow = TRUE)
for(b in 2:(B+1)) {
#print(paste('iteration: ', b))
#M-H Step for alpha and beta | data
ab.star <- as.numeric(rmvnorm(1, mean = c(alpha.post.MH[(b-1)], beta.post.MH[(b-1)]), sigma = diag.delta))
a.star <- ab.star[1]
b.star <- ab.star[2]
num1 <- Log.Likelihood2(y, a.star, b.star, x)
num2 <- Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig)
denom1 <- Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x)
denom2 <- Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
r.log <- Log.Likelihood2(y, a.star, b.star, x) + Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig) - Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x) - Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
r = exp(r.log)
if(runif(1) < min(1,r)) {
alpha.post.MH[(b)] = a.star
beta.post.MH[(b)] = b.star
accept = accept + 1
} else {
alpha.post.MH[(b)] <- alpha.post.MH[(b-1)]
beta.post.MH[(b)] <- beta.post.MH[(b-1)]
}
#Print every 100th iteration
if(b %% 10000 == 0) {
cat(paste0("iteration: ", b, "\n"))
}
} #end for loop
return.list = list(accep.rate = accept/B, beta.samples = beta.post.MH, alpha.samples = alpha.post.MH)
return(return.list)
}
Let’s run it with this new tuning covariance matrix:
\[ [\Sigma_{Tune}]_{11} = s_1 = 0.01\\ [\Sigma_{Tune}]_{12} = -0.0029\\ [\Sigma_{Tune}]_{21} = -0.0029\\ [\Sigma_{Tune}]_{22} = s_2 = 0.01 \]
z <- Metrop.func2(s1 = s1, s2 = s2, B = 50000, a.init, b.init, data = data.rat)
## iteration: 10000
## iteration: 20000
## iteration: 30000
## iteration: 40000
## iteration: 50000
z$accep.rate
## [1] 0.2836
We have a slightly higher acceptance rate, which makes sense since our tuning covariance matrix accounted for the covariance we saw in our initial run of the joint posterior.
Let’s re-run our diagnostics again.
B = 50000
params = cbind(z$beta.samples, z$alpha.samples)
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 1, to = (B+1), by = 1)
means.beta = temp[,1]/seq.div
means.alpha = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(1:(B+1), means.beta, type = 'l', main = 'Running Mean of Beta')
plot(1:(B+1), means.alpha, type = 'l', main = 'Running Mean of Alpha')
A burn-in value of 7,500 still seems necessary.
Let’s recheck the ACF plots.
burnin = 7500
#Autocorrelation plots#
par(mfrow = c(1,2))
acf(z$beta.samples[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(z$alpha.samples[-c(1:burnin)], main = 'Alpha Samples after Burn-in')
Let’s check when the auto correlation goes to 0.
autocorr(as.mcmc(z$alpha.samples[-c(1:burnin)]), lags = c(10,20,30,40,50, 60, 70, 80, 90, 100), relative = FALSE)
## , , 1
##
## [,1]
## Lag 10 0.1589400090
## Lag 20 0.0338350984
## Lag 30 0.0165962370
## Lag 40 0.0087988960
## Lag 50 0.0105552772
## Lag 60 0.0004682415
## Lag 70 0.0114996750
## Lag 80 -0.0022761143
## Lag 90 0.0141132079
## Lag 100 0.0121622141
Again, a lag/thinning rate of 60 will be used.
Due to time, and seeing that tuning our covariance matrix with covariance term \(s_{12} = s_{21} = -0.0029\) did not change the lag/thinning rate or our burn-in value, we will not re-run with the necessary effective sample size.
plot(alpha.thin, beta.thin, main = 'p(alpha, beta|data) under MH', xlab = 'alpha', ylab = 'beta', pch = 20)
We provide the marginal posterior samples of \(\alpha\) and \(\beta\).
par(mfrow = c(1,2))
plot(density(alpha.thin), main = 'P(alpha|data)')
z.a = density(alpha.thin)
alpha.thin.MAP <- z.a$x[which.max(z.a$y)]
abline(v = alpha.thin.MAP, col = 'red')
alpha.thin.lower <- quantile(alpha.thin, 0.025)
alpha.thin.upper <- quantile(alpha.thin, 0.975)
abline(v = alpha.thin.lower, col = 'blue')
abline(v = alpha.thin.upper, col = 'blue')
plot(density(beta.thin), main = 'P(beta|data)')
z.b = density(beta.thin)
beta.thin.MAP <- z.b$x[which.max(z.b$y)]
abline(v = beta.thin.MAP, col = 'red')
beta.thin.lower <- quantile(beta.thin, 0.025)
beta.thin.upper <- quantile(beta.thin, 0.975)
abline(v = beta.thin.lower, col = 'blue')
abline(v = beta.thin.upper, col = 'blue')
We can visually see the MAP estimates and the 95% credible intervals for each of \(\alpha|data\) and \(\beta|data\). Numerically they are presented below in the console output.
## parameters lowers maps uppers
## 1 alpha 0.656580 0.7691150 0.8644199
## 2 beta -0.491793 -0.3424071 -0.2046865
What does all this mean for our rat data? Well, recall that \(\log(\theta_i) = \alpha + \beta x_i\). Furthermore, \(y_i|\theta_i \sim Gamma(4, \theta_i)\). Thus, our rate parameter \(\theta_i\) is a exponential transform of a linear function of \(\alpha\) and \(\beta\) where \(\alpha\) is the y-intercept and \(\beta\) is our slope that only is turned on for Group 1 (Restricted Group).
Thus, with our \(\alpha_{MAP}\) greater than our \(\beta_{MAP}\), we see that for group 1 (restricted group) the rate parameter is less than that of our group 0 (ad libitum group).
In a gamma disribution the rate parameter can be used in survival analysis and its interpretation is the rate of death occurence. Thus, we can see that the rate of death occurance is less for group 1 than it is for group 0.
To confirm our inference we will compute the expected survival time distribution for each group in part e.).
We will the posterior samples of \(\theta\) for each group. The Restricted Group is when \(x_1 = 1\) and the Ad Libitum Group is when \(x_1 = 0\).
Once we have the posterior samples of \(\theta\) then we can approximate the expected value of the data \(y_i\) given the data, which is equal to \(\frac{4}{\theta_i}\)
alpha.beta.samp <- data.frame(alpha = alpha.thin, beta = beta.thin)
bootstrap.index <- sample(1:nrow(alpha.beta.samp), size = 2000, replace = TRUE)
ab.samp.mat <- matrix(NA, nrow = 2000, ncol = 2)
for(i in 1:2000) {
ab.samp.mat[i,] <- as.numeric(alpha.beta.samp[bootstrap.index[i],])
}
theta.samples.x1 <- numeric()
for(i in 1:2000) {
theta.samples.x1[i] <- exp(ab.samp.mat[i,1] + ab.samp.mat[i,2]*1)
}
expected.val.x1 <- 4/theta.samples.x1
#-------------------
#resample and bootstrap again for expected value of theta_i when x_i = 0
alpha.beta.samp <- data.frame(alpha = alpha.thin, beta = beta.thin)
bootstrap.index <- sample(1:nrow(alpha.beta.samp), size = 2000, replace = TRUE)
ab.samp.mat <- matrix(NA, nrow = 2000, ncol = 2)
for(i in 1:2000) {
ab.samp.mat[i,] <- as.numeric(alpha.beta.samp[bootstrap.index[i],])
}
theta.samples.x0 <- numeric()
for(i in 1:2000) {
theta.samples.x0[i] <- exp(ab.samp.mat[i,1] + ab.samp.mat[i,2]*0)
}
expected.val.x0 <- 4/theta.samples.x0
#--------------------
#make plots
par(mfrow = c(1,2))
plot(density(expected.val.x1), main = 'Group 1 - Exp.Survival')
z.x1 = density(expected.val.x1)
x1.MAP <- z.x1$x[which.max(z.x1$y)]
abline(v = x1.MAP, col = 'red')
x1.lower <- quantile(expected.val.x1, 0.025)
x1.upper <- quantile(expected.val.x1, 0.975)
abline(v = x1.lower, col = 'blue')
abline(v = x1.upper, col = 'blue')
plot(density(expected.val.x0), main = 'Group 0 - Exp.Survival')
z.x0 = density(expected.val.x0)
x0.MAP <- z.x0$x[which.max(z.x0$y)]
abline(v = x0.MAP, col = 'red')
x0.lower <- quantile(expected.val.x0, 0.025)
x0.upper <- quantile(expected.val.x0, 0.975)
abline(v = x0.lower, col = 'blue')
abline(v = x0.upper, col = 'blue')
We can visually see that Group 1 has an expected survival time that exceeds Group 0. In other words, on average a Rat with a Restricted Diet appears to live, on average, longer than that of a rat who is allowed to eat freely. Thus, it appears that rats do eat themselves to the grave.
We can numerically check our results as well.
## parameters lowers maps uppers
## 1 Group 0 Exp.Survival 1.695004 1.847032 2.079459
## 2 Group 1 Exp.Survival 2.424211 2.627391 2.939281
As we can see from our table output, even the highest 97.5% percentile of the Group 0 expected survival time distribution is less than that of the 2.5% percentile of the Group 1 expected survival time. Again, this confirms what we see in our posterior plots; the Restricted Group survived on average longer and we can conclude that rats will eat themselves to the grave.
#libraries
library(invgamma)
#Data
y <- c(0.03, 0.12, -0.14, 1.18, 0.26, -0.06, -0.02, -0.32, 0.27, 0.80, 0.54, 0.18, -0.02, 0.23, -0.18, -0.06, 0.30, 0.07, -0.07)
sigma.j <- c(0.125, 0.147, 0.167, 0.373, 0.369, 0.103, 0.103, 0.220, 0.164, 0.251, 0.302, 0.223, 0.289, 0.290, 0.159, 0.167, 0.139, 0.094, 0.174)
Our data \(y_j|\theta_j \sim N(\theta_j, \sigma^2_j)\).
We first note the following form of the joint posterior:
\[ \begin{align} p(\theta, \mu, \tau^2|y) &\propto p(\tau^2)p(\mu)[\prod_{j=1}^J p(\theta_j|\mu, \tau^2)] [\prod_{j=1}^J p(y_j|\theta_j)] \end{align} \]
From Lecture Slide 6 we have the distributions of the full posterior conditionals.
\[ (\tau^2|\theta, \mu) \sim N(\frac{J\bar{\theta}/\tau^2 + \mu_0 / \delta_0^2}{J/\tau^2 + 1/\delta_0^2}, [J/\tau^2 + 1/\delta_0^2]^{-1})\\ (\mu|\theta, \tau^2) \sim \Gamma^{-1}(\frac{\eta_0 + J}{2}, \frac{\eta_0 \tau_0^2 + \sum_{j=1}^J (\theta_j - \mu)^2}{2})\\ (\theta_j|y, \mu, \tau^2) \sim N(\frac{n_jy_j/\sigma^2 + \mu/\tau^2}{n_j/\sigma^2 + 1/\tau^2}, [n_j/\sigma^2 + 1/\tau^2]^{-1}) \]
Furthermore we have the following fixed values and identities:
\[ \mu_0 = 0\\ \delta_0^2 = 10^6\\ \eta_0 = 2*10^{-3}\\ \tau_0^2 = 10^6\\ J= 19\\ \sigma^2_j = \frac{\sigma^2}{n_j} \implies \frac{n_j}{\sigma^2} = \frac{1}{\sigma^2_j} \]
\[ (\mu|\theta, \tau^2) \sim N(\frac{J\bar{\theta}/\tau^2 + \mu_0 / \delta_0^2}{J/\tau^2 + 1/\delta_0^2}, [J/\tau^2 + 1/\delta_0^2]^{-1})\\ (\tau^2|\theta, \mu) \sim \Gamma^{-1}(\frac{\eta_0 + J}{2}, \frac{\eta_0 \tau_0^2 + \sum_{j=1}^J (\theta_j - \mu)^2}{2})\\ (\theta_j|y, \mu, \tau^2) \sim N(\frac{y_j/\sigma_j^2 + \mu/\tau^2}{1/\sigma_j^2 + 1/\tau^2}, [1/\sigma_j^2 + 1/\tau^2]^{-1}) \]
Our Gibbs Sampler Algorithm is as follows:
Step 0.) Set \(\theta^{(0)} = y\) as an initial value.
Step 1.) Sample \(\mu \sim N(\mu_0 = 0, \delta_0^2 = 10^6)\). Set \(\mu^{(0)} = \mu\)
Step 2.) Sample \(\tau^2 \sim \Gamma^{-1}(\frac{\eta_0}{2} = 10^{-3}, \frac{\eta_0 \tau_0^2}{2} = 10^3)\). Set \(\tau^{2^{(0)}} = \tau^2\)
For \(b = 2,... B\),
Step 3.) Sample \(\tau^{2^{(b)}}\) from \((\tau^2|\theta^{(b-1)}, \mu^{(b-1)}) \sim \Gamma^{-1}(\frac{\eta_0 + J}{2}, \frac{\eta_0 \tau_0^2 + \sum_{j=1}^J (\theta_j^{(b-1)} - \mu^{(b-1)})^2}{2})\)
Step 4.) Sample \(\mu^{(b)}\) from \((\mu|\theta^{(b-1)}, \tau^{2^{(b)}}) \sim N(\frac{J\bar{\theta^{(b-1)}}/\tau^{2^{(b)}} + \mu_0 / \delta_0^2}{J/\tau^{2^{(b)}} + 1/\delta_0^2}, [J/\tau^{2^{(b)}} + 1/\delta_0^2]^{-1})\)
Step 5.) Sample \(\theta^{(b)}\) from \((\theta_j|y, \mu^{(b)}, \tau^{2^{(b)}}) \sim N(\frac{y_j/\sigma_j^2 + \mu^{(b)}/\tau^{2^{(b)}}}{1/\sigma_j^2 + 1/\tau^{2^{(b)}}}, [1/\sigma_j^2 + 1/\tau^{2^{(b)}}]^{-1})\)
Step 6.) Compute the joint posterior \(p(\theta, \mu, \tau^2|y) \propto p(\tau^2)p(\mu)[\prod_{j=1}^J p(\theta_j|\mu, \tau^2)] [\prod_{j=1}^J p(y_j|\theta_j)]\) with the Gibbs parameter samples computed in Steps 3-5.
Step 7.) Repeat Steps 3-6 until you have obtained the desired number of independent posterior samples.
#fixed values
mu.0 <- 0
delta.0 <- sqrt(10^6)
eta.0 <- 2*(10^(-3))
tau.0.sq <- 10^6
#Defining posterior functions
#############################################
post_theta <- function(sigma.j, y, mu, tau.sq) {
J = length(y)
param1.num <- (y/sigma.j^2) + (mu/tau.sq)
param1.denom <- ((1/sigma.j^2) + (1/tau.sq))^(-1)
param1 <- param1.num*param1.denom
param2 <- param1.denom
theta.rval<- rnorm(J, mean = param1, sd = sqrt(param2))
dens.eval <- sum(dnorm(theta.rval, mean = param1, sd = sqrt(param2), log = TRUE))
return.list = list(rval = theta.rval, dens.val = dens.eval)
return(return.list)
}
#################################################
post_tau <- function(eta.0, tau.0.sq, mu, theta) {
J = length(theta)
param1 <- (eta.0 + J)/2
param2 <- ((eta.0*tau.0.sq) + (sum((theta - mu)^2)))/2
rval <- rinvgamma(1, shape = param1, rate = param2)
dens.eval <- dinvgamma(rval, shape = param1, rate = param2, log = TRUE)
return.list = list(rval = rval, dens.val = dens.eval)
return(return.list)
}
####################################################
post_mu <- function(tau.sq, theta, mu.0, delta.0) {
J = length(theta)
param1.num <- (J*mean(theta)/tau.sq) + (mu.0/delta.0^2)
param1.denom <- ((J/tau.sq) + (1/delta.0^2))^(-1)
param1 <- param1.num*param1.denom
param2 <- param1.denom
rval <- rnorm(1, mean = param1, sd = sqrt(param2))
dens.eval <- dnorm(rval, mean = param1, sd = sqrt(param2), log = TRUE)
return.list = list(rval = rval, dens.val = dens.eval)
return(return.list)
}
#fixed values
mu.0 <- 0
delta.0 <- sqrt(10^6)
eta.0 <- 2*(10^(-3))
tau.0.sq <- 10^6
B = 50000
J = length(y)
theta.samples <- matrix(NA, nrow = B+1, ncol = J) #each row will be a Gibbs sample theta = (theta1,..., thetaJ)
tausq.samples <- numeric()
mu.samples <- numeric()
joint.posterior.log <- numeric()
y_data.log <- numeric()
#start our chain with Step 0, 1, 2 from algorithm above
set.seed(11)
theta.samples[1,] <- y
mu.samples[1] <- rnorm(1, mean = mu.0, sd = delta.0)
tausq.samples[1] <- rinvgamma(1, shape = eta.0/2, rate = eta.0 * tau.0.sq/2)
#Step 3-6
for(i in 2:(B+1)) {
#tausq | theta, mu
tausq.list = post_tau(eta.0, tau.0.sq, mu.samples[(i-1)], as.numeric(theta.samples[i-1,]))
tausq.samples[i] <- tausq.list$rval
#mu | theta, tausq
mu.list = post_mu(tausq.samples[i], as.numeric(theta.samples[i-1,]), mu.0, delta.0)
mu.samples[i] <- mu.list$rval
#theta | mu, tausq
theta.list = post_theta(sigma.j, y, mu.samples[i], tausq.samples[i])
theta.samples[i,] <- theta.list$rval
y_data.log[(i-1)] <- sum(dnorm(y, mean = as.numeric(theta.samples[i,]), sd = sigma.j, log = TRUE))
joint.posterior.log[(i-1)] <- sum(theta.list$dens.val, tausq.list$dens.val, mu.list$dens.val, y_data.log[(i-1)])
}
Trace plots of the joint.posterior.log. Note that for numerical stability purposes we computed the log of the joint posterior.
joint.posterior = exp(joint.posterior.log)
plot(joint.posterior.log, type = 'l', main = 'Trace Plot Joint.Posterior.Log')
Let’s look at a running mean of some of our parameter estimates. We will now convert our log(joint posterior) to the joint posterior by exponentiating our vector.
B = 50000
joint.posterior = exp(joint.posterior.log)
params = joint.posterior[-1]
temp = cumsum(params)
seq.div <- seq(from = 2, to = (B), by = 1)
means.joint.posterior = temp/seq.div
plot(2:(B), means.joint.posterior, type = 'l', main = 'Running Mean of JPL')
It appears that a burn-in value of approximately 30,000 will be sufficient.
Let’s check our ACF plot.
burnin = 30000
#Autocorrelation plots#
acf(joint.posterior[-c(1:burnin)], main = 'Joint Posterior after Burn-In')
It appears that we do not need to thin our joint posterior samples since there doesn’t exist much autocorrelation after eliminating the burn-in.
Thus, since we are not thinning any values, and we started with an initial run of \(B = 50,000\) values, and we our burning \(30,000\), we have obtained \(20,000\) independent samples from \(p(\theta, \mu, \tau^2|data)\), as desired.
Let’s do some diagnostics on our \(\mu\) and \(\tau^2\) samples.
B = B
params = cbind(mu.samples, tausq.samples)
params = params[-1,]
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 2, to = (B+1), by = 1)
means.mu = temp[,1]/seq.div
means.tau.sq = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(2:(B+1), means.mu, type = 'l', main = 'Running Mean of Mu')
plot(2:(B+1), means.tau.sq, type = 'l', main = 'Running Mean of Tau.Sq')
lower.vec <- numeric()
upper.vec <- numeric()
map.vec <- numeric()
par(mfrow = c(2,1))
samples = tail(mu.samples, 10000)
#samples = mu.samples
plot(density(samples), main = 'P(mu_Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')
lower.vec[1] <- lower
upper.vec[1] <- upper
map.vec[1] <- MAP
samples = tail(tausq.samples, 10000)
plot(density(samples), main = 'P(tau.sq_Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')
lower.vec[2] <- lower
upper.vec[2] <- upper
map.vec[2] <- MAP
Here are the numerical results in a data.frame console output.
## parameters lowers maps uppers
## 1 mu_gibbs -4.92614 0.1111276 5.224513
## 2 tau.sq_gibbs 63.58325 95.4254798 244.529403
Let’s check out a couple of our Gibbs \(\theta_j\).
lower.vec <- numeric()
upper.vec <- numeric()
map.vec <- numeric()
par(mfrow = c(2,1))
theta.1 <- as.numeric(theta.samples[,1])
samples = tail(theta.1, 10000)
#samples = mu.samples
plot(density(samples), main = 'P(theta_1 Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')
lower.vec[1] <- lower
upper.vec[1] <- upper
map.vec[1] <- MAP
theta.4 <- as.numeric(theta.samples[,4])
samples = tail(theta.4, 10000)
plot(density(samples), main = 'P(theta_4 Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')
lower.vec[2] <- lower
upper.vec[2] <- upper
map.vec[2] <- MAP
Here are the numerical results in a data.frame console output.
## parameters lowers maps uppers observed.effect
## 1 theta_1 gibbs -0.2119753 0.04026493 0.2737457 0.03
## 2 theta_4 gibbs 0.4509788 1.16513946 1.9283290 1.18
We will also include a console output of the 95% confidence intervals for the \(\theta_j\).
95% Confidence Intervals for Thetas
## 2.5% 50% 97.5%
## -0.21197534 0.02955113 0.27374571
## 2.5% 50% 97.5%
## -0.1677276 0.1220001 0.4103404
## 2.5% 50% 97.5%
## -0.4638582 -0.1387893 0.1896810
## 2.5% 50% 97.5%
## 0.4509788 1.1793690 1.9283290
## 2.5% 50% 97.5%
## -0.4755122 0.2592586 0.9728368
## 2.5% 50% 97.5%
## -0.26493088 -0.05685746 0.14249402
## 2.5% 50% 97.5%
## -0.22336212 -0.02207886 0.17759042
## 2.5% 50% 97.5%
## -0.7537856 -0.3204266 0.1121454
## 2.5% 50% 97.5%
## -0.04742889 0.26878043 0.58762908
## 2.5% 50% 97.5%
## 0.3118087 0.7997457 1.2951560
## 2.5% 50% 97.5%
## -0.05452223 0.54096462 1.13958951
## 2.5% 50% 97.5%
## -0.2532166 0.1772947 0.6129856
## 2.5% 50% 97.5%
## -0.57625305 -0.01481592 0.53780485
## 2.5% 50% 97.5%
## -0.3349262 0.2296117 0.7961485
## 2.5% 50% 97.5%
## -0.4889775 -0.1824865 0.1333774
## 2.5% 50% 97.5%
## -0.38811461 -0.05718639 0.26977266
## 2.5% 50% 97.5%
## 0.02265753 0.30096514 0.57159284
## 2.5% 50% 97.5%
## -0.11795452 0.06992924 0.24944663
## 2.5% 50% 97.5%
## -0.41827525 -0.06903138 0.27347358
I am confident about the results of my sampling. Looking at the console output for the table above, \(\hat{\theta}_4\) is close in value to the observed effect \(y_4 = 1.18\). Furthermore, our 95% for each \(\theta_j\) seems to match the observed effect \(y_j\) fairly well.
Recall in our model that \(y_j | \theta_j \sim N(\theta_j, \sigma^2_j)\). Thus, one interpretation is that \(\theta_j\) is mean effect for school \(j\). This makes sense with our console output above.
Recall also that in our hierarchical model \((\theta_j|\mu, \tau^2) \sim N(\mu, \tau^2)\). Thus, one interpretation is that the overall mean of the effects on all \(J\) schools is captured by the \(\mu\) parameter.
Therefore, we will now look more carefully at the distribution of the \(\mu\) samples, and in particular the last \(10,000\).
The overall observed mean effect across all studies is \(\bar{y} = 0.16\). Let’s compare this to the 95% CI for \(\mu\) and the mean of the last \(10,000\) \(\mu\) samples.
mu.10 <- tail(mu.samples, 10000)
quantile(mu.10, probs = c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## -4.9261399 0.1861451 5.2245125
mean(y)
## [1] 0.1636842
mean(mu.10)
## [1] 0.1826752
As we can see the mean of our \(\mu\) samples does come close to the overall mean \(\bar{y}\).
That being said the results of our \(\mu\) samples do not agree with the naive results of the effects \(y_j\). If we just take the overall observed mean effect across all studies, \(\bar{y} = 0.16\) suggests that the teachers’ actions did have a positive effect on the IQ scores of the students. However, our 95% CI for \(\mu\) including 0 suggests that the teachers’ actions didn’t have a positive effect on the IQ scores of the students. Turns out knowing Bayesian statistics can help us make more informed decisions and analysis.
We first define some notation. Let \(y_h\) denote the vector of observations for subject \(h = 1,...,24\). In other words, \(y_h = (y_{h1}, y_{h2}, y_{h3}, y_{h4})\). Then, the joint likelihood is given by the following.
\[ \begin{align} f(y|\theta, \phi, \sigma^2) &= \prod_{h = 1}^{24} f(y_h)\\ &=\prod_{h = 1}^{24} f(y_{h1}, y_{h2}, y_{h3}, y_{h4})\\ &= \prod_{h = 1}^{24} \prod_{j=1}^4 f(y_{hj})\\ &= \prod_{h = 1}^{24} \prod_{j=1}^4 dNormal(\text{mean =} \theta_h + \phi_j, \text{var =} \sigma^2)\\ \end{align} \]
\[ \begin{align} f(\theta_h|\theta_{-h}, \phi, \mu, \sigma^2, y) &= f(\theta_h|\phi, \mu, \sigma^2, y)\\ &\propto f(\theta_h|\mu, \sigma^2) \prod_{j=1}^{4} f(y_{hj}|\phi, \theta_h, \sigma^2)\\ &\sim N(\frac{\bar{y}_h/\sigma_h^2 + \mu/\sigma^2}{1/\sigma_h^2 + 1/\sigma^2}, [1/\sigma_h^2 + 1/\sigma^2]^{-1}) \end{align} \]
\[ \begin{align} f(\phi_j|\phi_{-j}, \theta, \mu, \sigma^2, y) &= f(\phi_j|\theta, \mu, \sigma^2, y)\\ &\propto f(\phi_j|\sigma^2) \prod_{j=1}^{4} f(y_{hj}|\phi_j, \theta, \sigma^2)\\ &\sim N(\frac{\bar{y}_h/\sigma_h^2 + 0}{1/\sigma_h^2 + 4/\sigma^2}, [1/\sigma_h^2 + 4/\sigma^2]^{-1}) \end{align} \]
\[ f(\mu|\phi, \theta, \sigma^2, y) \propto f(\mu) \prod_{h=1}^{24} f(\theta_h|\mu, \sigma^2)\\ \implies \sim N(\frac{24\bar{\theta}/\sigma^2 + 30/\frac{\sigma^2}{9}}{24/\sigma^2 + 1/\frac{\sigma^2}{9}}, [24/\sigma^2 + 1/\frac{\sigma^2}{9}]^{-1}) \]
\[ f(\sigma^2|\phi, \theta, \mu, y) \propto f(\sigma^2) \prod_{h=1}^{24} f(\theta_h|\mu, \sigma^2)\\ \implies \sim \Gamma^{-1}(\frac{2 + 24}{2},\frac{(2)(1) + \sum_{h = 1}^{24} (\theta_h - \mu)^2}{2}) \]
post_mu <- function(theta, sig.sq) {
theta.bar <- mean(theta)
param1.num <- (24*theta.bar/sig.sq) + (30/sig.sq/9)
param1.denom <- ((24/sig.sq) + (1/sig.sq/9))^(-1)
param1 <- param1.num*param1.denom
rval <- rnorm(1, mean = param1, sd = sqrt(param1.denom))
return(rval)
}
post_theta <- function(y.bar, sig.h.sq, sig.sq, mu) {
H = length(y.bar)
param1.num <- (y.bar/sig.h.sq) + (mu/sig.sq)
param1.denom <- ((1/sig.h.sq) + (1/sig.sq))^(-1)
param1 <- param1.num*param1.denom
rval <- rnorm(H, mean = param1, sd = sqrt(param1.denom))
return(rval)
}
post_phi <- function(y.bar, sig.h.sq, sig.sq) {
param1.num <- (y.bar/sig.h.sq)
param1.denom <- ((1/sig.h.sq) + (4/sig.sq))^(-1)
param1 <- param1.num*param1.denom
rval <- rnorm(4, mean = param1, sd= sqrt(param1.denom))
return(rval)
}
post_sigma.sq <- function(theta, mu) {
param1 <- (2+24)/2
param2 <- (2 + (sum(theta - mu))^2)/2
rval <- rinvgamma(1, shape = param1, rate = param2)
return(rval)
}
hearing <- read.csv('hearing-1.txt', sep = '\t')
y.bar <- apply(hearing, 1, mean)
sig.h.sq <- apply(hearing, 1, sd)^2
B = 50000
J = length(y.bar)
theta.samples <- matrix(NA, nrow = B+1, ncol = J) #each row will be a Gibbs sample theta = (theta1,..., thetaJ)
mu.samples <- numeric()
phi.samples <- matrix(NA, nrow = B+1, ncol = ncol(hearing))
sig.sq.samples <- numeric()
#start our chain with some initial values
set.seed(11)
theta.samples[1,] <- y.bar
sig.sq.samples[1] <- rinvgamma(1, shape = 1, rate = 1)
#Step 3-6
for(i in 2:(B+1)) {
#mu|others
mu.samples[i-1] <- post_mu(as.numeric(theta.samples[i-1]), sig.sq.samples[i-1])
#phi|others
phi.samples[i-1,] <- post_phi(y.bar, sig.h.sq, sig.sq.samples[i-1])
#theta | others
theta.samples[i,] <- post_theta(y.bar, sig.h.sq, sig.sq.samples[i-1], mu.samples[i-1])
#sigma.sq|others
sig.sq.samples[i] <- post_sigma.sq(as.numeric(theta.samples[i,]), mu.samples[i-1])
}
par(mfrow = c(1,2))
plot(mu.samples, type = 'l', main = 'Trace Plot Mu')
B = 50000
params = mu.samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (B), by = 1)
running.means = temp/seq.div
plot(1:(B), running.means, type = 'l', main = 'Running Mean of Mu')
par(mfrow = c(1,2))
samples <- as.numeric(theta.samples[,1])
plot(samples, type = 'l', main = 'Trace Plot Theta_1')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Theta_1')
par(mfrow = c(1,2))
samples <- as.numeric(theta.samples[,2])
plot(samples, type = 'l', main = 'Trace Plot Theta_2')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Theta_2')
par(mfrow = c(1,2))
samples <- as.numeric(theta.samples[,6])
plot(samples, type = 'l', main = 'Trace Plot Theta_6')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Theta_6')
par(mfrow = c(1,2))
samples <- as.numeric(phi.samples[,1])
plot(samples, type = 'l', main = 'Trace Plot Phi_1')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Phi_1')
par(mfrow = c(1,2))
samples <- as.numeric(phi.samples[,4])
plot(samples, type = 'l', main = 'Trace Plot Phi_4')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Phi_4')
It appears that in all of our trace and running mean plots that a burn-in of \(10,00\) will be sufficient.
Recall our model.
\[ (y_{hj}|\theta_h, \phi_j, \sigma^2) \sim N(\theta_h + \phi_j, \sigma^2) \]
Thus, by the MLE estimate of a Normal distribution we have
\[ \hat{\theta}_{h_{MLE}} = \bar{y}_h \]
We will provide a numerical output comparing our MLE estimate and our Bayesian posterior means of our \(\theta_h\).
burnin= 10000
theta.bayes.means <- apply(theta.samples[-c(1:burnin),], 2, mean)
bayes.means <- round(theta.bayes.means, 1)
mles <- y.bar
results.df <- data.frame(c(1:length(y.bar)), mles, bayes.means)
colnames(results.df) <- c("h", "MLE", "Bayes.Means")
results.df
## h MLE Bayes.Means
## 1 1 24.5 25.7
## 2 2 24.0 25.5
## 3 3 28.0 26.5
## 4 4 20.5 24.6
## 5 5 31.0 27.6
## 6 6 28.0 26.8
## 7 7 27.0 26.3
## 8 8 28.5 27.1
## 9 9 36.5 28.4
## 10 10 30.5 27.1
## 11 11 28.0 26.6
## 12 12 31.0 27.2
## 13 13 32.0 28.2
## 14 14 36.0 29.4
## 15 15 33.5 28.4
## 16 16 27.0 26.2
## 17 17 28.5 26.7
## 18 18 19.5 24.3
## 19 19 38.0 30.4
## 20 20 21.5 24.7
## 21 21 17.0 23.2
## 22 22 22.0 25.0
## 23 23 27.0 26.3
## 24 24 40.0 30.4
library(reshape2)
library(ggplot2)
melted.gg.df <- melt(results.df, id.vars='h', variable.name='z')
p <- ggplot(data=melted.gg.df, aes(h, value)) + geom_point(aes(colour=z)) + geom_hline(yintercept = mean(y.bar)) +
scale_x_discrete(name ="Theta_h",
limits=c("1","2","3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"))
p
We can see that the Baye’s estimates tend to be more centered and have less ‘extreme’ values compared to the MLE estimates. For example, when the MLE estimate of \(\theta_{21}\) is below 20, the corresponding Bayes estimate is near 24, closer to the overall mean of the data, i.e., mean(y.bar).
burnin = 10000
lower.vec <- numeric()
upper.vec <- numeric()
map.vec <- numeric()
par(mfrow = c(2,2))
phi.post.samples <- phi.samples[-c(1:burnin, nrow(phi.samples)), ]
for(j in 1:4) {
phi <- as.numeric(phi.post.samples[,j])
samples <- phi
title <- paste('P(Phi_', j, ' Gibbs')
plot(density(samples), main = title)
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')
lower.vec[j] <- lower
upper.vec[j] <- upper
map.vec[j] <- MAP
}
Here are the numerical results:
## parameters lowers maps uppers
## 1 phi_1 gibbs -0.2567914 0.22481639 27.13545
## 2 phi_2 gibbs -0.5653560 0.06600242 27.52085
## 3 phi_3 gibbs -0.7031373 0.06468620 31.01528
## 4 phi_4 gibbs -0.6478045 0.06888625 24.43991
Zero is contained in each of the 95% credible intervals 0. Thus, we can conclude that the lists have the same level of difficulty, i.e., they have no effect on the subject score.