Math 538 Midterm 2

Problem 1.) Data Analysis Rat Survival Times

1a.) Select an appropriate diffuse prior for \(p(\alpha, \beta)\).

We will select a bi-variate normal diffuse prior for \(p(\alpha, \beta)\).

\[ \alpha \sim N(0, 10^2)\\ \beta \sim N(0, 10^2)\\ Corr(\alpha, \beta) = -0.5 \implies \\ (\alpha,\beta) \sim N_2(\mu_0, \Sigma_0)\\ \mu_0 = (0,0)^T\\ [\Sigma_0]_{11} = 10^2\\ [\Sigma_0]_{12} = -50\\ [\Sigma_0]_{22} = 10^2 \]

Since our prior is proper, it follows that our posterior is proper.

Problem 1b.) Perform MCMC using a Metropolis-Hastings Algorithm with a bivariate normal proposal distribution.

We will use a bi-variate Normal proposal distribution for \(\alpha\) and \(\beta\) with tuning parameter \(\Sigma_{Tune}\).

\[ \text{proposal: } (\alpha^*, \beta^*) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\\ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]

In summary our model is as follows:

\[ \text{prior: } (\alpha, \beta) \sim N_2(\mu_0, \Sigma_0)\\ u_0 = (0,0)^T\\ [\Sigma_0]_{11} = 10^2\\ [\Sigma_0]_{12} = -50\\ [\Sigma_0]_{22} = 10^2\\ \text{Data}: y_i|\theta_i \sim Gamma(4, \theta_i)\\ \text{Link Function: } \log(\theta_i) = \alpha + \beta x_i \implies \theta_i = \exp(\alpha + \beta x_i)\\ \text{Proposal Distribution for MH: } (\alpha^*, \beta^*) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\\ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]

Algorithm:

Step 1.) Initial values for \(\alpha_{MH}^{0}\) and \(\beta_{MH}^{0}\)

For b = 1 to B:

Step 2.) Sample \((\alpha^{*^b}, \beta^{*^b}) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\) where the \(\Sigma_{Tune}\) is a tuning parameter that will, amongst other things, primarily control the acceptance rate. This is our proposed new \((\alpha_{MH}, \beta_{MH})\) sample. This proposed value will be used in our ratio \(r_b\) and we will decide on keeping it or not in Step 5.)

Step 3.) Compute the log ratio \(\log(r_b)\). Note that since our proposal is symmetric we do not need to include it in our ratio since they cancel by symmetry.

\[ \log(r_b) = \log(\text{Likelihood}(y_i, \alpha^{*^b}, \beta^{*^b})) + \log(\text{Prior}(\alpha^{*^b},\beta^{*^b})) - \log(\text{Likelihood}(y_i, \alpha_{MH}^{(b-1)}, \beta_{MH}^{(b-1)})) - \log(\text{Prior}(\alpha_{MH}^{(b-1)}, \beta_{MH}^{(b-1)})) \]

Step 4.) Compute \(r_b\) by taking \(\exp(\log(r_b))\)

Step 5.) Sample a \(u \sim Unif(0,1)\).

If \(u < \text{min}(1,r)\), then we accept the current \((\alpha^*, \beta^*)\). In other words, set \((\alpha_{MH}^{(b)}, \beta_{MH}^{(b)}) = (\alpha^*, \beta^*)\).

Else, we reject the current \(\alpha^*\) and set \((\alpha_{MH}^{(b)}, \beta_{MH}^{(b)}) = (\alpha_{MH}^{(b-1)},\beta_{MH}^{(b-1)})\)

Initial Chain Values

We use the initial chain values of \((\hat{\alpha}, \hat{\beta}) = (0.76, -0.35)\). These were approximated by maximizing the log likelihood of the data as function of \((\alpha, \beta)\).

Code Implementation:

Libraries needed.

library(mvtnorm)
library(gtools)
library(coda)

#dataset
data.rat <- read.csv('rat.csv')
y <- data.rat$V1
x <- data.rat$group
n <- nrow(data.rat)

##Predefine Functions used in M-H###

#prior values for alpha and beta coming from bivariate normal
prior.mu = c(0,0)
prior.sig = matrix(c(10^2, -50, -50, 10^2), nrow = 2, byrow = TRUE)

#Log of the prior density
Log.Prior <- function(alpha, beta, prior.mean, prior.sigma) {
  z = dmvnorm(c(alpha, beta), mean = prior.mean, sigma = prior.sigma, log = TRUE)
  return(z)
}

#Log of the likelihood of data
Log.Likelihood2 <- function(y, alpha, beta, x) {
  theta = exp(alpha + beta*x)
  val = sum(dgamma(y, shape = 4, rate = theta, log = TRUE))
  return(val)
}

### --  main function  -- ##
Metrop.func <- function(s1, s2, B, alpha.chain.init, beta.chain.init, data) {
  
  #read data
  y <- data$V1
  x <- data$group
  
  ## initialize
  accept = 0
  alpha.post.MH <- numeric()
  beta.post.MH <- numeric()
  alpha.post.MH[1]  <- alpha.chain.init #initalize alpha
  beta.post.MH[1] <- beta.chain.init #initialize beta
  set.seed(11)
  
  #tuning parameter covariance matrix
  diag.delta <- matrix(c(s1, 0, 0, s2), nrow = 2, byrow = TRUE) 
  
  for(b in 2:(B+1)) {
    #print(paste('iteration: ', b))
    #M-H Step for alpha and beta | data
    ab.star <- as.numeric(rmvnorm(1, mean = c(alpha.post.MH[(b-1)], beta.post.MH[(b-1)]), sigma = diag.delta))
    a.star <- ab.star[1]
    b.star <- ab.star[2]
    
    num1 <- Log.Likelihood2(y, a.star, b.star, x)
    num2 <- Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig)
    denom1 <- Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x)
    denom2 <- Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
    
    r.log <- Log.Likelihood2(y, a.star, b.star, x) + Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig) - Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x) - Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
    
    r = exp(r.log)
    
    
    if(runif(1) < min(1,r)) {
      alpha.post.MH[(b)] = a.star
      beta.post.MH[(b)] = b.star
      accept = accept + 1
    } else {
      alpha.post.MH[(b)] <- alpha.post.MH[(b-1)]
      beta.post.MH[(b)] <- beta.post.MH[(b-1)]
    }
    
   #Print every 100th iteration
    if(b %% 10000 == 0) {
      cat(paste0("iteration: ", b, "\n"))
    }
    
  } #end for loop
  return.list = list(accep.rate = accept/B, beta.samples = beta.post.MH, alpha.samples = alpha.post.MH)
  return(return.list)
}

We’ll look at our acceptance rate. Per Lecture from Math 534 and online sources, an acceptance ratio near 25% is an approriate acceptance ratio for higher dimensions. We ran the algorithm a few times with differeing values of \(s_1\) and \(s_2\). With \(s_1 = s_2 = 1\) the acceptance rate was far too low. After a few trials, we found that using \(s_1 = s_2 = 0.01\) gave us an acceptance rate of approximately 25%.

#initial chain values for alpha and beta. (see above explanation) 
a.init <- 0.76
b.init <- -0.35

#Tuning parameters for variance of each of alpha and beta
s1 = .01
s2 = .01

#prior values for alpha and beta coming from bivariate normal
prior.mu = c(0,0)
prior.sig = matrix(c(10^2, -50, -50, 10^2), nrow = 2, byrow = TRUE)

#chain length
B = 50000
z <- Metrop.func(s1 = s1, s2 = s2, B = 50000, a.init, b.init, data = data.rat)

## iteration: 10000
## iteration: 20000
## iteration: 30000
## iteration: 40000
## iteration: 50000

z$accep.rate

## [1] 0.26582

Problem 1c.) Provide trace plots, acf plots, thinning rate, burnin, and effective sample size. Report your tuning parameter (covariance matrix).

We will first report our tuning parameter (covariance matrix \(\Sigma_{Tune}\)). As discussed in Part (b.), we have:

\[ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]

With our choice of \(s_1 = s_2 = 0.01\) as discussed in Part (b.), this is an optimal acceptance rate for higher dimensions.

Let’s do some diagnostics. Let’s first examine some trace plots.

par(mfrow = c(1,2))
plot(z$beta.samples[1:10000], type = 'l', main = 'Trace Plot of Beta[1:10,000]')
plot(z$alpha.samples[1:10000], type = 'l', main = 'Trace Plot of Alpha[1:10,000]')

It appears that both of our \(\alpha\) and \(\beta\) samples are thick in the first 10,000 samples.

We will approximate a burn-in value using the running mean for alpha and beta samples.

params = cbind(z$beta.samples, z$alpha.samples)
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 1, to = (B+1), by = 1)
means.beta = temp[,1]/seq.div
means.alpha = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(1:(B+1), means.beta, type = 'l', main = 'Running Mean of Beta')
plot(1:(B+1), means.alpha, type = 'l', main = 'Running Mean of Alpha')

It appears that a burn-in value of approximately 7,500 will be sufficient.

burnin = 7500
#Autocorrelation plots# 
par(mfrow = c(1,2))
acf(z$beta.samples[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(z$alpha.samples[-c(1:burnin)], main = 'Alpha Samples after Burn-in')

It appears our \(\alpha\) and \(\beta\) samples have some ACF we need to be aware of. Let’s see when it goes to 0.

autocorr(as.mcmc(z$alpha.samples[-c(1:burnin)]), lags = c(10,20,30,40,50, 60, 70, 80, 90, 100), relative = FALSE)

## , , 1
## 
##                  [,1]
## Lag 10   0.2310767119
## Lag 20   0.0648119004
## Lag 30   0.0162263040
## Lag 40   0.0207803543
## Lag 50   0.0238873039
## Lag 60   0.0073002487
## Lag 70  -0.0106006388
## Lag 80  -0.0069793375
## Lag 90   0.0006591463
## Lag 100  0.0050989047

It appears that our thinning rate will be 60 due to lag 60 final showing an auto-correlation going close to 0.

Now let’s thin our chain. We first need to calculate the chain length \(B\) in order to have an effective sample size of 2,000 independent samples. Since we will be taking every 60th value, we will need \(60 * 2000 = B = 120,000\) as our chain length. In addition, since we burned \(7,500\), we will need \(B + \text{burn-in} = 120,000 + 7,500 = 127,500\) as our chain length.

B = 127500
burnin = 7500
thin = 60
Eff.samp = floor((B - burnin)/thin) 
Eff.samp

## [1] 2000

B = 127500
burnin = 7500
z <- Metrop.func(s1 = s1, s2 = s2, B = B, a.init, b.init, data = data.rat)

## iteration: 10000
## iteration: 20000
## iteration: 30000
## iteration: 40000
## iteration: 50000
## iteration: 60000
## iteration: 70000
## iteration: 80000
## iteration: 90000
## iteration: 100000
## iteration: 110000
## iteration: 120000

alpha.thin = beta.thin = numeric()
for(i in 1:Eff.samp) {
  alpha.thin[i] = z$alpha.samples[(burnin+1+(thin*(i-1)))]
  beta.thin[i] = z$beta.samples[(burnin + 1 + (thin*(i-1)))]
}

In summary, with the tuning covariance matrix \(\Sigma_{Tune}\) discussed above with \(s_1 = 0.01\) and \(s_2 = 0.01\) we obtained an acceptance ratio of approximately 25%, burnin of 7,500, a thinning rate of 60, and an effective sample size of 2,000 needing \(B = 127,500\) total chain length.

Let’s see if we can further fine-tune our covariance matrix. With this initial run we see the following joint posterior plot:

Let’s provide the joint posterior plot.

plot(alpha.thin, beta.thin, main = 'p(alpha, beta|data) under MH', xlab = 'alpha', ylab = 'beta', pch = 20)

cov(alpha.thin, beta.thin)

## [1] -0.002970878

Thus, we see that we can fine-tune or covariance tuning matrix by inserting an approximate -0.0029 covariance term. We will do that now.

Metrop.func2 <- function(s1, s2, B, alpha.chain.init, beta.chain.init, data) {
  
  #read data
  y <- data$V1
  x <- data$group
  
  ## initialize
  accept = 0
  alpha.post.MH <- numeric()
  beta.post.MH <- numeric()
  alpha.post.MH[1]  <- alpha.chain.init #initalize alpha
  beta.post.MH[1] <- beta.chain.init #initialize beta
  set.seed(11)
  
  #tuning parameter covariance matrix
  diag.delta <- matrix(c(s1, -0.0029, -0.0029, s2), nrow = 2, byrow = TRUE) 
  
  for(b in 2:(B+1)) {
    #print(paste('iteration: ', b))
    #M-H Step for alpha and beta | data
    ab.star <- as.numeric(rmvnorm(1, mean = c(alpha.post.MH[(b-1)], beta.post.MH[(b-1)]), sigma = diag.delta))
    a.star <- ab.star[1]
    b.star <- ab.star[2]
    
    num1 <- Log.Likelihood2(y, a.star, b.star, x)
    num2 <- Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig)
    denom1 <- Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x)
    denom2 <- Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
    
    r.log <- Log.Likelihood2(y, a.star, b.star, x) + Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig) - Log.Likelihood2(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], x) - Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
    
    r = exp(r.log)
    
    
    if(runif(1) < min(1,r)) {
      alpha.post.MH[(b)] = a.star
      beta.post.MH[(b)] = b.star
      accept = accept + 1
    } else {
      alpha.post.MH[(b)] <- alpha.post.MH[(b-1)]
      beta.post.MH[(b)] <- beta.post.MH[(b-1)]
    }
    
   #Print every 100th iteration
    if(b %% 10000 == 0) {
      cat(paste0("iteration: ", b, "\n"))
    }
    
  } #end for loop
  return.list = list(accep.rate = accept/B, beta.samples = beta.post.MH, alpha.samples = alpha.post.MH)
  return(return.list)
}

Let’s run it with this new tuning covariance matrix:

\[ [\Sigma_{Tune}]_{11} = s_1 = 0.01\\ [\Sigma_{Tune}]_{12} = -0.0029\\ [\Sigma_{Tune}]_{21} = -0.0029\\ [\Sigma_{Tune}]_{22} = s_2 = 0.01 \]

z <- Metrop.func2(s1 = s1, s2 = s2, B = 50000, a.init, b.init, data = data.rat)

## iteration: 10000
## iteration: 20000
## iteration: 30000
## iteration: 40000
## iteration: 50000

z$accep.rate

## [1] 0.2836

We have a slightly higher acceptance rate, which makes sense since our tuning covariance matrix accounted for the covariance we saw in our initial run of the joint posterior.

Let’s re-run our diagnostics again.

B = 50000
params = cbind(z$beta.samples, z$alpha.samples)
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 1, to = (B+1), by = 1)
means.beta = temp[,1]/seq.div
means.alpha = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(1:(B+1), means.beta, type = 'l', main = 'Running Mean of Beta')
plot(1:(B+1), means.alpha, type = 'l', main = 'Running Mean of Alpha')

A burn-in value of 7,500 still seems necessary.

Let’s recheck the ACF plots.

burnin = 7500
#Autocorrelation plots# 
par(mfrow = c(1,2))
acf(z$beta.samples[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(z$alpha.samples[-c(1:burnin)], main = 'Alpha Samples after Burn-in')

Let’s check when the auto correlation goes to 0.

autocorr(as.mcmc(z$alpha.samples[-c(1:burnin)]), lags = c(10,20,30,40,50, 60, 70, 80, 90, 100), relative = FALSE)

## , , 1
## 
##                  [,1]
## Lag 10   0.1589400090
## Lag 20   0.0338350984
## Lag 30   0.0165962370
## Lag 40   0.0087988960
## Lag 50   0.0105552772
## Lag 60   0.0004682415
## Lag 70   0.0114996750
## Lag 80  -0.0022761143
## Lag 90   0.0141132079
## Lag 100  0.0121622141

Again, a lag/thinning rate of 60 will be used.

Due to time, and seeing that tuning our covariance matrix with covariance term \(s_{12} = s_{21} = -0.0029\) did not change the lag/thinning rate or our burn-in value, we will not re-run with the necessary effective sample size.

Problem 1d.) Obtain marginal and joint posterior plots of 2,000 independent posterior samples of \(\alpha\) and \(\beta\). Also report the MAP and 95% credible estimates for these model parameters. Is there a significant effect of the experimental group on the survival times? Elaborate.

plot(alpha.thin, beta.thin, main = 'p(alpha, beta|data) under MH', xlab = 'alpha', ylab = 'beta', pch = 20)

We provide the marginal posterior samples of \(\alpha\) and \(\beta\).

par(mfrow = c(1,2))
plot(density(alpha.thin), main = 'P(alpha|data)')
z.a = density(alpha.thin)
alpha.thin.MAP <- z.a$x[which.max(z.a$y)]
abline(v = alpha.thin.MAP, col = 'red')
alpha.thin.lower <- quantile(alpha.thin, 0.025)
alpha.thin.upper <- quantile(alpha.thin, 0.975)
abline(v = alpha.thin.lower, col = 'blue')
abline(v = alpha.thin.upper, col = 'blue')

plot(density(beta.thin), main = 'P(beta|data)')
z.b = density(beta.thin)
beta.thin.MAP <- z.b$x[which.max(z.b$y)]
abline(v = beta.thin.MAP, col = 'red')
beta.thin.lower <- quantile(beta.thin, 0.025)
beta.thin.upper <- quantile(beta.thin, 0.975)
abline(v = beta.thin.lower, col = 'blue')
abline(v = beta.thin.upper, col = 'blue')

We can visually see the MAP estimates and the 95% credible intervals for each of \(\alpha|data\) and \(\beta|data\). Numerically they are presented below in the console output.

##   parameters    lowers       maps     uppers
## 1      alpha  0.656580  0.7691150  0.8644199
## 2       beta -0.491793 -0.3424071 -0.2046865

What does all this mean for our rat data? Well, recall that \(\log(\theta_i) = \alpha + \beta x_i\). Furthermore, \(y_i|\theta_i \sim Gamma(4, \theta_i)\). Thus, our rate parameter \(\theta_i\) is a exponential transform of a linear function of \(\alpha\) and \(\beta\) where \(\alpha\) is the y-intercept and \(\beta\) is our slope that only is turned on for Group 1 (Restricted Group).

Thus, with our \(\alpha_{MAP}\) greater than our \(\beta_{MAP}\), we see that for group 1 (restricted group) the rate parameter is less than that of our group 0 (ad libitum group).

In a gamma disribution the rate parameter can be used in survival analysis and its interpretation is the rate of death occurence. Thus, we can see that the rate of death occurance is less for group 1 than it is for group 0.

To confirm our inference we will compute the expected survival time distribution for each group in part e.).

Problem 1e.) Using your posterior samples, obtain distributions of the posterior mean survival times for each group utilizing the fact that the mean of the gamma distribution is given by \(\frac{4}{\theta_i}\).

We will the posterior samples of \(\theta\) for each group. The Restricted Group is when \(x_1 = 1\) and the Ad Libitum Group is when \(x_1 = 0\).

Once we have the posterior samples of \(\theta\) then we can approximate the expected value of the data \(y_i\) given the data, which is equal to \(\frac{4}{\theta_i}\)

alpha.beta.samp <- data.frame(alpha = alpha.thin, beta = beta.thin)
bootstrap.index <- sample(1:nrow(alpha.beta.samp), size = 2000, replace = TRUE)
ab.samp.mat <- matrix(NA, nrow = 2000, ncol = 2)
for(i in 1:2000) {
  ab.samp.mat[i,] <- as.numeric(alpha.beta.samp[bootstrap.index[i],])
}
theta.samples.x1 <- numeric()
for(i in 1:2000) {
  theta.samples.x1[i] <- exp(ab.samp.mat[i,1] + ab.samp.mat[i,2]*1)
}

expected.val.x1 <- 4/theta.samples.x1


#-------------------
#resample and bootstrap again for expected value of theta_i when x_i = 0
alpha.beta.samp <- data.frame(alpha = alpha.thin, beta = beta.thin)
bootstrap.index <- sample(1:nrow(alpha.beta.samp), size = 2000, replace = TRUE)
ab.samp.mat <- matrix(NA, nrow = 2000, ncol = 2)
for(i in 1:2000) {
  ab.samp.mat[i,] <- as.numeric(alpha.beta.samp[bootstrap.index[i],])
}
theta.samples.x0 <- numeric()
for(i in 1:2000) {
  theta.samples.x0[i] <- exp(ab.samp.mat[i,1] + ab.samp.mat[i,2]*0)
}

expected.val.x0 <- 4/theta.samples.x0

#--------------------
#make plots
par(mfrow = c(1,2))
plot(density(expected.val.x1), main = 'Group 1 - Exp.Survival')
z.x1 = density(expected.val.x1)
x1.MAP <- z.x1$x[which.max(z.x1$y)]
abline(v = x1.MAP, col = 'red')
x1.lower <- quantile(expected.val.x1, 0.025)
x1.upper <- quantile(expected.val.x1, 0.975)
abline(v = x1.lower, col = 'blue')
abline(v = x1.upper, col = 'blue')


plot(density(expected.val.x0), main = 'Group 0 - Exp.Survival')
z.x0 = density(expected.val.x0)
x0.MAP <- z.x0$x[which.max(z.x0$y)]
abline(v = x0.MAP, col = 'red')
x0.lower <- quantile(expected.val.x0, 0.025)
x0.upper <- quantile(expected.val.x0, 0.975)
abline(v = x0.lower, col = 'blue')
abline(v = x0.upper, col = 'blue')

We can visually see that Group 1 has an expected survival time that exceeds Group 0. In other words, on average a Rat with a Restricted Diet appears to live, on average, longer than that of a rat who is allowed to eat freely. Thus, it appears that rats do eat themselves to the grave.

We can numerically check our results as well.

##             parameters   lowers     maps   uppers
## 1 Group 0 Exp.Survival 1.695004 1.847032 2.079459
## 2 Group 1 Exp.Survival 2.424211 2.627391 2.939281

As we can see from our table output, even the highest 97.5% percentile of the Group 0 expected survival time distribution is less than that of the 2.5% percentile of the Group 1 expected survival time. Again, this confirms what we see in our posterior plots; the Restricted Group survived on average longer and we can conclude that rats will eat themselves to the grave.

Problem 2.) IQ Score Data

#libraries
library(invgamma)

#Data
y <- c(0.03, 0.12, -0.14, 1.18, 0.26, -0.06, -0.02, -0.32, 0.27, 0.80, 0.54, 0.18, -0.02, 0.23, -0.18, -0.06, 0.30, 0.07, -0.07)

sigma.j <- c(0.125, 0.147, 0.167, 0.373, 0.369, 0.103, 0.103, 0.220, 0.164, 0.251, 0.302, 0.223, 0.289, 0.290, 0.159, 0.167, 0.139, 0.094, 0.174)

Problem 2a.) Write out a Gibbs Sampler MCMC Algorithm for obtaining the posterior samples from \(p(\theta, \mu, \tau^2|data)\)

Our data \(y_j|\theta_j \sim N(\theta_j, \sigma^2_j)\).

We first note the following form of the joint posterior:

\[ \begin{align} p(\theta, \mu, \tau^2|y) &\propto p(\tau^2)p(\mu)[\prod_{j=1}^J p(\theta_j|\mu, \tau^2)] [\prod_{j=1}^J p(y_j|\theta_j)] \end{align} \]

From Lecture Slide 6 we have the distributions of the full posterior conditionals.

\[ (\tau^2|\theta, \mu) \sim N(\frac{J\bar{\theta}/\tau^2 + \mu_0 / \delta_0^2}{J/\tau^2 + 1/\delta_0^2}, [J/\tau^2 + 1/\delta_0^2]^{-1})\\ (\mu|\theta, \tau^2) \sim \Gamma^{-1}(\frac{\eta_0 + J}{2}, \frac{\eta_0 \tau_0^2 + \sum_{j=1}^J (\theta_j - \mu)^2}{2})\\ (\theta_j|y, \mu, \tau^2) \sim N(\frac{n_jy_j/\sigma^2 + \mu/\tau^2}{n_j/\sigma^2 + 1/\tau^2}, [n_j/\sigma^2 + 1/\tau^2]^{-1}) \]

Furthermore we have the following fixed values and identities:

\[ \mu_0 = 0\\ \delta_0^2 = 10^6\\ \eta_0 = 2*10^{-3}\\ \tau_0^2 = 10^6\\ J= 19\\ \sigma^2_j = \frac{\sigma^2}{n_j} \implies \frac{n_j}{\sigma^2} = \frac{1}{\sigma^2_j} \]

\[ (\mu|\theta, \tau^2) \sim N(\frac{J\bar{\theta}/\tau^2 + \mu_0 / \delta_0^2}{J/\tau^2 + 1/\delta_0^2}, [J/\tau^2 + 1/\delta_0^2]^{-1})\\ (\tau^2|\theta, \mu) \sim \Gamma^{-1}(\frac{\eta_0 + J}{2}, \frac{\eta_0 \tau_0^2 + \sum_{j=1}^J (\theta_j - \mu)^2}{2})\\ (\theta_j|y, \mu, \tau^2) \sim N(\frac{y_j/\sigma_j^2 + \mu/\tau^2}{1/\sigma_j^2 + 1/\tau^2}, [1/\sigma_j^2 + 1/\tau^2]^{-1}) \]

Our Gibbs Sampler Algorithm is as follows:

Step 0.) Set \(\theta^{(0)} = y\) as an initial value.

Step 1.) Sample \(\mu \sim N(\mu_0 = 0, \delta_0^2 = 10^6)\). Set \(\mu^{(0)} = \mu\)

Step 2.) Sample \(\tau^2 \sim \Gamma^{-1}(\frac{\eta_0}{2} = 10^{-3}, \frac{\eta_0 \tau_0^2}{2} = 10^3)\). Set \(\tau^{2^{(0)}} = \tau^2\)

For \(b = 2,... B\),

Step 3.) Sample \(\tau^{2^{(b)}}\) from \((\tau^2|\theta^{(b-1)}, \mu^{(b-1)}) \sim \Gamma^{-1}(\frac{\eta_0 + J}{2}, \frac{\eta_0 \tau_0^2 + \sum_{j=1}^J (\theta_j^{(b-1)} - \mu^{(b-1)})^2}{2})\)

Step 4.) Sample \(\mu^{(b)}\) from \((\mu|\theta^{(b-1)}, \tau^{2^{(b)}}) \sim N(\frac{J\bar{\theta^{(b-1)}}/\tau^{2^{(b)}} + \mu_0 / \delta_0^2}{J/\tau^{2^{(b)}} + 1/\delta_0^2}, [J/\tau^{2^{(b)}} + 1/\delta_0^2]^{-1})\)

Step 5.) Sample \(\theta^{(b)}\) from \((\theta_j|y, \mu^{(b)}, \tau^{2^{(b)}}) \sim N(\frac{y_j/\sigma_j^2 + \mu^{(b)}/\tau^{2^{(b)}}}{1/\sigma_j^2 + 1/\tau^{2^{(b)}}}, [1/\sigma_j^2 + 1/\tau^{2^{(b)}}]^{-1})\)

Step 6.) Compute the joint posterior \(p(\theta, \mu, \tau^2|y) \propto p(\tau^2)p(\mu)[\prod_{j=1}^J p(\theta_j|\mu, \tau^2)] [\prod_{j=1}^J p(y_j|\theta_j)]\) with the Gibbs parameter samples computed in Steps 3-5.

Step 7.) Repeat Steps 3-6 until you have obtained the desired number of independent posterior samples.

Problem 2b.) Obtain at least 2,000 independent posterior samples from \(p(\theta, \mu, \tau^2|y)\). Provide trace plots of your posterior samples and perform thinning if needed.

#fixed values
mu.0 <- 0
delta.0 <- sqrt(10^6)
eta.0 <- 2*(10^(-3))
tau.0.sq <- 10^6

#Defining posterior functions
#############################################
post_theta <- function(sigma.j, y, mu, tau.sq) {
  J = length(y)
  param1.num <- (y/sigma.j^2) + (mu/tau.sq)
  param1.denom <- ((1/sigma.j^2) + (1/tau.sq))^(-1)
  param1 <- param1.num*param1.denom
  param2 <- param1.denom
  theta.rval<- rnorm(J, mean = param1, sd = sqrt(param2))
  
  dens.eval <- sum(dnorm(theta.rval, mean = param1, sd = sqrt(param2), log = TRUE))
  return.list = list(rval = theta.rval, dens.val = dens.eval)
  
  return(return.list)
}

#################################################
post_tau <- function(eta.0, tau.0.sq, mu, theta) {
  J = length(theta)
  param1 <- (eta.0 + J)/2
  param2 <- ((eta.0*tau.0.sq) + (sum((theta - mu)^2)))/2
  rval <- rinvgamma(1, shape = param1, rate = param2)
  
  dens.eval <- dinvgamma(rval, shape = param1, rate = param2, log = TRUE)
  return.list = list(rval = rval, dens.val = dens.eval)

  return(return.list)
}
####################################################
post_mu <- function(tau.sq, theta, mu.0, delta.0) {
  J = length(theta)
  param1.num <- (J*mean(theta)/tau.sq) + (mu.0/delta.0^2)
  param1.denom <- ((J/tau.sq) + (1/delta.0^2))^(-1)
  param1 <- param1.num*param1.denom
  param2 <- param1.denom
  rval <- rnorm(1, mean = param1, sd = sqrt(param2))
  
  dens.eval <- dnorm(rval, mean = param1, sd = sqrt(param2), log = TRUE)
  return.list = list(rval = rval, dens.val = dens.eval)

  return(return.list)
}

#fixed values
mu.0 <- 0
delta.0 <- sqrt(10^6)
eta.0 <- 2*(10^(-3))
tau.0.sq <- 10^6

B = 50000
J = length(y)
theta.samples <- matrix(NA, nrow = B+1, ncol = J) #each row will be a Gibbs sample theta = (theta1,..., thetaJ)
tausq.samples <- numeric()
mu.samples <- numeric()

joint.posterior.log <- numeric()

y_data.log <- numeric()


#start our chain with Step 0, 1, 2 from algorithm above
set.seed(11)
theta.samples[1,] <- y
mu.samples[1] <- rnorm(1, mean = mu.0, sd = delta.0)
tausq.samples[1] <- rinvgamma(1, shape = eta.0/2, rate = eta.0 * tau.0.sq/2)


#Step 3-6
for(i in 2:(B+1)) { 
  
  #tausq | theta, mu
  tausq.list = post_tau(eta.0, tau.0.sq, mu.samples[(i-1)], as.numeric(theta.samples[i-1,]))
  tausq.samples[i] <- tausq.list$rval
  
  #mu | theta, tausq
  mu.list = post_mu(tausq.samples[i], as.numeric(theta.samples[i-1,]), mu.0, delta.0)
  mu.samples[i] <- mu.list$rval
  
  #theta | mu, tausq
  theta.list = post_theta(sigma.j, y, mu.samples[i], tausq.samples[i])
  theta.samples[i,] <- theta.list$rval
  
  y_data.log[(i-1)] <- sum(dnorm(y, mean = as.numeric(theta.samples[i,]), sd = sigma.j, log = TRUE))
  
  joint.posterior.log[(i-1)] <- sum(theta.list$dens.val, tausq.list$dens.val, mu.list$dens.val, y_data.log[(i-1)])
}

Trace plots of the joint.posterior.log. Note that for numerical stability purposes we computed the log of the joint posterior.

joint.posterior = exp(joint.posterior.log)
plot(joint.posterior.log, type = 'l', main = 'Trace Plot Joint.Posterior.Log')

Let’s look at a running mean of some of our parameter estimates. We will now convert our log(joint posterior) to the joint posterior by exponentiating our vector.

B = 50000
joint.posterior = exp(joint.posterior.log)
params = joint.posterior[-1]
temp = cumsum(params)
seq.div <- seq(from = 2, to = (B), by = 1)
means.joint.posterior = temp/seq.div
plot(2:(B), means.joint.posterior, type = 'l', main = 'Running Mean of JPL')

It appears that a burn-in value of approximately 30,000 will be sufficient.

Let’s check our ACF plot.

burnin = 30000
#Autocorrelation plots# 
acf(joint.posterior[-c(1:burnin)], main = 'Joint Posterior after Burn-In')

It appears that we do not need to thin our joint posterior samples since there doesn’t exist much autocorrelation after eliminating the burn-in.

Thus, since we are not thinning any values, and we started with an initial run of \(B = 50,000\) values, and we our burning \(30,000\), we have obtained \(20,000\) independent samples from \(p(\theta, \mu, \tau^2|data)\), as desired.

Problem 2c.) Part I.) Provide density plots of your posterior samples of the model parameters, as well as 95% credible intervals and MAP estimates.

Let’s do some diagnostics on our \(\mu\) and \(\tau^2\) samples.

B = B
params = cbind(mu.samples, tausq.samples)
params = params[-1,]
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 2, to = (B+1), by = 1)
means.mu = temp[,1]/seq.div
means.tau.sq = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(2:(B+1), means.mu, type = 'l', main = 'Running Mean of Mu')
plot(2:(B+1), means.tau.sq, type = 'l', main = 'Running Mean of Tau.Sq')

lower.vec <- numeric()
upper.vec <- numeric()
map.vec <- numeric()

par(mfrow = c(2,1))
samples = tail(mu.samples, 10000)
#samples = mu.samples
plot(density(samples), main = 'P(mu_Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')
lower.vec[1] <- lower
upper.vec[1] <- upper
map.vec[1] <- MAP


samples = tail(tausq.samples, 10000)
plot(density(samples), main = 'P(tau.sq_Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')

lower.vec[2] <- lower
upper.vec[2] <- upper
map.vec[2] <- MAP

Here are the numerical results in a data.frame console output.

##     parameters   lowers       maps     uppers
## 1     mu_gibbs -4.92614  0.1111276   5.224513
## 2 tau.sq_gibbs 63.58325 95.4254798 244.529403

Problem 2c.) Part II.) Moreover, compare the estimates of the \(\theta_i\) to the observed effect and the overall mean effect across all studies.

Let’s check out a couple of our Gibbs \(\theta_j\).

lower.vec <- numeric()
upper.vec <- numeric()
map.vec <- numeric()

par(mfrow = c(2,1))
theta.1 <- as.numeric(theta.samples[,1])
samples = tail(theta.1, 10000)
#samples = mu.samples
plot(density(samples), main = 'P(theta_1 Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')
lower.vec[1] <- lower
upper.vec[1] <- upper
map.vec[1] <- MAP



theta.4 <- as.numeric(theta.samples[,4])
samples = tail(theta.4, 10000)
plot(density(samples), main = 'P(theta_4 Gibbs)')
z = density(samples)
MAP <- z$x[which.max(z$y)]
abline(v = MAP, col = 'red')
lower <- quantile(samples, 0.025)
upper <- quantile(samples, 0.975)
abline(v = lower, col = 'blue')
abline(v = upper, col = 'blue')

lower.vec[2] <- lower
upper.vec[2] <- upper
map.vec[2] <- MAP

Here are the numerical results in a data.frame console output.

##      parameters     lowers       maps    uppers observed.effect
## 1 theta_1 gibbs -0.2119753 0.04026493 0.2737457            0.03
## 2 theta_4 gibbs  0.4509788 1.16513946 1.9283290            1.18

We will also include a console output of the 95% confidence intervals for the \(\theta_j\).

95% Confidence Intervals for Thetas

##        2.5%         50%       97.5% 
## -0.21197534  0.02955113  0.27374571 
##       2.5%        50%      97.5% 
## -0.1677276  0.1220001  0.4103404 
##       2.5%        50%      97.5% 
## -0.4638582 -0.1387893  0.1896810 
##      2.5%       50%     97.5% 
## 0.4509788 1.1793690 1.9283290 
##       2.5%        50%      97.5% 
## -0.4755122  0.2592586  0.9728368 
##        2.5%         50%       97.5% 
## -0.26493088 -0.05685746  0.14249402 
##        2.5%         50%       97.5% 
## -0.22336212 -0.02207886  0.17759042 
##       2.5%        50%      97.5% 
## -0.7537856 -0.3204266  0.1121454 
##        2.5%         50%       97.5% 
## -0.04742889  0.26878043  0.58762908 
##      2.5%       50%     97.5% 
## 0.3118087 0.7997457 1.2951560 
##        2.5%         50%       97.5% 
## -0.05452223  0.54096462  1.13958951 
##       2.5%        50%      97.5% 
## -0.2532166  0.1772947  0.6129856 
##        2.5%         50%       97.5% 
## -0.57625305 -0.01481592  0.53780485 
##       2.5%        50%      97.5% 
## -0.3349262  0.2296117  0.7961485 
##       2.5%        50%      97.5% 
## -0.4889775 -0.1824865  0.1333774 
##        2.5%         50%       97.5% 
## -0.38811461 -0.05718639  0.26977266 
##       2.5%        50%      97.5% 
## 0.02265753 0.30096514 0.57159284 
##        2.5%         50%       97.5% 
## -0.11795452  0.06992924  0.24944663 
##        2.5%         50%       97.5% 
## -0.41827525 -0.06903138  0.27347358

I am confident about the results of my sampling. Looking at the console output for the table above, \(\hat{\theta}_4\) is close in value to the observed effect \(y_4 = 1.18\). Furthermore, our 95% for each \(\theta_j\) seems to match the observed effect \(y_j\) fairly well.

Recall in our model that \(y_j | \theta_j \sim N(\theta_j, \sigma^2_j)\). Thus, one interpretation is that \(\theta_j\) is mean effect for school \(j\). This makes sense with our console output above.

Recall also that in our hierarchical model \((\theta_j|\mu, \tau^2) \sim N(\mu, \tau^2)\). Thus, one interpretation is that the overall mean of the effects on all \(J\) schools is captured by the \(\mu\) parameter.

Therefore, we will now look more carefully at the distribution of the \(\mu\) samples, and in particular the last \(10,000\).

The overall observed mean effect across all studies is \(\bar{y} = 0.16\). Let’s compare this to the 95% CI for \(\mu\) and the mean of the last \(10,000\) \(\mu\) samples.

mu.10 <- tail(mu.samples, 10000)
quantile(mu.10, probs = c(0.025, 0.5, 0.975))

##       2.5%        50%      97.5% 
## -4.9261399  0.1861451  5.2245125

mean(y)

## [1] 0.1636842

mean(mu.10)

## [1] 0.1826752

As we can see the mean of our \(\mu\) samples does come close to the overall mean \(\bar{y}\).

That being said the results of our \(\mu\) samples do not agree with the naive results of the effects \(y_j\). If we just take the overall observed mean effect across all studies, \(\bar{y} = 0.16\) suggests that the teachers’ actions did have a positive effect on the IQ scores of the students. However, our 95% CI for \(\mu\) including 0 suggests that the teachers’ actions didn’t have a positive effect on the IQ scores of the students. Turns out knowing Bayesian statistics can help us make more informed decisions and analysis.

Problem 3.) Hearing.txt Hierarchical Model

Problem 3a.) Write out the joint likelihood \(p(y|\theta, \phi, \sigma^2)\)

We first define some notation. Let \(y_h\) denote the vector of observations for subject \(h = 1,...,24\). In other words, \(y_h = (y_{h1}, y_{h2}, y_{h3}, y_{h4})\). Then, the joint likelihood is given by the following.

\[ \begin{align} f(y|\theta, \phi, \sigma^2) &= \prod_{h = 1}^{24} f(y_h)\\ &=\prod_{h = 1}^{24} f(y_{h1}, y_{h2}, y_{h3}, y_{h4})\\ &= \prod_{h = 1}^{24} \prod_{j=1}^4 f(y_{hj})\\ &= \prod_{h = 1}^{24} \prod_{j=1}^4 dNormal(\text{mean =} \theta_h + \phi_j, \text{var =} \sigma^2)\\ \end{align} \]

Problem 3b.) Derive the full posterior conditional distribution for \(\theta_h\).

\[ \begin{align} f(\theta_h|\theta_{-h}, \phi, \mu, \sigma^2, y) &= f(\theta_h|\phi, \mu, \sigma^2, y)\\ &\propto f(\theta_h|\mu, \sigma^2) \prod_{j=1}^{4} f(y_{hj}|\phi, \theta_h, \sigma^2)\\ &\sim N(\frac{\bar{y}_h/\sigma_h^2 + \mu/\sigma^2}{1/\sigma_h^2 + 1/\sigma^2}, [1/\sigma_h^2 + 1/\sigma^2]^{-1}) \end{align} \]

Problem 3c.) Derive the full posterior conditional distribution for \(\phi_j\).

\[ \begin{align} f(\phi_j|\phi_{-j}, \theta, \mu, \sigma^2, y) &= f(\phi_j|\theta, \mu, \sigma^2, y)\\ &\propto f(\phi_j|\sigma^2) \prod_{j=1}^{4} f(y_{hj}|\phi_j, \theta, \sigma^2)\\ &\sim N(\frac{\bar{y}_h/\sigma_h^2 + 0}{1/\sigma_h^2 + 4/\sigma^2}, [1/\sigma_h^2 + 4/\sigma^2]^{-1}) \end{align} \]

Problem 3d.) Derive the full posterior conditionals for the hyperparameters.

\[ f(\mu|\phi, \theta, \sigma^2, y) \propto f(\mu) \prod_{h=1}^{24} f(\theta_h|\mu, \sigma^2)\\ \implies \sim N(\frac{24\bar{\theta}/\sigma^2 + 30/\frac{\sigma^2}{9}}{24/\sigma^2 + 1/\frac{\sigma^2}{9}}, [24/\sigma^2 + 1/\frac{\sigma^2}{9}]^{-1}) \]

\[ f(\sigma^2|\phi, \theta, \mu, y) \propto f(\sigma^2) \prod_{h=1}^{24} f(\theta_h|\mu, \sigma^2)\\ \implies \sim \Gamma^{-1}(\frac{2 + 24}{2},\frac{(2)(1) + \sum_{h = 1}^{24} (\theta_h - \mu)^2}{2}) \]

Problem 3e.) Fit the model with MCMC.

post_mu <- function(theta, sig.sq) {
  theta.bar <- mean(theta)
  param1.num <- (24*theta.bar/sig.sq) + (30/sig.sq/9)
  param1.denom <- ((24/sig.sq) + (1/sig.sq/9))^(-1)
  param1 <- param1.num*param1.denom
  rval <- rnorm(1, mean = param1, sd = sqrt(param1.denom))
  return(rval)
}

post_theta <- function(y.bar, sig.h.sq, sig.sq, mu) {
  H = length(y.bar)
  param1.num <- (y.bar/sig.h.sq) + (mu/sig.sq)
  param1.denom <- ((1/sig.h.sq) + (1/sig.sq))^(-1)
  param1 <- param1.num*param1.denom
  rval <- rnorm(H, mean = param1, sd = sqrt(param1.denom))
  return(rval)
}

post_phi <- function(y.bar, sig.h.sq, sig.sq) {
  param1.num <- (y.bar/sig.h.sq)
  param1.denom <- ((1/sig.h.sq) + (4/sig.sq))^(-1)
  param1 <- param1.num*param1.denom
  rval <- rnorm(4, mean = param1, sd= sqrt(param1.denom))
  return(rval)
}

post_sigma.sq <- function(theta, mu) {
  param1 <- (2+24)/2
  param2 <- (2 + (sum(theta - mu))^2)/2
  rval <- rinvgamma(1, shape = param1, rate = param2)
  return(rval)
}

hearing <- read.csv('hearing-1.txt', sep = '\t')
y.bar <- apply(hearing, 1, mean)
sig.h.sq <- apply(hearing, 1, sd)^2

B = 50000
J = length(y.bar)

theta.samples <- matrix(NA, nrow = B+1, ncol = J) #each row will be a Gibbs sample theta = (theta1,..., thetaJ)
mu.samples <- numeric()
phi.samples <- matrix(NA, nrow = B+1, ncol = ncol(hearing))
sig.sq.samples <- numeric()


#start our chain with some initial values
set.seed(11)
theta.samples[1,] <- y.bar
sig.sq.samples[1] <- rinvgamma(1, shape = 1, rate = 1)

#Step 3-6
for(i in 2:(B+1)) { 
  
  #mu|others
  mu.samples[i-1] <- post_mu(as.numeric(theta.samples[i-1]), sig.sq.samples[i-1])
  
  #phi|others
  phi.samples[i-1,] <- post_phi(y.bar, sig.h.sq, sig.sq.samples[i-1])
  
  #theta | others
  theta.samples[i,] <- post_theta(y.bar, sig.h.sq, sig.sq.samples[i-1], mu.samples[i-1])
  
  #sigma.sq|others
  sig.sq.samples[i] <- post_sigma.sq(as.numeric(theta.samples[i,]), mu.samples[i-1])
}

Trace Plot and Running Mean of \(\mu\)

par(mfrow = c(1,2))
plot(mu.samples, type = 'l', main = 'Trace Plot Mu')
B = 50000
params = mu.samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (B), by = 1)
running.means = temp/seq.div
plot(1:(B), running.means, type = 'l', main = 'Running Mean of Mu')

Trace Plot and Running Mean of \(\theta_1\)

par(mfrow = c(1,2))
samples <- as.numeric(theta.samples[,1])
plot(samples, type = 'l', main = 'Trace Plot Theta_1')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Theta_1')

Trace Plot and Running Mean of \(\theta_2\)

par(mfrow = c(1,2))
samples <- as.numeric(theta.samples[,2])
plot(samples, type = 'l', main = 'Trace Plot Theta_2')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Theta_2')

Trace Plot and Running Mean of \(\theta_6\)

par(mfrow = c(1,2))
samples <- as.numeric(theta.samples[,6])
plot(samples, type = 'l', main = 'Trace Plot Theta_6')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Theta_6')

Trace Plot and Running Mean of \(\phi_1\)

par(mfrow = c(1,2))
samples <- as.numeric(phi.samples[,1])
plot(samples, type = 'l', main = 'Trace Plot Phi_1')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Phi_1')

Trace Plot and Running Mean of \(\phi_4\)

par(mfrow = c(1,2))
samples <- as.numeric(phi.samples[,4])
plot(samples, type = 'l', main = 'Trace Plot Phi_4')
N = length(samples)
params = samples
temp = cumsum(params)
seq.div <- seq(from = 1, to = (N), by = 1)
running.means = temp/seq.div
plot(1:(N), running.means, type = 'l', main = 'Running Mean of Phi_4')

It appears that in all of our trace and running mean plots that a burn-in of \(10,00\) will be sufficient.

Problem 3f.) Provide MLE estimates of the \(\theta_h\).

Recall our model.

\[ (y_{hj}|\theta_h, \phi_j, \sigma^2) \sim N(\theta_h + \phi_j, \sigma^2) \]

Thus, by the MLE estimate of a Normal distribution we have

\[ \hat{\theta}_{h_{MLE}} = \bar{y}_h \]

We will provide a numerical output comparing our MLE estimate and our Bayesian posterior means of our \(\theta_h\).

burnin= 10000
theta.bayes.means <- apply(theta.samples[-c(1:burnin),], 2, mean)
bayes.means <- round(theta.bayes.means, 1)
mles <- y.bar

results.df <- data.frame(c(1:length(y.bar)), mles, bayes.means)
colnames(results.df) <- c("h", "MLE", "Bayes.Means")
results.df

##     h  MLE Bayes.Means
## 1   1 24.5        25.7
## 2   2 24.0        25.5
## 3   3 28.0        26.5
## 4   4 20.5        24.6
## 5   5 31.0        27.6
## 6   6 28.0        26.8
## 7   7 27.0        26.3
## 8   8 28.5        27.1
## 9   9 36.5        28.4
## 10 10 30.5        27.1
## 11 11 28.0        26.6
## 12 12 31.0        27.2
## 13 13 32.0        28.2
## 14 14 36.0        29.4
## 15 15 33.5        28.4
## 16 16 27.0        26.2
## 17 17 28.5        26.7
## 18 18 19.5        24.3
## 19 19 38.0        30.4
## 20 20 21.5        24.7
## 21 21 17.0        23.2
## 22 22 22.0        25.0
## 23 23 27.0        26.3
## 24 24 40.0        30.4

library(reshape2)
library(ggplot2)
melted.gg.df <- melt(results.df, id.vars='h', variable.name='z')
p <- ggplot(data=melted.gg.df, aes(h, value)) + geom_point(aes(colour=z)) + geom_hline(yintercept = mean(y.bar)) + 
  scale_x_discrete(name ="Theta_h", 
                    limits=c("1","2","3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"))
p

We can see that the Baye’s estimates tend to be more centered and have less ‘extreme’ values compared to the MLE estimates. For example, when the MLE estimate of \(\theta_{21}\) is below 20, the corresponding Bayes estimate is near 24, closer to the overall mean of the data, i.e., mean(y.bar).

Problem 3g.) Plot the densitites of the \(\phi_j\) for all four \(\phi\). What can you conclude about the lists.

burnin = 10000
lower.vec <- numeric()
upper.vec <- numeric()
map.vec <- numeric()
par(mfrow = c(2,2))
phi.post.samples <- phi.samples[-c(1:burnin, nrow(phi.samples)), ]

for(j in 1:4) {
  phi <- as.numeric(phi.post.samples[,j])
  samples <- phi
  title <- paste('P(Phi_', j, ' Gibbs')
  plot(density(samples), main = title)
  z = density(samples)
  MAP <- z$x[which.max(z$y)]
  abline(v = MAP, col = 'red')
  lower <- quantile(samples, 0.025)
  upper <- quantile(samples, 0.975)
  abline(v = lower, col = 'blue')
  abline(v = upper, col = 'blue')
  lower.vec[j] <- lower
  upper.vec[j] <- upper
  map.vec[j] <- MAP
}

Here are the numerical results:

##    parameters     lowers       maps   uppers
## 1 phi_1 gibbs -0.2567914 0.22481639 27.13545
## 2 phi_2 gibbs -0.5653560 0.06600242 27.52085
## 3 phi_3 gibbs -0.7031373 0.06468620 31.01528
## 4 phi_4 gibbs -0.6478045 0.06888625 24.43991

Zero is contained in each of the 95% credible intervals 0. Thus, we can conclude that the lists have the same level of difficulty, i.e., they have no effect on the subject score.