Chapter 11 Problem 2.) Metropolis algorithm: Replicate the computations for the bioassay example of Section 3.7 using the Metroplis algorithm. Be sure to define your starting points and your jumping rule. Compute with log-densities (see page 261). Run the simulations long enough for approximate convergence.

We use the Gaussian prior for \((\alpha, \beta)\) as given in Problem 11.) of Chapter 3.

\[ (\alpha, \beta) \sim N_2(\mu_0, \Sigma_0)\\ u_0 = (0,10)^T\\ [\Sigma_0]_{11} = 2^2\\ [\Sigma_0]_{12} = 12\\ [\Sigma_0]_{22} = 10^2 \]

We will use a bi-variate Normal proposal distribution for \(\alpha\) and \(\beta\) with tuning parameter \(\Sigma_{Tune}\).

\[ \text{proposal: } (\alpha^*, \beta^*) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\\ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]

In summary our model is as follows:

\[ \text{prior: } (\alpha, \beta) \sim N_2(\mu_0, \Sigma_0)\\ u_0 = (0,10)^T\\ [\Sigma_0]_{11} = 2^2\\ [\Sigma_0]_{12} = 12\\ [\Sigma_0]_{22} = 10^2\\ \text{Data}: y_i|\theta_i \sim Binomial(n_i, \theta_i)\\ \text{Link Function: } \text{logit}(\theta_i) = \alpha + \beta_i\\ \text{Proposal Distribution for MH: } (\alpha^*, \beta^*) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\\ [\Sigma_{Tune}]_{11} = s_1\\ [\Sigma_{Tune}]_{12} = 0\\ [\Sigma_{Tune}]_{21} = 0\\ [\Sigma_{Tune}]_{22} = s_2 \]

###Algorithm:

Step 1.) Initial values for \(\alpha_{MH}^{0}\) and \(\beta_{MH}^{0}\)

For b = 1 to B:

Step 2.) Sample \((\alpha^{*^b}, \beta^{*^b}) \sim N_2((\alpha^{(b-1)}, \beta^{(b-1)})^{T}, \Sigma_{Tune})\) where the \(\Sigma_{Tune}\) is a tuning parameter that will, amongst other things, primarily control the acceptance rate. This is our proposed new \((\alpha_{MH}, \beta_{MH})\) sample. This proposed value will be used in our ratio \(r_b\) and we will decide on keeping it or not in Step 5.)

Step 3.) Compute the log ratio \(\log(r_b)\). Note that since our proposal is symmetric we do not need to include it in our ratio since they cancel by symmetry.

\[ \log(r_b) = \log(\text{Likelihood}(y_i, \alpha^{*^b}, \beta^{*^b})) + \log(\text{Prior}(\alpha^{*^b},\beta^{*^b})) - \log(\text{Likelihood}(y_i, \alpha_{MH}^{(b-1)}, \beta_{MH}^{(b-1)})) - \log(\text{Prior}(\alpha_{MH}^{(b-1)}, \beta_{MH}^{(b-1)})) \]

Step 4.) Compute \(r_b\) by taking \(\exp(\log(r_b))\)

Step 5.) Sample a \(u \sim Unif(0,1)\).

If \(u < \text{min}(1,r)\), then we accept the current \((\alpha^*, \beta^*)\). In other words, set \((\alpha_{MH}^{(b)}, \beta_{MH}^{(b)}) = (\alpha^*, \beta^*)\).

Else, we reject the current \(\alpha^*\) and set \((\alpha_{MH}^{(b)}, \beta_{MH}^{(b)}) = (\alpha_{MH}^{(b-1)},\beta_{MH}^{(b-1)})\)

Initial Chain Values

We simply use the intial chain values we discovered in Homework 2. Namely, \((\hat{\alpha}, \hat{\beta}) = (0.7135563, 7.93254623)\)

###Code Implementation:

library(mvtnorm)
library(gtools)
library(coda)
##Dataset 
data.bio <- data.frame(Dose = c(-0.86, -0.30, -0.05, 0.73), Num.Animals = c(5,5,5,5), Num.Deaths = c(0,1,3,5))
n <- data.bio$Num.Animals
y <- data.bio$Num.Deaths
x <- data.bio$Dose
##Predefine Functions used in M-H###

#prior values for alpha and beta coming from bivariate normal
prior.mu = c(0,10)
prior.sig = matrix(c(2^2, 12, 12, 10^2), nrow = 2, byrow = TRUE)

#Log of the prior density
Log.Prior <- function(alpha, beta, prior.mean, prior.sigma) {
  z = dmvnorm(c(alpha, beta), mean = prior.mean, sigma = prior.sigma, log = TRUE)
  return(z)
}

#Log of the likelihood
Log.Likelihood <- function(y, alpha, beta, n, x) {
  val = 0
  for(i in 1:length(y)) {
    current.val <- y[i]*log(inv.logit(alpha + beta*x[i])) + (n[i] - y[i])*log(1 - inv.logit(alpha + beta*x[i]))
    val = val + current.val
  }
  return(val)
}
### --  main function  -- ##
Metrop.func <- function(s1, s2, B, alpha.chain.init, beta.chain.init, data) {
  
  #read data
  n <- data$Num.Animals
  y <- data$Num.Deaths
  x <- data$Dose
  
  ## initialize
  accept = 0
  alpha.post.MH <- numeric()
  beta.post.MH <- numeric()
  alpha.post.MH[1]  <- alpha.chain.init #initalize alpha
  beta.post.MH[1] <- beta.chain.init #initialize beta
  set.seed(10)
  
  #tuning parameter covariance matrix
  diag.delta <- matrix(c(s1, 0, 0, s2), nrow = 2, byrow = TRUE) 
  
  for(b in 2:(B+1)) {
    #M-H Step for alpha and beta | data
    ab.star <- as.numeric(rmvnorm(1, mean = c(alpha.post.MH[(b-1)], beta.post.MH[(b-1)]), sigma = diag.delta))
    a.star <- ab.star[1]
    b.star <- ab.star[2]

    r.log <- Log.Likelihood(y, a.star, b.star, n, x) + Log.Prior(a.star, b.star, prior.mean = prior.mu, prior.sigma = prior.sig) - Log.Likelihood(y, alpha.post.MH[(b-1)], beta.post.MH[(b-1)], n, x) - Log.Prior(alpha.post.MH[(b-1)], beta.post.MH[(b-1)], prior.mu, prior.sig)
    
    r = exp(r.log)
    
    if(runif(1) < min(1,r)) {
      alpha.post.MH[(b)] = a.star
      beta.post.MH[(b)] = b.star
      accept = accept + 1
    } else {
      alpha.post.MH[(b)] <- alpha.post.MH[(b-1)]
      beta.post.MH[(b)] <- beta.post.MH[(b-1)]
    }
    
   # Print every 100th iteration
    #if(b %% 10000 == 0) {
     # cat(paste0("iteration: ", b, "\n"))
   # }
    
  } #end for loop
  return.list = list(accep.rate = accept/B, beta.samples = beta.post.MH, alpha.samples = alpha.post.MH)
  return(return.list)
}

Let’s run our algorithm. Initial run with \(s_1 = 0.8\) and \(s_2 = 2.5\). We’ll look at our acceptance rate. Per Lecture from Math 534 and online sources an acceptance ratio near 25% is an approriate acceptance ratio for higher dimensions.

#initial chain values for alpha and beta. Looked up from Homework 2. 
a.init <- 0.713
b.init <- 7.92

#Tuning parameters for variance of each of alpha and beta
s1 = 0.8
s2 = 2.5

#prior values for alpha and beta coming from bivariate normal
prior.mu = c(0,10)
prior.sig = matrix(c(2^2, 12, 12, 10^2), nrow = 2, byrow = TRUE)

#chain length
B = 50000
z <- Metrop.func(s1 = s1, s2 = s2, B = 50000, a.init, b.init, data = data.bio)
z$accep.rate
## [1] 0.59914

Thus, it appears we need to change our tuning parameters. We will perform a search. Note, I will not run the following code in the report due to its lengthy run-time.

delta1 <- seq(0.5, 8, by = 0.5)
delta2 <- seq(0.5, 8, by = 0.5)
delta.mat <- cbind(delta1, delta2)
acceptance.ratios <- rep(-10, length = length(delta1))
for(i in 1:10) {
  chain = Metrop.func(delta1[i], delta2[i], B, a.init, b.init, data.bio)
  acceptance.ratios[i] <- chain$accep.rate
}

It turns out that when \(s_1 = 8\) and \(s_2 = 8\) we see an acceptance rate of approximately 26%. This will suffice for our analysis.

z = Metrop.func(8, 8, B, a.init, b.init, data.bio)

Let’s do some diagnostics. We will approximate a burn-in value using the running mean for alpha and beta samples.

params = cbind(z$beta.samples, z$alpha.samples)
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 1, to = (B+1), by = 1)
means.beta = temp[,1]/seq.div
means.alpha = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(1:(B+1), means.beta, type = 'l', main = 'Running Mean of Beta')
plot(1:(B+1), means.alpha, type = 'l', main = 'Running Mean of Alpha')

It appears that a burn-in value of approximately 10,000 will be sufficient.

burnin = 10000
#Autocorrelation plots# 
par(mfrow = c(1,2))
acf(z$beta.samples[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(z$alpha.samples[-c(1:burnin)], main = 'Alpha Samples after Burn-in')

It appears our \(\alpha\) and \(\beta\) samples have some ACF we need to be aware of. Let’s see when it goes to 0.

autocorr(as.mcmc(z$alpha.samples[-c(1:burnin)]), lags = c(10,20,30,40,50, 60, 70, 80), relative = FALSE)
## , , 1
## 
##               [,1]
## Lag 10 0.303389386
## Lag 20 0.191380900
## Lag 30 0.142024834
## Lag 40 0.091999009
## Lag 50 0.056161230
## Lag 60 0.030160135
## Lag 70 0.002534454
## Lag 80 0.007546357

It appears that our thinning rate will be 70 due to lag 70 final showing an auto-correlation going close to 0.

Now let’s thin our chain. We first need to calculate the chain length \(B\) in order to have an effective sample size of 1,000 independent samples. Since we will be taking every 75th value, we will need \(70 x 1000 = B = 70,000\) as our chain length. In addition, since we burned \(10,000\), we will need \(B + \text{burn-in} = 70,000 + 10,000 = 80,000\) as our chain length.

B = 80000
burnin = 10000
thin = 70
Eff.samp = floor((B - burnin)/thin) 
Eff.samp
## [1] 1000

Let’s re-run our algorithm with the proper \(B\) size of \(80,000\).

B = 80000
#reset burnin
burnin = 10000
z = Metrop.func(8, 8, B, a.init, b.init, data.bio)
alpha.thin = beta.thin = numeric()
for(i in 1:Eff.samp) {
  alpha.thin[i] = z$alpha.samples[(burnin+1+(thin*(i-1)))]
  beta.thin[i] = z$beta.samples[(burnin + 1 + (thin*(i-1)))]
}
alpha.1000 = alpha.thin[1:1000]
beta.1000 = beta.thin[1:1000]

Let’s plot our posterior samples. We can visually see the MAP estimates and 95% credible intervals for each. We will also present them numerically.

par(mfrow = c(1,2))
plot(density(alpha.1000), main = 'P(alpha|data)')
z.a = density(alpha.1000)
alpha.1000.MAP <- z.a$x[which.max(z.a$y)]
abline(v = alpha.1000.MAP, col = 'red')
alpha.1000.lower <- quantile(alpha.1000, 0.025)
alpha.1000.upper <- quantile(alpha.1000, 0.975)
abline(v = alpha.1000.lower, col = 'blue')
abline(v = alpha.1000.upper, col = 'blue')

plot(density(beta.1000), main = 'P(beta|data)')
z.b = density(beta.1000)
beta.1000.MAP <- z.b$x[which.max(z.b$y)]
abline(v = beta.1000.MAP, col = 'red')
beta.1000.lower <- quantile(beta.1000, 0.025)
beta.1000.upper <- quantile(beta.1000, 0.975)
abline(v = beta.1000.lower, col = 'blue')
abline(v = beta.1000.upper, col = 'blue')

We can visually see the MAP estimates and the 95% credible intervals for each of \(\alpha|data\) and \(\beta|data\). Numerically they are presented below in the console output.

##   parameters     lowers      maps   uppers
## 1      alpha -0.6377425 0.6322571  2.86299
## 2       beta  3.6592066 8.5781361 21.76323

After plotting the posteriors, we derive the posterior sample distributions for LD50 on its original (g/ML) and log(log(g/ML)) scales. Recall that in our logistic model \(x_i\) is equal to the dose level.

\[ \text{logit} (\theta_i) = \alpha + \beta x_i \] If we have a 50% survival rate then \(\theta_i = 0.50\), and we have:

\[ \text{logit} (0.50) = 0 = \alpha + \beta x_i \implies x_i = -\frac{\alpha}{\beta} \]

In other words, the lethal dose that gives probability of death 50% is when the dose \(x_i\) is equal to \(\frac{\alpha}{\beta}\). We will denote this dose level by LD50.

LD50orig.sorted <- sort(exp(-alpha.1000/beta.1000))
LD50.orig.Lower95CI <- LD50orig.sorted[.025*length(LD50orig.sorted)]
LD50.orig.Upper95CI <- LD50orig.sorted[.975*length(LD50orig.sorted)]

hist(LD50orig.sorted, freq = T, xlim = c(0.5, 2), breaks = 100, main = 'Lethal Dose, 50% (g/mL) - Posterior Sample', xlab = 'LD50 (g/mL)')
arrows(x0 = LD50.orig.Lower95CI, y0 = 2, x1 = LD50.orig.Upper95CI, y1 = 2, code = 3, angle = 90, length = 0.5, col = 'red', lwd = 3)

Thus, we can say say the following:

## [1] "There is a 95% chance (via the central posterior credible interval) that an animal has a 50% chance of dying if it receives between  0.792  and  1.112  g/mL of drugs for treating this tumor."

Chapter 5 Textbook Problem 4.)

###4a.) Are \(\theta_1,...,\theta_{2J}\) exchangeable under this prior distribution?

Yes. In order to be exchangeable we must show that \(p(\theta_1,...,\theta_{2J}) = p(\theta_{\pi_1}, ..., .\theta_{\pi_{2J}})\) for any permutation \(\pi\).

We know that the joint distribution is equal to (by independence of the priors) to the following:

\[ p(\theta) =(\prod_{j=1}^J dNormal(\theta_j, \mu = 1, \sigma = 1))(\prod_{j = j+1}^{2J} dNormal(\theta_j, \mu = -1, \sigma = 1)) \]

But this product implies exchangeability since the joint will always be a product of \(J\) \(Normal(1,1)\) and \(J\) \(Normal(-1,1)\) densities. In other words while the above equation is once such permutation, it would make no difference if we multiplied the densities in another such order (since multiplication is commutative).

###4b.) Show that this distribution cannot be written as a mixture of independent and identically distributed components.

We must show that if we condition on the information of one component \(j\) that the probability distribution of another component \(i\) would then change.

Without loss of generality we will show that \(p(\theta_1|\theta_2) \neq p(\theta_1)\).

Marginally and without any conditioning we know that

\[ p(\theta_1) \sim N(1,1) \text{ with probability } \frac{J}{2J} = 0.5\\ p(\theta_1) \sim N(-1,1) \text{ with probability } \frac{J}{2J} = 0.5 \]

Suppose we know the event \(\theta_2 > 1\) has occured. This would most likely imply that \(\theta_2\) component came from the prior \(Normal(1,1)\) distribution.

Since we have a finite number \(2J\), and \(J\) originally came from the prior \(N(1,1)\), conditioning on \(\theta_2 > 0\) means that we now only have \(J-1\) remaining components that will come from \(N(1,1)\). In summary,

\[ p(\theta_1|\theta_2 > 0) \sim N(1,1) \text{ with probability } \frac{J-1}{2J-1}\\ p(\theta_1|\theta_2 > 0) \sim N(-1,1) \text{ with probability } \frac{J}{2J-1} \]

In summary, the distribution cannot be written as a mixture of independent and identically distributed components since the components are not independent of one another.

###4c.) Why can we not simply take the \(\lim_{x \to \inf}\) and get a counterexample to De Finetti’s Theorem

As we let \(J\) go to \(\infty\) we can see that the above conditional distribution asymptotically gives probability \(\frac{1}{2}\) for each distribution. This does not disagree with De Finetti’s Theorem. De Finetti’s Theorem states that an infinite exchangeable sequence of will be conditionally independent. Thus, our work above which shows that as our sequence of exchangeable distributions goes towards infinity, the components of this sequence become conditionally independent.

Chapter 5 Textbook Problem 11.)

###11a.) Write the joint posterior density \(p(\theta, \mu, \tau|y)\)

Our model:

\[ \text{prior: } p(\mu, \tau^2) \propto \tau^{-2}\\ \text{prior: } logit(\theta_j) \sim N(\mu, \tau^2)\\ \text{data: } y_j \sim Bin(n_j, \theta_j) \]

Then we can write the posterior as a product of the Likelihood and prior.

Before we proceed we note that while we need the prior distribution of \(\theta_j\) to write the posterior of \(p(\theta, \mu, \tau|y)\), we don’t immediately have it. What we do have is the prior distribution of \(logit(\theta_j) \sim N(\mu, \tau^2)\)

Therefore, we will need to apply a transformation. Thankfully the \(logit()\) is a 1-1 function and therefore the transformation can be obtained by the 1-1 probability density transformation theorem. Recall that \(logit^{-1}(\theta) = \ln(\frac{\theta}{1-\theta})\). Then we have the following:

\[ \begin{align} f_{\theta_j}(\theta_j) &= f_{logit(\theta_j)}(\ln(\frac{\theta_j}{1-\theta_j}))|\frac{d}{d\theta}\ln(\frac{\theta_j}{1-\theta_j})|\\ &= dNormal(\ln(\frac{\theta_j}{1-\theta_j}), \mu, \tau^2)[\frac{1}{\theta} + \frac{1}{1-\theta}]\\ &= \frac{1}{\tau\sqrt{2\pi}}\exp(-\frac{1}{2}(\frac{logit(\theta_j) - \mu}{\tau})^2)[\frac{1}{\theta_j} \frac{1}{1-\theta_j}]\\ &\propto \tau^{-1}\theta_j^{-1}(1-\theta_j)^{-1}\exp(-\frac{1}{2}(\frac{logit(\theta_j) - \mu}{\tau})^2) \end{align} \]

Now we can are prepared to write the posterior as the product of the likelihood and prior.

Let’s quickly summarize our model again:

\[ \text{prior: } p(\mu, \tau^2) \propto \tau^{-2}\\ \text{prior: } logit(\theta_j) \sim N(\mu, \tau^2) \implies p(\theta_j) \propto \tau^{-1}\theta_j^{-1}(1-\theta_j)^{-1}\exp(-\frac{1}{2}(\frac{logit(\theta_j) - \mu}{\tau})^2)\\ \text{data: } y_j \sim Bin(n_j, \theta_j) \]

Thus we have:

\[ p(\theta, \mu, \tau|y) \propto (\prod_{j=1}^J \theta_j^{y_j} (1-\theta_j)^{n_j - y_j})[\prod_{j=1}^J \tau^{-1}\theta_j^{-1}(1-\theta_j)^{-1}\exp(-\frac{1}{2}(\frac{logit(\theta_j) - \mu}{\tau})^2)] \tau^{-2} \]

###11b.) Show that the integral (5.4) has no closed-form expression.

I do not recognize any such density that involves the \(logit()\) function and so I do not believe the integral will have a closed-form expression even though the \(J\) integrals can be looked at individually (since they factor).

In other words we seek \(p(\mu, \tau|y)\), but since we cannot integrate over the \(\theta = (\theta_1,...,\theta_J)\) in our above integral we will not find such a closed form solution.

###11c.) Why is expression (5.5) no help for this problem?

The expression (5.5) is given below:

\[ p(\phi|y) = \frac{p(\theta, \phi|y)}{p(\theta|\phi, y)} \]

To use this expression though we need to know the exact form of the numerator \(p(\theta, \phi|y)\). In problem 11a. we only know the joint \(p(\theta, \phi|y)\) up to a constant of proportionality. So again, like in problem 11b. we seek \(p(\mu, \tau|y)\), but we are stumped.

Chapter 5 Textbook Problem 13.)

Chapter 5 Problem 13a.)

Here is the model specifications.

Let \(y_1,...,y_{10}\) denote the number of bicycles at ten different locations.

Let \(n_1,...,n_{10}\) be the total vehicles including bicycles at location 1-10.

Then \(y_j|\theta_j \sim Binomial(n_j, \theta_j)\) for \(j = 1,2,...,10\)

\[ \text{data: } y_j|\theta_j \sim Binomial(n_j, \theta_j)\\ \text{prior: } \theta_j \sim Beta(\alpha, \beta)\\ \text{Hyperparameter Prior: } p(\alpha, \beta) \propto (\alpha + \beta)^{-\frac{5}{2}} \]

We seek the joint posterior distribution, which per usual will be Likelihood * prior.

\[ \begin{align} p(\theta, \alpha, \beta|y) &\propto p(y|\theta, \alpha, \beta)*p(\theta, \alpha, \beta)\\ &\propto p(y|\theta, \alpha, \beta)*p(\theta|\alpha, \beta)*p(\alpha, \beta)\\ &\propto (\prod_{j=1}^{10} \binom{n_j}{y_j} \theta^{y_j}(1-\theta_j)^{n_j - y_j})(\prod_{j=1}^{10} \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1}(1-\theta_j)^{\beta - 1})(\alpha + \beta)^{-\frac{5}{2}}\\ &\propto (\alpha + \beta)^{-\frac{5}{2}} (\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)})^{10}(\prod_{j=1}^{10} \theta^{y_j}(1-\theta_j)^{n_j - y_j})\theta^{\alpha - 1}(1-\theta_j)^{\beta - 1}) \end{align} \]

Chapter 5 Problem 13b.) Math Work for Marginal Posterior Densities and Overall view of the Sampling Algorithm

Now we seek the marginal posterior density of the hyperparameters. In other words we seek \(p(\alpha, \beta|y)\).

To compute this we use the formula given in Equation (5.4) of the textbook.

\[ p(\alpha, \beta|y) = \frac{p(\alpha, \beta, \theta|y)}{p(\theta|y, \alpha, \beta)} \]

The numerator is the joint posterior distribution, which we know from problem 13a.) up to a constant of proportionality.

The denominator can be computed by an application of Bayes. We will focus on one component \(j\) since the joint will just be a product of the 10 iid \(\theta_j\). In the second line we discard \(p(\alpha, \beta)\) since it doesn’t depend on \(\theta\), and furthermore, we note that this is simply a Binomial-Beta-Beta model.

\[ \begin{align} p(\theta_j|\alpha, \beta, y_j) &\propto p(y_j|\theta_j)p(\theta_j|\alpha, \beta)p(\alpha, \beta)\\ &\propto Beta(\alpha + y_j, \beta + n_j - y_j)\\ & = \frac{\Gamma(\alpha + \beta + n_j)}{\Gamma(\alpha + y_j)\Gamma(\beta + n_j - y_j)} \theta_j^{\alpha + y_j - 1}(1-\theta_j)^{\beta + n_j -1} \end{align} \]

Combining the numerator, which is the joint posterior distribution from part (a.) and the denominator computer above, we arrive at

\[ \begin{align} p(\alpha, \beta|y) &= \frac{p(\alpha, \beta, \theta|y)}{p(\theta|y, \alpha, \beta)}\\ &\propto p(\alpha + \beta)(\prod_{j=1}^{10}(\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha + \beta + n_j)}{\Gamma(\alpha + y_j)\Gamma(\beta + n_j - y_j)})\\ &\propto (\alpha + \beta)^{-\frac{5}{2}}(\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)})^{10})\prod_{j=1}^{10}\frac{\Gamma(\alpha + \beta + n_j)}{\Gamma(\alpha + y_j)\Gamma(\beta + n_j - y_j)}) \end{align} \]

Just like the Rat Tumor problem this is not a standard distribution. Thus, in order to sample from it we would have to grid sample or use Metropolis Hastings.

We go the MH route. To do so we need to compute the full conditional of the hyperparamters, i.e., we need \(p(\alpha, \beta|\theta, y)\)

\[ \begin{align} p(\alpha, \beta|\theta, y) &\propto p(\theta|\alpha, \beta)*p(\alpha, \beta)\\ &\propto \prod_{j=1}^{10} Beta(\theta_j| \alpha, \beta)(\alpha + \beta)^{-\frac{5}{2}} \end{align} \]

Thus, in our MH-step (recall we are trying to sample the hyperparameters \(\alpha\) and \(\beta\)) we will compute the ratio \(r_b\) with the this marginal posterior.

We will use a Normal proposal (same as we did in the Rat Tumor example from Lecture).

We have, using the \(\log()\) of the ratio,

\[ \log(r_{b}) = \exp[(\sum_{j=1}^{10} dbeta(\theta_j^{b}, \alpha^*, \beta^*)) -\frac{5}{2}\log(\alpha^* + \beta^*) + \log(\alpha^* \beta^*) - (\sum_{j=1}^{10}dbeta(\theta_j^b, \alpha^b, \beta^b)) + \frac{5}{2}\log(\alpha^b + \beta^b) - \log(\alpha^b \beta^b)] \]

And where \(\theta_j^b\) is the Gibbs step since \((\theta_j^b|\alpha^{(b-1)}, \beta^{(b-1)}, y_j) \sim Beta(\alpha + y_j, \beta + n_j - y_j)\).

Chapter 5 Problem 13b.) Coding Implentation

#--- Predefine Function ##

Prior <- function(alpha, beta) {
   z = (alpha + beta)^(-5/2)
}

LogLikelihood <- function(theta, alpha, beta) {
  sum(log(dbeta(theta, alpha, beta)))
}


Proposal <- function(alpha,beta) {
  1/(alpha * beta)
}

rProposal <- function(n, mean, cov) {
  rmvnorm(n, mean, cov)
}

#Data
y <- c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55)
n <- c(74, 99, 58, 70, 122, 77, 104, 129, 308, 119)

accept = 0
B = 100000
alpha.post.MH = beta.post.MH = numeric()
theta.post2 <- matrix(0, ncol = B, nrow = 10)

S <- matrix(c(0.1229, 0.1193, 0.1193, 0.1263), ncol = 2, nrow = 2)

#Decide initial values for alpha/beta
alpha.c = 4
beta.c = 15
theta.c = numeric(length = 10)

for(b in 1:B) {
  
  #Gibbs step for thetas
  for(j in 1:10) {
    shp1 = alpha.c + y[j]
    shp2 = beta.c + n[j] - y[j]
    theta.c[j] = rbeta(1, shp1, shp2)
  }
  
  #---- M-H Step for alpha and beta ----
  phi.star = rProposal(1, c(log(alpha.c), log(beta.c)), 1*S)
  
  alpha.star = exp(phi.star[1])
  beta.star = exp(phi.star[2])
  
  r = exp(
    LogLikelihood(theta.c, alpha.star, beta.star)
    + log(Prior(alpha.star, beta.star))
    + log(Proposal(alpha.c, beta.c))
    - LogLikelihood(theta.c, alpha.c, beta.c)
    - log(Prior(alpha.c, beta.c))
    - log(Proposal(alpha.star, beta.star))
  )
  
  if(runif(1) < min(1,r)) {
    alpha.c = alpha.star
    beta.c = beta.star
    accept = accept + 1
  }
  
  #-- drop off samples to vector/matrix
  
  alpha.post.MH[b] = alpha.c
  beta.post.MH[b] = beta.c
  theta.post2[,b] = theta.c
  
  #Prints every 10000th iteration
  if(b %% 10000 == 0) {
    cat(paste0('iteration: ', b, '\n'))
  }
  
  ##---- Compute acceptance ratio
  accept/B
}
## iteration: 10000
## iteration: 20000
## iteration: 30000
## iteration: 40000
## iteration: 50000
## iteration: 60000
## iteration: 70000
## iteration: 80000
## iteration: 90000
## iteration: 100000

Let’s look at the Running Mean to see our BurnIn Value.

params = cbind(beta.post.MH, alpha.post.MH)
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 1, to = B, by = 1)
means.beta = temp[,1]/seq.div
means.alpha = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(1:B, means.beta, type = 'l', main = 'Running Mean of Beta')
plot(1:B, means.alpha, type = 'l', main = 'Running Mean of Alpha')

It looks like a burn-in value of 10,000 is needed.

burnin = 10000 #burn in 10000
#Autocorrelation plots# 
par(mfrow = c(1,2))
acf(beta.post.MH[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(alpha.post.MH[-c(1:burnin)], main = 'Alpha Samples after Burn-in')

It looks like thinning rate of 50 will suffice.

B = 100000
burnin = 10000
thin = 50
Eff.samp = floor((B - burnin)/thin) 
Eff.samp
## [1] 1800
alpha.thin = beta.thin = numeric()
theta.thin = matrix(data = NA, nrow = 10, ncol = Eff.samp)
for(i in 1:Eff.samp) {
  alpha.thin[i] = alpha.post.MH[(burnin+1+(thin*(i-1)))]
  beta.thin[i] = beta.post.MH[(burnin + 1 + (thin*(i-1)))]
  theta.thin[,i] = theta.post2[,(burnin+1+(thin*(i-1)))]
}

Before comparing the \(\theta_j\) let’s plot the \(\alpha\) and \(\beta\) samples.

alpha.2000 = alpha.thin
beta.2000 = beta.thin
par(mfrow = c(1,2))
plot(density(alpha.2000), main = 'P(alpha|data)')
z.a = density(alpha.2000)
alpha.2000.MAP <- z.a$x[which.max(z.a$y)]
abline(v = alpha.2000.MAP, col = 'red')
alpha.2000.lower <- quantile(alpha.2000, 0.025)
alpha.2000.upper <- quantile(alpha.2000, 0.975)
abline(v = alpha.2000.lower, col = 'blue')
abline(v = alpha.2000.upper, col = 'blue')

plot(density(beta.2000), main = 'P(beta|data)')
z.b = density(beta.2000)
beta.2000.MAP <- z.b$x[which.max(z.b$y)]
abline(v = beta.2000.MAP, col = 'red')
beta.2000.lower <- quantile(beta.2000, 0.025)
beta.2000.upper <- quantile(beta.2000, 0.975)
abline(v = beta.2000.lower, col = 'blue')
abline(v = beta.2000.upper, col = 'blue')

Chapter 5 Problem 13c.) Compare the theta’s

CI = matrix(0, ncol = 3, nrow = 10)
for(i in 1:10) {
  CI[i, ] = quantile(theta.thin[i,], probs = c(0.025, 0.5, 0.975))
}

plot(y/n, y/n, type = 'l')
points(y/n, CI[,2], pch = 19, col = 3)
for(j in 1:10) {
  points(c(y[j]/n[j], y[j]/n[j]), c(CI[j,1], CI[j,3]))
}

As we can see the posterior distributions of the parameters \(\theta_j\) seem to be fairly accurate for each location \(j\). The medians of each \(\theta_j\) are almost identical to the raw proportions of (number of bicycles)/(total number of vehicles) as we can see from the green points on the above plot.

Chapter 5 Problem 13d.) Give a 95% posterior interval for the for average underlying proportion of traffic that is bicycles.

Recall that \(\theta \sim Beta(\alpha, \beta)\). Since \(E[Beta(\alpha, \beta)] = \frac{\alpha}{\alpha + \beta}\) we have \(E[\theta] = \frac{\alpha}{\alpha + \beta}\).

props <- alpha.thin/(alpha.thin + beta.thin)
vals = as.numeric(quantile(props, c(0.025, 0.975)))
print(paste('The average underlying proportion of traffict that is bicycles, with probability 95%, is between ', round(vals[1], 3), ' and ' , round(vals[2], 3)))
## [1] "The average underlying proportion of traffict that is bicycles, with probability 95%, is between  0.145  and  0.287"

Chapter 5 Problem 13e.) Posterior Predictive

We seek the posterior predictive distribution \(p(y_{new}|y)\).

We would like to sample from this distribution. In order to do this we need to first generate \(\theta_j \sim Beta(\alpha, \beta)\) using our \(\alpha\) and \(\beta\) samples.

Then, using these values, we can sample from a Binomial distribution with n = 100 (the 100 vehicles in the town) since \((y_j|\theta_j, n_j) \sim Binomial(n_j = 100, \theta_j)\).

We will produce a posterior predictive distribution of \(M = 2000\) samples.

M = 2000
theta.samples <- rbeta(M, alpha.thin, beta.thin)
y.new <- rbinom(M, 100, theta.samples)
hist(y.new)
abline(v = quantile(y.new, 0.025), col = 'blue')
abline(v = quantile(y.new, 0.975), col = 'blue')

vals = as.numeric(quantile(y.new, c(0.025, 0.975)))

Our posterior predictive distribution can be seen above. The numerical results are printed in the statement below:

## [1] "With probability 95%, between  3  and  52  of the 100 vehicles will be bicycles."

Chapter 5 Problem 13f.) Was the beta distribution for the \(\theta_j\) reasonable?

Yes, based on the plot above from part c.) I believe the beta distribution was reasonable.

Additional Problem

To do MCMC using a Gibbs step for \(\beta\) we need to specify the distribution \(\beta|\alpha, data\)

Our model for \(\beta\) is the following:

\[ \text{prior}: \beta \sim Gamma(\alpha_{\beta} = 1, \beta_{\beta} = 0.8)\\ \text{Data}: y_i \sim Gamma(\alpha, \beta) \]

It follows from Homework 1 that our posterior distribution of \(\beta|\alpha, data\) is

\[ \text{posterior: } (\beta|\alpha, data) \sim Gamma(\alpha_{\beta} + n\alpha, \beta_{\beta} + \sum_{i=1}^{n}y_i) \]

Thus, at each step of our \(B\) iterations, we will sample \(\beta|\alpha, data\) from a \(Gamma(\alpha_{\beta} + n\alpha^{b-1}, \beta_{\beta} + \sum_{i=1}^{n}y_i)\), where \(\alpha^{b-1}\) is the \((b-1)\) iteration of the Metropolis Hastings Algorithm for \(\alpha\).

On that note, let us discuss our MH-Algorithm for \(\alpha\).

Since \(\alpha\) is the shape parameter of the data \(y_i \sim Gamma(\alpha, \beta)\), \(\alpha\) must be greater than 0.

Thus, we choose our proposal density as:

\[ Gamma(\text{shape = } \alpha_{MH}^{b-1}*\text{rate}, \text{rate = }s) \]

where \(s\) is a tuning parameter. We chose this as our proposal distribution since \(\alpha\) is the shape parameter of a \(Gamma\), \(\alpha\) must be strictly greater than 0.

With our proposal density in mind we now define our ratio \(r_b\) at step \(b\) of our iterations.

\[ r_b = \frac{p(y|\alpha^*, \beta) * p(\alpha^*|\beta)/J_b(\alpha^*|\alpha^{(b-1)})}{p(y|\alpha^{(b-1)}, \beta) * p(\alpha^{(b-1)}|\beta)/J_b(\alpha^{b-1}|\alpha^*)} \]

Which in terms of our problem we have:

\[ r_b = \frac{[(\prod_{i=1}^n dGamma(y_i, \text{shape = }\alpha^{*^b}, \text{rate = } \beta_{Gibbs}^{(b)})) (dgamma(\alpha^{*^b}, \text{shape = }\alpha_{\alpha}, \text{rate} = \beta_{\alpha})]/ dgamma(\alpha^{*^b}, \text{shape =} \alpha_{MH}^{(b-1)}*s, \text{rate = } s)} {[(\prod_{i=1}^n dGamma(y_i, \text{shape = }\alpha_{MH}^{(b-1)}, \text{rate = } \beta_{Gibbs}^{(b)})) (dgamma(\alpha_{MH}^{(b-1)}, \text{shape = }\alpha_{\alpha}, \text{rate} = \beta_{\alpha})]/ dgamma(\alpha_{MH}^{(b-1)}, \text{shape =} \alpha^{*^b}*s, \text{rate = } s)} \]

Additional Problem a.) Model Statement, Algorithm, and Proposal Distribution

In summary our model:

\[ \text{prior: } \alpha \sim Gamma(\alpha_{\alpha} = 2, \beta_{\alpha} = 0.5) \\ \text{prior}: \beta \sim Gamma(\alpha_{\beta} = 1, \beta_{\beta} = 0.8)\\ \text{Data}: y_i \sim Gamma(\alpha, \beta)\\ \text{Conditional for Gibbs: } (\beta|\alpha, data) \sim Gamma(\alpha_{\beta} + n\alpha, \beta_{\beta} + \sum_{i=1}^{n}y_i)\\ \text{Proposal Distribution for MH: } \alpha^* \sim Gamma(\text{shape = } \alpha * \text{s}, \text{rate = } s) \]

Our algorithm:

Step 1.) Initial values for \(\alpha_{MH}^{0}\) and \(\beta_{Gibbs}^{0}\)

For b = 1 to B:

Step 2.) Sample \(\beta^{b}\) using Gibbs sampling, i.e., set \(\beta^{b}\) equal to a randomly sampled value from \((\beta|\alpha, data) \sim Gamma(\alpha_{\beta} + n\alpha_{MH}^{(b-1)}, \beta_{\beta} + \sum_{i=1}^{n}y_i)\).

Step 3.) Sample \(\alpha^{*^b} \sim \sim Gamma(\text{shape = } \alpha^{(b-1)} * \text{s}, \text{rate = } s)\) where the rate \(s\) is a tuning parameter that will, amongst other things, primarily control the acceptance rate. This is our proposed new \(\alpha_{MH}\) sample. This proposed value will be used in our ratio \(r_b\) and we will decide on keeping it or not in Step 5.)

Step 4.) Compute our ratio \(r_b\).

\[ r_b = \frac{[(\prod_{i=1}^n dGamma(y_i, \text{shape = }\alpha^{*^b}, \text{rate = } \beta_{Gibbs}^{(b)})) (dgamma(\alpha^{*^b}, \text{shape = }\alpha_{\alpha}, \text{rate} = \beta_{\alpha})]/ dgamma(\alpha^{*^b}, \text{shape =} \alpha_{MH}^{(b-1)}*s, \text{rate = } s)} {[(\prod_{i=1}^n dGamma(y_i, \text{shape = }\alpha_{MH}^{(b-1)}, \text{rate = } \beta_{Gibbs}^{(b)})) (dgamma(\alpha_{MH}^{(b-1)}, \text{shape = }\alpha_{\alpha}, \text{rate} = \beta_{\alpha})]/ dgamma(\alpha_{MH}^{(b-1)}, \text{shape =} \alpha^{*^b}*s, \text{rate = } s)} \]

Step 5.) Sample a \(u \sim Unif(0,1)\).

If \(u < \text{min}(1,r)\), then we accept the current \(\alpha^*\). In other words, set \(\alpha_{MH}^{(b)} = \alpha^*\).

Else, we reject the current \(\alpha^*\) and set \(\alpha_{MH}^{(b)} = \alpha_{MH}^{(b-1)}\)

Initial Chain Values

Recal that:

\[ \text{prior: } \alpha \sim Gamma(\alpha_{\alpha} = 2, \beta_{\alpha} = 0.5) \\ \text{prior}: \beta \sim Gamma(\alpha_{\beta} = 1, \beta_{\beta} = 0.8)\\ \]

Our choice of the initial chain values will be the expected value of each of the prior distributions. In other words,

\[ \alpha_{MH}^{(0)} = E[Gamma(\alpha_{\alpha} = 2, \beta_{\alpha} = 0.5)] = \frac{\alpha_{\alpha}}{\beta_{\alpha}} = 4\\ \beta_{Gibbs}^{(0)} = E[Gamma(\alpha_{\beta} = 2, \beta_{\beta} = 0.5)] = \frac{\alpha_{\beta}}{\beta_{\beta}} = 1.25 \]

Code Implementation

We can see our code below.

Libraries Used

library(ggplot2)
library(coda)
#Model values and Data
alpha.a.prior <- 2
beta.a.prior <- 0.5

alpha.b.prior <- 1
beta.b.prior <- 0.8

y <- c(1.8, 1.6, 1.4, 1.6, 1.4, 1.5, 1.2)
n <- length(y)

#Initial Chain Values
##Prior intial values of algorithm##
beta0 = alpha.b.prior/beta.b.prior #beta is the mean of its prior distribution
alpha0 = alpha.a.prior/beta.a.prior #alpha is the mean of its prior distribution
##Helper Functions

### Predefine functions for M-H Steps ###
Prior.alpha <- function(x, shape, rate) {
  dgamma(x, shape = shape, rate = rate)
}

Likelihood <- function(data, shape, rate) {
  prod(dgamma(data, shape = shape, rate = rate))
}

Proposal <- function(x, shape, rate) {
  dgamma(x, shape = shape, rate = rate)
}

rProposal <- function(n, shape, rate) {
  rgamma(1, shape = shape, rate = rate)
}
MH.func2 <- function(s1, B, alpha.chain.init, beta.chain.init) {
  ## initialize
  accept = 0
  alpha.post.MH <- numeric()
  beta.post.gibbs <- numeric()
  alpha.post.MH[1]  <- alpha.chain.init #initalize alpha
  beta.post.gibbs[1] <- beta.chain.init #initialize beta
  set.seed(10)
  for(b in 2:(B+1)) {
    
    #Gibbs Step for beta|alpha, data
    shape.param = alpha.b.prior + n*alpha.post.MH[(b-1)]
    rate.param = beta.b.prior + sum(y)
    beta.post.gibbs[(b)] <- rgamma(1, shape = shape.param, rate = rate.param)
    
    #M-H Step for alpha|beta, data
    alpha.star <- rProposal(1, shape = alpha.post.MH[(b-1)] * s1, rate = s1)
    
    r.num <- (Likelihood(y, shape = alpha.star, rate = beta.post.gibbs[(b)])) * 
      Prior.alpha(alpha.star, shape = alpha.a.prior, rate = beta.a.prior) / 
      Proposal(alpha.star, shape = alpha.post.MH[(b-1)] * s1, rate = s1)
    
    r.denom <- (Likelihood(y, shape = alpha.post.MH[(b-1)], rate = beta.post.gibbs[(b)])) * 
      Prior.alpha(alpha.post.MH[(b-1)], shape = alpha.a.prior, rate = beta.a.prior) /
      Proposal(alpha.post.MH[(b-1)], shape = alpha.star * s1, rate = s1)
    
    r =  r.num/r.denom
    
    if(runif(1) < min(1,r)) {
      alpha.post.MH[(b)] = alpha.star
      accept = accept + 1
    } else {
      alpha.post.MH[(b)] <- alpha.post.MH[(b-1)] 
    }
    
    #Print every 100th iteration
    #if(b %% 1000 == 0) {
     # cat(paste0("iteration: ", b, "\n"))
    #}
    
    
  } #end for loop
  return.list = list(accep.rate = accept/B, beta.samples = beta.post.gibbs, alpha.samples = alpha.post.MH)
  return(return.list)
}
B = 50000
y.05 <- MH.func2(s1 = 0.5, B = 50000, alpha0, beta0)
y.1 <- MH.func2(s1 = 1, B = 50000, alpha0, beta0)
y.15 <- MH.func2(s1 = 1.5, B = 50000, alpha0, beta0)
y.2 <- MH.func2(s1 = 2, B = 50000, alpha0, beta0)

accep.rates <- c(y.05$accep.rate, y.1$accep.rate, y.15$accep.rate, y.2$accep.rate)
accep.rates
## [1] 0.28426 0.38272 0.44714 0.49180

We can see the acceptance rates above. Conventional wisdom (i.e. people on the Internet and Professor Morey) have said that the optimal acceptance rate is between 20 and 30%. Too high of an acceptance rate possibly means that the algorithm is not letting the chain fully explore the possible space of values. Too low means we are possibly being inefficient (i.e. large computational time).

Thus, we will analyze the chains for this acceptance rate, which corresponds to the tuning parameter rate = 0.5.

Additional Problem b.)

par(mfrow = c(1,2))
plot(y.05$beta.samples[1:5000], type = 'l', main = 'Trace Plot of Beta')
plot(y.05$alpha.samples[1:5000], type = 'l', main = 'Trace Plot of Alpha')

As it would make sense, our \(\beta\) samples have a much thicker trace plot since we are not rejecting any of the samples. Conversely, our \(\alpha\) samples are much thinner (especially in the earlier indexes) since we are rejecting many proposed values.

We will now visualize the running means to get an idea of our burn-in value.

params = cbind(y.05$beta.samples, y.05$alpha.samples)
temp = apply(params, 2, cumsum)
seq.div <- seq(from = 1, to = (B+1), by = 1)
means.beta = temp[,1]/seq.div
means.alpha = temp[,2]/seq.div
par(mfrow = c(1,2))
plot(1:(B+1), means.beta, type = 'l', main = 'Running Mean of Beta')
plot(1:(B+1), means.alpha, type = 'l', main = 'Running Mean of Alpha')

#it appears burnin of 7,500 will be sufficient

It appears that a burn-in value of 7,500 will be sufficient.

Now with our burnin value of 7,500 we will produce the autocorrelation plots.

burnin = 7500 #burn in 7500
#Autocorrelation plots# 
par(mfrow = c(1,2))
acf(y.05$beta.samples[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(y.05$alpha.samples[-c(1:burnin)], main = 'Alpha Samples after Burn-in')

It appears that our chain values are correlated out to a large lag-value. This makes sense because our acceptance rate was low (due to a low tuning spread parameter) that our chain values tended to be more correlated.

We will need to find the appropriate lag-value, i.e., a lag-value at which the ACF plot goes to zero, particularly in our \(\alpha_{MH}\) chain.

autocorr(as.mcmc(y.05$alpha.samples[-c(1:burnin)]), lags = c(10,20,30,40,50, 60, 70, 80), relative = FALSE)
## , , 1
## 
##               [,1]
## Lag 10 0.645852537
## Lag 20 0.412378488
## Lag 30 0.252085362
## Lag 40 0.142832833
## Lag 50 0.073252454
## Lag 60 0.041571355
## Lag 70 0.019159851
## Lag 80 0.007109668

It appears that our thinning rate will be 75 (since the ACF plot finally goes beneath our threshold of 0.01).

Now let’s thin our chain. We first need to calculate the chain length \(B\) in order to have an effective sample size of 2,000 independent samples. Since we will be taking every 75th value, we will need \(75 x 2000 = B = 150000\) as our chain length. In addition, since we burned \(7,500\), we will need \(B + \text{burn-in} = 150,000 + 7,500 = 157,500\) as our chain length.

B = 157500
burnin = 7500
thin = 75
Eff.samp = floor((B - burnin)/thin) 
Eff.samp
## [1] 2000

In summary, with our tuning parameter \(s = 0.50\), we have burn-in = 7,500, thinning = 75, B = 157,500, and that will lead to our effective sample size of 2,000.

Checking a different tuning parameter \(s = 3\)

Out of curiosity and to check if our mixing substantially changes with a change in the tuning parameter \(s\), we explore the following. To be safe we have burn-in= 10,000.

B = 157500
y.3 = MH.func2(s1 = 3, B = B, alpha0, beta0)
y.3$accep.rate
## [1] 0.556927
burnin = 10000 #burnin to be safe
#Autocorrelation plots# 
par(mfrow = c(1,2))
acf(y.3$beta.samples[-c(1:burnin)], main = 'Beta Samples after Burn-in')
acf(y.3$alpha.samples[-c(1:burnin)], main = 'Alpha Samples after Burn-in')

autocorr(as.mcmc(y.3$alpha.samples[-c(1:burnin)]), lags = c(10,20,30,40,50, 60, 70, 80, 90, 100), relative = FALSE)
## , , 1
## 
##                [,1]
## Lag 10  0.628193750
## Lag 20  0.401008451
## Lag 30  0.253895451
## Lag 40  0.158850261
## Lag 50  0.101490332
## Lag 60  0.065312718
## Lag 70  0.036874351
## Lag 80  0.021518600
## Lag 90  0.013532691
## Lag 100 0.008444598

As expected, increasing the ‘spread’ tuning parameter \(s\) to 3 increased the acceptance rate to 0.55, yet it came at the cost of an increase in the autocorrelation. Here is a comparison:

##   tuning accept.rate lags
## 1    0.5    0.284260   75
## 2    3.0    0.556927   95

Therefore, we will stick with the tuning parameter value \(s = 0.5\)

Additional Problem c.) Generate 2,000 posterior samples of \(\alpha\) and \(\beta\)

Let’s re-run our algorithm with the proper \(B\) value.

B = 157500
#reset burnin
burnin = 7500
y.05 = MH.func2(s1 = 0.5, B = B, alpha0, beta0)
alpha.thin = beta.thin = numeric()
for(i in 1:Eff.samp) {
  alpha.thin[i] = y.05$alpha.samples[(burnin+1+(thin*(i-1)))]
  beta.thin[i] = y.05$beta.samples[(burnin + 1 + (thin*(i-1)))]
}
alpha.2000 = alpha.thin[1:2000]
beta.2000 = beta.thin[1:2000]

Let’s plot our posterior samples. We can visually see the MAP estimates and 95% credible intervals for each. We will also present them numerically.

par(mfrow = c(1,2))
plot(density(alpha.2000), main = 'P(alpha|data)')
z.a = density(alpha.2000)
alpha.2000.MAP <- z.a$x[which.max(z.a$y)]
abline(v = alpha.2000.MAP, col = 'red')
alpha.2000.lower <- quantile(alpha.2000, 0.025)
alpha.2000.upper <- quantile(alpha.2000, 0.975)
abline(v = alpha.2000.lower, col = 'blue')
abline(v = alpha.2000.upper, col = 'blue')

plot(density(beta.2000), main = 'P(beta|data)')
z.b = density(beta.2000)
beta.2000.MAP <- z.b$x[which.max(z.b$y)]
abline(v = beta.2000.MAP, col = 'red')
beta.2000.lower <- quantile(beta.2000, 0.025)
beta.2000.upper <- quantile(beta.2000, 0.975)
abline(v = beta.2000.lower, col = 'blue')
abline(v = beta.2000.upper, col = 'blue')

##   parameters   lowers     maps    uppers
## 1      alpha 2.108043 4.741958 11.275007
## 2       beta 1.195120 3.330826  7.359483

An interpretation of the above would be the following:

## [1] "With probability 95% the parameter alpha is between  2.10804269329218  and  11.2750070294818 . Furthermore, the most likely value of alpha (i.e. the MAP estimate) is  4.74195783851813 ."

Additional Problem Part d.) Posterior Predictive Distribution

We will repeat the following procedure many times:

1.) Take one sample from each of the posterior(s). In our case, \(\beta_{Gibbs}\) and \(\alpha_{MH}\)

2.) Plug those samples into the probability density function to generate a data set \(y_1,...,y_n\), i.e., \(y_i \sim Gamma(\alpha = \alpha_{MH}, \beta = \beta_{Gibbs})\)

B <- 2000
alpha <- sample(alpha.2000, size = B, replace = TRUE)
beta <- sample(beta.2000, size = B, replace = TRUE)
y.posterior <- rgamma(B, shape = alpha, rate = beta)

df <- data.frame(y.posterior.predictive = y.posterior)


z <- density(y.posterior)
y.map.estimate <-z$x[which.max(z$y)] 
y.posterior.lower = quantile(y.posterior, 0.025)
y.posterior.upper = quantile(y.posterior, 0.975)

p <- ggplot(df, aes(x=y.posterior.predictive)) +
  geom_histogram(aes(y=..density..)) +  # scale histogram y
  geom_density(col = "red") + ggtitle('Empirical Density Estimate Plot - Posterior Predictive Distribution') + geom_vline(xintercept = y.posterior.lower, color='blue') + geom_vline(xintercept = y.posterior.upper, color = 'blue') + geom_vline(xintercept=y.map.estimate, color='green')
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see our posterior predictive distribution for \(y_{new}\) with visual lines for the 95% credible interval and the MAP estimate. We will present the results numerically below.

##      y.new    lowers     maps   uppers
## 2.5% y.new 0.2515413 1.032163 5.998761

We can interpret the the numerical results as follows:

## [1] "With probability 95% the new y-value is between  0.251541348382796  and  5.99876119531376 . Furthermore, the most likely value of the new y-value (i.e. the MAP estimate) is  1.03216292898411 . In the context of our data, this means the most likely new artichoke tree height would be  1.03  meters."