Math 538 HW2

Textbook Chapter 2 Problem 5.)

Textbook 5a.)

\[ \begin{align} Pr(y=k) &= \int_{0}^{1}Pr(y=k)d\theta\\ &= \int_{0}^{1}\binom{n}{k}\theta^k (1-\theta)^{n-k}d\theta\\ &= \binom{n}{k}\frac{\Gamma(k+1)\Gamma(n-k+1)}{\Gamma(n+2)}\\ &= \frac{1}{n+1} \end{align} \]

Where from the second to third equality we noticed the kernel of \(Beta \sim (k+1, n-k+1)\), which helped reduce the integral.

Textbook 5b.)

Suppose prior \(\theta \sim Beta(\alpha, \beta)\).

The posterior is given by:

\[ \begin{align} p(\theta|y) &\propto p(\theta)p(y|\theta)\\ &\propto \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha-1}(1-\theta)^{\beta+n-y}\binom{n}{y}\theta^y(1-\theta)^{n-y}\\ &\propto \theta^{\alpha+y-1}(1-\theta)^{\beta+n-y} \end{align} \]

We recognize this as the kernel of the \(Beta(\alpha+y, \beta+n-y)\) distribution.

It follows that the mean of this distribution is \(\frac{\alpha+y}{\alpha+\beta+n}\).

We will show that this posterior mean is between \(\frac{y}{n}\) and \(\frac{\alpha}{\alpha+\beta}\) by showing that the posterior mean is really just a weighted average of those two expressions and thus in-between the two.

\[ \begin{align} (\frac{y}{n})(\frac{n}{\alpha+\beta+n}) + (\frac{\alpha}{\alpha+\beta})(\frac{\alpha+\beta}{\alpha+\beta+n}) &= \frac{yn(\alpha+\beta) + n\alpha(\alpha+\beta)}{n(\alpha+\beta)(\alpha+\beta+n)}\\ &= \frac{n(\alpha+\beta)(\alpha+y)}{n(\alpha+\beta)(\alpha+\beta+n)}\\ &= \frac{\alpha+y}{\alpha+\beta+n} \end{align} \] Since \((\frac{n}{\alpha+\beta+n}) < 1\) and \((\frac{\alpha+\beta}{\alpha+\beta+n}) < 1\) and they sum to 1, this means we have expressed the posterior mean as a weighted average of those two expressions. Like we said before, this means the posterior mean is between those two values.

Textbook 5c.)

If \(\theta \sim Unif(0,1)\) then we have the following posterior distribution:

\[ \begin{align} p(\theta|y) &\propto p(\theta)p(y|\theta)\\ &\propto (1)\binom{n}{y}\theta^y(1-\theta)^{n-y}\\ \end{align} \]

This implies that \(\theta|y\) is \(Beta(y+1, n-y+1)\).

Using the formula for variance of Beta we have \(Var(\theta|y) = \frac{(y+1)(n-y+1)}{(n+2)^2(n+3)}\)

We must show that this is less than \(Var(\theta) = Var(Unif(0,1)) = \frac{1}{12}\)

We have

\[ \begin{align} Var(\theta|y) &= \frac{(y+1)(n-y+1)}{(n+2)^2(n+3)}\\ &= \frac{y+1}{n+2}\frac{n-y+1}{n+2}\frac{1}{n+3}\\ &<\frac{y+1}{n+2}\frac{n-y+1}{n+2}\frac{1}{3}\\ &\leq \frac{1}{4}\frac{1}{3}\\ &\leq \frac{1}{12} \end{align} \] The above inequalities are valid because we know that \(\frac{1}{n+3} < \frac{1}{3}\) since \(n \geq 1\). Furthermore, we know the two fractions preceding that are both fractions less than 1 and whose sum is 1. This means that there product is maximally \(\frac{1}{4}\).

Thus, we have proved that if \(\theta \sim Unif(0,1)\) then we have \(Var(\theta|y) < Var(\theta)\).

Textbook 5d.)

In general if we have prior \(Beta(\alpha, \beta)\), then we know from problem 5b. that \(\theta|y \sim Beta(\alpha+y, \beta+n-y)\).

This means that \(Var(\theta|y) = \frac{(\alpha +y)(\beta + n + y)}{(\alpha+\beta+n)^2(\alpha+\beta+n+1)}\)

Note since our prior is \(Beta(\alpha, \beta)\) we have \(Var(\theta) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\).

One such set of parameters is \(y=1\), \(n=1\), \(\alpha=2\) and \(\beta=6\). This will result in prior \(Var(\theta)= 0.0208\) and posterior \(Var(\theta|y)=0.02222\). Thus, the posterior variance can be greater, but mostly for only small values of \(n\), since intuitively we know that as \(n\) increases our \(Var\) decreases.

Textbook Ch.2 Problem 17.)

Textbook 17a.)

Let \(\tau=\sigma^2\). Then \(\sigma = \sqrt{\tau}\). Using the inverse transformation theorem for 1-1 single variable functions, we have the following density for \(\sigma^2\). \[ \begin{align} f_{\sigma^2}(\sigma^2) &=f_{\tau}(\tau)\\ &=f_{\sigma}(\sqrt{\tau})|\frac{d\sigma}{d\tau}|\\ &=f_{\sigma}(\sqrt{\tau})\frac{1}{2\sqrt\tau}\\ &=\frac{1}{2\tau}\\ &= \frac{1}{2\sigma^2}\\ &\propto \frac{1}{\sigma^2} \end{align} \]

Textbook 17b.)

We must first find the posterior distribution of \(p(\sigma|\nu)\). We know that \(p(\sigma|\nu) \propto p(\sigma)p(\nu|\sigma)\).

In order to find this posterior distribution \(\sigma|\nu\) we need the distribution \(p(\nu|\sigma)\). This will take a little work in the form of a linear transformation and the fact that we know that \(\frac{n\nu}{\sigma^2} \sim \chi_{n}^2\).

In the below derivation we will use the following fact from probability theory: If \(Y = aX\) and \(X \sim f_X\), then we have the following:

\[ f_Y(y) = \frac{1}{|a|}f_X(\frac{y}{a}) \]

In particular, we have:

\[ \begin{align} p_{\nu|\sigma}(\nu|\sigma) &= \frac{1}{\frac{\sigma^2}{n}}p_{\frac{n\nu}{\sigma^2}}(\frac{(\sigma^2/n)\frac{n\nu}{\sigma^2}}{\sigma^2/n}|\sigma)\\ &= \frac{n}{\sigma^2}(\frac{n\nu}{\sigma^2})^{\frac{n}{2}-1}e^{-\frac{n\nu}{2\sigma^2}}\\ \end{align} \]

Then the posterior of \((\sigma|\nu)\) is given by

\[ \begin{align} p(\sigma|\nu) &\propto p(\sigma)p(\nu|\sigma)\\ &\propto (\sigma^{-1})(\sigma^2)^{-1}({\sigma^2})^{-(\frac{n}{2}-1)}e^{-\frac{c}{\sigma^2}}\\ &= (\sigma^2)^{-\frac{n}{2}-\frac{1}{2}}e^{-\frac{c}{\sigma^2}} \end{align} \]

Similarly, the posterior of \(\sigma^2|\nu\) is the following:

\[ \begin{align} p(\sigma^2|\nu) &\propto p(\sigma^2)p(\nu|\sigma^2)\\ &\propto (\sigma^{-2})(\sigma^2)^{-1}({\sigma^2})^{-(\frac{n}{2}-1)}e^{-\frac{c}{\sigma^2}}\\ &= (\sigma^2)^{-\frac{n}{2}-1}e^{-\frac{c}{\sigma^2}} \end{align} \]

A highest posterior density interval \((a,b)\) is such that \(p(a) = p(b)\) and that the probability density between \(a\) and \(b\) is equal to 0.95.

We will show that if we have a 95% HPD for \(\sigma\) then we cannot simply just square the endpoints and obtain a 95% HPD for \(\sigma^2\).

We will argue by proof by contradiction.

Suppose we have a 95% HPD \((\sqrt{a}, \sqrt{b})\) for \(\sigma\). Then, since \(p_{\sigma|\nu}(\sqrt{a})=p_{\sigma|\nu}(\sqrt{b})\) we have the following:

\[ a^{-(\frac{n}{2}+\frac{1}{2})}\exp(-c/a) = b^{-(\frac{n}{2}+\frac{1}{2})}\exp(-c/b) \] Suppose we can simply square these endpoints \((\sqrt{a}, \sqrt{b}) \rightarrow (a,b)\) and we will also have a 95% HPD for \(\sigma^2\).

Then we will also have, by \(p_{\sigma^2|\nu}(a) = p_{\sigma^2|\nu}(b)\),

\[ a^{-(\frac{n}{2}+1)}\exp(-c/a) = b^{-(\frac{n}{2}+1)}\exp(-c/b) \]

Combining these two equalities involving \(a\) and \(b\), we will arrive at the conclusion that

\[ a^{-\frac{1}{2}} = b^{-\frac{1}{2}} \implies a=b \] But this contradicts the fact that the probability mass between \(a\) and \(b\) must equal 0.95. Thus, we supposed wrong and it is not true that we can simply square the endpoints.

Textbook Chapter 2 Problem 20.) Censored and Uncensored Data

Problem 20a.)

\[ \begin{align} p(\theta|y \geq 100) &\propto p(y\geq100|\theta)p(\theta)\\ &= (\int_{100}^{\infty}\theta e^{-\theta y}dy)\frac{\beta^{\alpha}}{\Gamma(\alpha)}\theta^{\alpha-1}e^{\beta\theta}\\ &=(e^{-100\theta})\frac{\beta^{\alpha}}{\Gamma(\alpha)}\theta^{\alpha-1}e^{\beta\theta}\\ &\propto\theta^{\alpha-1}e^{-\theta(\beta+100)}\\ &\sim Gamma(\alpha, \beta+100) \end{align} \]

It follows that the posterior mean is \(\frac{\alpha}{\beta+100}\) and the posterior variance is \(\frac{\alpha}{(\beta+100)^2}\)

Problem 20b.)

\[ \begin{align} p(\theta|y = 100) &\propto p(y=100|\theta)p(\theta)\\ &= \theta e^{-\theta y}dy\frac{\beta^{\alpha}}{\Gamma(\alpha)}\theta^{\alpha-1}e^{\beta\theta}\\ &\propto(\theta e^{-100\theta})\frac{\beta^{\alpha}}{\Gamma(\alpha)}\theta^{\alpha-1}e^{\beta\theta}\\ &\propto\theta^{\alpha+1-1}e^{-\theta(\beta+100)}\\ &\sim Gamma(\alpha+1, \beta+100) \end{align} \]

It follows that the posterior mean is \(\frac{\alpha+1}{\beta+100}\) and the posterior variance is \(\frac{\alpha+1}{(\beta+100)^2}\).

Problem 20c.)

\(Var(\theta|y=100) > Var(\theta|y \geq 100)\) because \(\frac{\alpha+1}{(\beta+100)^2}\) > \(\frac{\alpha}{(\beta+100)^2}\).

This doesn’t contradict identity (2.8) \(Var(\theta) = E(Var(\theta|y)) + Var(E(\theta|y))\) because if we plug y=100 into (2.8) and use our posterior variance as found in part (b) the identity still holds.

Textbook Ch.3 Problem 9.)

\[ \begin{align} p(\mu,\sigma^2|y) &\propto p(y|\mu,\sigma^2)p(\mu,\sigma^2)\\ &\propto [(\sigma^2)^{-\frac{n}{2}}\exp(-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\mu)^2)][(\frac{\sigma^2}{\kappa_0})^{-\frac{1}{2}}\exp(-\frac{(\mu-\mu_0)^2}{2(\frac{\sigma^2}{\kappa_0})})][\frac{\exp(\frac{-\nu_0\sigma_0^2}{2\sigma^2})}{(\sigma^2)^{1+\nu_0/2}}]\\ &\propto \sigma^{-1}(\sigma^2)^{-(n+\nu_0)/2 + 1}\exp(-\frac{\sigma_0^2\nu_0+(n-1)s^2 + \frac{\kappa_0n(\bar{y}mu_0)^2}{\kappa_0+n} + (\kappa_0+n)(\mu_0 - \frac{\kappa_0\mu_0+n\bar{y}}{n+\kappa_0})^2}{2\sigma^2})\\ &\sim N-Inv-\chi^2(\frac{\kappa_0\mu_0 + n\bar{y}}{n+\kappa_0}, \frac{\nu_0\sigma_0^2 + (n-1)s^2 + \frac{n\kappa_0(\bar{y}-\mu_0)^2}{n+\kappa_0}}{n+\nu_0}) \end{align} \]

Textbook Ch.3 Problem 11.)

Ch.3 Problem 11a.)

In order to estimate a posterior distribution we must first hone in on the region of \((\alpha, \beta)\) where the density is non-zero. We will find the peak of the posterior likelihood, which is proportional to the posterior distribution.

We will need the MVT library and ggplot library.

library(mvtnorm)
library(ggplot2)
library(boot)

data.bio <- data.frame(Dose = c(-0.86, -0.30, -0.05, 0.73), Num.Animals=c(5,5,5,5), Num.Deaths = c(0,1,3,5))

p.y <- function(alpha, beta, data) {
  val = 1
  for(i in 1:nrow(data)) {
    ni = data$Num.Animals[i]
    yi = data$Num.Deaths[i]
    xi = data$Dose[i]
    current.val = ((inv.logit(alpha+beta*xi))^yi) * ((1-inv.logit(alpha+beta*xi))^(ni-yi))
    val = val * current.val
  }
  return(val)
}

p.ab <- function(alpha, beta) {
  cov.matrix = matrix(c(4, 10, 10, 100), nrow=2, byrow=TRUE)
  val <- dmvnorm(x=c(alpha,beta), mean=c(0,10), sigma=cov.matrix)
  return(val)
}
posterior.likelihood <- function(par, dat) {
  z <- p.ab(par[1], par[2]) * p.y(par[1], par[2], dat)
  return(z)
}

fit <- optim(par = c(0,10), fn=posterior.likelihood, dat=data.bio, control=list(fnscale=-1))
fit$par

## [1] 0.7135563 7.9254623

Thus, the maximizers for the posterior likelihood are \((\hat{\alpha}, \hat{\beta}) = (0.7135563, 7.9254623)\).

We are now ready to compute the posterior density at a grid of points \((\alpha, \beta)\). We will use the range \((\alpha, \beta) \in [-8, 8] \text{x} [-20, 40]\).

alpha.grid <-  seq(-8, 8, length.out=100)
beta.grid <-  seq(-20, 40, length.out=100)
data.fit <- expand.grid(alpha=alpha.grid, beta=beta.grid)
posterior <- rep(0, nrow(data.fit))
for(i in 1:nrow(data.fit)) {
  posterior[i] <- posterior.likelihood(as.numeric(data.fit[i,]), dat = data.bio)
}
gdata <- cbind(data.fit, posterior)
plot1 <- ggplot(gdata, aes(x=alpha, y=beta, z=posterior)) + 
  stat_contour() + 
  geom_tile(aes(fill=posterior)) + 
  stat_contour(bins = 15) + 
  xlab("Alpha") + 
  ylab("Beta") 
plot1

It appears that we can tighten our range of values for \((\alpha, \beta)\). We will try another plot here:

alpha.grid <-  seq(-4, 5, length.out=100)
beta.grid <-  seq(-10, 30, length.out=100)
data.fit <- expand.grid(alpha=alpha.grid, beta=beta.grid)
posterior <- rep(0, nrow(data.fit))
for(i in 1:nrow(data.fit)) {
  posterior[i] <- posterior.likelihood(as.numeric(data.fit[i,]), dat = data.bio)
}
gdata <- cbind(data.fit, posterior)
plot2 <- ggplot(gdata, aes(x=alpha, y=beta, z=posterior)) + 
  stat_contour() + 
  geom_tile(aes(fill=posterior)) + 
  stat_contour(bins = 15) + 
  xlab("Alpha") + 
  ylab("Beta") 
plot2

Now we must sample from the joint posterior distribution. We will follow the example from page 76 of the textbook.

#construct normalized posterior distribution
p.ab.table = gdata
p.ab.table$normalized = p.ab.table$posterior/sum(p.ab.table$posterior)

discrete.alphas = unique(data.fit$alpha)
discrete.betas = unique(data.fit$beta)

p.a = matrix(nrow=length(discrete.alphas), ncol=2)
for(i in 1:nrow(p.a)) {
  current.alpha = discrete.alphas[i]
  beta.pairs = p.ab.table$beta[which(p.ab.table$alpha==current.alpha)]
  current.pa = 0
  for(j in 1:length(beta.pairs)) {
    current.pa = current.pa + p.ab(current.alpha, beta.pairs[j])
  }
  p.a[i,] = c(current.alpha, current.pa)
}
p.a.table = cbind(p.a, p.a[,2]/sum(p.a[,2]))

#------------------------------------------
#sample alphas
cum.column = cumsum(p.a.table[,3])
NUMSIM <- 1000
alpha.sampled <- rep(-10, NUMSIM)
for(i in 1:NUMSIM) {
  counter <- 1
  r = runif(1, min=0, max=1)
  while(r > cum.column[counter]) {
    counter = counter + 1
  }
  alpha.sampled[i] <- p.a.table[counter,1]
}
#--------------------------------------

p.bgivena = data.frame(matrix(nrow=nrow(gdata), ncol=4))
p.bgivena$alpha = gdata$alpha
p.bgivena$beta = gdata$beta
p.bgivena$joint.normalized = p.ab.table$normalized
p.bgivena$marginal.alpha = p.a.table[,3]
p.bgivena = p.bgivena[,-c(1:4)]
play = p.bgivena

NUMSIM = 1000
alpha.S = rep(-100, NUMSIM) 
alpha.S = sample(p.a.table[,1], size=NUMSIM, prob=p.a.table[,3], replace=TRUE)
beta.S = rep(-100, NUMSIM)
alpha.S = alpha.sampled
for(i in 1:length(alpha.S)) {
  current.alpha.S = alpha.S[i]
  conditional.table = p.bgivena[which(p.bgivena$alpha == current.alpha.S), ]
  conditional.likelihood = conditional.table$joint.normalized/conditional.table$marginal.alpha
  conditional.prob = conditional.likelihood/sum(conditional.likelihood)
  beta.S[i] = sample(conditional.table[,2], size=1, prob=conditional.prob)
}
discrete.space.a = diff(discrete.alphas)[1]
discrete.space.b = diff(discrete.betas)[1]

alpha.s.jittered = alpha.S + runif(length(alpha.S), min=-discrete.space.a/2, max=discrete.space.a/2)
beta.s.jittered = beta.S + runif(length(alpha.S), min=-discrete.space.b/2, max=discrete.space.b/2)

alpha.beta.samples = data.frame(alpha.samples = alpha.s.jittered, beta.samples = beta.s.jittered)
plot1 <-ggplot(alpha.beta.samples, aes(x=alpha.samples, y=beta.samples)) + geom_point()
plot1

Textbook Ch.3 Problem 11c.)

As stated in the text, we may be interested in the parameter LD50 - the dose level at which the probability of death is 50%. As noted in the textbook, we just compute \(\frac{\alpha}{\beta}\) for the 1000 draws of the \((\alpha, \beta)\) distribution.

LD50 <- alpha.s.jittered/beta.s.jittered
hist(LD50)
abline(v = mean(LD50), col='red')

Textbook Ch.3 Problem 14.)

We assume that \(p(\alpha, \beta) \propto 1\). Thus, our prior is improper since \(\int\int 1 d\alpha d\beta\) will diverge to infinity.

We will now prove that the posterior distribution will still be proper.

\[ \begin{align} \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} p(\alpha, \beta) \prod_{i=1}^{n}[logit^{-1}(\alpha+\beta x_i)]^{y_i}[1-logit^{-1}(\alpha+\beta x_i)]^{n_i - y_i}d\alpha d\beta &\propto \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} 1 \prod_{i=1}^{n}[logit^{-1}(\alpha+\beta x_i)]^{y_i}[1-logit^{-1}(\alpha+\beta x_i)]^{n_i - y_i}]d\alpha d\beta\\ &=\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\prod_{i=1}^{n}[\theta_i^{y_i}][1-\theta_i]^{n_i - y_i}d\alpha d\beta\\ &=\int_{0}^1 \prod_{i=1}^{n}[\theta_i^{y_i}][1-\theta_i]^{n_i - y_i}d\theta\\ &\leq \int_{0}^1 \theta^{y_j}(1-\theta)^{n_j - y_j}d\theta\\ &\leq c \end{align} \]

In our work above we know that \(logit^{-1}(\alpha+\beta x_i) = \theta_i \in (0,1)\). Thus,with a change of variables, and also knowing that \(\prod_{i=1}^{n}[\theta_i^{y_i}][1-\theta_i]^{n_i - y_i} \leq [\theta_j^{y_j}][1-\theta_j]^{n_j - y_j}\) for some index \(j \in {1,...n}\), we can bound the integral by the Beta distribution, which we know has a finite integral over \((0,1)\).

Thus, we conclude that the posterior distribution is proper.

Additional Problem 3.)

3a.) Derive Jeffrey’s Prior for Lambda

By definition we know the Jeffrey’s prior is proportional to the square root of the Fisher Information. In other words,

\(g(\lambda) \propto \sqrt{-E_y(\frac{\partial^2}{\partial\lambda^2} \log(f(y|\lambda)))}\)

Thus, we will calculate the Fisher Information. \[ \begin{align} \sqrt{-E_y[\frac{\partial^2}{\partial\lambda^2} \log(f(y|\lambda)))]} &= \sqrt{-E_y[\frac{\partial^2}{\partial\lambda^2}\log(\frac{e^{-y}\lambda^y}{y!})]} \\ &= \sqrt{-E_y[\frac{-y}{\lambda^2}]}\\ &= \sqrt{\frac{\lambda}{\lambda^2}}\\ &= \sqrt{\frac{1}{\lambda}} \end{align} \]

Therefore, we have that Jeffrey’s prior \(g(\lambda)\) is proportional to \(\frac{1}{\lambda}\)

3b.) Is this prior proper? Defend your answer.

This prior is improper because the integral of \(g(\lambda)\) over all possible values of \(\lambda\) does not converge to a finite number.

In other words,

\(\int_{0}^{\infty} \frac{1}{\sqrt{\lambda}}d\lambda\) is not finite, so no choice of normalizing constant \(c\) would produce a pdf from \(g(\lambda)\) over the parameter space \(\Lambda\).

3c.) Derive the posterior and give conditions on Y to ensure it is proper

We know that \(g(\lambda|y) \propto g(\lambda)f(y|\lambda)\). In other words, the posterior is proportional to the prior times the likelihood.

Thus,

\[ \begin{align} g(\lambda)f(y|\lambda) &= (\lambda^{-1/2})\frac{(e^{-\lambda})}{y!}\\ &= \frac{e^{-\lambda}\lambda^{y+1/2 - 1}}{y!}\\ & \sim Gamma(\alpha=y+1/2, \beta=1) \end{align} \]

In order for this distribution to be a proper, we must have \(\alpha > 0\) and \(\beta > 0\). Therefore, we require that \(y>-1/2\).

Additional Problem 4

We will derive the posterior distribution of \(\theta\) below using the fact that posterior = prior x likelihood and also noting that the likelihood of \((y_1,...,y_n)^{T}\) is \(\frac{1}{\theta^{n}}I_{max y \leq \theta}\). Remember that we can ignore all non-\(\theta\) terms in the below derivation.

\[ \begin{align} p(\theta|(y_1,..,y_n)) &\propto p(\theta)p((y_1,...,y_n)|\theta)\\ &\propto \frac{1}{\theta^{\alpha+1}}I_{\beta \leq \theta} \frac{1}{\theta^n}I_{max y \leq \theta} \\ &\propto \frac{1}{\theta^{n+\alpha+1}}I_{\text{max}(\text{max}y, \beta)\leq \theta}\\ &\sim Pareto(\tilde{\alpha}=n+\alpha, \tilde{\beta}=\text{max}(\text{max}y, \beta)) \end{align} \]