Chapter 4

This is my solution to exercises for chapter 4 in Hoff’s Bayesian statistics book. (so there might be some mistakes!)

Hoff, Peter D. n.d. A First Course in Bayesian Statistical Methods. Springer New York. Accessed March 14, 2023.

This chapter covers theory and implementation of Monte Carlo simulation to approximate posterior quantities of interest.

4-1

Posterior comparison: estimating posterior probability that the rate (of supporting policy Z)

set.seed(1)
N <- 5000 # number of simulation

# sample information (sufficient statistic)
n_1 <- 100; y_1 <- 57 # county A
n_2 <- 50; y_2 <- 30 # county B

# sampling theta from their posterior distribution using Monte Carlo method
theta_1_mc <-rbeta(N, 1 + y_1, 1 + n_1 - y_1)
theta_2_mc <-rbeta(N, 1 + y_2, 1 + n_2 - y_2)

# estimate p(theta_1 < theta_2 | data, prior) 
mean(theta_1_mc < theta_2_mc)

## [1] 0.637

4-2

Tumor count comparison: comparing rate of tumorigenesis in two strains of mice (A and B).

4-2 (a)

N <- 5000
# data
y_a <- c(12,9,12,14,13,13,15,8,15,6)
y_b <- c(11,11,10,9,9,8,7,10,6,8,8,9,7)

# prior information
alpha_a <- 120; beta_a <- 10
alpha_b <- 12; beta_b <- 1

# sample information (sufficient statistic)
sy_a <- sum(y_a); n_a <- length(y_a)
sy_b <- sum(y_b); n_b <- length(y_b)

# sampling theta from their posterior distribution
theta_a_mc <- rgamma(N, alpha_a + sum(y_a), beta_a + n_a)
theta_b_mc <- rgamma(N, alpha_b + sum(y_b), beta_b + n_b)

# estimate p(theta_b < theta_a | data, prior)
mean(theta_b_mc < theta_a_mc)

## [1] 0.9958

4-2 (b)

Setting prior as follow: \(\theta_b \sim \operatorname{gamma}\left(12 * n_0, n_0\right) \:for \:n_0=1,2, \cdots, 50\) and calculate \(\operatorname{Pr}\left(\theta_B<\theta_A \mid y_A, y_B\right)\)

n_0 <- 1:50
p_list <- c() # list to store p(theta_b < theta_a | data, prior) for each n_0

for (i in n_0){
  theta_b_mc <- rgamma(N, 12 * i + sy_b, i + n_b)
  p <- mean(theta_b_mc < theta_a_mc)
  p_list <- append(p_list, p)
}

plot(n_0, p_list, type='l', ylab="Probability")

As \(n_0\) increase, the probability decrease. \(n_0\) represents sample size of prior data or confidence about prior information that the count is 12. As \(n_o\) increase, the probability reflect more information from prior than model. Since \(mean(y_a) < 12\), as we get more confident that \(mean(y_b) = 12\) from prior, \(\operatorname{Pr}\left(\theta_B<\theta_A \mid y_A, y_B\right)\) decreases.

4-2 (c)

Repeat a) and b) with replacing event \(\{\theta_B<\theta_A\}\) with the event \(\{\tilde{Y}_B<\tilde{Y}_A\}\) where \(\tilde{Y}_B\) and \(\tilde{Y}_A\) are samples from posterior predictive distribution.

# sampling y_tilde from their posterior predictive distribution
y_a_pred_mc <- rnbinom(N, alpha_a + sy_a, (beta_a + n_a) / (beta_a + n_a + 1))
y_b_pred_mc <- rnbinom(N, alpha_b + sy_b, (beta_b + n_b) / (beta_b + n_b + 1))

# estimate p(y_tilde_b < y_tilde_a | data, prior)
mean(y_a_pred_mc < y_b_pred_mc)

## [1] 0.2394

n_0 <- 1:50
p_list <- c()

for (i in n_0){
  y_b_pred_mc <- rnbinom(N, 12 * i + sy_b, (i + n_b) / (i + n_b + 1))
  p <- mean(y_b_pred_mc < y_a_pred_mc)
  p_list <- append(p_list, p)
}

plot(n_0, p_list, type='l', ylab="Probability")

4-3

Posterior predictive checks: investigate adequacy of the Poisson model for tumor count data (same data as 4-2).

4-3(a)

Obtain 10 samples from posterior predictive distribution of \(y_A\) and calculate \(t\), where \(t\) is sample average of 10 samples divided by standard deviation. Repeat this 1000 time and make histgram of \(t\).

t_mc <- NULL
for (s in 1:1000){
  theta_a <- rgamma(1, alpha_a + sum(y_a), beta_a + n_a) # obtain theta_a from posterior
  y_mc <- rpois(10, theta_a) # obtain n_a = 10 samples from posterior predictive distributino
  t_mc <- c(t_mc, mean(y_mc) / (var(y_mc)*0.5)) 
}

# observed value of t
t_obs <- mean(y_a) / (var(y_a)*0.5)

# plot
hist(t_mc)
abline(v=t_obs, col=2)

Vertical line shows observed values. It seems that it located in the area with high density. It indicates that the choice of Poisson for the model is likely to be appropriate.

4-3(b)

Repeat this for \(y_b\).

t_mc <- NULL
for (s in 1:1000){
  theta <- rgamma(1, alpha_b + sy_b, beta_b + n_b)
  y_mc <- rpois(13, theta)
  t_mc <- c(t_mc, mean(y_mc) / sd(y_mc))
}

t_obs <- mean(y_b) / sd(y_b)

hist(t_mc)
abline(v=t_obs, col=2)

It seems that the data observed is rare to be observed with the assumption of Poisson model. It might be the case that the choice of Poisson is not adequate in this case.

4-4

Mixed prior

# mixed prior
d_prior_mixed <- function(theta) {
  d <- 0.25*(gamma(10)/(gamma(2) * gamma(8))) * 
    (3*theta*((1 - theta)^7) + (theta^7)*(1-theta))
  return(d)
}

4-4(a)

Make a plot of \(p(y|\theta)p(\theta)\)

n <- 43; y <- 25; # sample information

theta <- seq(0,1, 0.01)
likelihood <- dbinom(y, n, theta)
posterior <- d_prior_mixed(theta) * likelihood

plot(theta, posterior, type='l')

Now getting 95% posterior confidence interval by using discrete approximation.

discretised <- NULL
for (i in 1:length(theta)){
  discretised <- c(discretised, rep(theta[i], floor(10000 * posterior[i])))
}

quantile(discretised, c(.025, .975))

##  2.5% 97.5% 
##  0.41  0.73

We can discrete the continuous distribution by taking posterior[i] as frequency for each theta[i]. Since frequency should be integer for discrete random variable, we multiply posterior[i] by large enough integer so that for all posterior density, floor exist and not 0. It does not matter how large integer we choose as long as floor exist since what matters is relative frequency.

4-4(b)

Using Monte Carlo approximation to obtain 95% posterior confidence interval and compare with a).

N <- 100000
w <- 0.75
posterior_mc <- NULL

for (i in 1:N){
  x <- rbinom(1, 1, w)
  if (x == 1){
    theta <- rbeta(1, y + 2, n - y + 8)
  }
  
  else{
    theta <- rbeta(1, y + 8, n - y + 2)
  }
  
  posterior_mc <- c(posterior_mc, theta)
}

hist(posterior_mc)

quantile(posterior_mc, c(.025, .975))

##      2.5%     97.5% 
## 0.3848259 0.7073393

Empirical confidence interval obtained from MC approximation is close to theoretical confidence interval obtained from discrete approximation.

4-5

Cancer deaths: We have \(X_i =\) number of people in 10,000s, and \(Y_i =\) number of cancer fatalities. We assume \(Y_i \sim Poisson(\theta X_i)\) We have data from two populations; “near nuclear reactor” and “not near nuclear reactor.

4-5(a)

Identify the posterior distribution of \(theta\) given data and a \(gamma(\alpha, \beta)\) prior distribution.

\[ \begin{aligned} p(\theta \mid y, x) & \propto p(\theta) \cdot p(y, x \mid \theta) \\ & =p(\theta) \cdot p(y \mid \theta, x) p(x \mid \theta) \\ & =\frac{\beta^\alpha}{\Gamma(\alpha)} \theta^{\alpha-1} e^{-\beta \theta} \cdot \prod_{i=1}^n \frac{\left(\theta x_i\right)^{y_i} e^{-\theta x_i}}{y_{i} !} \cdot p(x) \\ & \propto \theta^{\alpha-1} e^{-\beta \theta} \cdot \theta^{\sum_{i=1}^n y_i} e^{-\theta \sum_{i=1}^n x_i} \\ & =\theta^{\sum_{i=1}^n y_i+\alpha-1} e^{-\left(\beta+\sum_{i=1}^n x_i\right) \theta} \\ & \propto gamma(\sum_{i=1}^n y_i+\alpha, \beta+\sum_{i=1}^n x_i) \end{aligned} \]

4-5(b)

# load data
react <- read.table("https://www2.stat.duke.edu/courses/Fall09/sta290/datasets/Hoffdata/cancer_react.dat", header=TRUE)

non_react <- read.table("https://www2.stat.duke.edu/courses/Fall09/sta290/datasets/Hoffdata/cancer_noreact.dat", header=TRUE)

a1 <- 1; b1 <- 1 # prior information for react
a2 <- 1; b2 <- 1 # prior information for non-react
sy_react <- sum(react$y); sx_react <- sum(react$x) # sufficient statistic for react
sy_n_react <- sum(non_react$y); sx_n_react <- sum(non_react$x) # sufficient statistic for non-react

theta <- seq(0, 5, 0.01) # support of theta

pos_1 <- dgamma(theta, sy_react + a1, b1 + sx_react)
pos_2 <- dgamma(theta, sy_n_react + a2, b1 + sx_n_react)

plot_d <- matrix(c(pos_1, pos_2), ncol=2)
matplot(theta, plot_d, type='l', col=1:2, ylab="density")
legend(x="topright",legend = c("react", "non react"), col=1:2, lwd=2, cex=0.6)

4-5(c)

Plot posterior densities for different prior opinion

output_result <- function(a1, a2, b1, b2){
  print(c(a1, a2, b1, b2))
  expected_react <- (sy_react + a1) / (sx_react + b1)
  nine_five_react <- qgamma(c(.025, .075), sy_react + a1, sx_react + b1)
  density_react <- dgamma(theta, sy_react + a1, sx_react + b1)
  
  expected_n_react <- (sy_n_react + a2) / (sx_n_react + b2)
  nine_five_n_react <- qgamma(c(.025, .075), sy_n_react + a2, sx_n_react + b2)
  density_n_react <- dgamma(theta, sy_n_react + a2, sx_n_react + b2)
  
  # MC simulation
  theta_mc_1 <- rgamma(10000, sy_react + a1, sx_react + b1)
  theta_mc_2 <- rgamma(10000, sy_n_react + a2, sx_n_react + b2)
  p <- mean(theta_mc_2 > theta_mc_1)
  
  # create table to show posterior quantity of interest
  tab <- matrix(c(expected_react, expected_n_react, paste(nine_five_react, collapse="-"), paste(nine_five_n_react, collapse="-"), p, ""), ncol=2, byrow=TRUE)
  colnames(tab) <- c("react", "non react")
  rownames(tab) <- c("Ecpected value", "95% interval", "P(theta_2 > theta_1)")
  tab <- as.table(tab)
  print(tab)
  
  # plot posterior density
  plot_d <- matrix(c(density_react, density_n_react), ncol=2)
  matplot(theta, plot_d, type='l', col=1:2)
  legend(x="topright",legend = c("react", "non react"), col=1:2, lwd=2, cex=0.6)
}

# option 1: same rate for both types of counties
output_result(2.2*100,2.2*100,100,100)

## [1] 220 220 100 100
##                      react                           
## Ecpected value       2.44102564102564                
## 95% interval         2.2266334491328-2.28184384454256
## P(theta_2 > theta_1) 0.0233                          
##                      non react                        
## Ecpected value       2.20316622691293                 
## 95% interval         2.11772595673103-2.14011682859407
## P(theta_2 > theta_1)

# option 2: less information for reactor counties
output_result(2.2*100,2.2,100,1)

## [1] 220.0   2.2 100.0   1.0
##                      react                           
## Ecpected value       2.44102564102564                
## 95% interval         2.2266334491328-2.28184384454256
## P(theta_2 > theta_1) 0.0229                          
##                      non react                        
## Ecpected value       2.20346820809249                 
## 95% interval         2.11408099084459-2.13749171410179
## P(theta_2 > theta_1)

# option 3: more uncertainty for both
output_result(2.2,2.2,1,1)

## [1] 2.2 2.2 1.0 1.0
##                      react                            
## Ecpected value       2.68958333333333                 
## 95% interval         2.37149686692554-2.45248296202239
## P(theta_2 > theta_1) 0.0019                           
##                      non react                        
## Ecpected value       2.20346820809249                 
## 95% interval         2.11408099084459-2.13749171410179
## P(theta_2 > theta_1)

4-5(d)

Is it reasonable to assume that population size does not affect fertility rate?

I think it is reasonable to assume that fatality rate does not depend on population size because cancer is not infectious so population or population density shouldn’t change the parameter for poisson distribution. If one claim it does, he can use poisson parameter that depends on n. eg: \(y\sim poi(theta*x + c*n)\) for some constant c.

4-5(e)

It might be the case that the rate in priori is discomposed into \(\theta_0 = c_1 + c_2* dummy(near\:react)\)

In this case, change in \(c\) leads changes in \(\theta_1\) and \(\theta_2\), hence two are dependent.

4-6

Non-informative prior: plot log odds of \(\theta \sim Uniform(0,1)\) and comment on it.

theta <- runif(10000, 0, 1)
log_odd <- log(theta/(1 - theta))
hist(log_odd)

Prior distribution for log odds have higher density around 0. log_odd takes 0 when theta=0.5, so it is not really non-informative in terms of log odds. It looks like normal distribution but it is gamma distribution (can be shown analytically in 3.10)

4-7

Mixture models: after a posterior analysis on data from a population of squash plants, it was determined that the total vegetable weight of a given plant could be modeled with mixture of normal distribution.

4-7(a)

Sample at least 5000 y from posterior predictive distribution.

N <- 5000
vars <- 1 / rgamma(N, 10, 2.5)
theta_mc <- rnorm(N, 4.1, sqrt(vars/20))
y_pred_mc <- .31 * rnorm(N, theta_mc, sqrt(vars)) + .46 * rnorm(N, 2*theta_mc, sqrt(2*vars)) + .23 *  rnorm(N, 3*theta_mc, sqrt(3*vars))

4-7(b)

Form a 75% quantile-based confidence interval

# 75% quantile-based confidence interval
q_based <- quantile(y_pred_mc, c(0.125, 0.875))
print(q_based)

##    12.5%    87.5% 
## 7.329973 8.437732

4-7(c)

Form a 75% HPD region

# get density
y_pred_dens <- density(y_pred_mc)

# normalized density so that it sums to 1
dens_norm <- y_pred_dens$y / sum(y_pred_dens$y)
y <- y_pred_dens$x

# sort discrete probabilities in decreasing order
ordered_p <- dens_norm[order(dens_norm, decreasing = TRUE)]
ordered_y <- y[order(dens_norm, decreasing = TRUE)]

# find index of first probability value such that the cumulative sum of the sorted values exceeds 0.75
threshold <- max(which(cumsum(ordered_p) < 0.75))

# extract y only in the interval (HPD)
ordered_y_in_conf <- ordered_y[1:threshold]

# get HPD
hpd <- c(min(ordered_y_in_conf), max(ordered_y_in_conf))
print(hpd)

## [1] 7.338057 8.436482

# plot both confidence interval and HPD
plot(y, dens_norm, type='l')
abline(v=q_based, col=2)
abline(v=hpd, col=3)

HPD and quantile-based confidence interval is close. It makes sense since distribution looks quite close to symmetric and perhaps normal, which HPD and confidence interval matches.

It makes sense to use mixed normal for sampling if each parts of vegetables have different distribution of weight.

4-8

More posterior predictive checks: Let \(\theta_A\) and \(\theta_B\) be the average number of children of men in their 30s with and without bachelor’s degrees, respectively.

4-8(a)

Derive posterior predictive distribution using data and \(gamma(2, 1)\) prior with MC approximation.

# load data
data_a <- scan("https://www2.stat.duke.edu/courses/Fall09/sta290/datasets/Hoffdata/menchild30bach.dat") # bach
data_b <- scan("https://www2.stat.duke.edu/courses/Fall09/sta290/datasets/Hoffdata/menchild30nobach.dat") # non bach

# sufficient statistic
sy_a <- sum(data_a)
sy_b <- sum(data_b)
n_a <- length(data_a)
n_b <- length(data_b)

N <- 5000 # number of simulation
a <- 2; b <- 1 # prior information
# sample theta from posterior distribution
theta_mc_a <- rgamma(N, a + sy_a, b + n_a)
theta_mc_b <- rgamma(N, a + sy_b, b + n_b)

# sample y from posterior predictive distribution
y_pred_a <- rpois(N, theta_mc_a)
y_pred_b <- rpois(N, theta_mc_b)

# plot 
data <- data.frame(
  var = c(rep("bach", N), rep("non bach", N)),
  value = c(y_pred_a, y_pred_b)
)

p <- ggplot(data, aes(x=value, fill=var)) +
  geom_histogram(color="#e9ecef", alpha=0.6, position="dodge")

p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4-8(b)

Find 95% confidence interval for theta_b - theta_a and y_pred_b - y_pred_a

# 95% confidence interval for theta_b - theta_a
print(c(quantile(theta_mc_b - theta_mc_a, c(0.025, 0.975))))

##      2.5%     97.5% 
## 0.1488143 0.7350778

# 95% confidence interval for y_pred_b - y_pred_a
print(c(quantile(y_pred_b - y_pred_a, c(0.025, 0.975))))

##  2.5% 97.5% 
##    -3     4

It is likely that theta_b is higher than theta_a, meaning having bacheler is associated with less children.
95% confidence interval is positive for difference in theta, while it takes negative and positive for posterior predictive distribution.

4-8(c)

Obtain the empirical distribution of the data in group B. Compare this to the Poisson distribution with mean \(\hat{\theta} = 1.4\).

plotdist(data_b, "pois", discrete=TRUE, para=list(lambda = 1.4), lwd="2")

Empirical distribution have a peak in 0 but theoretical distribution have a peak in 1. Model fit for \(y \geq 2\) but not for 0 and 1.

4-8(d)

Examine adequacy of the Poisson model by counting number of 0s and 1s in simulated dataset and compare it with observed value

n_0 <- NULL
n_1 <- NULL

n_b <- 218
for (i in 1:length(theta_mc_b)){
  y_mc <- rpois(n_b, theta[i])
  n_0 <- c(n_0, sum(y_mc == 0))
  n_1 <- c(n_1, sum(y_mc == 1))
}

plot(n_0, n_1)
points(sum(data_b == 0), sum(data_b == 1), col="red")

Observed combination of n_0 and n_1 seems to be outlier given assumed data generating process (poi with lambda=1.4). It strengthen support for the argument that poisson model is not appropriate.