\(P(\theta_B< \theta_A | y_a, y_b)\) will be obtained using Monte Carlo sampling (S = 10,000) in the code below.
#sampling data
ya <- c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
sumya <- sum(ya)
na <- length(ya)
yb <- c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
sumyb <- sum(yb)
nb <- length(yb)
#priors ~ denote parameters as gamma(1,2)
a1 <- 120
a2 <- 10
b1 <- 12
b2 <- 1
#monte carlo sampling
set.seed(8675309)
theta.amc <- rgamma(10000, a1 + sumya, a2 + na)
theta.bmc <- rgamma(10000, b1 + sumyb, b2 + nb)
mean(theta.amc>theta.bmc)
## [1] 0.995
So \(P(\theta_B< \theta_A | y_a, y_b) = .995\)
Let \(\theta _ B\) ~ \(gamma(12*n_0, n_0)\). Now, we observe how \(P(\theta_B< \theta_A | y_a, y_b)\) varies as \(n_0\) changes.
#sampling data
ya <- c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
sumya <- sum(ya)
na <- length(ya)
yb <- c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
sumyb <- sum(yb)
nb <- length(yb)
#priors ~ denote parameters as gamma(1,2)
a1 <- 120
a2 <- 10
b1 <- 12
b2 <- 1
#sample 10,000 posterior theta_a and theta_b for each n0 value
set.seed(8675309)
n0 <- c(1:50)
prob <- c()
for(i in 1:length(n0)){
newb1 <- b1 * n0[i]
newb2 <- b2 * n0[i]
theta.amc <- rgamma(10000, a1 + sumya, a2 + na)
theta.bmc <- rgamma(10000, newb1 + sumyb, newb2 + nb)
mean <- mean(theta.amc>theta.bmc)
prob <- c(prob, mean)
}
newprob <- data.frame(n0 = n0, probability = prob)
print(newprob)
## n0 probability
## 1 1 0.9950
## 2 2 0.9944
## 3 3 0.9885
## 4 4 0.9859
## 5 5 0.9835
## 6 6 0.9754
## 7 7 0.9749
## 8 8 0.9680
## 9 9 0.9604
## 10 10 0.9552
## 11 11 0.9476
## 12 12 0.9419
## 13 13 0.9328
## 14 14 0.9300
## 15 15 0.9251
## 16 16 0.9140
## 17 17 0.9092
## 18 18 0.9038
## 19 19 0.8947
## 20 20 0.8861
## 21 21 0.8769
## 22 22 0.8670
## 23 23 0.8574
## 24 24 0.8595
## 25 25 0.8537
## 26 26 0.8458
## 27 27 0.8322
## 28 28 0.8329
## 29 29 0.8250
## 30 30 0.8166
## 31 31 0.8206
## 32 32 0.8097
## 33 33 0.8071
## 34 34 0.8027
## 35 35 0.7860
## 36 36 0.7888
## 37 37 0.7787
## 38 38 0.7762
## 39 39 0.7741
## 40 40 0.7684
## 41 41 0.7660
## 42 42 0.7597
## 43 43 0.7494
## 44 44 0.7434
## 45 45 0.7484
## 46 46 0.7402
## 47 47 0.7382
## 48 48 0.7243
## 49 49 0.7249
## 50 50 0.7216
plot(newprob$n0, newprob$prob, type = "l", xlab = "n0", ylab = "Probability", main = "Posterior Probability (theta_a > theta_b)")
In general, we see that as \(n0\) increases, the conclusion that \(\theta_A > \theta_B\) becomes less likely. However, I would argue that the conclusions about \(\theta_A > \theta_B\) are not incredibly sensitive to the prior distribution \(\theta_B\). At \(n_0 = 10\), which is the same prior number of observations for the well-studied population A, \(P(\theta_B< \theta_A | y_a, y_b)\) is still greater than .95. Even as \(n0 = 50\), \(P(\theta_B< \theta_A | y_a, y_b)\) is greater than .70.
So \(\theta_B\) certainly has a tangible effect on the posterior probability but I do not believe the conclusions are very sensitive to \(\theta_B\) since we require relatively large \(n_0\) in order to change the conclusion about \(\theta_A > \theta_B\).
Repeating parts a) and b) for \(\tilde{Y_B} < \tilde{Y_A}\) where \(\tilde{Y_A}\) and \(\tilde{Y_B}\) are samples from the posterior predictive distribution.
First, we obtain \(P(\tilde{Y_B} < \tilde{Y_A}| y_A, y_B)\) using Monte Carlo sampling.
#sampling data
ya <- c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
sumya <- sum(ya)
na <- length(ya)
yb <- c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
sumyb <- sum(yb)
nb <- length(yb)
#priors ~ denote parameters as gamma(1,2)
a1 <- 120
a2 <- 10
b1 <- 12
b2 <- 1
#posterior predictive distribution, generated using monte carlo
set.seed(8675309)
theta.amc <- rgamma(10000, a1 + sumya, a2 + na)
theta.bmc <- rgamma(10000, b1 + sumyb, b2 + nb)
ytil.mca <- rpois(10000, theta.amc)
ytil.mcb <- rpois(10000, theta.bmc)
mean(ytil.mca > ytil.mcb)
## [1] 0.7011
So we have \(P(\tilde{Y_B} < \tilde{Y_A}| y_A, y_B) = . 7011\). Now, we must repeat the process where the prior for \(\theta_B\) is now \(\theta _ B\) ~ \(gamma(12*n_0, n_0)\) and seeing how the probability varies as \(n_0\) changes.
#sampling data
ya <- c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
sumya <- sum(ya)
na <- length(ya)
yb <- c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
sumyb <- sum(yb)
nb <- length(yb)
#priors ~ denote parameters as gamma(1,2)
a1 <- 120
a2 <- 10
b1 <- 12
b2 <- 1
#sample 10,000 posterior theta_a and theta_b for each n0 value
set.seed(8675309)
n0 <- c(1:50)
prob <- c()
for(i in 1:length(n0)){
newb1 <- b1 * n0[i]
newb2 <- b2 * n0[i]
theta.amc <- rgamma(10000, a1 + sumya, a2 + na)
theta.bmc <- rgamma(10000, newb1 + sumyb, newb2 + nb)
ytil.mca <- rpois(10000, theta.amc)
ytil.mcb <- rpois(10000, theta.bmc)
mean <- mean(ytil.mca > ytil.mcb)
prob <- c(prob, mean)
}
newprob <- data.frame(n0 = n0, probability = prob)
print(newprob)
## n0 probability
## 1 1 0.7011
## 2 2 0.6812
## 3 3 0.6713
## 4 4 0.6536
## 5 5 0.6428
## 6 6 0.6290
## 7 7 0.6243
## 8 8 0.6141
## 9 9 0.6119
## 10 10 0.6021
## 11 11 0.5834
## 12 12 0.5891
## 13 13 0.5854
## 14 14 0.5718
## 15 15 0.5731
## 16 16 0.5704
## 17 17 0.5632
## 18 18 0.5590
## 19 19 0.5570
## 20 20 0.5445
## 21 21 0.5418
## 22 22 0.5486
## 23 23 0.5411
## 24 24 0.5341
## 25 25 0.5344
## 26 26 0.5391
## 27 27 0.5340
## 28 28 0.5344
## 29 29 0.5323
## 30 30 0.5314
## 31 31 0.5214
## 32 32 0.5263
## 33 33 0.5210
## 34 34 0.5316
## 35 35 0.5273
## 36 36 0.5120
## 37 37 0.5154
## 38 38 0.5081
## 39 39 0.5106
## 40 40 0.5047
## 41 41 0.5106
## 42 42 0.5119
## 43 43 0.5097
## 44 44 0.5104
## 45 45 0.4983
## 46 46 0.5125
## 47 47 0.5117
## 48 48 0.5000
## 49 49 0.5085
## 50 50 0.4993
plot(newprob$n0, newprob$prob, type = "l", xlab = "n0", ylab = "Probability", main = "Posterior Probability (ytilda_a > ytilda_b)")
abline(h = .95, col = "blue")
As \(n_0\) increases, \(P(\tilde{Y_B} < \tilde{Y_A}| y_A, y_B)\) decreases. I would argue that the conclusion about \(\tilde{Y_B} < \tilde{Y_A}\) is sensitive to changes in \(\theta_B\). This is different from the previous part b) since in this problem, \(P(\tilde{Y_B} < \tilde{Y_A}| y_A, y_B)\) is starting at a relatively low probability of .7011 for \(n_0 = 1\). In this case, we would be somewhat certain that \(\tilde{Y_B} < \tilde{Y_A}\). If \(n_0 > 20\), \(P(\tilde{Y_B} < \tilde{Y_A}| y_A, y_B) < .54\) and we are very uncertain as to which posterior prediction is larger.
In order to investigate the adequacy of the Poisson model for tumor count data, 1000 posterior predicitive datasets (each n = 10) will be generated.
For each predictive dataset \(y^{(s)}\), let t(s) = \(\frac{\bar{y}^{(s)}}{sd({y}^{(s)})}\).
ya <- c(12, 9, 12, 14, 13, 13, 15, 8, 15, 6)
sumya <- sum(ya)
na <- length(ya)
a1 <- 120
a2 <- 10
t.mc <- c()
for(s in 1:1000){
theta1 <- rgamma(1, a1 + sumya, a2 + na)
y1.mc <- rpois(10, theta1)
t.mc <- c(t.mc, mean(y1.mc)/sd(y1.mc))
}
t.obs <- mean(ya)/sd(ya)
hist(t.mc, main ="Histogram of t", xlab = "t")
abline(v = t.obs, col = "blue",lwd = 2)
As shown by the vertical blue line, the observed value of the statistic falls close to the mode of the histogram. Therefore, the Poisson model was likely a good fit for the data based on this test statistic.
Now we repeat the same process for population B.
yb <- c(11, 11, 10, 9, 9, 8, 7, 10, 6, 8, 8, 9, 7)
sumyb <- sum(yb)
nb <- length(yb)
b1 <- 12
b2 <- 1
t.mc <- c()
for(s in 1:1000){
theta1 <- rgamma(1, b1 + sumyb, b2 + nb)
y1.mc <- rpois(10, theta1)
t.mc <- c(t.mc, mean(y1.mc)/sd(y1.mc))
}
t.obs <- mean(yb)/sd(yb)
hist(t.mc, main ="Histogram of t", xlab = "t")
abline(v = t.obs, col = "blue",lwd = 2)
For population B, it looks like the Poisson model does not fit the data well based on this test statistic. As shown on the histogram, the observed value of the statistic falls on the right tail of the histogram.