Homework: Model Checking

Exercise 6.1

Posterior predictive checking:

On page 120, the data from the SAT coaching experiments were checked against the model that assumed identical effects in all eight schools: the expected order statistics of the effect sizes were \((26, 19, 14, 10, 6, 2, −3, −9)\), compared to observed data of \((28,18, 12, 8, 7, 1, −1, −3)\). Express this comparison formally as a posterior predictive check comparing this model to the data. Does the model fit the aspect of the data tested here?

Response: Formally, in a posterior predictive check, values of the unknown parameter should be drawn and be used to generate data \(y^{rep}\) through the posterior predictive distribution. In this problem, the unknown parameter is \(\mu\), while \(\sigma^2\) is known since “Under the hypothesis that all experiments have the same effect and produce independent estimates of this common effect, we could treat the data in Table 5.2 as eight normally distributed observations with known variances” (Gelman, pg. 119-120). We proceed as it follows:

Draw \(\mu\) values from the posterior distribution \(N(7.7,4.1^2)\), since we assume a common effect in all schools, i.e., a pooled estimate. Use them to generate \(y^{rep}\):

#Define values needed
y_obs<-sort(c(28,8,-3,7,-1,1,18,12))
y_checked <-sort(c(26,19,14,10,6,2,-3,-9))
mu<-7.7
sd<-4.1
sd_known<-sort(c(15,10,16,11,9,11,10,18))
N<-200

#Empty matrices for generation
mu_gen<-matrix(NA,nrow = N,ncol=8)
yrep<-matrix(NA,nrow = N,ncol=8)

#Loop for mu and yrep generation

for (j in 1:N) {
  mu_gen[j,]<-rnorm(n=8,mean = mu,sd=sd)
  
  yrep[j,]<-sort(rnorm(n=8,mean=mu_gen[j],sd=sd_known))
}

To compare generated data vs observed data, test quantities would be a great approach. Instead of choosing an arbitrary test quantity to work with, each sorted value of each \(y^{rep}\) is plotted:

par(mfrow=c(3,3))
for(i in 1:8){
  hist(yrep[,i])
  abline(v=y_checked[i],col="red")
  abline(v=y_obs[i],col='green')
}

Thus, we can conclude that the approach to generate \(y^{rep}\) (bars in histogram) is accurate since the mean is always near the values of the order statistics given in the problem (red line). But most importantly, we can say the same for the observed values (green line). Therefore, the model does, in fact, fits the observed value and the tested values given in the problem.

Explain why, even though the identical-schools model fits under this test, it is still unacceptable for some practical purposes.

If we consider an identical-effects model, it would mean that all schools are equally good for some practical purposes, for example a research on the effects of coaching programs. However, it’s easy to understand that in practice it doesn’t make sense, for example some researchers would wish to continue experimenting only in certain schools and not in the others, maybe considering only those with better performances. So we have to reject this identical-effects model and find a compromise that combines information from all eight experiments without assuming all the \(\theta's\) to be equal. For example, we could relax this strong pooling and think to apply a hierarchical model, which is of course more flexible, it allows to distinguish between the variances, within and between groups (or schools), providing a posterior inference that accounts both for a partial pooling and uncertainty in hyper-parameters.

Exercise 6.5

Hypothesis testing: discuss the statement, ‘Null hypotheses of no difference are usually known to be false before the data are collected; when they are, their rejection or acceptance simply reflects the size of the sample and the power of the test, and is not a contribution to science’ (Savage, 1957, quoted in Kish, 1965). If you agree with this statement, what does this say about the model checking discussed in this chapter?

By definition, we are going to reject the null hypothesis when the test statistic observed on its replicated data differs so much from that observed on the initial data, i.e. when the \(p-\)value takes on extreme values. And also by definition of the \(p-\)value, we can see that this depends on the choice of statistical test, so it is important to choose an adequate statistical test, not a sufficient statistic, which in the absence of a priori information would be centered on the observed values. Clearly, the posterior distribution strictly depends on the sample size. Furthermore, the acceptance or rejection threshold for the value must be set beforehand and with criteria. Let’s say that acceptance or rejection depends a lot on the size of the sample and the threshold we impose for the \(p-\)value (power of the test), importantly choosing an appropriate statistical test. So let’s say that the rejection of a certain hypothesis is not really objective, it depends on many other factors.

Exercise 6.7

Prior vs. posterior predictive checks (from Gelman, Meng, and Stern, 1996): consider 100 observations, \(y_1,...,y_n\), modeled as independent samples from a \(N(θ, 1)\) distribution with a diffuse prior distribution, say, \(p(θ) = \frac{1}{2A}\) for \(\theta \in [-A,A]\) with some extremely large value of A, such as \(10^5\). We wish to check the model using, as a test statistic, \(T(y)=max_i|y_i|\): is the maximum absolute observed value consistent with the normal model? Consider a data set in which \(\overline{y}=5.1\) and \(T(y)=8.1\).

What is the posterior predictive distribution for \(y^{rep}\)? Make a histogram for the posterior predictive distribution of \(T(y^{rep})\) and give the posterior predictive \(p-\)value for the observation \(T(y)=8.1\).

To calculate the posterior predictive distribution, the posterior distribution for \(\theta\) must be calculated first:

By Bayes we know:

\[P(\theta|y)=\frac{\frac{1}{(2\pi)^{2n}}exp(-\frac{1}{2}\sum_{i=1}^N(y_i-\theta)^2)\cdot\frac{1}{2A}\mathbb{1}_{[-A,A]}^{(\theta)}}{\int_{-A}^A[\frac{1}{(2\pi)^{2n}}exp(-\frac{1}{2}\sum_{i=1}^N(y_i-\theta)^2)\cdot\frac{1}{2A}]d\theta},\]

considering \(\sum_{i=1}^N(y_i-\theta)^2=n[(\overline{y}-\theta)^2]-n\overline{y}^2\), and some algebraic steps:

\[=\frac{1}{\sqrt{2\pi\frac{1}{n}}}e^{-\frac{n}{2}(\overline y - \theta)^2}\cdot\mathbb{1}_{[-A,A]}^{(\theta)}\] \[\propto N(\overline y,\frac{1}{n}).\]

Then, \(y^{rep}\) can be drawn to calculate \(T(y^{rep})\), as it follows:

#Define necessary objects 
set.seed(1111)
N<-10000
Nobs<-100
theta<-rep(0,N)
yrep<-matrix(NA,nrow = N,ncol=Nobs)
Tstat_obs<-8.1

mu<-5.1
var<-1/Nobs

#Draw theta values to generate yrep values
for (i in 1:N) {
  theta[i]=rnorm(1,mean=mu,sd=sqrt(var))
  yrep[i,]=sort(abs(rnorm(Nobs,mean=theta[i],sd=1)))
}

#Compute p value
mean(yrep[,100]>=Tstat_obs)

[1] 0.1334

hist(yrep[,100])
abline(v=Tstat_obs,col="red")

Thus, it’s clear the model is not appropriate to capture maximum values similar to the observed data. This is clear in the plot, where the mean of all \(max(y^{rep})\) is far away from the observed maximum value \(8.1\), which is the main goal. The \(p-\)value also indicates a poor level of validity.

The prior predictive distribution is \(p(y^{rep})=\int p(y^{rep}|\theta)p(\theta)d\theta\) (Compare to equation (6.1).) What is the prior predictive distribution for \(y^{rep}\) in this example? Roughly sketch the prior predictive distribution of \(T(y^{rep})\) and give the approximate prior predictive \(p-\)value for the observation \(T(y)=8.1\).

Since we are now working with the prior predictive check, then \(\theta\) values should be drawn from the prior distribution to then be used to generate \(y^{rep}\) data. Thus:

#Prepare necessary variables
theta_p<-rep(0,N)
yrep_p<-matrix(NA,nrow = N,ncol=Nobs)
A<-10^5

#generate theta and yrep values 
for (j in 1:N) {
  theta_p[j]<-runif(1,-A,A)
  yrep_p[j,]=sort(abs(rnorm(Nobs,mean=theta_p[j],sd=1)))
}

#Capure p value
mean(yrep_p[,100]>=Tstat_obs)

[1] 0.9999

This indicates generating data from the prior predictive distribution means there is no observed data to learn about the distribution of \(\theta\), meaning information available for the behavior of the unknown parameter comes only from the prior distribution. With this approach, given the prior distribution of this exercise with a \(A\) being a very large value, the maximum value of the generate data will basically always be greater than the maximum value of the observed data. This is \(T((y^{rep})\ge T(y)) \approx 1\).

Your answers for (a) and (b) should show that the data are consistent with the posterior predictive but not the prior predictive distribution. Does this make sense? Explain.

The non-informative prior distribution, with a large value for A, spreads the probability mass over a very large range of possible parameters values, so the range of the uniform is too wide that it assigns significant probability to an interval that is not supported by the data. As a result, the prior predictive distribution produces values that are inconsistent with the observed data. On the other hand, the posterior distribution involves updating prior beliefs with observed data: if prior is non-informative while data are very informative, data can overwhelm the influence of the prior, leading to a posterior distribution that is primarily determined by the data. In our case the posterior predictive distribution is consistent with the data, so it suggests that the observed data contain enough information to correct the overly spread-out prior distribution, resulting in a more accurate representation of the underlying parameters.