A model needs to be checked. Can it predict the data used to fit the model? Can it recover the parameters that generate the data? Here we use data simulation for model evaluation. Generative models work forward and backward.

Logistics

Resources

  • Software:

    • on Canvas: /Modules/Code and data/clarkFunctions2026.r

For next time

  • Unit 4 problems (this vignette) to Canvas/Discussion folder

Today’s plan

  • Complete vignette 3

  • This vignette, including Exercise 1

Objectives:

  • Understand a generative model that captures the joint distribution [data, parameters]
  • Use simulation for model checking
  • Check coverage for estimates and predictions.


Once you have downloaded from Canvas the file clarkFunctions2026.R, move it to your working directory, and source it this way:

source('clarkFunctions2026.R')

From last time

Here are the questions from the last vignette:

Exercise 1: Based on plots of soil moisture and climate deficit does it appear useful to include these predictors in a model for biomass of this species? Do both variables bring information?

Exercise 2. Repeat the regression analysis using function lm with a different species as a response. Interpret the analysis.

Exercise 3. Using a different species and predictors, interpret differences between estimates for the Tobit and standard regression model.

Questions??

Evaluate a generative model

Models that involve computation (i.e., all of them) require algorithms. Many things can go wrong, so models must be checked. Model checking can be done in a number of ways, but ultimately it should recover the right parameter values and predict the data used to fit the model.

The computation that yields parameter estimates from data is the inverse of the generative process. Returning to the Bayesian Tobit model as an example, consider the model graph below:

Model graph for Bayesian Tobit model. Censored values of *w* have point mass in *y*, e.g., zero.

Model graph for Bayesian Tobit model. Censored values of w have point mass in y, e.g., zero.

A generative model is one that works backwards and forwards. Said another way, it can predict the data used to fit the model. We could think of these two directions as forward simulation (up the graph) and estimation or inference (down). Recall that the posterior distribution, [parameters | data], is in some sense the “inverse” of the data-generating likelihood, [data | parameters]. If you think about this, a generative model has to properly capture the joint distribution [data, parameters]; otherwise, I could not invert it to recover inputs (estimation) or data (simulation). Of course, this requires Bayes theorem,

\[ [ P | D ] \propto [ D | P] [P] \]

Not all models are generative (i.e., not all can predict the data used to fit the model). We can use data simulated in forward mode to check if there is parameter recovery in the inverse mode.

Forward simulation

Here I simulate data using the model bottom up. I use the Tobit model from vignette 3.

1. Parameters from prior

The graph shows parameters for prior distributions at the bottom, which, in this case, is

\[ \begin{align} [ \boldsymbol{\beta}, \sigma^2 ] &= [ \boldsymbol{\beta} ] \times [\sigma^2 ] \\ &= MVN_p( \boldsymbol{\beta} | \mathbf{b}, \mathbf{B} ) \times IG(\sigma^2 | s_1, s_2 ) \end{align} \] Notice that coefficients \(\boldsymbol{\beta}\) are a priori independent of the variance parameter \(\sigma^2\). They will not be independent a posteriori. Because they are independent at this stage I can simulate them independently:

Coefficient vector: For the coefficients in length-\(p\) vector \(\boldsymbol{\beta}\) I have a multivariate normal distribution. In the code that follows I specify this prior distribution,

\[ \mathbf{b}' = [0 \quad 0 \quad 0 \quad 0 \quad 0], \quad \mathbf{B} = \left[ \begin{matrix} 10 & 0 & 0 & 0 & 0\\ 0 & 10 & 0 & 0 & 0\\ 0 & 0 & 10 & 0 & 0\\ 0 & 0 & 0 & 10 & 0\\ 0 & 0 & 0 & 0 & 10\\ \end{matrix} \right] \] From this prior distribution I generate coefficients by drawing a random vector from \(MVN_p( \boldsymbol{\beta} | \mathbf{b}, \mathbf{B} )\)

p <- 5
n <- 1000
b <- matrix( 0, p )
B <- diag( p )*10
beta <- t( myrmvnorm( 1, b, B ) )
vars <- paste( 'x', c( 1:( p - 1) ), sep = '_' )
rownames( beta ) <- c('Intercept', vars ) 

Take a look at coeficients in beta.

Now a draw a value for the residual variance from \(IG(\sigma^2 | s_1, s_2 )\),

s1 <- s2 <- 1
sigma2 <- 1/rgamma( 1, s1, s2 )
sigma  <- sqrt( sigma2 )

Here is the inverse gamma distribution, which has some useful properties that we will introduce later. This prior distribution \(IG( \sigma^2| 1, 1 )\) looks like this:

Inverse gamma distribution showing the value of sigma2 drawn at random.

Inverse gamma distribution showing the value of sigma2 drawn at random.

2. Predictors (design matrix)

Now I need a design matrix \(\mathbf{X}\) with \(n\) rows and \(p\) columns. The model will have an intercept, so the first column will have all ones. I’ll simply fill the other columns with values drawn from \(N( 0, 10^2 )\),

X <- matrix( rnorm( n*p, 0, 10 ), n, p ) 
X[,1] <- 1
colnames( X ) <- rownames( beta )

This is the design matrix.

3. Latent states and data

Were this a standard regression model the responses would be simulated as \(y_i \sim N( \mathbf{x}'_i \boldsymbol{\beta}, \sigma^s)\). In code this could be

y <- rnorm( n, X%*%beta, sigma )

Recall that the Tobit has negative values censored at zero, the \(\mathbf{w}\) stage in the graph,

\[ y_i = \begin{cases} w_i & \quad w_i > 0\\ 0 & \quad w_i \leq 0\\ \end{cases} \] So I need to adjust this:

y <- w <- rnorm( n, X%*%beta, sigma )
y[ w < 0 ] <- 0

I have now simulated data starting from the prior. The correlation structure looks like this:

pairs( cbind( y, X[,-1] ), pch = 16, cex = .3, las = 1 )

Note the relationship between \(w\) and \(y\):

plot( w, y, cex = .2, bty = 'n', las = 1 )

Convince yourself that we have worked our way up the model graph, from prior to parameters to data:

\[ [data | parameters] \leftarrow [ parameters | prior ] \leftarrow [prior] \]

Inverse estimation: Parameter recovery, predictions

To predict data I worked my way up from the bottom of the graph. To estimate parameters I work down from the top,

\[ [data | parameters] \rightarrow [ parameters | data, prior ] \]

There is only one thing to do, fit the model:

data     <- data.frame( y, X[,-1] )
form     <- as.formula( paste( 'y ~ ', paste0( vars, collapse = ' + ' ) ) )
fitTobit <- bayesReg( form , data, TOBIT = T )
## fitted as Tobit model

Look at the formula in form and the data.

Here is a plot of some outputs:

par( mfrow = c(1,2), bty='n', las = 1 )
plot( y, fitTobit$predict[, '50%'], cex = .3, xlab = 'Observed y',
      ylab = 'Predicted y', pch = 16, asp = 1 )
abline( 0, 1, lty=2 )

plot( beta, fitTobit$beta[,1], xlab = 'True beta', ylab = 'Estimated beta' )
segments( beta, fitTobit$beta[,2], beta, fitTobit$beta[,3] )
text( beta[,1], fitTobit$beta[,1], rownames(beta), pos = 4, cex = .8 )
abline( 0, 1, lty=2 )
*Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.*

Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.

The Tobit model developed here is generative: It predicts the data used to fit the model. It recovers the parameters used to simulate the data. In addition to the coefficients shown in the plot above, it also recovers the variance sigma2 = 2.9154851, which is estimated to be fitTobit$sigma2 = 2.764 with 95% credible interval (2.469, 3.13).

Predict from the posterior distribution

The plot of predicted y versus observed y shows the observations that are predicted from the posterior distribution. They look tight, but are they reasonable? Coverage is the idea that a 95% credible (parameters) or predictive (responses) interval might include the true value 95% of the time. This will never be precisely true; among other things, there is a prior distribution that pulls the predictions away from the data. However, it’s good to know if I’m in the ballpark. For the Tobit model I exclude the zeros because they are too easy. Here is the percent of observations that fall within the 95% predictive interval (PI):

yhat <- fitTobit$predict
yhat <- yhat[ y > 0, ]                                # non-zero observations
yin  <- length( inside( y[ y > 0 ], yhat[, 3:4] ) )   # observed y is within PI
perc <- yin/nrow(yhat)*100

For this example, I find that the 95% PI includes 95% of the observations.

I could also check coverage for the coefficients:

inside( beta, fitTobit$beta[,2:3] )
## [1] 1 2 3 4 5

Of the p = 5 coefficients I see that 5 are within the 95% credible interval. If I wanted to evaluate coverage for the coefficients I might simulate a large number of data sets, estimate parameters, and tally the fraction of CI estimates that contain the true coefficient values.

We will talk more about predicting from the posterior when we discuss Markov chain Monte Carlo (MCMC).

Prior versus posterior coefficients

Recall that my prior distribution for coefficients was centered on zero with a covariance matrix having 10 along the diagonal (the variances) and zeros everywhere else (the covariances). We already saw that the posterior estimates are not zero. There is also non-zero posterior covariance. Here is the covariance matrix for \(\boldsymbol{\hat{\beta}}\):

signif( fitTobit$betaVar, 3 )
##           intercept       x_1       x_2       x_3       x_4
## intercept  1.14e-02  8.48e-04 -2.79e-04  4.62e-05  2.79e-04
## x_1        8.48e-04  1.22e-04 -2.09e-05  5.98e-06  1.99e-05
## x_2       -2.79e-04 -2.09e-05  5.43e-05 -1.31e-06 -6.99e-06
## x_3        4.62e-05  5.98e-06 -1.31e-06  5.26e-05  3.20e-07
## x_4        2.79e-04  1.99e-05 -6.99e-06  3.20e-07  5.78e-05

The standard errors are taken from the diagonal of this matrix:

signif( sqrt( diag( fitTobit$betaVar ) ), 3 )
## intercept       x_1       x_2       x_3       x_4 
##   0.10700   0.01100   0.00737   0.00725   0.00760

Covariance matrices are hard to interpret due to the differences in scale for the different predictors. It is easier to digest the posterior correlation. Recall that correlation can be obtained from covariance as

\[ cor( x_{i,j} ) = \frac{\sigma_{i,j}}{\sqrt{\sigma_i^2\sigma_{j}^2}} \] where the numerator is a covariance between elements \(i\) and \(j\), and the denominator holds their variances. The function cov2cor converts a covariance matrix to a correlation matrix:

corBeta <- cov2cor( fitTobit$betaVar )
signif( corBeta, 3 )
##           intercept     x_1     x_2     x_3     x_4
## intercept    1.0000  0.7200 -0.3550  0.0597  0.3440
## x_1          0.7200  1.0000 -0.2570  0.0747  0.2370
## x_2         -0.3550 -0.2570  1.0000 -0.0246 -0.1250
## x_3          0.0597  0.0747 -0.0246  1.0000  0.0058
## x_4          0.3440  0.2370 -0.1250  0.0058  1.0000

Note that some of these correlations are substantial.

A few notes on data prediction

When simulating data it is tempting to assume sample sizes and levels of “noise” comparable to those that might be encountered in actual data. That’s fine, but when checking a model start by insuring there will be signal–large \(n\) and small \(\sigma^2\). If I use a small sample size \(n\) and a large residual variance \(\sigma^2\), it will be hard to know if the lack of agreement between true and estimated parameters or between data and predictions result from noisy simulated data or a faulty algorithm. First determine if the algorithm is reliable. Then explore the inpact of sample size and noise.

Recap

Generative models return us to Bayes’ Theorem. There are two ways to factor the joint distribution of data and parameters \([\mbox{parameters}, \mbox{data}]\),

\[ \begin{align} [\mbox{parameters} | \mbox{data}] \times [ \mbox{data} ] &= [\mbox{data} | \mbox{parameters}] \times [ \mbox{parameters} ]\\ [\mbox{parameters} | \mbox{data}] & \propto [\mbox{data} | \mbox{parameters}] \times [ \mbox{parameters} ] \end{align} \] The left side is the posterior distribution. The right side is the likelihood \(\times\) prior. The left side gives us parameter estimates. The right side gives us data. Models are transparent when this relationship holds.

A generative model allows both parameter estimation and prediction. The one can be viewed as the inverse of the other. Prediction is one of the most valuable ways to evaluate a model, both to check that the algorithms are working and that the model can recover the correct estimates.


Exercise 1: Model performance will depend on \(n\), \(p\), \(\sigma^2\) and the range of simulated values in \(\boldsymbol{\beta}\) and \(\mathbf{X}\). Generate simulated data sets to discover the effects of each of these elements on predictions and parameter recovery.


Appendix

Multivariate normal

We will discuss the multivarite normal distribution in later vignettes. For now, I’ve just used the function myrmvnorm to draw a random vector of coefficients.

The arguments to myrmvnorm are in order the number of random vectors, the mean vector, and the covariance matrix. If the mean vector has \(p\) elements then the covariance matrix must be \(p \times p\). The valid covariance matrix is positive definite. The easiest way to generate a positive definite covariance matrix is by simulating \(n\) random vectors, where \(n\) > \(p\), and then take the covariance matrix. I start with a simple identity matrix centered on zero:

n <- 5
p <- 4
z <- myrmvnorm( n, 0, diag(p) )

Now z has n rows and p columns. Here is the covariance of matrix z,

Sigma <- var( z )

I can check that it is positive definite by inverting it,

SI <- solve( Sigma )

Conversely, if p > n it will not invert,

n <- 4
p <- 5
z <- myrmvnorm( n, 0, diag(p) )
Sigma <- var( z )
solve( Sigma )

Notation

Here are some notation conventions used in these vignettes.

notation example interpretation
italic \(x\), \(X\) scalar quantity, known
greek \(\alpha, \beta, \dots\) stochastic (fitted) variable, unknown
parentheses \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) parametric function/density
curly brackets \(\{0, 1, \dots\}\) a set on objects
closed interval \((-\infty,0]\), \([0, \infty)\) intervals include zero
open interval \((-\infty,0)\), \((0, \infty)\) exclude zero
distributed as \(x \sim N(\mu, \sigma^2)\) distribution or density
expectation \(E[\epsilon] = 0\) expected value of a variable
variance \(Var[\epsilon] = \sigma^2\) variance of a variable
bracket, distribution \([A, B, C]\) unspecified density or distribution
proportional \(f(x) \propto g(x)\) differ by a constant, \(f(x) = c g(x)\)
approximate \(\pi \approx 3.14159\) approximately equal
real numbers \(\mathbb{R} = (-\infty, \infty)\) note: also positive real \(\mathbb{R}_+\)
is an element of \(\pi \in \mathbb{R}\) \(\pi\) is a real number
subset \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) in the set
proper subset \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) cannot include all elements
union \(a \cup b\) either \(a\) or \(b\)
intersection \(a \cap b\) both \(a\) and \(b\)
sum \(\sum_{i=1}^n x_i\) \(x_1 + \dots + x_n\)
product \(\prod_{i=1}^n x_i\) \(x_1 \times \dots \times x_n\)
exponentiate \(e^x\), \(exp(x)\) \(e \approx 2.71828\), inverse of \(ln(x)\)
natural logarithm \(ln(x)\) inverse of \(e^x\)

Matrices

notation example interpretation
bold, l.c. \(\mathbf{x}\), \(\mathbf{x}'\) column and row vectors, respectively
bold, u.c. \(\mathbf{X}\) matrix
dimensions \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) \(n\) rows, \(q\) columns
subscript \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) element of an array
row vector \(\mathbf{X}_{i\cdot}\) row \(i\)
column vector \(\mathbf{X}_{\cdot j}\) column \(j\)
transpose \(\mathbf{X}'\) rows become columns
matrix inverse \(\mathbf{A}^{-1}\) solve systems of linear equations
identity matrix \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) \(p \times p\) with 1’s on the diagonal, zeros elsewhere
matrix determinant \(det\left( \mathbf{A} \right)\) obtain for a square matrix, e.g., covariance
kronecker product \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) \(np \times mq\) matrix, defined in text