A model needs to be checked. Can it predict the data used to fit the model? Can it recover the parameters that generate the data? Here we use data simulation for model evaluation. Generative models work forward and backward.
Software:
/Modules/Code and data/clarkFunctions2026.rComplete vignette 3
This vignette, including Exercise 1
Objectives:
Once you have downloaded from Canvas the file
clarkFunctions2026.R, move it to your working directory,
and source it this way:
source('clarkFunctions2026.R')
Here are the questions from the last vignette:
Exercise 1: Based on plots of soil moisture and climate deficit does it appear useful to include these predictors in a model for biomass of this species? Do both variables bring information?
Exercise 2. Repeat the regression analysis using
function lm with a different species as a response.
Interpret the analysis.
Exercise 3. Using a different species and predictors, interpret differences between estimates for the Tobit and standard regression model.
Questions??
Models that involve computation (i.e., all of them) require algorithms. Many things can go wrong, so models must be checked. Model checking can be done in a number of ways, but ultimately it should recover the right parameter values and predict the data used to fit the model.
The computation that yields parameter estimates from data is the inverse of the generative process. Returning to the Bayesian Tobit model as an example, consider the model graph below:
Model graph for Bayesian Tobit model. Censored values of w have point mass in y, e.g., zero.
A generative model is one that works backwards and forwards. Said another way, it can predict the data used to fit the model. We could think of these two directions as forward simulation (up the graph) and estimation or inference (down). Recall that the posterior distribution, [parameters | data], is in some sense the “inverse” of the data-generating likelihood, [data | parameters]. If you think about this, a generative model has to properly capture the joint distribution [data, parameters]; otherwise, I could not invert it to recover inputs (estimation) or data (simulation). Of course, this requires Bayes theorem,
\[ [ P | D ] \propto [ D | P] [P] \]
Not all models are generative (i.e., not all can predict the data used to fit the model). We can use data simulated in forward mode to check if there is parameter recovery in the inverse mode.
Here I simulate data using the model bottom up. I use the Tobit model from vignette 3.
The graph shows parameters for prior distributions at the bottom, which, in this case, is
\[ \begin{align} [ \boldsymbol{\beta}, \sigma^2 ] &= [ \boldsymbol{\beta} ] \times [\sigma^2 ] \\ &= MVN_p( \boldsymbol{\beta} | \mathbf{b}, \mathbf{B} ) \times IG(\sigma^2 | s_1, s_2 ) \end{align} \] Notice that coefficients \(\boldsymbol{\beta}\) are a priori independent of the variance parameter \(\sigma^2\). They will not be independent a posteriori. Because they are independent at this stage I can simulate them independently:
Coefficient vector: For the coefficients in length-\(p\) vector \(\boldsymbol{\beta}\) I have a multivariate normal distribution. In the code that follows I specify this prior distribution,
\[ \mathbf{b}' = [0 \quad 0 \quad 0 \quad 0 \quad 0], \quad \mathbf{B} = \left[ \begin{matrix} 10 & 0 & 0 & 0 & 0\\ 0 & 10 & 0 & 0 & 0\\ 0 & 0 & 10 & 0 & 0\\ 0 & 0 & 0 & 10 & 0\\ 0 & 0 & 0 & 0 & 10\\ \end{matrix} \right] \] From this prior distribution I generate coefficients by drawing a random vector from \(MVN_p( \boldsymbol{\beta} | \mathbf{b}, \mathbf{B} )\)
p <- 5
n <- 1000
b <- matrix( 0, p )
B <- diag( p )*10
beta <- t( myrmvnorm( 1, b, B ) )
vars <- paste( 'x', c( 1:( p - 1) ), sep = '_' )
rownames( beta ) <- c('Intercept', vars )
Take a look at coeficients in beta.
Now a draw a value for the residual variance from \(IG(\sigma^2 | s_1, s_2 )\),
s1 <- s2 <- 1
sigma2 <- 1/rgamma( 1, s1, s2 )
sigma <- sqrt( sigma2 )
Here is the inverse gamma distribution, which has some useful properties that we will introduce later. This prior distribution \(IG( \sigma^2| 1, 1 )\) looks like this:
Inverse gamma distribution showing the value of sigma2 drawn at random.
Now I need a design matrix \(\mathbf{X}\) with \(n\) rows and \(p\) columns. The model will have an intercept, so the first column will have all ones. I’ll simply fill the other columns with values drawn from \(N( 0, 10^2 )\),
X <- matrix( rnorm( n*p, 0, 10 ), n, p )
X[,1] <- 1
colnames( X ) <- rownames( beta )
This is the design matrix.
Were this a standard regression model the responses would be simulated as \(y_i \sim N( \mathbf{x}'_i \boldsymbol{\beta}, \sigma^s)\). In code this could be
y <- rnorm( n, X%*%beta, sigma )
Recall that the Tobit has negative values censored at zero, the \(\mathbf{w}\) stage in the graph,
\[ y_i = \begin{cases} w_i & \quad w_i > 0\\ 0 & \quad w_i \leq 0\\ \end{cases} \] So I need to adjust this:
y <- w <- rnorm( n, X%*%beta, sigma )
y[ w < 0 ] <- 0
I have now simulated data starting from the prior. The correlation structure looks like this:
pairs( cbind( y, X[,-1] ), pch = 16, cex = .3, las = 1 )
Note the relationship between \(w\) and \(y\):
plot( w, y, cex = .2, bty = 'n', las = 1 )
Convince yourself that we have worked our way up the model graph, from prior to parameters to data:
\[ [data | parameters] \leftarrow [ parameters | prior ] \leftarrow [prior] \]
To predict data I worked my way up from the bottom of the graph. To estimate parameters I work down from the top,
\[ [data | parameters] \rightarrow [ parameters | data, prior ] \]
There is only one thing to do, fit the model:
data <- data.frame( y, X[,-1] )
form <- as.formula( paste( 'y ~ ', paste0( vars, collapse = ' + ' ) ) )
fitTobit <- bayesReg( form , data, TOBIT = T )
## fitted as Tobit model
Look at the formula in form and the
data.
Here is a plot of some outputs:
par( mfrow = c(1,2), bty='n', las = 1 )
plot( y, fitTobit$predict[, '50%'], cex = .3, xlab = 'Observed y',
ylab = 'Predicted y', pch = 16, asp = 1 )
abline( 0, 1, lty=2 )
plot( beta, fitTobit$beta[,1], xlab = 'True beta', ylab = 'Estimated beta' )
segments( beta, fitTobit$beta[,2], beta, fitTobit$beta[,3] )
text( beta[,1], fitTobit$beta[,1], rownames(beta), pos = 4, cex = .8 )
abline( 0, 1, lty=2 )
Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.
The Tobit model developed here is generative: It predicts the data
used to fit the model. It recovers the parameters used to simulate the
data. In addition to the coefficients shown in the plot above, it also
recovers the variance sigma2 = 2.9154851, which is
estimated to be fitTobit$sigma2 = 2.764 with 95% credible
interval (2.469, 3.13).
The plot of predicted y versus observed y shows the observations that are predicted from the posterior distribution. They look tight, but are they reasonable? Coverage is the idea that a 95% credible (parameters) or predictive (responses) interval might include the true value 95% of the time. This will never be precisely true; among other things, there is a prior distribution that pulls the predictions away from the data. However, it’s good to know if I’m in the ballpark. For the Tobit model I exclude the zeros because they are too easy. Here is the percent of observations that fall within the 95% predictive interval (PI):
yhat <- fitTobit$predict
yhat <- yhat[ y > 0, ] # non-zero observations
yin <- length( inside( y[ y > 0 ], yhat[, 3:4] ) ) # observed y is within PI
perc <- yin/nrow(yhat)*100
For this example, I find that the 95% PI includes 95% of the observations.
I could also check coverage for the coefficients:
inside( beta, fitTobit$beta[,2:3] )
## [1] 1 2 3 4 5
Of the p = 5 coefficients I see that 5 are within the
95% credible interval. If I wanted to evaluate coverage for the
coefficients I might simulate a large number of data sets, estimate
parameters, and tally the fraction of CI estimates that contain the true
coefficient values.
We will talk more about predicting from the posterior when we discuss Markov chain Monte Carlo (MCMC).
Recall that my prior distribution for coefficients was centered on zero with a covariance matrix having 10 along the diagonal (the variances) and zeros everywhere else (the covariances). We already saw that the posterior estimates are not zero. There is also non-zero posterior covariance. Here is the covariance matrix for \(\boldsymbol{\hat{\beta}}\):
signif( fitTobit$betaVar, 3 )
## intercept x_1 x_2 x_3 x_4
## intercept 1.14e-02 8.48e-04 -2.79e-04 4.62e-05 2.79e-04
## x_1 8.48e-04 1.22e-04 -2.09e-05 5.98e-06 1.99e-05
## x_2 -2.79e-04 -2.09e-05 5.43e-05 -1.31e-06 -6.99e-06
## x_3 4.62e-05 5.98e-06 -1.31e-06 5.26e-05 3.20e-07
## x_4 2.79e-04 1.99e-05 -6.99e-06 3.20e-07 5.78e-05
The standard errors are taken from the diagonal of this matrix:
signif( sqrt( diag( fitTobit$betaVar ) ), 3 )
## intercept x_1 x_2 x_3 x_4
## 0.10700 0.01100 0.00737 0.00725 0.00760
Covariance matrices are hard to interpret due to the differences in scale for the different predictors. It is easier to digest the posterior correlation. Recall that correlation can be obtained from covariance as
\[
cor( x_{i,j} ) = \frac{\sigma_{i,j}}{\sqrt{\sigma_i^2\sigma_{j}^2}}
\] where the numerator is a covariance between elements \(i\) and \(j\), and the denominator holds their
variances. The function cov2cor converts a covariance
matrix to a correlation matrix:
corBeta <- cov2cor( fitTobit$betaVar )
signif( corBeta, 3 )
## intercept x_1 x_2 x_3 x_4
## intercept 1.0000 0.7200 -0.3550 0.0597 0.3440
## x_1 0.7200 1.0000 -0.2570 0.0747 0.2370
## x_2 -0.3550 -0.2570 1.0000 -0.0246 -0.1250
## x_3 0.0597 0.0747 -0.0246 1.0000 0.0058
## x_4 0.3440 0.2370 -0.1250 0.0058 1.0000
Note that some of these correlations are substantial.
When simulating data it is tempting to assume sample sizes and levels of “noise” comparable to those that might be encountered in actual data. That’s fine, but when checking a model start by insuring there will be signal–large \(n\) and small \(\sigma^2\). If I use a small sample size \(n\) and a large residual variance \(\sigma^2\), it will be hard to know if the lack of agreement between true and estimated parameters or between data and predictions result from noisy simulated data or a faulty algorithm. First determine if the algorithm is reliable. Then explore the inpact of sample size and noise.
Generative models return us to Bayes’ Theorem. There are two ways to factor the joint distribution of data and parameters \([\mbox{parameters}, \mbox{data}]\),
\[ \begin{align} [\mbox{parameters} | \mbox{data}] \times [ \mbox{data} ] &= [\mbox{data} | \mbox{parameters}] \times [ \mbox{parameters} ]\\ [\mbox{parameters} | \mbox{data}] & \propto [\mbox{data} | \mbox{parameters}] \times [ \mbox{parameters} ] \end{align} \] The left side is the posterior distribution. The right side is the likelihood \(\times\) prior. The left side gives us parameter estimates. The right side gives us data. Models are transparent when this relationship holds.
A generative model allows both parameter estimation and prediction. The one can be viewed as the inverse of the other. Prediction is one of the most valuable ways to evaluate a model, both to check that the algorithms are working and that the model can recover the correct estimates.
Exercise 1: Model performance will depend on \(n\), \(p\), \(\sigma^2\) and the range of simulated values in \(\boldsymbol{\beta}\) and \(\mathbf{X}\). Generate simulated data sets to discover the effects of each of these elements on predictions and parameter recovery.
We will discuss the multivarite normal distribution in later
vignettes. For now, I’ve just used the function myrmvnorm
to draw a random vector of coefficients.
The arguments to myrmvnorm are in order the number of
random vectors, the mean vector, and the covariance matrix. If the mean
vector has \(p\) elements then the
covariance matrix must be \(p \times
p\). The valid covariance matrix is positive
definite. The easiest way to generate a positive definite
covariance matrix is by simulating \(n\) random vectors, where \(n\) > \(p\), and then take the covariance matrix. I
start with a simple identity matrix centered on zero:
n <- 5
p <- 4
z <- myrmvnorm( n, 0, diag(p) )
Now z has n rows and p columns. Here is the
covariance of matrix z,
Sigma <- var( z )
I can check that it is positive definite by inverting it,
SI <- solve( Sigma )
Conversely, if p > n it will not invert,
n <- 4
p <- 5
z <- myrmvnorm( n, 0, diag(p) )
Sigma <- var( z )
solve( Sigma )
Here are some notation conventions used in these vignettes.
| notation | example | interpretation |
|---|---|---|
| italic | \(x\), \(X\) | scalar quantity, known |
| greek | \(\alpha, \beta, \dots\) | stochastic (fitted) variable, unknown |
| parentheses | \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) | parametric function/density |
| curly brackets | \(\{0, 1, \dots\}\) | a set on objects |
| closed interval | \((-\infty,0]\), \([0, \infty)\) | intervals include zero |
| open interval | \((-\infty,0)\), \((0, \infty)\) | exclude zero |
| distributed as | \(x \sim N(\mu, \sigma^2)\) | distribution or density |
| expectation | \(E[\epsilon] = 0\) | expected value of a variable |
| variance | \(Var[\epsilon] = \sigma^2\) | variance of a variable |
| bracket, distribution | \([A, B, C]\) | unspecified density or distribution |
| proportional | \(f(x) \propto g(x)\) | differ by a constant, \(f(x) = c g(x)\) |
| approximate | \(\pi \approx 3.14159\) | approximately equal |
| real numbers | \(\mathbb{R} = (-\infty, \infty)\) | note: also positive real \(\mathbb{R}_+\) |
| is an element of | \(\pi \in \mathbb{R}\) | \(\pi\) is a real number |
| subset | \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) | in the set |
| proper subset | \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) | cannot include all elements |
| union | \(a \cup b\) | either \(a\) or \(b\) |
| intersection | \(a \cap b\) | both \(a\) and \(b\) |
| sum | \(\sum_{i=1}^n x_i\) | \(x_1 + \dots + x_n\) |
| product | \(\prod_{i=1}^n x_i\) | \(x_1 \times \dots \times x_n\) |
| exponentiate | \(e^x\), \(exp(x)\) | \(e \approx 2.71828\), inverse of \(ln(x)\) |
| natural logarithm | \(ln(x)\) | inverse of \(e^x\) |
| notation | example | interpretation |
|---|---|---|
| bold, l.c. | \(\mathbf{x}\), \(\mathbf{x}'\) | column and row vectors, respectively |
| bold, u.c. | \(\mathbf{X}\) | matrix |
| dimensions | \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) | \(n\) rows, \(q\) columns |
| subscript | \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) | element of an array |
| row vector | \(\mathbf{X}_{i\cdot}\) | row \(i\) |
| column vector | \(\mathbf{X}_{\cdot j}\) | column \(j\) |
| transpose | \(\mathbf{X}'\) | rows become columns |
| matrix inverse | \(\mathbf{A}^{-1}\) | solve systems of linear equations |
| identity matrix | \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) | \(p \times p\) with 1’s on the diagonal, zeros elsewhere |
| matrix determinant | \(det\left( \mathbf{A} \right)\) | obtain for a square matrix, e.g., covariance |
| kronecker product | \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) | \(np \times mq\) matrix, defined in text |