Torture the data long enough and they will confess to anything.
BBSeast.Rdata
source('clarkFunctions2024.r')
library( gjam )
Brynjarsdottir, J. and A.E. Gelfand. 2014. Collective sensitivity analysis for ecological regression models with multivariate response. Journal of Biological, Environmental, and Agricultural Statistics 19, 481-502.
Chib, S and E Greenberg. 1998. Analysis of multivariate probit models. Biometrika 85, 347-361.
Clark, JS, D Nemergut, B Seyednasrollah, P Turner, and S Zhang. 2017. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data, Ecological Monographs 87, 34–56.
Understand correlation structure and a joint distribution of responses
Execute and interpret a multivariate model
Multivariate models are built around a multivariate likelihood for a response vector. This likelihood describes a dependence structure in the response vector. This dependence exists after accounting for relationships in other parts of the model.
Multivariate modeling in environmental science has transitioned from descriptive to inferential. Descriptive methods have been popular for decades, most of which attempt to extract low-dimensional pattern from high-dimensional data. These methods tend to be algorithmic rather than model-based. Traditional principal components analysis (PCA) concentrates variation into a new set of orthogonal (uncorrelated) axes. Each axis is selected to maximize the residual information. There is one axis for each of the dimensions in the original data, but structured such that most information is contained in a few. Related descriptive methods are factor analysis, correspondence analysis, multidimensional scaling, discriminant analysis, and cluster analysis. There is a huge literature on these important tools for EDA.
Because Bayesian analysis is based on a probability model, I focus here on multivariate models that can be used to draw inference. The distributions I consider are:
multivariate normal: a likelihood for continuous responses and a prior distribution for regression coefficients; this is generalization of the normal distribution for univariate responses
inverse Wishart: a prior distribution for the covariance matrix; this is generalization of the inverse gamma distribution for univariate responses
multinomial: a likelihood for discrete responses; this is generalization of the binomial distribution for univariate responses
categorical: a likelihood for discrete responses when the number of trials is 1; this is a special case of the multinomial and a generalization of the Bernoulli distribution for a univariate response
The concepts of covariance, correlation, and distance are central to exploratory methods and probabilistic models in multiple dimensions.
A covariance matrix for two variables has this structure:
\[ \Sigma = \begin{pmatrix} \ \sigma_1^2 & \rho_{12} \sigma_1 \sigma_2 \\ \ \rho_{21} \sigma_1 \sigma_2 & \sigma_2^2 \end{pmatrix} \] where the covariance between 1 and 2 consists of a correlation coefficient \(\rho_{12}\) times the product of the two standard deviations. Here is an example covariance matrix generated from a sample of random vectors from the multivariate normal:
y <- .rMVN(5, 0, diag(2) )
Sigma <- cov( y )
print( Sigma )
## [,1] [,2]
## [1,] 1.177658 -1.487145
## [2,] -1.487145 2.407106
The vector of two variances is on the diagonal
diag( Sigma ) = 1.1776578, 2.4071063. The two covariance
elements are the same by symmetry and occupy the off-diagonal.
The correlation matrix makes it easier to see relations between variables because it strips away the differences in scale:
\[
\mathbf{R} =
\begin{pmatrix}
\ 1 & \rho_{12} \\
\ \rho_{21} & 1
\end{pmatrix}
\] I can translate covariance to correlation using the
function cov2cor in clarkFunctions2024,
R <- cov2cor( Sigma )
print( R )
## [,1] [,2]
## [1,] 1.0000000 -0.8832754
## [2,] -0.8832754 1.0000000
The correlation coefficient for the two variables is
R[1,2] = -0.8832754.
I can recover covariance usting the
function cor2cov,
V <- diag( Sigma ) # vector of variances
cor2cov( V, R )
## [,1] [,2]
## [1,] 1.177658 -1.487145
## [2,] -1.487145 2.407106
This is the original covariance matrix Sigma.
Covariance matrices are needed for models. Correlation matrices are often more transparent. It’s easy to translate one to the other.
One more thing related to covariance: often we need a concept of distance between variables. Highly correlated variables are “close” and vice versa. Distance matrices are required for many multivariate techniques including cluster analysis.
A distance matrix is obtained from a covariance using the
function cov2dist. Here is a comparison of covariance,
correlation, and distance for three variables:
y <- .rMVN( 5, 0, diag(3) )
Sigma <- cov( y )
print( cov2cor( Sigma ) )
## [,1] [,2] [,3]
## [1,] 1.0000000 0.6401016 -0.3260528
## [2,] 0.6401016 1.0000000 0.2674412
## [3,] -0.3260528 0.2674412 1.0000000
print( cov2dist( Sigma ) )
## [,1] [,2] [,3]
## [1,] 0.0000000 0.8894929 2.251071
## [2,] 0.8894929 0.0000000 1.256459
## [3,] 2.2510712 1.2564587 0.000000
Note how highly correlated variables are “close” (small distance). Negatively correlated variables are “far”.
There are many ways to calculate a distance between vectors that do not work directly from the covariance matrix.
Consider using a multivariate model when responses are related, they will be interpreted together, and there is residual variation at the likelihood stage.
Multivariate data are jointly distributed. Each observed response is treated as a vector. The elements of an observation vector are not modeled independently, because the likelihood must capture the covariance between them, after accounting for dependence in other parts of the model. For example, consider a continous response variable modeled as a (univariate) regression,
\[y_i|\mathbf{x}_i \sim N(\mathbf{x}'_i\boldsymbol{\beta},\sigma^2)\] When a second response variable is measured at the same time/location, call it \(j\), the analyst might fit a regression model,
\[y_{ij}|\mathbf{x}_i \sim N(\mathbf{x}'_i\boldsymbol{\beta}_j,\sigma^2_j)\] where \(j\) can be response 1 or response 2, and \(\mathbf{x}_i\) and \(\boldsymbol{\beta}_j\) are length-\(Q\) vectors for the regression coefficients.
Note that the two variables can be viewed as part of the same response, because they were observed together, \(\mathbf{y}_i = (y_{i1}, y_{i2})'\). The two regression models could be collapsed into a single multivariate equation,
\[\mathbf{y}_i|\mathbf{x}_i \sim MVN( \mathbf{x}'_i \boldsymbol{\beta},\boldsymbol{\Sigma}) \] where \(\boldsymbol{\beta}\) is now a \(Q \times 2\) matrix, and the covariance matrix assumes independence,
\[ \Sigma = \begin{pmatrix} \ \sigma_1^2 & 0 \\ \ 0 & \sigma_2^2 \end{pmatrix} \]
I have not actually changed anything, because the two responses have zero residual covariance. I know this because I see the off-diagonal zero elements in \(\boldsymbol{\Sigma}\). Still, the question becomes, should these two variables be modeled independently?
There are several issues that might motivate a multivariate model. First, if I am interested in a set of responses, and they are related in some sense, then a multivariate model helps to insure valid inference. If I want to make statements about significance of an effect (a value in \(\boldsymbol{\beta}\)), then I need the covariance matrix.
Conversely, if I am only interested in \(y_1\), and I have no reason to believe that it depends on \(y_2\) or if I am simply ignorant of \(y_2\), then I choose a univariate model for \(y_1\). Most effects of the environment are not measured, but this fact should not keep us from fitting models to what we can measure.
Second, suppose that I fit a multivariate model like the one above and estimate that residual covariance elements \(\Sigma_{1,2} = \Sigma_{2,1}\) are close to zero. In this case, I have not really changed anything by modeling them jointly. Of course, \(y_{i1}\) and \(y_{i2}\) still depend on one another, because they both depend on \(\mathbf{x}_i\). However, if mean structure of the model accounts for the dependence, then it will not show up in the residual covariance.
To convince myself of this fact, I do a simple experiment.
If responses depend on the same predictors \(\mathbf{X}\), then those responses should be correlated, despite the fact that they are simulated independently from one another. In this example I simulate three data sets independently, and I compare their correlation structure, with and without the effect of predictors (raw data versus residual from the regression):
n <- 5000 # sample size
Q <- 4 # predictors
x <- matrix( rnorm(n*Q), n, Q) # design
x <- sweep( x, 2, colMeans(x), '-' ) # center
x <- sweep( x, 2, apply(x, 2, sd), '/' ) # standardize
x[, 1] <- 1 # intercept
beta1 <- matrix( rnorm(Q), Q) # coefficients
beta2 <- matrix( rnorm(Q), Q)
beta3 <- matrix( rnorm(Q), Q)
sigma <- .1 # variances
y1 <- rnorm(n, x%*%beta1, sqrt(sigma)) # responses
y2 <- rnorm(n, x%*%beta2, sqrt(sigma))
y3 <- rnorm(n, x%*%beta3, sqrt(sigma))
The empirical covariance between the \(y\)’s includes some that are substantially different from zero:
y <- cbind( y1, y2, y3 )
round( cov( y ), 3 )
## y1 y2 y3
## y1 1.862 0.765 -1.829
## y2 0.765 5.291 -4.021
## y3 -1.829 -4.021 4.151
The residual covariance (after accounting for \(\mathbf{x}'_i \boldsymbol{\beta}\)) is much smaller:
res <- cbind(y1 - x%*%beta1,
y2 - x%*%beta2,
y3 - x%*%beta3)
round( cov( res ), 3 )
## [,1] [,2] [,3]
## [1,] 0.101 0.002 0.001
## [2,] 0.002 0.103 0.000
## [3,] 0.001 0.000 0.100
The residual covariances between the \(\mathbf{y}\)’s are low compared with the empirical covariances.
Here is another way to think about the covariance that comes through \(\mathbf{X}\). This contribution from \(\mathbf{X}\) depends on its own covariance, which is transmitted to covariance in \(\mathbf{Y}\),
\[ \begin{aligned} Cov(\mathbf{y}_i) &= Cov( \boldsymbol{\beta}'\mathbf{x}_i + \epsilon_i) \\ &= Cov( \boldsymbol{\beta}'\mathbf{x}_i) + \Sigma \end{aligned} \] If we think about \(\mathbf{X}\) as random and multiplied by fixed \(\boldsymbol{\beta}\), then the first term becomes
\[ Cov( \boldsymbol{\beta}'\mathbf{x}_i) = \boldsymbol{\beta}' Cov(\mathbf{X}) \boldsymbol{\beta} \]
Here is a comparison of this covariance induced by \(\mathbf{X}\) with the empirical covariance for this example:
beta <- cbind(beta1, beta2, beta3) # Q by S
cov(cbind(y1, y2, y3)) # empirical covariance
## y1 y2 y3
## y1 1.8620781 0.7645774 -1.829186
## y2 0.7645774 5.2913737 -4.021142
## y3 -1.8291856 -4.0211416 4.150844
t(beta)%*%cov(x)%*%beta + diag(sigma, 3) # covariance induced by X plus residual
## [,1] [,2] [,3]
## [1,] 1.8747089 0.7509225 -1.830880
## [2,] 0.7509225 5.2702302 -3.995892
## [3,] -1.8308799 -3.9958925 4.131275
They do not agree precisely, because the \(\mathbf{y}\)’s were generated stochastically, but they are quite close.
Thus the y1, y2, y3 can show
substantial correlation (off-diagonal elements different from zero),
despite having been simulated independently, because they both depend on
x. When the mean structure is removed (i.e., the
contribution from \(\mathbf{X}\)), the
correlation declines substantially. This is the “residual
correlation”.
Again, two points here. First, correlated responses do not need to be modeled jointly solely because they are correlated. If the mean structure of the model takes up the correlation structure, then univariate models (including credible intervals on coefficients) will offer near identical results. Second, even if two responses have residual correlation (the mean structure does not remove the correlation), we may still consider modeling them independently if we believe they have nothing to do with one another, and we do not intend to interpret their joint distribution.
Having said this, if there is residual correlation, and we have reason to expect dependence, then independent modeling will yield incorrect estimates of credible intervals, and it will preclude examination of indirect effects.
To examine the impact of ignoring the joint structure, I repeat the previous experiment, now with dependence, showing just two responses. Here again is a covariance matrix,
\[ \Sigma = \begin{pmatrix} \ \sigma_1^2 & \sigma_{12}^2 \\ \ \sigma_{21}^2 & \sigma_2^2 \end{pmatrix} \] where \(\sigma_{12}^2 = \sigma_{21}^2 = \rho_{12} \sigma_{1} \sigma_{2}\), with \(\rho_{12}\) being the correlation. I want to know how the non-zero covariances affect estimates, so I use simulation.
Now I simulate the three responses jointly,
# a design matrix
n <- 20 # sample size
Q <- 4 # predictors
x <- matrix( rnorm(n*Q), n, Q) # design
x[, 1] <- 1 # intercept
# coefficients
beta1 <- matrix( rnorm(Q), Q) # coefficients
beta2 <- matrix( rnorm(Q), Q)
beta3 <- matrix( rnorm(Q), Q)
beta <- cbind(beta1, beta2, beta3) # coefficient matrix
S <- ncol(beta)
Sigma <- cov( .rMVN(S+1, 0, diag(S) ) ) # a cheap covariance matrix
xnames <- paste('x',1:Q,sep='') # label everything
ynames <- paste('y',1:S,sep='')
rownames(beta) <- colnames(x) <- xnames
colnames(beta) <- rownames(Sigma) <- colnames(Sigma) <- ynames
# simulated data
y <- x%*%beta + .rMVN(n, 0, Sigma) # simulated y have residual covariance
res <- y - x%*%beta # residuals
par( mfrow = c(2,2), mar = c(4,4,2,1), bty = 'n' )
plot( y, (x%*%beta), xlab='observed', ylab='predicted' ); abline( 0, 1, lty = 2 )
plot( res[,1], res[,2] ); title( paste('cov =', round(Sigma[1,2],2)) )
plot( res[,1], res[,3] ); title( paste('cov =', round(Sigma[1,3],2)) )
plot( res[,2], res[,3] ); title( paste('cov =', round(Sigma[2,3],2)) )
Residual correlations come from matrix Sigma.
Look at each of the objects generated by this code. Does the residual
covariance look like Sigma?
Unlike an empirical covariance matrix, a model covariance matrix must be not only symmetric but also positive definite or full rank.
One thing to note here is the generation of the covariance matrix, which I obtained by simulating MVN vectors. A positive definite covariance (or full-rank) matrix is one that can be inverted, i.e., there is a solution to \(\Sigma^{-1}\). This inversion occurs when I solve for parameters. I cannot arbitrarily construct a symmetric matrix and assume it will be full rank. I have only \(S = 3\) responses here, but there could be 100 responses. An easy way to do this is to simply draw random data and take the covariance. Provided the number of vectors used to simulate \(\Sigma\) is larger than the number of response variables (i.e., in this example > \(S = 3\)), it will be full rank. I now have a simulated data set.
Here are the estimates I would obtain if I modeled each column of the simulated data independently:
bhat <- numeric(0)
for(s in 1:S){
tmp <- summary( lm(y[,s] ~ x[,-1]) )$coefficients[,1:2]
bhat <- rbind(bhat,tmp)
}
blo <- bhat[,1] - 1.96*bhat[,2]
bhi <- bhat[,1] + 1.96*bhat[,2]
plot(beta, bhat[,1], ylim = range(c(blo,bhi)))
segments(beta, blo, beta, bhi)
abline(0,1)
Coefficients used to simulate the data compared with estimates from a univariate model.
The point estimates for coefficients are ok, but their standard errors are not. To see this I next estimate coefficients for the multivariate model.
Just as with the univariate model, I can fit the multivariate model with Gibbs sampling. The big change is the fact that I must now sample coefficients as a matrix of values (\(S \times Q\)), and the parameter \(\sigma^2\) is replaced with a covariance matrix \(\Sigma\). The prior distribution is multivariate normal for the coefficients and non-informative for the covariance,
\[ MVN(vec(\boldsymbol{\beta}) | 0, \Sigma \otimes \mathbf{C}) \times |\Sigma|^{-(S + 2)/2} \] The new notation here is the \(vec(\cdot)\) operator and the Kronecker product \(\otimes\). The \(vec(\boldsymbol{\beta})\) operation takes the \(S \times Q\) matrix \(\boldsymbol{\beta}\) and stacks the columns into a length-\(SQ\) vector. Of course, covariance for a vector of length \(SQ\) requires a \(SQ \times SQ\) covariance matrix.
Here is the matrix \(\boldsymbol{\beta}\) from this example and its vector form \(vec(\boldsymbol{\beta})\)
print( beta )
## y1 y2 y3
## x1 -0.1783069 0.07833256 2.6070522
## x2 -0.2711932 -0.67298680 0.5417304
## x3 -0.8592820 -0.06381870 0.6151654
## x4 -1.3105012 0.01734260 -1.0785816
print( as.vector( beta ) )
## [1] -0.17830689 -0.27119320 -0.85928204 -1.31050124 0.07833256 -0.67298680
## [7] -0.06381870 0.01734260 2.60705224 0.54173040 0.61516540 -1.07858165
The \(vec()\) operator unstacks the matrix one column at a time. This \(Q \times S\) matrix has a distribution for the length-\(QS\) vector.
The \(QS \times QS\) prior covariance matrix is a Kronecker product. The Kronecker product generates a block-diagonal matrix, where there are \(S\) blocks (\(\Sigma\) is \(S \times S\)), each of which is a \(Q \times Q\) matrix (\(\mathbf{C}\) is \(Q \times Q\)). Here is matrix:
C <- diag( 2 )
print( Sigma )
## y1 y2 y3
## y1 0.5320020 -0.1158832 0.2481713
## y2 -0.1158832 2.2442979 0.1808189
## y3 0.2481713 0.1808189 0.2945873
print( C )
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
print( kronecker( Sigma, C ) )
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.5320020 0.0000000 -0.1158832 0.0000000 0.2481713 0.0000000
## [2,] 0.0000000 0.5320020 0.0000000 -0.1158832 0.0000000 0.2481713
## [3,] -0.1158832 0.0000000 2.2442979 0.0000000 0.1808189 0.0000000
## [4,] 0.0000000 -0.1158832 0.0000000 2.2442979 0.0000000 0.1808189
## [5,] 0.2481713 0.0000000 0.1808189 0.0000000 0.2945873 0.0000000
## [6,] 0.0000000 0.2481713 0.0000000 0.1808189 0.0000000 0.2945873
If I label the blocks from 1 to \(S\), then block \((s, s')\) is \(\Sigma_{ss'}\mathbf{C}\). This is a type of separable matrix, which means that every element of \(\mathbf{C}\) in block \((s, s')\) is multiplied by the same scalar value (one element of \(\Sigma\)). In other words this species \(\times\) covariate matrix has no interactions between species and covariates a priori. However, there will be interactions a posteriori induced by the data.
I take the prior covariance \(\mathbf{C}\) to be non-informative.
Gibbs sampling is all direct from conditional posterior distributions. First I mention the inverse Wishart distribution. The prior distribution for the covariance matrix \(|\Sigma|^{-(S + 2)/2}\) is conjugate with the \(MVN\) likelihood. The conditional posterior distribution is also inverse Wishart. This distribution takes as arguments a prior covariance matrix having the same dimensions as \(\Sigma\), and degrees of freedom, which must be larger than \(S + 1\). The larger the degrees of freedom, the more informative (more weight on prior). The conditional posteriors are admittedly ugly, but direct sampling is worth a bit of math:
\[ \begin{aligned} vec(\boldsymbol{\beta}) & \sim & MVN\left( vec( (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} ), \Sigma \otimes (\mathbf{X}'\mathbf{X})^{-1} \right) \\ \Sigma & \sim & IW( n - Q + S - 1,\mathbf{Y}' \mathbf{D} \mathbf{Y} ) \end{aligned} \] where \(\mathbf{D} = \mathbf{I}_n - \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\).
To do this as a single multivariate regression I write a Gibbs sampler:
bg <- beta*0
sg <- Sigma*0
rownames(bg) <- colnames(x) <- xnames
colnames(bg) <- rownames(sg) <- colnames(sg) <- ynames
ng <- 2000 # setup Gibbs sampler
bgibbs <- matrix( 0, ng, S*Q)
sgibbs <- matrix( 0, ng, S*S)
predy <- matrix( 0, ng, S*n) # predict the data
colnames(bgibbs) <- as.vector( outer(xnames,ynames,paste,sep='_') )
colnames(sgibbs) <- as.vector( outer(ynames,ynames,paste,sep='_') )
IXX <- solve(crossprod(x)) # only do this once
df <- n - Q + S - 1 # Wishart
for(g in 1:ng){
bg <- updateBetaMVN( x, y, sg)
sg <- updateWishart( x, y, df, beta = bg, IXX = IXX )$sigma
sgibbs[g,] <- sg
bgibbs[g,] <- bg
predy[g,] <- as.vector( .rMVN( n, x%*%bg, sg ) ) # predicted y
}
Here are some plots of estimates and predictions:
par( mfrow=c(1,3), mar=c(4,4,1,1), bty = 'n')
bmu <- colMeans(bgibbs)
bci <- apply(bgibbs, 2, quantile, c(.025,.975))
plot(as.vector(beta), bmu, ylim = range(bgibbs),
xlab='true beta', ylab='estimate')
segments(as.vector(beta), bci[1,], as.vector(beta), bci[2,])
abline(0, 1, lty = 2)
smu <- colMeans(sgibbs)
sci <- apply(sgibbs, 2, quantile, c(.025,.975))
plot(as.vector(Sigma), smu, ylim = range(sgibbs),
xlab='true Sigma', ylab='')
segments(as.vector(Sigma), sci[1,], as.vector(Sigma), sci[2,])
abline(0, 1, lty = 2)
ymu <- matrix( colMeans(predy),n,S )
plot(y, ymu, ylim = range(predy), xlab = 'Observed y', ylab = 'Predicted y' )
abline(0, 1, lty = 2)
yci <- apply(predy, 2, quantile, c(.025,.975))
segments(as.vector(y), yci[1,], as.vector(y), yci[2,])
Estimates and predictions from the multivariate model.
This analysis can be done compactly in GJAM:
library(gjam)
ml <- list(ng = 2000, burnin = 500, typeNames = 'CON')
form <- as.formula(~ x2 + x3 + x4)
out <- gjam(form, data.frame(x), y, modelList = ml)
## ================================================================================================================================================================
summary(out)
##
## Sensitivity by predictor variables f:
## Estimate SE CI_025 CI_975
## x2 4.00 1.060 2.29 6.29
## x3 2.43 0.734 1.25 4.06
## x4 2.55 0.913 1.16 4.64
##
##
##
## Coefficient matrix B:
## y1 y2 y3
## intercept -0.212 0.185 2.510
## x2 -0.258 -0.448 0.720
## x3 -1.030 0.902 0.627
## x4 -1.260 0.708 -1.050
## RMSPE 1.410 1.510 1.010
##
##
##
## Coefficient matrix B as table
##
## Last column indicates if 95% posterior distribution contains zero:
## Estimate SE CI_025 CI_975 sig95
## y1_intercept -0.212 0.276 -0.76800 0.3190
## y1_x2 -0.258 0.218 -0.70200 0.1620
## y1_x3 -1.030 0.422 -1.86000 -0.1940 *
## y1_x4 -1.260 0.277 -1.79000 -0.6930 *
## y2_intercept 0.185 0.295 -0.38300 0.7730
## y2_x2 -0.448 0.235 -0.91700 0.0308
## y2_x3 0.902 0.457 0.00645 1.7700 *
## y2_x4 0.708 0.309 0.08500 1.3100 *
## y3_intercept 2.510 0.198 2.12000 2.9300 *
## y3_x2 0.720 0.156 0.41000 1.0200 *
## y3_x3 0.627 0.297 0.04630 1.2400 *
## y3_x4 -1.050 0.202 -1.45000 -0.6500 *
##
##
##
## Variance contributions from model mean and random effects:
## y1 y2 y3
## total variance 2.090 1.670 2.690
## mean fraction 0.709 0.583 0.876
## R2 0.523 0.319 0.812
##
##
##
## Design Table: VIF and correlations:
## x2 x3 x4
## VIF 1.01 0.99 0.97
## factor 0.00 0.00 0.00
## x3 0.27 NA NA
## x4 -0.24 -0.20 NA
##
##
##
## The sample contains n = 20 observations on S = 3 response variables. Data types (typeNames) include CON. There are missing values in X and missing values in Y. The RMSPE is 1.33, and the DIC is 400. Computation involved 2000 Gibbs steps, with a burnin of 500.
Here is a comparison of coefficients from the univariate model and gjam:
Q <- 4 # intercept plus three predictors
betaMu <- out$parameters$betaMu
betaSe <- out$parameters$betaSe
bUniMu <- matrix(bhat[,1], Q, S)
bUniSe <- matrix(bhat[,2], Q, S)
par(mfrow=c(S,2),bty='n', mar=c(1,1,1,1), omi = c(.6,.6,.2,0))
for(j in 1:S){
xlab <- ''
if(j == S)xlab <- c('mean','sd')
plot(betaMu[,j], bUniMu[,j], xlab = xlab[1], ylab = '', pch=15); abline(0,1, lty=2)
if(j == 1)title('Mean estimate')
plot(betaSe[,j], bUniSe[,j], xlab = xlab[2], ylab = '', pch=15); abline(0,1,lty=2)
if(j == 1)title('SE')
}
mtext('Multivariate model', 1, outer = T, line = 2 )
mtext('Univariate model', 2, outer = T, line = 2 )
Coefficients and standard errors for the univariate and multivariate model.
The mean estimates are ok (left side), but the standard errors differ (right side). Why does the univariate model fail to generate correct standard errors? Recall the conditional posterior for the univariate model has a contribution from the likelihood to the covariance
\[ Var(\mathbf{\beta})|\sigma^2 = \sigma^2(\mathbf{X}'\mathbf{X})^{-1} \] For the multivariate model it changes to
\[ Var(vec(\mathbf{\beta}))|\Sigma = \Sigma \otimes (\mathbf{X}'\mathbf{X})^{-1} \] This conditional relationship reveals an important role for the covariance \(\Sigma\), directly affecting uncertainty of coefficients.
Multivariate discrete observations are often modeled with a multinomial, or its special case for a single trial, the categorical distribution. A multinomial distribution might be used to describe eye color in members of our class (say, \(n = 15\)),
\[\mathbf{y}_i \sim multinom(n, [\theta_{i1}, \dots, \theta_{iR}]\] where \(R\) is the number of classes. There are \(R - 1\) actual classes, because the last one is redundant with others. For the special case of the binomial (\(R = 2\)), there is only one actual class, because the second class, having probability \(\theta_2 = 1 - \theta_1\) is fully determined by the first class.
In this case, \(\mathbf{y}_i\) is a length-\(R\) vector of counts (e.g., for number of brown, blue, green) that sum to \(n\), with a corresponding length-\(R\) vector of probabilities \(\boldsymbol{\theta}_i\) that sum to 1. A categorical distribution might be used to describe the eye color of one individual selected at random from the class.
To draw random vectors from the multinomial I can use the R function
rmultinom.
n <- 10
m <- 20
theta <- c(.5, .3, .2) # probability vector
y <- t( rmultinom(m, n, theta) ) # sample of size m
The drawback with this function is that it accepts only a single probability vector \(\boldsymbol{\theta}\) that is applied to all \(m\) observations. I rarely use this function because I typically have a model, which means that there is a size \(n_i\) and a vector \(\boldsymbol{\theta}_i\) for each of the \(i = 1, \dots, m\) observations \(\mathbf{y}_i\). For example, microbial composition reads do not sum to the same total for all observations.
Here is an example from myrmultinom:
theta <- matrix( runif(m*3, 0, 1), m, 3) # random vectors
theta <- sweep( theta, 1, rowSums(theta), '/' ) # normalization is not necessary
y <- myrmultinom(n, theta) # sample size = nrow(theta)
This function is useful when I have a model for \(\mathbf{\theta}_i\).
Again, because the multinomial has one less response variable than there are classes (they sum to fixed \(n_i\) for \(R\) classes), a multinomial regression has \(R-1\) independent responses. Each looks like a logit model:
\[ \begin{align} \log \frac{[y_i = 1]}{[y_i = R]} &= \boldsymbol{\beta}_1 \mathbf{x}'_i \\ \log \frac{[y_i = 2]}{[y_i = R]} &= \boldsymbol{\beta}_2 \mathbf{x}'_i \\ & \vdots \\ \log \frac{[y_i = R-1]}{[y_i = R]} &= \boldsymbol{\beta}_{R-1} \mathbf{x}'_i \end{align} \] The coefficient vector for response \(r\) is the vector \(\boldsymbol{\beta}_r = (\beta_{r1}, \dots \beta_{rQ})'\), where \(Q\) is the number of predictors in \(\mathbf{x}_i\). This vector is not the response for \(r\) but rather the response relative to the response of the reference class \(R\). This relationship, together with the non-linear link function makes parameters hard to interpret. Here are the probabilities for classes:
\[ \begin{align} [y_i = 1] &= \frac{exp(\beta_1 \mathbf{x}'_i)}{1 + \sum_{r=1}^{R-1}exp(\beta_r \mathbf{x}'_i)} \\ & \vdots \\ [y_i = R-1] &= \frac{exp(\beta_{R-1} \mathbf{x}'_i)}{1 + \sum_{r=1}^{R-1}exp(\beta_{r} \mathbf{x}'_i)} \end{align} \] For \(R = 2\) this becomes the simple logit model. Here is a data simulation and model fitting with Gibbs sampling. First I simulate data
n <- 500
Q <- 3 # inputs
S <- 5 # response classes
size <- 1 + sample(30,n,replace=T) # up to 30 objects per observation
b <- bmultiProp( S, Q)$cc
ynames <- paste( 's', 1:S, sep = '-' )
xnames <- paste( 'q', 1:Q, sep = '-' )
loX <- c( 1, -1, -1) # range of covariates, length = k
hiX <- c( 1, 1, 1)
LIKE <- 'multinom' # 'norm', 'pois', 'binom', or 'multinom'
x <- simX(n, loX, hiX) # simulate x and y
y <- simY(x, b, LIKE, S, size, 0)
colnames(y) <- ynames
colnames(x) <- rownames(b) <- xnames
colnames(b) <- ynames[-S]
Here is a Gibbs sampler:
priorB <- as.vector(b*0) # prior means
priorVB <- diag(1000, Q*(S-1)) # prior covariance
priorIV <- solve(priorVB) # prior inv covariance
ng <- 15000
tmp <- gibbsLoop(LIKE, ng, x, y, b,
priorB = priorB, priorVB = priorVB, burnin = 5000)
## ================================================================================
chainPlot( tmp$bgibbs, trueValues = as.vector(b) )
MCMC chains for the multinomial example, where red lines indicate true values.
The headings in this plot identify the values in b as
row b[q,s]. The true values are within 95% credible
intervals. There is good mixing.
Here are predictions with observed:
predVsObsPlot( obs = y, pred = tmp$ymean )
Predicted and observed data from the multinomial logit.
Prediction uncertainty has much to do with the size of each observation. Here are the predictive CVs:
plot( jitter(y), tmp$yse/tmp$ymean, bty = 'n', xlab = 'Count',
ylab = 'Predictive CV', cex = .1 )
The multinomial logit model yields coefficients on the logit scale, which, as for the binomial, do not have transparent interpretation beyond their sign. As this example shows, the model yields to Bayesian analysis and can provide confident estimates and predictions.
Leaf habit is one of the most important trait adaptations for variation in climate and habitat. Leaves can be thin and expendable or thick, durable, and long-lived. The type of leaf that a species invests in can tell us much about the environment in which it lives and how that species copes with it. A full description of trait analysis in this FIA is available in Clark et al. (2017). I used GJAM there, whereas here I provide an example with the multinomial logit.
Leaf habit is an example of a categorical variable. In this example,
I look at the distribution of leaf types across FIA plots. I have
counted the number of individuals on each aggregate plot that possess
each of ** leaf types. These are the columns in
matrix yleaf:
load( 'leafFIA.rdata', verbose=T )
yleaf[1:10,]
Each row of yleaf is a multinomial distribution, each
with a different size due to the fact that different numbers of trees
occur on each plot.
The predictors in the model are held in a matrix x. They
include the site moisture status (moist), winter minimum
temperature (winTmin), annual precipitation
(annualPrec), and annual moisture deficit
(annualDef).
Here is a Gibbs sampler for a multinomial logit distribution. It needs many interations due to the fact that this is a large data set, and coefficients must be sampled indirectly (with Metropolis).
x <- model.matrix(~ moist*annualDef + winTmin + annualPrec,
data = data.frame(xcluster) ) # generate design matrix
S <- ncol(yleaf)
Q <- ncol(x)
b <- matrix(0, Q, S-1) # coefficient matrix
rownames(b) <- colnames(x)
colnames(b) <- colnames(yleaf)[-ncol(yleaf)]
priorB <- as.vector(b*0) # prior means
priorVB <- diag(1000,Q*(S-1)) # prior covariance
priorIV <- solve(priorVB) # prior inv covariance
ng <- 4000
burnin <- 1000
tmp <- gibbsLoop(LIKE = 'multinom', ng = ng, x, yleaf, b, # Gibbs sampler
priorB=priorB, priorVB=priorVB, burnin = burnin)
par( mfrow=c(2,3), mar=c(4,4,2,1), bty='n' )
for(j in 1:S){
predVsObsPlot( sqrt(yleaf[,j]), sqrt(tmp$ymean[,j]) )
abline(0, 1, lty=2) # agreement prediction
abline( h = mean(yleaf[,j], na.rm=T), lty=2) # noise prediction
title(colnames(yleaf)[j])
}
Predicted versus observed leaf traits from the multinomial logit model.
Note that these covariates do not do a good job of predicting leaf
habit for most types. It will always be especially hard to predict rare
types (e.g., broad_evergreen). The fact that it does best
on needle_evergreen is consistent with the observation that there are
climate trends in the prevalence of evergreen foliage.
Here are responses using as density plots:
vmat <- columnSplit( colnames(tmp$bgibbs), '_' )
par( mfrow=c(2,3), mar=c(4,2,1,1) , omi = c(.5, .5, .2, 0), bty='n' )
wr <- burnin:ng
for( j in 2:Q ){
wj <- which( vmat[,1] == colnames(x)[j] )
xlim <- range( tmp$bgibbs[wr,wj] )
plot( NA, xlim = xlim, ylim=c(0, 20/diff(xlim) ), xlab='', ylab='' )
title( colnames(x)[j] )
for( i in 1:length(wj) ){
xij <- density( tmp$bgibbs[wr,wj[i]], n = 30 )
lines( xij$x, xij$y, lwd=2, col=i )
abline( v = 0, lty = 2 )
}
}
legend('topright', colnames(yleaf)[-S], text.col = 1:S, bty='n', cex=.9)
mtext('Parameter value', 1, outer=T, line = 1)
mtext('Density', 2, outer=T, line = 1)
Posterior densities for the predictors for each leaf type. Note that there are no estimates for the reference class, scalelike_evergreen.
The bumpiness of these plots indicate that more mixing is needed
(bigger ng). The indirect Metropolis sampling in this
algorithm will typically be difficult to tune for efficiency.
There are a number of important methods built on the multinomial, including the multinomial logit and probit models. There are Bayesian implementations of these methods that are worth consideration when data are discrete.
There are three reasons why I turn immediately to an alternative approach. First is the issue of covariance. Just as in the continuous case, discrete data can have correlation in outcomes. However, unlike the multivariate normal distribution, which has a covariance matrix, the multinomial distribution has only a probability vector. With fixed \(n\), the competing classes impose negative covariance. The variance in outcome \(s\) is \(n \theta_s (1 - \theta_s)\). The covariance between outcomes \(s\) and \(s'\) is \(-n \theta_s \theta_{s'}\). If you are unconvinced, try this:
n <- 50 # simulated multinomial data
theta <- c(.5, .3, .2)
y <- t( rmultinom(1000, n, theta) )
cov(y)
## [,1] [,2] [,3]
## [1,] 12.707098 -7.415085 -5.292013
## [2,] -7.415085 9.725626 -2.310541
## [3,] -5.292013 -2.310541 7.602554
and compare with this theoretical result:
sigma <- -n*tcrossprod(theta) # covariances for all (s, s')
diag(sigma) <- n*theta*(1 - theta) # variances on diagonal
print( sigma )
## [,1] [,2] [,3]
## [1,] 12.5 -7.5 -5
## [2,] -7.5 10.5 -3
## [3,] -5.0 -3.0 8
The negative covariance is imposed by the compositional nature of the data–the \(y_{is}\) sum to \(n\), so increasing any one tends to decrease others. Multinomial data are also called composition data. I call this a disadvantage because, regardless of whether or not two microbial taxa tend to occur together, the multinomial distribution used for count data will impose negative covariance.
A second limitation comes from the difficulty interpreting coefficients when a non-linear link function is used to model count data with a continuous intensity or probability model. If I do a linear regression, I know that a regression coefficient \(\beta_q\) represents the effect of \(x_q\) on response \(y\) and has the dimension \(y/x_q\). Non-linear link functions are used in GLMs to insure positive \((0, \infty)\) intensity or \((0, 1)\) probability. The problem comes from interpreting coefficients on these non-linear scales.
Still another reason that I tend not to use models based on the multinomial comes from the fact that data commonly consist of combinations of continuous and discrete variables. In social sciences, a response could consist of income (continuous) and political party (discrete). In the health sciences, coronavirus sypmtoms can be discrete (cough, tiredness, difficulty breathing) and continuous (fever).
Real data tend to be more complicated than simply continuous versus
discrete. In ecology, observations on plant health can include nomimal
classes for herbivore damage, ordinal classes from abundance, or
combinations of continuous and discrete in the same variable: discrete
zero or detection limit for chemical constituents, continuous positive
values, discrete maximum detection. gjam was developed
(partly in this class) to address the common problems I typically
encounter with multivariate data. These issues are summarized in the
next section.
Generalized Joint Attribute Modelling gjam models multivariate responses that can be combinations of discrete and continuous variables, where interpretation is needed on the observation scale. It was motivated by the challenges of modeling distribution and abundance of multiple species, so-called joint species distribution models (JSDMs). Because the different attributes that make up a multivariate response are often recorded on many different scales and with different levels of sampling effort, analyses often default to the least informative common-denominator, binary or ‘presence-absence’.
Combining species and other attributes in a single analysis is challenging because individuals of some species groups are counted. Some may be observed as continuous cover values or basal area. Some may be recorded in ordinal bins, such as ‘rare’, ‘moderate’, and ‘abundant’. Others may be presence-absence. Some are composition data, either fractional (continuous on (0, 1)) or counts (e.g., molecular and fossil pollen data). Attributes such as body condition, infection status, and herbivore damage are often included in field data. GJAM accommodate multifarious observations.
As discussed in the previous section, a covariance matrix allows me to see the relationships between responses that are left after I remove the mean structure. For a simple linear model this is easy, because the responses are real numbers on \((-\infty, \infty)\). I cannot do this with counts unless I impose a non-linear link function, e.g., the link functions used in GLMs. The problem is that the coefficients no longer have a simple interpretation. For example, the covariance matrix is not the covariance between species, but rather a non-linear transformation of them.
To allow transparent interpretation GJAM avoids non-linear link functions. This is a departure from generalized linear models (GLMs) and most hierarchical Bayes models.
In GJAM, the integration of discrete and continuous data on the observed scales makes use of censoring. Censoring extends a model for continuous variables across censored intervals. Continuous observations are uncensored. Recall, the Tobit model was an example of censoring for discrete zeros. Censored observations are discrete and can depend on sample effort.
The partition for discrete data in one dimension.
Censoring is used with the effort for an observation to combine discrete variables with appropriate weight. In count data, effort is determined by the size of the sample plot, search time, or both. It is comparable to the offset in GLMs. In count composition data (e.g., microbiome, fossil pollen), effort is the total count taken over all species. In GJAM, discrete observations can be viewed as censored versions of an underlying continuous space.
As a data-generating model (a), a realization \(w_{is}\) that lies within a censored interval is translated by the partition \(\mathbf{p}_{is}\) to discrete \(y_{is}\). The distribution of data (bars at left) is induced by the latent scale and the partition. For inference (b), observed discrete \(y_{is}\) takes values on the latent scale from a truncated distribution.
GJAM achieves effort-based weighting through a partition of the continuous scale. Where effort \(E = 1\), the partition for discrete counts \(0, 1, 2, \dots\) begin at \(-\infty\), followed by midpoints between count values, \(\mathbf{p} = (-\infty, 1/2, 3/2, \dots)\). For \(\mathbf{z}_{is} = k\) the interval is thus \((p_{i,k}, p_{i,k+1}] = (k - 1/2, k + 1/2]\). When effort varies between observations the partition shifts to the ‘effort scale’,
\[\begin{equation} \label{eqn:pik} \mathbf{p}_{ik} = \left(\frac{k - 1/2}{E_{i}}, \frac{k + 1/2}{E_{i}}\right] \end{equation}\] If observations are animals counted per hour, \(E_i\) can be search time. If observations are benthic organisms per sediment core, \(E_i\) can be core volume. If observations are seedlings per plot, then \(E_i\) can be the area of plot \(i\). Because plots have different areas one might choose to model \(w_{is}\) on a ‘per-area’ scale (density) rather than a ‘per-plot’ scale.
To see how censoring accommodates the increasing uncertainty with reduced effort, consider point counts for birds. Suppose an observation lasting 10 min yields roughly 10 times larger counts than observations lasting one minute. Instead of 2 birds per minute I might count 20 birds per 10 minutes. The partition for a count of 2 per 1 minute is (1.5, 2.5), or a spread of 1 unit. The partition for a count of 20 birds per 10 minutes is (19.5/10, 20.5/10) = (1.95, 2.05). The narrow partition for the second case (0.1 unit) admits the lower uncertainty represented by greater effort.
GJAM uses this partition with a MVN model for the continuous scale. Disadvantages of models based on the multinomial or GLMs are remedied by the GJAM treatment based on censoring. First, there is a covariance \(\Sigma\) fitted to data–a fixed covariance is not imposed by the multinomial distribution. Second, the scale is easily interpreted. If effort is omitted (\(E = 1\)), then coefficients and covariance are interpreted on the observation scale (e.g., per plot, per interval). If effort is included, the scale is ‘per effort’ (e.g., per hectare, per minute). Coefficients and covariance are interpreted on this scale. Finally, the different data types are combined. For biodiversity, where birds come from point counts, plants from fixed-area plots, microbes from sequencing, and mammals from live traps, there is a model that combines them. In the next section I summarize the main data types.
(CA) data can be concentration, biomass, density, basal area, leaf area, cover, and so on, all of which are continuous above zero, but have discrete zero.
Table 1. Partition for each data type
typeNames |
Type | Obs values | Default partition | Comments |
|---|---|---|---|---|
'CON' |
continuous, uncensored | \((-\infty, \infty)\) | none | e.g., centered, standardized |
'CA' |
continuous abundance | \([0, \infty)\) | \((-\infty, 0, \infty)\) | with zeros |
'DA' |
discrete abundance | \(\{0, 1, 2, \dots \}\) | \((-\infty, \frac{1}{2E_{i}}, \frac{3}{2E_{i}}, \dots , \frac{max_s(y_{is}) - 1/2}{E_i}, \infty)^1\) | e.g., count data |
'PA' |
presence- absence | \(\{0, 1\}\) | \((-\infty, 0, \infty)\) | unit variance scale |
'OC' |
ordinal counts | \(\{0, 1, 2, \dots , K \}\) | \((-\infty, 0, estimates, \infty)\) | unit variance scale, imputed partition |
'FC' |
fractional composition | \([0, 1]\) | \((-\infty, 0, 1, \infty)\) | relative abundance |
'CC' |
count composition | \(\{0, 1, 2, \dots \}\) | \((-\infty, \frac{1}{2E_{i}}, \frac{3}{2E_{i}}, \dots , 1 - \frac{1}{2E_i}, \infty)^1\) | relative abundance counts |
'CAT' |
categorical | \(\{0, 1\}\) | \((-\infty, max_{k}(0, w_{is,k}), \infty)^2\) | unit variance, multiple levels |
(DA) data arise from counts, which are often not well described by standard distributions, such as the Poisson or the negative binomial, and perform poorly when zeros are common. Again, the multinomial distribution induces a negative covariance. When the total count in the multinomial distribution is related to abundance a separate model is needed for this total. By treating observed counts as a censored version of true abundance GJAM accommodates effort, and parameters can be interpreted on the observation scale or the effort scale.
(PA) data include only two categories, \(\{0, 1\}\). The multivariate probit model of Chib and Greenberg (1998) is a special case of GJAM for PA data, where both intervals are censored.
(OC) data are collected where abundance must be evaluated rapidly, where precise measurements are difficult, or absolute scales are difficult to apply. Because there is no absolute scale the partition must be inferred. Consider the ordinal scale represented by these labels: (absent, rare, intermediate, abundant). The sample partition is \(\mathbf{p}_{s} = (-\infty, 0, p_{s,2},p_{s,3}, \infty)\), where elements 2 and 3 are estimated. The zero anchors location, and unit variance imposes a scale.
may be continuous fractions with a sum-to-one constraint (fractional composition) or discrete counts. Both have interpretation on the relative abundance \([0,1]\) scale, and both require point mass at zero and one. Due to the sum-to-one (fractional composition) or sum-to-\(E_i\) (count composition) constraint, there is information on only \(S - 1\) columns in \(\mathbf{Y}\).
(CC) data are composition data reported as numbers of each species counted. Composition counts are only meaningful in a relative sense; they provide no information on absolute abundance. The total count for a sample is the effort \(E_i = \sum_{s}{y_{is}}\). Common examples include molecular sequence data, paleoecology, and fungal assays. In paleoecology total counts can differ widely between observations (www.neotomadb.org). The number of DNA sequence reads in microbiome data can range over orders of magnitude. A practice that is widespread in the microbiome community rarifies count data to achieve approximate equity between samples. This amounts to a massive manipulation of data that can throw away vast amounts of information. Alternative model-based approaches applied to counts are limited to single taxa. A multinomial model with second-stage covariance is not on the data scale. Moreover, dominance of zeros in microbiome data limits application of most approaches.
GJAM accommodates the discrete observations and the underlying relative abundance scale. A sample count can take values \(y_{is} \in \{0,1, 2, \dots \}\), with \(E_i\) being the total count for sample \(i\). The partition segments the \([0, 1]\) composition scale according to effort and allowing for zeros. Small samples have wide bins and, thus, high variance and low weight.
(FC) data arise in many ways, examples including the fraction of a photoplot or remotely sensed image occupied by each species or cover type. It can be the fraction of leaves lost to different types of herbivory or stream or foliar chemistry. The correlations between responses are distorted when estimated on the multivariate logit scale. Still more problematic, the logit scale does not admit zeros, which are common in composition data. In GJAM a FC observation is represented in continuous space and censored at 0 (absent species) and 1 (monoculture).
A sample may have multiple composition groups. For example, \(\mathbf{Y}\) may include both soil and endophytic microbiome data, each with its own total count (effort). Let \(G\) be the number of composition groups. If there are \(L_g\) response variables for a given FC or CC group \(g\), then there are \(L_g - 1\) non-redundant columns in \(\mathbf{Y}\) for group \(g\). A sample includes information on the total number of non-redundant columns, \(S = \sum_g L_g - G\). A link function provides support over the real line for composition data, while providing estimates on the observation scale (Appendix S1).
(CAT) describe unordered categories. If observation \(i\) refers to a sample plot, and the response \(s\) is a cover-type variable, then it might be assigned to one of several categories \(k\), such as ‘tidal flat’, ‘low marsh’, or ‘high marsh’. If it refers to a sample plant, and a response is growth habit, it might be assigned one of four categories ‘herb’, ‘graminoid’, ‘shrub’, or ‘tree’. These are multinomial responses. Like composition data, a categorical response \(s\) occupies as many columns in \(\mathbf{Y}\) as there are non-redundant levels \(K_s - 1\), because the \(K_s\) columns sum to 1. The observed category is that having the largest value of \(w_{is,k}\) for response \(s\).
The different types of data that can be included in the model are
summarized in Table 1, assigned to the character variable
typeNames that is included in the modelList
passed to gjam:
Consider a sample of size \(n =
500\) for \(S = 10\) species and
\(Q = 4\) predictors. To indicate that
all species are continuous abundance data I specify
typeNames as 'CA':
library(gjam)
f <- gjamSimData(n = 500, S = 10, Q = 4, typeNames = 'CA')
summary(f)
The object f includes elements needed to analyze the
simulated data set. f$typeNames is a length-\(S\) character vector. The
formula follows standard R syntax. It does not start with
y ~, because gjam is multivariate. The multivariate
response is supplied as a \(n \times
S\) matrix or data.frame ydata. Here is
the formula for this example:
f$formula
The simulated parameter values are returned from
gjamSimData in the list f$trueValues
The predictors are held in a \(n \times
Q\) data.frame xdata. The responses are a \(n \times S\) matrix ydata.
par(bty = 'n', mfrow = c(1,2), family='')
h <- hist( c(-1,f$y), nclass = 50, plot = F )
plot( h$counts, h$mids, type = 's', xlab='count', ylab='abundance' )
plot( f$w, f$y, cex = .2, xlab='w', ylab='y' )
A histogram of observed y (left) and plotted against latent w (right).
Here is a short Gibbs sampler to estimate parameters and fit the
data. The function GJAM needs the formula for the model,
the data.frame xdata, which includes the predictors, the
response matrix or data.frame ydata, and a
modelList specifying number of Gibbs steps ng,
the burnin, and typeNames.
ml <- list(ng = 1000, burnin = 100, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
summary(out)
The print function also works:
print(out)
Among the objects to consider initially are the design matrix
out$inputs$x, response matrix out$inputs$y,
and the MCMC out$chains with these names and sizes:
summary(out$chains)
$chains is a list of matrices, each with ng
rows and as many columns as needed to hold parameter estimates. For
example, each row of $chains$bgibbs is a length-\(QS\) vector of values for the \(Q \times S\) matrix \(\mathbf{B}\), standardized for \(\mathbf{X}\). In other words, a coefficient
\(\mathbf{B}_{qs}\) has the units of
\(w_s\). $chains$sgibbs
holds either the \(S(S + 1)/2\) unique
values of \(\boldsymbol{\Sigma}\) or
the \(N \times r\) unique values of the
dimension reduced covariance matrix.
Additional summaries are available in the list
inputs:
summary(out$inputs)
The matrix classBySpec shows the number of observations
in each interval. For this example of continuous data censored at zero,
the two labels are \(k \in \{0, 1\}\)
corresponding to the intervals \((p_{s,0},
p_{s,1}] = (-\infty,0]\) and \((p_{s,1}, p_{s,2}) = (0, \infty)\). The
length-\((K + 1)\) partition vector is
the same for all species, \(\mathbf{p} =
(-\infty, 0, \infty)\). Here is classBySpec for this
example:
out$inputs$classBySpec
The first interval is censored (all values of \(y_{is}\) = 0). The second interval is not censored (\(y_{is} = w_{is}\)).
The fitted coefficients in $parameters, as summarized in
Table 2. For example, here is posterior mean estimate of unstandardized
coefficients \(\mathbf{B}_u\),
out$parameters$betaMu
head( out$parameters$betaTable )
The data are also predicted in GJAM, summarized by predictive means
and standard errors. These are contained in \(n \times Q\) matrices
$prediction$xpredMu and $prediction$xpredSd
and \(n \times S\) matrices
$prediction$ypredMu and $prediction$ypredSd.
These are in-sample predictions.
The output can be viewed with the function gjamPlot:
f <- gjamSimData(n = 500, S = 10, typeNames = 'CA') # simulate data
ml <- list(ng = 1000, burnin = 200, typeNames = f$typeNames) # inputs
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml) # fit model
pl <- list(trueValues = f$trueValues, PLOTALLY = T, GRIDPLOTS = T) # plot parameters
gjamPlot(output = out, plotPars = pl)
gjamPlot creates a number of plots comparing true and
estimated parameters (for simulated data). Here are true versus
estimated coefficients and correlation:
MCMC chains converge rapidly.
MCMC chains converge rapidly.
Fitted coefficients plotted against values used to simulate data
Predictions of responses in Y against true values.
Exercise 1: I have a data set with $S = $ hundreds of
species, but few observations. I know that estimating many species will
be difficult because it means that I have to estimate all values in the
\(S \times S\) covariance matrix.
Conduct simlation experiments to determine how the number of response
variables affects estimates of errors in beta coefficients. You can
simulate data sets in gjam as shown below, changing the value
S, and checking the standard errors in
output$parameters$betaSe.
f <- gjamSimData(n = 200, S = 3, typeNames = 'CA') #S species, continuous abundance data
ml <- list(ng = 2000, burnin = 500, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
summary(out)
betaSe <- out$parameters$betaSe
Sonya has two scores that are assigned to patients that suffer with different levels of severity. One score is binary. The other score is ordinal. Ordinal scores have a meaningful rank, but they do not have a scale. In other words scores of 1, 2, 3 are increasing in magnitude, but the difference between 1 and 2 is not the same as the difference between 2 and 3.
In this case, I want a multivariate model for multiple data types. I want coefficients on the data scale (i.e., not a logit or some other scale). I want a residual covariance matrix (the multinomial imposes covariance, always negative). GJAM was developed for this purpose, written as part of this class.
Here is a simulated data set for \(Q =
3\) predictors (e.g., an intercept and two variables affecting
severity). The response types are binary (PA) and ordinal
(OC).
f <- gjamSimData(n = 200, Q = 3, S = 2, typeNames = c('PA', 'OC') )
plot( jitter( f$ydata[,1] ), jitter( f$ydata[,2] ), cex = .3, bty = 'n',
xlab = 'binary score', ylab = 'ordinal score' )
Here is the model fit, with coefficients and residual covariance matrix:
ml <- list(ng = 500, burnin = 50, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
## ================================================================================================================================================================
summary(out)
##
## Sensitivity by predictor variables f:
## Estimate SE CI_025 CI_975
## x2 0.251 0.0869 0.0985 0.439
## x3 2.020 0.2600 1.5900 2.580
##
##
##
## Coefficient matrix B:
## S1 S2
## intercept 0.933 0.341
## x2 0.114 -0.215
## x3 0.479 1.680
## RMSPE 0.381 1.320
##
##
##
## Coefficient matrix B as table
##
## Last column indicates if 95% posterior distribution contains zero:
## Estimate SE CI_025 CI_975 sig95
## S1_intercept 0.933 0.1180 0.6960 1.1300 *
## S1_x2 0.114 0.0951 -0.0672 0.2970
## S1_x3 0.479 0.0935 0.3020 0.6530 *
## S2_intercept 0.341 0.1390 0.0623 0.6160 *
## S2_x2 -0.215 0.0832 -0.3750 -0.0506 *
## S2_x3 1.680 0.1330 1.4400 1.9500 *
##
##
##
## Variance contributions from model mean and random effects:
## S1 S2
## total variance 1.040 3.820
## mean fraction 0.249 0.751
## R2 0.226 0.730
##
##
##
## Design Table: VIF and correlations:
## x2 x3
## VIF 1.00 1
## factor 0.00 0
## x3 0.03 NA
##
##
##
## The sample contains n = 200 observations on S = 2 response variables. Data types (typeNames) include PA, OC. There are missing values in X and missing values in Y. The RMSPE is 0.975, and the DIC is 3076. Computation involved 500 Gibbs steps, with a burnin of 50.
btrue <- as.vector( f$trueValues$beta )
chain2density( out$chains$bgibbs, trueValues = btrue )
btab <- out$parameters$betaTable
plot( btrue, btab[,1], xlab = 'True value', ylab = 'Estimated',
ylim = range( btab[,3:4]), bty = 'n' )
segments( btrue, btab[,'CI_025'], btrue, btab[,'CI_975'] )
abline( 0, 1, lty = 2 )
chain2density( out$chains$sgibbs, trueValues = f$trueValues$sigma[c(1,2,4)] )
If the variables in the design x explain much of the
variation in two responses, then the residual covariance might be small.
However, if there is large residual covariance, then both responses
might be finding agreement for reasons that are not covered by
x.
Here is an example using BBS data, which we looked at earlier in the semester. In this case, I’m using data from the eastern US:
load('BBSeast.Rdata', verbose = T)
Here is a gjam analysis. First, I trim the matrix ydata
down to only those species that occur in at least 1000 observations:
tmp <- gjamTrimY(ydata, 1000) # at least 1000 observations
ytrim <- tmp$y # trimmed version of ydata
dim(ytrim)
Commuity-wide sensitivity to predictors in the model.
Sensitivity to developed cover types in BBS.
Communities based on responses to predictors in the model (left) and responses to predictors in the coefficient matrix (right).
I set up three objects:
reductList does dimension reduction on the covariance
matrix. With this many columns in ytrim, there may be more
coefficients in \(\Sigma\) than the
data can inform. Or it may be too large to invert. I reduce the size to
N effective rows and r effective columns.
modelList holds the number of MCMC iterations
ng, the burnin iterations, and the data type
'DA'.
formula is the model
reductList <- list(N = 20, r = 15) # S X S reduced to N X r
modelList <- list(ng = 2000, burnin = 500, typeNames='DA', reductList = reductList)
formula <- as.formula(~ winterTemp + therm + def + nlcd)
Predicted abundances in the object outBBS\(prediction\)ypredMu.
Here is a GJAM fit, will take some time:
outBBS <- gjam(formula, xdata, ydata = ytrim, modelList = modelList)
gjamPlot( outBBS, plotPars = list(PLOTALLY = T, GRIDPLOTS = T) )
Here is the dimension reduction, the full \(98 \times 98\) covariance matrix is reduced to 112 coefficients:
The small block below represents the number of coefficients estimated in the covariance matrix.
Here is the structure it creates in \(\Sigma\):
Dimension reduction structures the residual correlation matrix.
There are a large number of species. To help visualize output I
assign cluster codes to specific groups. This is a list
specColor, where I have organized some of the species into
groups that a non-specialist might recognize:
snames <- colnames(outBBS$inputs$y) # species names
S <- length(snames)
specGroups <- list( # 3 color groups
urban = c("EuropeanStarling","AmericanCrow","ChimneySwift",
"CommonGrackle","FishCrow","HouseFinch",
"HouseSparrow","NorthernMockingbird","HouseWren"),
raptor = c('CoopersHawk','RedtailedHawk','SwainsonsHawk',
'RedshoulderedHawk','BaldEagle','AmericanKestrel',
'Osprey','BarredOwl','GreatHornedOwl'),
wetland = c("Anhinga","DoublecrestedCormorant",
"Mallard","WoodDuck","CanadaGoose","WhiteIbis",
"RedwingedBlackbird","Mallard","WoodDuck","CanadaGoose",
"WhiteIbis","GreatBlueHeron","GreatEgret",
"LittleBlueHeron","CattleEgret","GreenHeron",
"YellowcrownedNightHeron","BeltedKingfisher")
)
specColor <- rep('black',S) # assign colors to groups
names(specColor) <- rep('others', S)
cc <- c('brown','red','blue')
for(j in 1:length(specGroups)){
wj <- which(snames %in% specGroups[[j]])
specColor[wj] <- cc[j]
names(specColor)[wj] <- names(specGroups)[j]
}
Look at the structure of specColor.
Inverse prediction of continuous predictors (climate) for BBS. Important variables are well-predicted by the bird assemblage.
Inverse prediction of land cover predictors (discrete) for BBS. Important variables are well-predicted by the bird assemblage.
Here are plots:
plotPars <- list(GRIDPLOTS=T, specColor = specColor, PLOTALLY = T)
gjamPlot(outBBS, plotPars)
Prediction is used to i) evaluate model fit (in-sample, out-of-sample), ii) impute missing values in x and y, and iii) project the fitted model to a (out-of-sample) prediction grid. Predictions marginalize the uncertainty in the fitted model. Consider a likelihood and distribution of parameters \([\theta]\). I can integrate the variation in parameters this way.
\[ [Y|X] = \int [Y | X, \theta] [\theta] d\theta \] You might recognize this to be the marginal likelihood of Bayes theorem, where \([\theta]\) is the prior distribution.
To predict from a fitted model, I marginalize the posterior distribution,
\[ [Y^* |X^*] = \int [Y^* | X^*, \theta] [\theta | X, Y] d\theta \] The factors in the integrand are the likelihood, now written for a new data set \((X^*, Y^*)\), and the posterior distribution \([\theta | X, Y]\). If the predictors in \(X^*\) are the same as \(X\), then I am predicting in-sample. If not, I am predicting out-of-sample.
For a hierarchical model, there is uncertainty in parameters, in [process| parameters], and in [data | process, parameters]. Suppose \(Z\) represents a latent state or process such that \(X, \theta \rightarrow Z \rightarrow Y\). Then prediction includes an additional stage for \(Z\), written like this:
\[ [Y^*] = \int [Y^* | Z, \theta] [Z| \theta, X, Y] [\theta | X, Y] d\theta dZ \] In this case, I am assuming that some of the parameters connect observed \((X, Y)\) to the process \(Z\) and others to the observations \(Y\).
When the model is fitted with MCMC, then this integration is done numerically. The MCMC chains are stored as iteration \(\times\) parameters matrix. To integrate numerically, I draw a random row from this matrix, evaluate the process model, and then the data model. By repeating this process I build up a distribution of predictions.
Inverse prediction involves inverting the model to predict \(X\). Here is a prediction grid for the BBS data fitted with GJAM,
newdata <- list(xdata = predGrid, nsim = 100)
p1 <- gjamPredict(outBBS, newdata = newdata)
yPredMu <- p1$sdList$yMu # predictive mean
yPredSe <- p1$sdList$yPe # predictive sd
Here is the distribution of soils in the prediction map:
par(mfrow=c(1,1), mar=c(1,1,1,1))
library('RColorBrewer')
palette(brewer.pal(8,'Accent'))
mapx <- range(predGrid$lon)
mapy <- range(predGrid$lat)
maps::map(boundary=T,col='grey',lwd=2,xlim=mapx,ylim=mapy)
levs <- levels(predGrid$soil)
scols <- match(predGrid$soil,levs)
points(predGrid[,'lon'],predGrid[,'lat'],col=scols,pch=15, cex=.6)
maps::map(boundary=T,add=T,interior=T,col='white',lwd=7)
maps::map(boundary=T,add=T,interior=T,col='grey',lwd=3)
legend('bottomright',levs,text.col=c(1:length(levs)),bg='white',
box.col='white', cex=.6)
In this block, I generate maps for random species, including the predictive coefficient of variation,
library(MBA)
par(mfrow=c(2,2), mar=c(2,2,2,1), bty='n') # 4 plot panels
ngrid <- 30 # density of map grid
colM <- colorRampPalette(brewer.pal(5,'YlOrRd')) # color ramp function
qlev <- seq(0,1,length=10) # 10 color levels
zlevs <- quantile(yPredMu,qlev) # equal quantiles per color level
zlevs <- zlevs[!duplicated(zlevs)] # in case some levels are empty
nz <- length(zlevs) - 1
colm <- colM(nz) # color ramp
ii <- sample(colnames(yPredMu), 4) # select species randomly
for(j in 1:2){
i <- ii[j]
values2contour(xx=predGrid[,'lon'],yy=predGrid[,'lat'],
z=yPredMu[,i],nx=ngrid,ny=ngrid,col=colm,
zlevs=zlevs,lwd=.1,add=F,fill=T)
maps::map(boundary=T,col='grey',lwd=2,xlim=mapx,ylim=mapy,add=T)
ydat <- outBBS$inputs$y[,i]*1000 # per 1000 hrs effort
ydat <- ydat/max(ydat,na.rm=T)
colj <- .getColor('darkblue',ydat)
symbols(xdata$Longitude, xdata$Latitude, circles=ydat, inches=F, add=T,
fg = colj, bg = colj)
title(i)
}
# coefficient of variation
for(j in 1:2){
i <- ii[j]
wi <- which(colnames(yPredMu) == i)
zj <- yPredSe[,i]/(yPredMu[,i] + .01)
values2contour(xx=predGrid[,'lon'],yy=predGrid[,'lat'],
z=zj,nx=ngrid,ny=ngrid,col=colm,
zlevs=slevs,lwd=.1,add=F,fill=T)
maps::map(boundary=T,col='grey',lwd=2,xlim=mapx,ylim=mapy,add=T)
ydat <- outBBS$inputs$y[,i]*1000 # per 1000 hrs effort
ydat <- ydat/max(ydat,na.rm=T)
colj <- .getColor('darkblue',ydat)
symbols(xdata$Longitude, xdata$Latitude, circles=ydat, inches=F, add=T,
fg = colj, bg = colj)
title('CV')
}
Predicted abundance for random species in BBS data.
The prediction errors (lower maps) increase where species are rarely observed.
Here is the example of leaf traits that we previously analyzed with a
multinomial logit. In GJAM I set the data type to composition count
(CC).
load( 'leafFIA.rdata', verbose=T )
formula <- as.formula( ~ moist*annualDef + winTmin + annualPrec )
modelList <- list(ng = 1000, burnin = 200, typeNames='CC')
out <- gjam(formula, xdata = data.frame( xcluster ),
ydata = yleaf, modelList = modelList)
gjamPlot( out, plotPars = list(PLOTALLY = T))
Predictions by category (sqrt scale).
As with the MV logit example, note that these variables provide limited predictive capacity for leaf type, and rare types are most poorly predicted.
Beta chains converged by 100 iterations.
Sigma chains chains converged by 100 iterations.
Because all sampling is direct, the chains converge quickly.
Responses by predictor.
These estimates do not look anything like those that we obtained with the MV logit. For example, GJAM shows positive responses to winter temperatures for broadleaf evergreens, as we would expect (they dominate in the South). The MV logit estimates a negative coefficient.
Overall sensitivity to predictors (across all classes).
The sensitivity plot summarizes the contributions of each predictor across all response variables. Local site moisture dominates in this analysis.
Recall that the multinomial models impose negative covariance. Note that that is not what we see with GJAM:
round( out$parameters$sigMu,4 )
## S1 S2
## S1 0.7933 -0.2276
## S2 -0.2276 1.0209
round( out$parameters$sigSe,5 )
## S1 S2
## S1 0.10710 0.11106
## S2 0.11106 0.16408
Like binary data, ordinal data do not have a scale. Unlike binary data, ordinal data have more than two classes, and they always have a definite order (binary data can by ‘yes’ versus ‘no’ or ‘female’ versus ‘male’). Because there are more than two classes, the scale has to be estimated (see description of data types above).
Here is an example with ordinal data:
f <- gjamSimData( S = 10, typeNames = 'OC')
ml <- list(ng = 2000, burnin = 500, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
pl <- list(trueValues = f$trueValues)
gjamPlot(out, plotPars = pl)
The partition estimates with true values (dashed lines) for S species.
These different data types can be modeled in combination. Here is an example that combines discrete abundance (DA) counts, ordinal counts (OC, recall there is order, but no scale), count composition (CC), continuous abundance (CA), and presence-absence (PA).
types <- c('DA','DA','OC','OC','OC','OC','CA','CA')
f <- gjamSimData(S = length(types), typeNames = types)
ml <- list(ng = 2000, burnin = 500, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
## ================================================================================================================================================================
pl <- list(trueValues = f$trueValues)
gjamPlot(out, plotPars = pl)
## ================================================================================
## simulated beta, corSpec vs betaMu, corMu (95%) -- return to continue
## cuts vs cutMu -- return to continue
## obs y vs predicted y -- return to continue
## x inverse prediction, covariates -- return to continue
## sensitivity over full model -- return to continue
## beta coefficient thinned chains -- return to continue
## correlation thinned chains -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## 95% posterior -- return to continue
Exercise 2. A former student in this class had a survey
with one binary response (typeNames = 'PA') and three
ordinal responses (ordinal counts, 'OC'). Simulate his data
using gjamSimData and fit the model using
gjam. Using the documentation for GJAM:
What are the partition estimates and what role do they play?
Which are better predicted by the fitted model, the
PA data or the OC data? Why might they
differ?
Exercise 3. Select 10 species from the BBS data and do
a model selection (using DIC) to evaluate a limited number of variable
combinations. DIC can be found using the function summary
or in the object out$fit$DIC. How do the DIC values you get
when including different predictors compare with the inverse prediction
and sensitivity you get when all of them are inclulded in the model?
An observation consists of environmental variables and species attributes, \(\lbrace \mathbf{x}_{i}, \mathbf{y}_{i}\rbrace\), \(i = 1,..., n\). The vector \(\mathbf{x}_{i}\) contains predictors \(x_{iq}: q = 1,..., Q\). The vector \(\mathbf{y}_{i}\) contains attributes (responses), such as species abundance, presence-absence, and so forth, \(y_{is}: s = 1,..., S\). The effort \(E_{is}\) invested to obtain the observation of response \(s\) at location \(i\) can affect the observation.
A length-\(S\) vector \(\mathbf{w}\) represents response \(\mathbf{y}_i\) in continuous space. This continuous space allows for the dependence structure with a covariance matrix. An element \(w_{is}\) can be known (e.g., continuous response \(y_{is}\)) or unknown (e.g., discrete responses).
For discrete observations, \(k\) is a censored interval, and \(w_{is}\) is a latent variable. A length-\(S\) vector of integers \(\mathbf{z}_{i}\) labels \(\mathbf{y}_i\) in discrete space. Each observed \(y_{is}\) determine the interval \(z_{is} \in \{0,...,K_{is}\}\). The number of intervals \(K_{is}\) can differ between observations and between species, because each species can be observed in different ways.
The partition defines the relationship between discrete observations \(y_{is}\) and continuous \(w_{is}\). Continuous space is partitioned at points \(p_{is,z} \in{\mathcal{P}}\). The label assigned to \(y_{is}\) is \(z_{is}\). Two values \((p_{is,k}, p_{is,k+1}]\) bound the \(k^{th}\) interval of \(s\) in observation \(i\). Intervals are contiguous and provide support over the real line \((-\infty, \infty)\). The set of censored intervals is \(\mathcal{C}\). The partition set \(\mathcal{P}\) can include both known (discrete counts, including composition data) and unknown (ordinal, categorical) points.
An observation \(y\) maps to \(w\) as
\[y_{is} = \left \{ \begin{matrix} \ w_{is} & & continuous\\ \ z_{is} & w_{is} \in (p_{z_{is}}, p_{z_{is} + 1}] & discrete \end{matrix} \right.\]
As discussed above, effort \(E_{is}\) affects the partition for discrete data. For the simple case where there is no error in the assignment of discrete intervals, \(\mathbf{z}_i\) is known, and the model for \(\mathbf{w}_i\) is
\[\mathbf{w}_i|\mathbf{x}_i, \mathbf{y}_i, \mathbf{E}_i \sim MVN(\boldsymbol{\mu}_i,\boldsymbol{\Sigma}) \times \prod_{s=1}^S\mathcal{I}_{is}\] \[\boldsymbol{\mu}_i = \mathbf{B}'\mathbf{x}_i\] \[\mathcal{I}_{is} = \prod_{k \in \mathcal{C}}I_{is,k}^{I(y_{is} = k)} (1 - I_{is,k})^{I(y_{is} \neq k)}\]
where \(I_{is} =I(p_{z_{is}} < w_{is} < p_{z_{is} + 1}]\), \(\mathcal{C}\) is the set of discrete intervals, \(\mathbf{B}\) is a \(Q \times S\) matrix of coefficients, and \(\boldsymbol{\Sigma}\) is a \(S \times S\) covariance matrix. There is a correlation matrix associated with \(\boldsymbol{\Sigma}\),
\[\mathbf{R}_{s,s'} = \frac{\boldsymbol{\Sigma}_{s,s'}}{\sqrt{\boldsymbol{\Sigma}_{s,s} \boldsymbol{\Sigma}_{s',s'}}}\]