Torture the data long enough and they will confess to anything.
Resources
Data file
BBSeast.Rdata
Readings
Brynjarsdottir, J. and A.E. Gelfand. 2014. Collective sensitivity analysis for ecological regression models with multivariate response. Journal of Biological, Environmental, and Agricultural Statistics 19, 481-502.
Chib, S and E Greenberg. 1998. Analysis of multivariate probit models. Biometrika 85, 347-361.
Clark, JS, D Nemergut, B Seyednasrollah, P Turner, and S Zhang. 2017. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data, Ecological Monographs 87, 34–56.
Objectives
Understand correlation structure and a joint distribution of responses
Execute and interpret a multivariate model
Overview of multivariate models and data
Multivariate models are built around a multivariate likelihood for a response vector. This likelihood describes a dependence structure in the response vector. This dependence exists after accounting for relationships in other parts of the model.
Multivariate modeling in environmental science has transitioned from descriptive to inferential. Descriptive methods have been popular for decades, most of which attempt to extract low-dimensional pattern from high-dimensional data. These methods tend to be algorithmic rather than model-based. For example, principal components analysis (PCA) concentrates covariance into a new set of orthogonal (uncorrelated) axes. Each axis is selected to maximize the residual information. There is one axis for each of the dimensions in the original data, but structured such that most information is contained in a few. Related descriptive methods are factor analysis, correspondence analysis, multidimensional scaling, discriminant analysis, and cluster analysis. There is a huge literature on these important tools for EDA.
Because Bayesian analysis is based on a model, I focus here on multivariate models that can be used to draw inference. The distributions I consider are:
multivariate normal: a likelihood for continous responses and a prior distribution for regression coefficients; this is generalization of the normal distribution for univariate responses
inverse Wishart: a prior distribution for the covariance matrix; this is generalization of the inverse gamma distribution for univariate responses
multinomial distribution: a likelihood for discrete responses; this is generalization of the binomial distribution for univariate responses
categorical distribution: a likelihood for discrete responses when the number of trials is 1; this is a special case of the multinomial and a generalization of the Bernoulli distribiution for a univariate response
Dependent responses
Consider using a multivariate model when responses are related, they will be interpreted together, and there is residual variation at the likelihood stage.
Multivariate data are jointly distributed. Each observed response is treated as a vector. The elements of an observation vector are not modeled independently, because the likelihood must capture the covariance between them, after accounting for dependence in other parts of the model. For example, consider a continous response variable modeled as a (univariate) regression,
\[y_i|\mathbf{x}_i \sim N(\mathbf{x}'_i\boldsymbol{\beta},\sigma^2)\] When a second response variable is measured at the same time/location, call it \(j\), the analyst might fit a regression model,
\[y_{ij}|\mathbf{x}_i \sim N(\mathbf{x}'_i\boldsymbol{\beta}_j,\sigma^2_j)\] where \(j\) can be response 1 or response 2, and \(\mathbf{x}_i\) and \(\boldsymbol{\beta}_j\) are length-\(Q\) vectors.
Note that the two variables can be viewed as part of the same response, because they were observed together, \(\mathbf{y}_i = (y_{i1}, y_{i2})'\). The two regressions could be collapsed into a single multivariate equation,
\[\mathbf{y}_i|\mathbf{x}_i \sim MVN(\boldsymbol{\beta} \mathbf{x}_i,\boldsymbol{\Sigma}) \] where \(\boldsymbol{\beta}\) is now a \(2 \times Q\) matrix, and the covariance matrix assumes independence,
\[ \Sigma = \begin{pmatrix} \ \sigma_1^2 & 0 \\ \ 0 & \sigma_2^2 \end{pmatrix} \]
I have not actually changed anything, because the two responses have zero residual covariance. I know this because I see the zeros in the matrix. Still, the question becomes, should they be modeled independently?
There are several issues that might motivate a multivariate model. First, if I am interested in a set of responses, and they are related in some sense, then a multivariate model helps to insure valid inference. If I want to make statements about significance of an effect (a value in \(\boldsymbol{\beta}\)), then I need the covariance matrix.
Conversely, if I am only interested in \(y_1\), and I have no reason to believe that it depends on \(y_2\) or if I am simply ignorant of \(y_2\), then I choose a univariate model for \(y_1\). Most effects of the environment are not measured, but this fact should not keep us from fitting models to what we can measure.
Second, suppose that I fit a multivariate model like the one above and estimate that residual covariance elements \(\Sigma_{1,2} = \Sigma_{2,1}\) are close to zero. In this case, I have not really changed anything by modeling them jointly. Of course, \(y_{i1}\) and \(y_{i2}\) still depend on one another, because they both depend on \(\mathbf{x}_i\). However, if mean structure of the model accounts for the dependence, then it will not show up in the residual covariance.
To convince myself of this fact, I do a simple experiment.
Empirical versus residual correlation
If responses depend on the same predictors \(\mathbf{X}\), then those responses should be correlated, despite the fact that they are simulated independently from one another. In this example I simulate three data sets independently, and I compare their correlation structure, with and without the effect of predictors (raw data versus residual from the regression):
n <- 200 # sample size
Q <- 4 # predictors
x <- matrix( rnorm(n*Q), n, Q) # design
x[, 1] <- 1
beta1 <- matrix( rnorm(Q), Q) # coefficients
beta2 <- matrix( rnorm(Q), Q)
beta3 <- matrix( rnorm(Q), Q)
sigma <- 1 # variances
y1 <- rnorm(n, x%*%beta1, sqrt(sigma)) # responses
y2 <- rnorm(n, x%*%beta2, sqrt(sigma))
y3 <- rnorm(n, x%*%beta3, sqrt(sigma))
res <- cbind(y1 - x%*%beta1, y2 - x%*%beta2, y3 - x%*%beta3) #residuals
cor(cbind(y1, y2, y3)) # empirical correlation## y1 y2 y3
## y1 1.0000000 0.6165583 0.5561943
## y2 0.6165583 1.0000000 0.7545621
## y3 0.5561943 0.7545621 1.0000000
## [,1] [,2] [,3]
## [1,] 1.00000000 -0.00587396 0.02990103
## [2,] -0.00587396 1.00000000 -0.11941155
## [3,] 0.02990103 -0.11941155 1.00000000
Notice that y1, y2, y3 can show substantial correlation (off-diagonal elements different from zero), despite having been simulated independently, because they both depend on x. Second, when the mean structure is removed (i.e., the contribution from \(\mathbf{X}\), the correlation declines substantially. This is the “residual correlation”.
There are two points here. First, correlated responses do not need to be modeled jointly solely because they are correlated. If the mean structure of the model takes up the correlation structure, then univariate models (including credible intervals on coefficients) will offer near identical results. Second, even if two responses have residual correlation (the mean structure does not remove the correlation), we may still consider modeling them independently if we believe they have nothing to do with one another, and we do not intend to interpret their joint distribution.
Having said this, if there is residual correlation, and we have reason to expect dependence, then independent modeling will yield incorrect estimates of credible intervals, and it will preclude examination of indirect effects.
A simple MVN model
To examine the impact of ignoring the joint structure, I repeat the previous experiment, now with dependence, showing just two responses,
\[ \Sigma = \begin{pmatrix} \ \sigma_1^2 & \sigma_{12}^2 \\ \ \sigma_{21}^2 & \sigma_2^2 \end{pmatrix} \] where \(\sigma_{12}^2 = \sigma_{21}^2 = \rho_{12} \sigma_{1} \sigma_{2}\), with \(\rho_{12}\) being the correlation. I want to know how the non-zero covariances affect estimates, so I use simulation.
Simulating a MVN data set
Now I simulate the three responses jointly,
beta <- cbind(beta1, beta2, beta3) # coefficient matrix
S <- ncol(beta)
Q <- nrow(beta)
Sigma <- cov( .rMVN(5, 0, diag(S) ) ) # a cheap covariance matrix
xnames <- paste('x',1:Q,sep='') # label everything
ynames <- paste('y',1:S,sep='')
rownames(beta) <- colnames(x) <- xnames
colnames(beta) <- rownames(Sigma) <- colnames(Sigma) <- ynames
y <- x%*%beta + .rMVN(n, 0, Sigma) # simulated y have residual covariance
r <- y - x%*%beta # residuals
par(mfrow=c(2,2), mar=c(4,4,2,1))
plot( y, (x%*%beta), xlab='observed', ylab='predicted' )
plot( r[,1], r[,2] ); title( paste('r =', round(Sigma[1,2],2)) )
plot( r[,1], r[,3] ); title( paste('r =', round(Sigma[1,3],2)) )
plot( r[,2], r[,3] ); title( paste('r =', round(Sigma[2,3],2)) )Predicted versus observed values for the simulated data with residual correlations.
Look at each of the objects generated by this code.
Unlike an empirical covariance matrix, a model covariance matrix must be not only symmetric but also positive definite or full rank.
One thing to note here is the generation of the covariance matrix, which I obtained by simulating MVN vectors. A positive definite covariance matrix is one that can be inverted, i.e., there is a solution to \(\Sigma^{-1}\). This inversion occurs in the solution for parameters. I cannot arbitrarily construct a symmetric matrix and assume it will be full rank. I have only \(S = 3\) responses here, but there could be 100 responses. An easy way to do this is to simply draw random data and take the covariance. Provided the number of vectors used to simulate \(\Sigma\) exceeds the number of response variables (i.e., \(n=5\) in this example > \(S = 3\)), it will be full rank. I now have a simulated data set.
Here are the estimates I would obtain if I modeled each column of the simulated data independently:
bhat <- numeric(0)
for(s in 1:S){
tmp <- summary( lm(y[,s] ~ x[,-1]) )$coefficients[,1:2]
bhat <- rbind(bhat,tmp)
}
blo <- bhat[,1] - 1.96*bhat[,2]
bhi <- bhat[,1] + 1.96*bhat[,2]
plot(beta, bhat[,1], ylim = range(c(blo,bhi)))
segments(beta, blo, beta, bhi)
abline(0,1)Coefficients used to simulate the data compared with estimates from a univariate model.
The point estimates for coefficients are ok, but their standard errors are not. To see this I estimate coefficients for the multivariate model.
Fitting MVN data
Just as with the univariate model, I can fit the multivariate model with Gibbs sampling. The big change is the fact that I must now sample coefficients as a matrix of values (\(S \times Q\)), and the parameter \(\sigma^2\) is replaced with a covariance matrix \(\Sigma\). The prior distribution is multivariate normal for the coefficients and non-informative for the covariance,
\[ MVN(vec(\boldsymbol{\beta}) | 0, \Sigma \otimes \mathbf{C}) \times |\Sigma|^{-(S + 2)/2} \] The new notation here is the \(vec(\cdot)\) operator and the Kronecker product \(\otimes\). The \(vec(\boldsymbol{\beta})\) operation takes the \(S \times Q\) matrix \(\boldsymbol{\beta}\) and stacks the columns into a length-\(SQ\) vector. Of course, covariance for a vector of length \(SQ\) requires a \(SQ \times SQ\) covariance matrix. The Kronecker product generates a block-diagonal matrix, where there are \(S\) blocks (\(\Sigma\) is \(S \times S\)), each of which is a \(Q \times Q\) matrix (\(\mathbf{C}\) is \(Q \times Q\)). If I label the blocks from 1 to \(S\), then block \((s, s')\) is \(\Sigma_{ss'}\mathbf{C}\). This is a type of separable matrix, which means that every element of \(\mathbf{C}\) in block \((s, s')\) is multiplied by the same scalar value (one element of \(\Sigma\)). In other words this species \(\times\) covariate covariance matrix has no interactions between species and covariates a priori. However, there will be interactions a posteriori induced by the data.
I take the prior covariance \(\mathbf{C}\) to be non-informative.
Gibbs sampling is all direct from conditional posterior distributions. First I mention the inverse inverse Wishart distribution. The prior distribution for the covariance matrix \(|\Sigma|^{-(S + 2)/2}\) is conjugate with the \(MVN\) likelihood. The conditional posterior distribution is also inverse Wishart. This distribution takes as arguments a prior covariance matrix having the same dimensions as \(\Sigma\), and degrees of freedom, which must be larger than \(S + 1\). The larger the degrees of freedom, the more informative (more weight on prior). The conditional posteriors are admittedly ugly, but direct sampling is worth a bit of math:
\[ \begin{aligned} vec(\boldsymbol{\beta}) & \sim & MVN\left( vec( (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} ), \Sigma \otimes (\mathbf{X}'\mathbf{X})^{-1} \right) \\ \Sigma & \sim & IW( n - Q + S - 1,\mathbf{Y}' \mathbf{D} \mathbf{Y} ) \end{aligned} \] where \(\mathbf{D} = \mathbf{I}_n - \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\).
To do this as a single multivariate regression I write a Gibbs sampler:
bg <- beta*0
sg <- Sigma*0
rownames(bg) <- colnames(x) <- xnames
colnames(bg) <- rownames(sg) <- colnames(sg) <- ynames
ng <- 2000 # setup Gibbs sampler
bgibbs <- matrix(0,ng,S*Q)
sgibbs <- matrix(0,ng,S*S)
predy <- matrix(0,ng,S*n) # predict the data
colnames(bgibbs) <- as.vector( outer(xnames,ynames,paste,sep='_') )
colnames(sgibbs) <- as.vector( outer(ynames,ynames,paste,sep='_') )
IXX <- solve(crossprod(x)) # only do this once
df <- n - Q + S - 1 # Wishart
for(g in 1:ng){
bg <- .updateBetaMVN(x,y,sg)
sg <- .updateWishart(x, y, df, beta=bg, IXX=IXX)$sigma
sgibbs[g,] <- sg
bgibbs[g,] <- bg
predy[g,] <- as.vector( .rMVN(n,x%*%bg,sg) ) # predicted y
}Here are some plots of estimates and predictions:
par(mfrow=c(2,2), mar=c(4,4,1,1))
bmu <- colMeans(bgibbs)
bci <- apply(bgibbs, 2, quantile, c(.025,.975))
plot(as.vector(beta),bmu, xlab='true beta', ylab='estimate')
segments(as.vector(beta), bci[1,], as.vector(beta), bci[2,])
abline(0,1)
smu <- colMeans(sgibbs)
sci <- apply(sgibbs, 2, quantile, c(.025,.975))
plot(as.vector(Sigma),smu, xlab='true Sigma', ylab='')
segments(as.vector(Sigma), sci[1,], as.vector(Sigma), sci[2,])
abline(0,1)
ymu <- matrix( colMeans(predy),n,S )
plot(y,ymu)
abline(0,1)
yci <- apply(predy, 2, quantile, c(.025,.975))
segments(as.vector(y), yci[1,], as.vector(y), yci[2,])Estimates and predictions from the multivariate model.
This analysis can be done compactly in gjam:
library(gjam)
ml <- list(ng = 2000, burnin = 500, typeNames = 'CON')
form <- as.formula(~ x2 + x3 + x4)
out <- gjam(form, data.frame(x), y, modelList = ml)## ================================================================================================================================================================
##
## Sensitivity by predictor variables f:
## Estimate SE CI_025 CI_975
## x2 3.54 0.413 2.80 4.46
## x3 8.96 0.635 7.73 10.20
## x4 12.20 1.270 9.90 14.80
##
## Coefficient matrix B:
## y1 y2 y3
## intercept 0.626 -0.993 0.494
## x2 -0.602 -0.525 -1.180
## x3 0.486 0.359 -0.945
## x4 -0.471 -1.820 -3.130
## RMSPE 1.800 0.721 1.090
##
## Coefficient matrix B:
## Estimate SE CI_025 CI_975 sig95
## y1_intercept 0.626 0.0927 0.438 0.805 *
## y1_x2 -0.602 0.0987 -0.783 -0.409 *
## y1_x3 0.486 0.0973 0.292 0.674 *
## y1_x4 -0.471 0.0920 -0.644 -0.289 *
## y2_intercept -0.993 0.0372 -1.060 -0.917 *
## y2_x2 -0.525 0.0389 -0.604 -0.448 *
## y2_x3 0.359 0.0397 0.283 0.438 *
## y2_x4 -1.820 0.0374 -1.890 -1.740 *
## y3_intercept 0.494 0.0539 0.389 0.596 *
## y3_x2 -1.180 0.0606 -1.300 -1.070 *
## y3_x3 -0.945 0.0581 -1.060 -0.828 *
## y3_x4 -3.130 0.0548 -3.240 -3.020 *
##
## Last column indicates if 95% posterior distribution contains zero.
##
## Coefficient matrix B, standardized for X:
## NULL
##
## Last column indicates if 95% posterior distribution contains zero.
##
## Coefficient matrix B, standardized for X and W:
## NULL
##
## Last column indicates if 95% posterior distribution contains zero.
##
## Sample contains n = 200 observations on S = 3 response variables. Data types (typeNames) include CON. There are 0 missing values in X and 0 missing values in Y. The RMSPE is 1.28, and the DIC is 2542. Computation involved 2000 Gibbs steps, with a burnin of 500.
# repeat with ng = 5000, burnin = 500, then plot data:
trueValues <- list(sigma = Sigma, beta = beta, corSpec = cov2cor(Sigma))
pl <- list(trueValues = trueValues)
gjamPlot(out, plotPars = pl)## ================================================================================
## simulated beta, corSpec vs betaMu, corMu (95%) -- return to continue
## obs y vs predicted y -- return to continue
## x inverse prediction, covariates -- return to continue
## sensitivity over full model -- return to continue
## beta coefficient thinned chains -- return to continue
## correlation thinned chains -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## 95% posterior -- return to continue
Here is a comparison of coefficients from the univariate model and gjam:
Q <- 4 # intercept plus three predictors
betaMu <- out$parameters$betaMu
betaSe <- matrix(out$parameters$betaTable[,2], Q, S)
bUniMu <- matrix(bhat[,1], Q, S)
bUniSe <- matrix(bhat[,2], Q, S)
par(mfrow=c(S,2),bty='n',mar=c(4,4,1,1))
for(j in 1:S){
plot(betaMu[,j], bUniMu[,j], pch=15); abline(0,1,lty=2)
plot(betaSe[,j], bUniSe[,j], pch=15); abline(0,1,lty=2)
}Coefficients and standard errors for the univariate and multivariate model.
The mean estimates are ok (left side), but the standard errors differ (right side). Why does the univariate model fail to generate correct standard errors? Recall the conditional posterior for the univariate model has a contribution from the likelihood to the covariance
\[ Var(\mathbf{\beta})|\sigma^2 = \sigma^2(\mathbf{X}'\mathbf{X})^{-1} \] For the multivariate model it changes to
\[ Var(vec(\mathbf{\beta}))|\Sigma = \Sigma \otimes (\mathbf{X}'\mathbf{X})^{-1} \] This conditional relationship reveals an important role for the covariance, directly affecting uncertainty of coefficients.
Exercise 1: Conduct simlation experiments to determine how the number of response variables affects estimates of the coefficients and covariance matrix.
Discrete data
Multivariate discrete observations are typically modeled with a multinomial, or its special case for a single trial, the categorical distribution. A multinomial distribution might be used to describe eye color in members of our class (say, \(n = 15\)),
\[\mathbf{y}_i \sim multinom(n, [\theta_{i1}, \dots, \theta_{iR}]\] where \(R\) is the number of classes. There are \(R - 1\) actual classes, because the last one is redundant with others. For the special case of the binomial (\(R = 2\)), there is only one actual class, because the second class, having probability \(\theta_2 = 1 - \theta_1\) is fully determined by the first class.
In this case, \(\mathbf{y}_i\) is a length-\(R\) vector of counts (e.g., for number of brown, blue, green) that sum to \(n\), with a corresponding length-\(R\) vector of probabilities \(\boldsymbol{\theta}_i\) that sum to 1. A categorical distribution might be used to describe the eye color of one individual selected at random from the class.
To draw random vectors from the multinomial I can use the R function rmultinom.
The drawback with this function is that it accepts only a single probability vector \(\boldsymbol{\theta}\) that is applied to all \(m\) observations. I rarely use this function because I typically have a model, which means that there is a vector \(\boldsymbol{\theta}_i\) for each of the \(i = 1, \dots, m\) observations \(\mathbf{y}_i\). Here is an example from myrmultinom:
theta <- matrix( runif(m*3, 0, 1), m, 3) # random vectors
theta <- sweep( theta, 1, rowSums(theta), '/' ) # normalization is not actually necessary
y <- myrmultinom(n, theta) This function is useful when I have a model for \(\mathbf{\theta}_i\).
Multinomial logistic regression
Again, because the multinomial has one less response variable than there are classes (they sum to fixed \(n_i\) for \(R\) classes), a multinomial regression has \(R-1\) independent responses. Each looks like a logit model:
\[ \begin{align} \log \frac{[y_i = 1]}{[y_i = R]} &= \boldsymbol{\beta}_1 \mathbf{x}'_i \\ \log \frac{[y_i = 2]}{[y_i = R]} &= \boldsymbol{\beta}_2 \mathbf{x}'_i \\ & \vdots \\ \log \frac{[y_i = R-1]}{[y_i = R]} &= \boldsymbol{\beta}_{R-1} \mathbf{x}'_i \end{align} \] The coefficient vector for response \(r\) is the vector \(\boldsymbol{\beta}_r = (\beta_{r1}, \dots \beta_{rQ})'\), where \(Q\) is the number of predictors in \(\mathbf{x}_i\). This vector is not the response for \(r\) but rather the response relative to the response of the reference class \(R\). This relationship, together with the non-linear link function makes parameters hard to interpret. Here are the probabilities for classes:
\[ \begin{align} [y_i = 1] &= \frac{exp(\beta_1 \mathbf{x}'_i)}{1 + \sum_{r=1}^{R-1}exp(\beta_r \mathbf{x}'_i)} \\ & \vdots \\ [y_i = R-1] &= \frac{exp(\beta_{R-1} \mathbf{x}'_i)}{1 + \sum_{r=1}^{R-1}exp(\beta_{r} \mathbf{x}'_i)} \end{align} \] For \(R = 2\) this becomes the simple logit model. Here is a data simulation and model fitting with Gibbs sampling. First I simulate data
n <- 500
k <- 3 # inputs
r <- 5 # response classes
size <- 1 + sample(10,n,replace=T) # up to 10 objects per observation
b <- bmultiProp(r,k)$cc
loX <- c(1,-1,-1) # range of covariates, length = k
hiX <- c(1,1,1)
LIKE <- 'multinom' # 'norm', 'pois', 'binom', or 'multinom'
x <- simX(n,loX,hiX) # simulate x and y
y <- simY(x, b, LIKE, r, size,0)Here is a Gibbs sampler:
priorB <- as.vector(b*0) # prior means
priorVB <- diag(1000,k*(r-1)) # prior covariance
priorIV <- solve(priorVB) # prior inv covariance
ng <- 5000
tmp <- gibbsLoop(LIKE, ng, x, y, b,
priorB=priorB, priorVB=priorVB, burnin=1000)## [1] 0
## ================================================================================
## $summary
## estimate se 0.025 0.975 true value
## q-1_s-1 -0.053710 0.03078 -0.11890 0.0003714 -0.094030
## q-2_s-1 0.134000 0.04680 0.02535 0.2159000 0.233200
## q-3_s-1 -0.565400 0.04412 -0.66440 -0.4944000 -0.440000
## q-1_s-2 -0.035450 0.04351 -0.12260 0.0426800 -0.107900
## q-2_s-2 0.094310 0.05958 -0.02314 0.2085000 0.004535
## q-3_s-2 0.396000 0.05403 0.30350 0.5002000 0.319400
## q-1_s-3 0.143500 0.02880 0.08304 0.1959000 0.199700
## q-2_s-3 0.050410 0.06436 -0.06110 0.1950000 -0.003658
## q-3_s-3 -0.585600 0.04555 -0.67150 -0.4918000 -0.567800
## q-1_s-4 0.811100 0.03273 0.74710 0.8763000 0.747200
## q-2_s-4 -0.007418 0.03646 -0.08138 0.0602300 -0.069010
## q-3_s-4 -0.114000 0.05963 -0.23850 0.0094420 -0.213100
## $summary
## estimate se 0.025 0.975 true value
## q-1_s-1 -0.053710 0.03078 -0.11890 0.0003714 -0.094030
## q-2_s-1 0.134000 0.04680 0.02535 0.2159000 0.233200
## q-3_s-1 -0.565400 0.04412 -0.66440 -0.4944000 -0.440000
## q-1_s-2 -0.035450 0.04351 -0.12260 0.0426800 -0.107900
## q-2_s-2 0.094310 0.05958 -0.02314 0.2085000 0.004535
## q-3_s-2 0.396000 0.05403 0.30350 0.5002000 0.319400
## q-1_s-3 0.143500 0.02880 0.08304 0.1959000 0.199700
## q-2_s-3 0.050410 0.06436 -0.06110 0.1950000 -0.003658
## q-3_s-3 -0.585600 0.04555 -0.67150 -0.4918000 -0.567800
## q-1_s-4 0.811100 0.03273 0.74710 0.8763000 0.747200
## q-2_s-4 -0.007418 0.03646 -0.08138 0.0602300 -0.069010
## q-3_s-4 -0.114000 0.05963 -0.23850 0.0094420 -0.213100
Here are predictions with observed:
Predicted and observed data from the multinomial logit.
With leaf habit
Leaf habit is one of the most important trait adaptations for variation in climate and habitat. Leaves can be thin and expendable or thick, durable, and long-lived. The type of leaf that a species invests in can tell us much about the environment in which it lives and how that species copes with it. A full description of trait analysis in this FIA is available in Clark et al. (2017). I used GJAM there, whereas here I provide an example with the multinomial logit.
Leaf habit is an example of a categorical variable. In this example, I look at the distribution of leaf types across FIA plots. I have counted the number of individuals on each aggregate plot that possess each of ** leaf types. These are the columns in matrix yleaf:
Each row of yleaf is a multinomial distribution, each with a different size due to the fact that different numbers of trees occur on each plot.
The predictors in the model are held in a matrix x. They include the site moisture status (moist), winter minimum temperature (winTmin), annual precipitation (annualPrec), and annual moisture deficit (annualDef).
Here is a Gibbs sampler for a multinomial logit distribution. It needs many interations due to the fact that this is a large data set, and coefficients must be sampled indirectly (with Metropolis).
x <- model.matrix(~ moist*annualDef + winTmin + annualPrec,
data = data.frame(xcluster) )
r <- ncol(yleaf)
k <- ncol(x)
b <- matrix(0, k, r-1)
rownames(b) <- colnames(x)
colnames(b) <- colnames(yleaf)[-ncol(yleaf)]
priorB <- as.vector(b*0) # prior means
priorVB <- diag(1000,k*(r-1)) # prior covariance
priorIV <- solve(priorVB) # prior inv covariance
ng <- 10000
tmp <- gibbsLoop(LIKE = 'multinom', ng, x, yleaf, b,
priorB=priorB, priorVB=priorVB, burnin=4000)
par( mfrow=c(2,3), mar=c(4,4,2,1), bty='n' )
for(j in 1:ncol(yleaf)){
plotObsPred(sqrt(yleaf[,j]),sqrt(tmp$ymean[,j]) )
abline(0, 1, lty=2)
abline( h = mean(yleaf[,j], na.rm=T), lty=2)
title(colnames(yleaf)[j])
}Predicted versus observed leaf traits from the multinomial logit model.
Note that these covariates do not do a good job of predicting leaf habit. It will always be especially hard to predict rare types (e.g., broad_evergreen).
Here are responses:
par( mfrow=c(2,3), mar=c(4,4,2,1), bty='n' )
for(j in 2:ncol(x)){
wj <- grep(colnames(x)[j], colnames(tmp$bgibbs))
if( j %in% c(2,3) )wj <- wj[ !wj %in% grep(':', colnames(tmp$bgibbs)) ]
xlim <- range(tmp$bgibbs[,wj])
ylim <- c(0, 50/diff(xlim) )
plot(NA, xlim = xlim, ylim=ylim, xlab='Parameter value', ylab='Density')
title(colnames(x)[j])
for(i in 1:length(wj)){
xij <- density(tmp$bgibbs[,wj[i]], n = 50)
lines(xij$x, xij$y, lwd=2, col=i)
abline(v = 0, lty=2)
}
}
legend('topleft', colnames(yleaf)[-r], text.col = 1:r, bty='n',
cex=.9)Posterior densities for the predictors for each leaf type. Note that there are no estimates for the reference class, scalelike_evergreen.
The bumpiness of these plots indicate that more mixing is needed (bigger ng). The indirect Metropolis sampling in this algorithm will typically be difficult to tune for efficiency.
There are a number of important methods built on the multinomial, including the multinomial logit and probit models. There are Bayesian implementations of these methods that are worth consideration when data are discrete.
Some limitations
There are three reasons why I turn immediately to an alternative approach. First is the issue of covariance. Just as in the continuous case, discrete data can have correlation in outcomes. However, unlike the multivariate normal distribution, which has a covariance matrix, the multinomial distribution has only a probability vector. In other words, the distribution itself imposes the covariance in outcomes. The variance in outcome \(s\) is \(n \theta_s (1 - \theta_s)\). The covariance between outcomes \(s\) and \(s'\) is \(-n \theta_s \theta_{s'}\). If you are unconvinced, try this:
## [,1] [,2] [,3]
## [1,] 25.610054 -15.720116 -9.889938
## [2,] -15.720116 21.732897 -6.012781
## [3,] -9.889938 -6.012781 15.902719
and compare with this:
## [,1] [,2] [,3]
## [1,] 25 -15 -10
## [2,] -15 21 -6
## [3,] -10 -6 16
The negative covariance is imposed by the compositional nature of the data–the \(y_{is}\) sum to \(n\), so increasing any one tends to decrease others. Multinomial data are also called composition data. I call this a disadvantage because, regardless of whether or not two microbial taxa tend to occur together, the multinomial distribution used for count data will impose negative covariance.
A second limitation comes from the difficulty interpreting coefficients when a non-linear link function is used to model count data with a continuous intensity or probability model. Non-linear link functions are used in GLMs to insure positive \((0, \infty)\) intensity or \((0, 1)\) probability. The problem comes from interpreting coefficients (and covariances) on these non-linear scales. For example, the covariance fitted on a non-linear scale is not a covariance between outcomes on the observation scale.
Still another reason that I tend not to use models based on the multinomial comes from the fact that data commonly consist of combinations of continuous and discrete variables. In social sciences, a response could consist of income (continuous) and political party (discrete). In the health sciences, coronavirus sypmtoms can be discrete (cough, tiredness, difficulty breathing) and continous (fever).
Real data tend to be more complicated than simply continuous versus discrete. In ecology, observations on plant health can include nomimal classes for herbivore damage, ordinal classes from abundance, or combinations of continuous and discrete in the same variable: discrete zero or detection limit for chemical constituents, continuous positive values, discrete maximum detection. gjam was developed (partly in this class) to address the common problems I typically encounter with multivariate data. These issues are summarized in the next section.
GJAM for multivariate data
Generalized Joint Attribute Modelling gjam models multivariate responses that can be combinations of discrete and continuous variables, where interpretation is needed on the observation scale. It was motivated by the challenges of modeling distribution and abundance of multiple species, so-called joint species distribution models (JSDMs). Because the different attributes that make up a multivariate response are often recorded on many different scales and with different levels of sampling effort, analyses often default to the least informative common-denominator, binary or ‘presence-absence’.
Combining species and other attributes in a single analysis is challenging because some species groups are counted. Some may be continuous cover values or basal area. Some may be recorded in ordinal bins, such as ‘rare’, ‘moderate’, and ‘abundant’. Others may be presence-absence. Some are composition data, either fractional (continuous on (0, 1)) or counts (e.g., molecular and fossil pollen data). Attributes such as body condition, infection status, and herbivore damage are often included in field data. GJAM accommodate multifarious observations.
As discussed in the previous section, a covariance matrix allows me to see the relationships between responses that are left after I remove the mean structure. For a simple linear model this is easy, because the reponses are real numbers on \((-\infty, \infty)\). I cannot do this with counts unless I impose a non-linear link function, e.g., the link functions used in GLMs. The problem is that the coefficients no longer have a simple interpretation. For example, the covariance matrix is not the covariance between species, but rather a non-linear transformation of them.
To allow transparent interpretation gjam avoids non-linear link functions. This is a departure from generalized linear models (GLMs) and most hierarchical Bayes models.
Censoring and observation effort
In GJAM, the integration of discrete and continuous data on the observed scales makes use of censoring. Censoring extends a model for continuous variables across censored intervals. Continuous observations are uncensored. Recall, the Tobit model was an example of censoring for discrete zeros. Censored observations are discrete and can depend on sample effort.
Censoring is used with the effort for an observation to combine discrete variables with appropriate weight. In count data, effort is determined by the size of the sample plot, search time, or both. It is comparable to the offset in GLMs. In count composition data (e.g., microbiome, fossil pollen), effort is the total count taken over all species. In GJAM, discrete observations can be viewed as censored versions of an underlying continuous space.
The partition and sample effort
GJAM achieves effort-based weighting through a partition of the continuous scale. Where effort \(E = 1\), the partition for discrete counts \(0, 1, 2, \dots\) begin at \(-\infty\), followed by midpoints between count values, \(\mathbf{p} = (-\infty, 1/2, 3/2, \dots)\). For \(\mathbf{z}_{is} = k\) the interval is thus \((p_{i,k}, p_{i,k+1}] = (k - 1/2, k + 1/2]\). When effort varies between observations the partition shifts to the ‘effort scale’,
\[\begin{equation} \label{eqn:pik} \mathbf{p}_{ik} = \left(\frac{k - 1/2}{E_{i}}, \frac{k + 1/2}{E_{i}}\right] \end{equation}\] If observations are animals counted per hour, \(E_i\) can be search time. If observations are benthic organisms per sediment core, \(E_i\) can be core volume. If observations are seedlings per plot, then \(E_i\) can be the area of plot \(i\). Because plots have different areas one might choose to model \(w_{is}\) on a ‘per-area’ scale (density) rather than a ‘per-plot’ scale.
GJAM uses this partition with a MVN model for the continuous scale. Disadvantages of models based on the multinomial or GLMs are remedied by the GJAM treatment based on censoring. First, there is a covariance \(\Sigma\) fitted to data–a fixed covariance is not imposed by the multinomial distribution. Second, the scale is easily interpreted. If effort is omitted (\(E = 1\)), then coefficients and covariance are interpreted on the observation scale (e.g., per plot, per interval). If effort is included, the scale is ‘per effort’ (e.g., per hectare, per minute). Coefficients and covariance are interpreted on this scale. Finally, the different data types are combined. For biodiversity, where birds come from point counts, plants from fixed-area plots, microbes from sequencing, and mammals from live traps, there is a model that combines them. In the next section I summarize the main data types.
Data types
(CA) data can be concentration, biomass, density, basal area, leaf area, cover, and so on, all of which are continuous above zero, but have discrete zero.
Table 1. Partition for each data type
typeNames |
Type | Obs values | Default partition | Comments |
|---|---|---|---|---|
'CON' |
continuous, uncensored | \((-\infty, \infty)\) | none | e.g., centered, standardized |
'CA' |
continuous abundance | \([0, \infty)\) | \((-\infty, 0, \infty)\) | with zeros |
'DA' |
discrete abundance | \(\{0, 1, 2, \dots \}\) | \((-\infty, \frac{1}{2E_{i}}, \frac{3}{2E_{i}}, \dots , \frac{max_s(y_{is}) - 1/2}{E_i}, \infty)^1\) | e.g., count data |
'PA' |
presence- absence | \(\{0, 1\}\) | \((-\infty, 0, \infty)\) | unit variance scale |
'OC' |
ordinal counts | \(\{0, 1, 2, \dots , K \}\) | \((-\infty, 0, estimates, \infty)\) | unit variance scale, imputed partition |
'FC' |
fractional composition | \([0, 1]\) | \((-\infty, 0, 1, \infty)\) | relative abundance |
'CC' |
count composition | \(\{0, 1, 2, \dots \}\) | \((-\infty, \frac{1}{2E_{i}}, \frac{3}{2E_{i}}, \dots , 1 - \frac{1}{2E_i}, \infty)^1\) | relative abundance counts |
'CAT' |
categorical | \(\{0, 1\}\) | \((-\infty, max_{k}(0, w_{is,k}), \infty)^2\) | unit variance, multiple levels |
(DA) data arise from counts, which are often not well described by standard distributions, such as the Poisson or the negative binomial, and perform poorly when zeros are common. Again, the multinomial distribution induces a negative covariance. When the total count in the multinomial distribution is related to abundance a separate model is needed for this total. By treating observed counts as a censored version of true abundance GJAM accommodates effort, and parameters can be interpreted on the observation scale or the effort scale.
(PA) data include only two categories, \(\{0, 1\}\). The multivariate probit model of Chib and Greenberg (1998) is a special case of GJAM for PA data, where both intervals are censored.
(OC) data are collected where abundance must be evaluated rapidly, where precise measurements are difficult, or absolute scales are difficult to apply. Because there is no absolute scale the partition must be inferred. Consider the ordinal scale represented by these labels: (absent, rare, intermediate, abundant). The sample partition is \(\mathbf{p}_{s} = (-\infty, 0, p_{s,2},p_{s,3}, \infty)\), where elements 2 and 3 are estimated. The zero anchors location, and unit variance imposes a scale.
may be continuous fractions with a sum-to-one constraint (fractional composition) or discrete counts. Both have interpretation on the relative abundance \([0,1]\) scale, and both require point mass at zero and one. Due to the sum-to-one (fractional composition) or sum-to-\(E_i\) (count composition) constraint, there is information on only \(S - 1\) columns in \(\mathbf{Y}\).
(CC) data are composition data reported as numbers of each species counted. Composition counts are only meaningful in a relative sense; they provide no information on absolute abundance. The total count for a sample is the effort \(E_i = \sum_{s}{y_{is}}\). Common examples include molecular sequence data, paleoecology, and fungal assays. In paleoecology total counts can differ widely between observations (www.neotomadb.org). The number of DNA sequence reads in microbiome data can range over orders of magnitude. A practice that is widespread in the microbiome community rarifies count data to achieve approximate equity between samples. This amounts to a massive manipulation of data that can throw away vast amounts of information. Alternative model-based approaches applied to counts are limited to single taxa. A multinomial model with second-stage covariance is not on the data scale. Moreover, dominance of zeros in microbiome data limits application of most approaches.
GJAM accommodates the discrete observations and the underlying relative abundance scale. A sample count can take values \(y_{is} \in \{0,1, 2, \dots \}\), with \(E_i\) being the total count for sample \(i\). The partition segments the \([0, 1]\) composition scale according to effort and allowing for zeros. Small samples have wide bins and, thus, high variance and low weight.
(FC) data arise in many ways, examples including the fraction of a photoplot or remotely sensed image occupied by each species or cover type. It can be the fraction of leaves lost to different types of herbivory or stream or foliar chemistry. The correlations between responses are distorted when estimated on the multivariate logit scale. Still more problematic, the logit scale does not admit zeros, which are common in composition data. In GJAM a FC observation is represented in continuous space and censored at 0 (absent species) and 1 (monoculture).
A sample may have multiple composition groups. For example, \(\mathbf{Y}\) may include both soil and endophytic microbiome data, each with its own total count (effort). Let \(G\) be the number of composition groups. If there are \(L_g\) response variables for a given FC or CC group \(g\), then there are \(L_g - 1\) non-redundant columns in \(\mathbf{Y}\) for group \(g\). A sample includes information on the total number of non-redundant columns, \(S = \sum_g L_g - G\). A link function provides support over the real line for composition data, while providing estimates on the observation scale (Appendix S1).
(CAT) describe unordered categories. If observation \(i\) refers to a sample plot, and the response \(s\) is a cover-type variable, then it might be assigned to one of several categories \(k\), such as ‘tidal flat’, ‘low marsh’, or ‘high marsh’. If it refers to a sample plant, and a response is growth habit, it might be assigned one of four categories ‘herb’, ‘graminoid’, ‘shrub’, or ‘tree’. These are multinomial responses. Like composition data, a categorical response \(s\) occupies as many columns in \(\mathbf{Y}\) as there are non-redundant levels \(K_s - 1\), because the \(K_s\) columns sum to 1. The observed category is that having the largest value of \(w_{is,k}\) for response \(s\).
Implementing GJAM
The different types of data that can be included in the model are summarized in Table 1, assigned to the character variable typeNames that is included in the modelList passed to gjam:
Consider a sample of size \(n = 500\) for \(S = 10\) species and \(Q = 4\) predictors. To indicate that all species are continuous abundance data I specify typeNames as 'CA':
The object f includes elements needed to analyze the simulated data set. f$typeNames is a length-\(S\) character vector. The formula follows standard R syntax. It does not start with y ~, because gjam is multivariate. The multivariate response is supplied as a \(n \times S\) matrix or data.frame ydata. Here is the formula for this example:
The simulated parameter values are returned from gjamSimData in the list f$trueValues
The predictors are held in a \(n \times Q\) data.frame xdata. The responses are a \(n \times S\) matrix ydata.
par(bty = 'n', mfrow = c(1,2), family='')
h <- hist(c(-1,f$y), nclass = 50, plot = F)
plot(h$counts,h$mids, type = 's', xlab='count', ylab='abundance')
plot(f$w,f$y,cex = .2, xlab='w', ylab='y')A histogram of observed y (left) and plotted against latent w (right).
Here is a short Gibbs sampler to estimate parameters and fit the data. The function GJAM needs the formula for the model, the data.frame xdata, which includes the predictors, the response matrix or data.frame ydata, and a modelList specifying number of Gibbs steps ng, the burnin, and typeNames.
ml <- list(ng = 1000, burnin = 100, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
summary(out)The print function also works:
Among the objects to consider initially are the design matrix out$inputs$x, response matrix out$inputs$y, and the MCMC out$chains with these names and sizes:
$chains is a list of matrices, each with ng rows and as many columns as needed to hold parameter estimates. For example, each row of $chains$bgibbs is a length-\(QS\) vector of values for the \(Q \times S\) matrix \(\mathbf{B}\), standardized for \(\mathbf{X}\). In other words, a coefficient \(\mathbf{B}_{qs}\) has the units of \(w_s\). $chains$sgibbs holds either the \(S(S + 1)/2\) unique values of \(\boldsymbol{\Sigma}\) or the \(N \times r\) unique values of the dimension reduced covariance matrix.
Additional summaries are available in the list inputs:
The matrix classBySpec shows the number of observations in each interval. For this example of continuous data censored at zero, the two labels are \(k \in \{0, 1\}\) corresponding to the intervals \((p_{s,0}, p_{s,1}] = (-\infty,0]\) and \((p_{s,1}, p_{s,2}) = (0, \infty)\). The length-\((K + 1)\) partition vector is the same for all species, \(\mathbf{p} = (-\infty, 0, \infty)\). Here is classBySpec for this example:
The first interval is censored (all values of \(y_{is}\) = 0). The second interval is not censored (\(y_{is} = w_{is}\)).
The fitted coefficients in $parameters, as summarized in Table 2. For example, here is posterior mean estimate of unstandardized coefficients \(\mathbf{B}_u\),
Species richness and diversity across all observations. The distribution of observed values is shown as a histogram. This plot is generated as the file richness.pdf when plotPars$SAVEPLOTS = T is passed to gjamPlot.
Species richness and diversity across all observations. The distribution of observed values is shown as a histogram. This plot is generated as the file richness.pdf when plotPars$SAVEPLOTS = T is passed to gjamPlot.
The data are also predicted in GJAM, summarized by predictive means and standard errors. These are contained in \(n \times Q\) matrices $prediction$xpredMu and $prediction$xpredSd and \(n \times S\) matrices $prediction$ypredMu and $prediction$ypredSd. These are in-sample predictions.
The output can be viewed with the function gjamPlot:
f <- gjamSimData(n = 500, S = 10, typeNames = 'CA')
ml <- list(ng = 1000, burnin = 200, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)
pl <- list(trueValues = f$trueValues, GRIDPLOTS = T)
gjamPlot(output = out, plotPars = pl)gjamPlot creates a number of plots comparing true and estimated parameters (for simulated data).
Predictions of responses in Y against true values.
Predictions of responses in Y against true values.
Breeding bird survey (BBS) example
Here is an example using BBS data, which we looked at earlier in the semester. In this case, I’m using data from the eastern US:
Predicted richness and diversity for BBS data.
Here is a gjam analysis. First, I trim the matrix ydata down to only those species that occur in at least 1000 observations:
Sensitivity to winter temperature in BBS. Weak for migrant species?
Sensitivity to croplands in BBS.
Sensitivity to wetlands in BBS.
Commuity-wide sensitivity to predictors in the model.
I set up three objects:
reductList does dimension reduction on the covariance matrix. With this many columns in ytrim, there may be more coefficients in \(\Sigma\) than the data can inform. Or it may be too large to invert. I reduce the size to N effective rows and r effective columns.
modelList holds the number of MCMC iterations ng, the burnin iterations, and the data type 'DA'.
formula is the model
reductList <- list(N = 20, r = 15)
modelList <- list(ng = 2000, burnin = 500, typeNames='DA',
reductList = reductList)
formula <- as.formula(~ winterTemp + therm + def + nlcd)Predicted abundance for BBS data.
Here is a GJAM fit, will take some time:
Here is the dimension reduction, the full \(98 \times 98\) covariance matrix is reduced to 112 coefficients:
The small block below represents the number of coefficients estimated in the covariance matrix.
There are a large number of species. To help visualize output I assign cluster codes to specific groups. This is a list specColor, where I have organized some of the species into groups that a non-specialist might recognize:
snames <- colnames(out$inputs$y)
specGroups <- list(
urban = c("EuropeanStarling","AmericanCrow","ChimneySwift",
"CommonGrackle","FishCrow","HouseFinch",
"HouseSparrow","NorthernMockingbird","HouseWren"),
raptor = c('CoopersHawk','RedtailedHawk','SwainsonsHawk',
'RedshoulderedHawk','BaldEagle','AmericanKestrel',
'Osprey','BarredOwl','GreatHornedOwl'),
wetland = c("Anhinga","DoublecrestedCormorant",
"Mallard","WoodDuck","CanadaGoose","WhiteIbis",
"RedwingedBlackbird","Mallard","WoodDuck","CanadaGoose",
"WhiteIbis","GreatBlueHeron","GreatEgret",
"LittleBlueHeron","CattleEgret","GreenHeron",
"YellowcrownedNightHeron","BeltedKingfisher")
)
S <- ncol(out$inputs$y)
specColor <- rep('black',S)
cc <- c('brown','red','blue')
for(j in 1:length(specGroups)){
wj <- which(snames %in% specGroups[[j]])
specColor[wj] <- cc[j]
names(specColor)[wj] <- names(specGroups)[j]
}Look at the structure of specColor.
Inverse prediction of predictors for BBS. Important variables are well-predicted by the bird assemblage.
Here are plots:
Prediction
Prediction is used to i) evaluate model fit (in-sample, out-of-sample), ii) impute missing values in x and y, and iii) project the fitted model to a (out-of-sample) prediction grid. Predictions marginalize the uncertainty in the fitted model. Consider a likelihood and distribution of parameters \([\theta]\). I can integrate the variation in parameters this way.
\[ [Y|X] = \int [Y | X, \theta] [\theta] d\theta \] You might recognize this to be the marginal likelihood of Bayes theorem, where \([\theta]\) is the prior distribution.
To predict from a fitted model, I marginalize the posterior distribution,
\[ [Y^* |X^*] = \int [Y^* | X^*, \theta] [\theta | X, Y] d\theta \] The factors in the integrand are the likelihood, now written for a new data set, and the posterior distribution. If the predictors in \(X^*\) are the same as \(X\), then I am predicting in-sample. If not, I am predicting out-of-sample.
For a hierarchical model, there is uncertainty in parameters, in [process| parameters], and in [data | process, parameters], or
\[ [Y^*] = \int [Y^* | Z, \theta] [Z| \theta, X, Y] [\theta | X, Y] d\theta dZ \] where \(Z\) represents a latent state or process. In this case, I am assuming that some of the parameters connect observed \((X, Y)\) to the process \(Z\) and others to the observations \(Y\).
When the model is fitted with MCMC, then this integration is done numerically. The MCMC chains are stored as iteration \(\times\) parameters matrix. To integrate numerically, I draw a random row from this matrix, evaluate the process model, and then the data model. By repeating this process I build up a distribution of predictions.
Inverse prediction involves inverting the model to predict \(X\). This is done by putting a prior distribution on \(X\). GJAM provides these predictions for all predictors.
The same basic concepts apply to multivariate distributions. Because GJAM accommodates many types of data with censoring it looks a bit different, as outlined in the graph below:
Censoring in gjam. As a data-generating model (a), a realization \(w_{is}\) that lies within a censored interval is translated by the partition \(\mathbf{p}_{is}\) to discrete \(y_{is}\). The distribution of data (bars at left) is induced by the latent scale and the partition. For inference (b), observed discrete \(y_{is}\) takes values on the latent scale from a truncated distribution.
Here is a prediction grid for the BBS data fitted with GJAM,
newdata <- list(xdata = predGrid, nsim = 100)
p1 <- gjamPredict(outBBS, newdata = newdata)
yPredMu <- p1$sdList$yMu # predictive mean
yPredSe <- p1$sdList$yPe # predictive sdHere is the distribution of soils in the prediction map:
par(mfrow=c(1,1), mar=c(1,1,1,1))
library('RColorBrewer')
palette(brewer.pal(8,'Accent'))
mapx <- range(predGrid$lon)
mapy <- range(predGrid$lat)
maps::map(boundary=T,col='grey',lwd=2,xlim=mapx,ylim=mapy)
levs <- levels(predGrid$soil)
scols <- match(predGrid$soil,levs)
points(predGrid[,'lon'],predGrid[,'lat'],col=scols,pch=15, cex=.6)
maps::map(boundary=T,add=T,interior=T,col='white',lwd=7)
maps::map(boundary=T,add=T,interior=T,col='grey',lwd=3)
legend('bottomright',levs,text.col=c(1:length(levs)),bg='white',
box.col='white', cex=.6)Predicted abundance for random species in BBS data.
In this block, I generate maps for random species, including the predictive coefficient of variation,
library(MBA)
par(mfrow=c(2,2), mar=c(1,1,3,2), bty='n')
ngrid <- 30
colM <- colorRampPalette(brewer.pal(5,'YlOrRd'))
qlev <- seq(0,1,length=10)
zlevs <- quantile(yPredMu,qlev)
zlevs <- zlevs[!duplicated(zlevs)]
nz <- length(zlevs) - 1
colm <- colM(nz)
slevs <- quantile(yPredSe/(yPredMu + .01),qlev)
slevs <- slevs[!duplicated(slevs)]
slevs <- slevs[!duplicated(slevs)]
ns <- length(slevs) - 1
cols <- colM(ns)
ii <- sample(colnames(yPredMu), 4)
for(j in 1:4){
i <- ii[j]
values2contour(xx=predGrid[,'lon'],yy=predGrid[,'lat'],
z=yPredMu[,i],nx=ngrid,ny=ngrid,col=colm,
zlevs=zlevs,lwd=.1,add=F,fill=T)
maps::map(boundary=T,col='grey',lwd=2,xlim=mapx,ylim=mapy,add=T)
ydat <- outBBS$inputs$y[,i]*1000 # per 1000 hrs effort
ydat <- ydat/max(ydat,na.rm=T)
colj <- .getColor('darkblue',ydat)
symbols(xdata$Longitude, xdata$Latitude, circles=ydat, inches=F, add=T,
fg = colj, bg = colj)
title(i)
}
# coefficient of variation
for(j in 1:4){
i <- ii[j]
wi <- which(colnames(yPredMu) == i)
zj <- yPredSe[,i]/(yPredMu[,i] + .01)
values2contour(xx=predGrid[,'lon'],yy=predGrid[,'lat'],
z=zj,nx=ngrid,ny=ngrid,col=colm,
zlevs=slevs,lwd=.1,add=F,fill=T)
maps::map(boundary=T,col='grey',lwd=2,xlim=mapx,ylim=mapy,add=T)
ydat <- outBBS$inputs$y[,i]*1000 # per 1000 hrs effort
ydat <- ydat/max(ydat,na.rm=T)
colj <- .getColor('darkblue',ydat)
symbols(xdata$Longitude, xdata$Latitude, circles=ydat, inches=F, add=T,
fg = colj, bg = colj)
title(i)
}Predicted coefficient of variation for random species in BBS data.
A second look at composition data
Here is the example of leaf traits that we previously analyzed with a multinomial logit. In GJAM I set the data type to composition count (CC).
load( 'leafFIA.rdata', verbose=T )
formula <- as.formula( ~ moist*annualDef + winTmin + annualPrec )
modelList <- list(ng = 1000, burnin = 200, typeNames='CC')
out <- gjam(formula, xdata = data.frame( xcluster ),
ydata = yleaf, modelList = modelList)
gjamPlot( out, plotPars = list(PLOTALLY = T))Predictions by category (sqrt scale).
As with the MV logit example, note that these variables provide limited predictive capacity for leaf type, and rare types are most poorly predicted.
Beta chains converged by 100 iterations.
Sigma chains chains converged by 100 iterations.
Because all sampling is direct, the chains converge quickly.
Responses by predictor.
These estimates do not look anything like those that we obtained with the MV logit. For example, GJAM shows positive responses to winter temperatures for broadleaf evergreens, as we would expect (they dominate in the South). The MV logit estimates a negative coefficient.
Overall sensitivity to predictors (across all classes).
The sensitivity plot summarizes the contributions of each predictor across all response variables. Local site moisture dominates in this analysis.
Recall that the multinomial models impose negative covariance. Note that that is not what we see with GJAM:
## y1 y2 y3
## y1 1.6064 -0.4819 0.6669
## y2 -0.4819 0.2602 -0.0300
## y3 0.6669 -0.0300 0.6028
## [,1] [,2] [,3]
## y1 0.15905 0.05743 0.08169
## y2 0.05743 0.02747 0.02580
## y3 0.08169 0.02580 0.05868
Combinations of data types
These different data types can be modeled in combination. Here is an example that combines discrete abundance (DA) counts, ordinal counts (OC, recall there is order, but no scale), count composition (CC), continuous abundance (CA), and presence-absence (PA).
types <- c('DA','DA','OC','OC','OC','OC','CC','CC','CC','CC','CC','CA','CA','PA','PA')
f <- gjamSimData(S = length(types), typeNames = types)
ml <- list(ng = 4000, burnin = 500, typeNames = f$typeNames)
out <- gjam(f$formula, f$xdata, f$ydata, modelList = ml)## ================================================================================================================================================================
##
## Sensitivity by predictor variables f:
## Estimate SE CI_025 CI_975
## x2 1400 149 1120 1700
## x3 1560 138 1290 1820
## x4 3400 247 2930 3870
## x5 1740 158 1440 2040
##
## Coefficient matrix B:
## S9 S10 S7 S8 S1 S2 S11 S12 S13
## intercept -0.637 0.1100 1.2300 -0.439 0.536 1.030 0.2410 0.1690 0.2520
## x2 0.669 0.7750 0.4450 1.550 1.140 0.105 -0.0581 -0.0433 0.0312
## x3 -0.574 0.7540 -0.2150 -0.192 1.350 0.254 0.0434 -0.0437 0.0845
## x4 -0.165 -0.5020 1.0200 0.210 0.789 0.889 -0.1010 0.0590 0.0837
## x5 -0.263 0.0528 0.0536 1.220 1.480 1.080 0.0405 0.0347 -0.1110
## RMSPE 0.471 0.4550 0.6440 0.646 1.190 1.140 38.3000 26.7000 21.1000
## S14 S3 S4 S5 S6
## intercept 0.16300 -0.135 1.920 0.799 0.9840
## x2 0.06920 1.450 1.350 0.183 1.8800
## x3 -0.05040 -0.213 1.890 1.680 0.7920
## x4 -0.00231 1.360 -0.404 0.455 0.0663
## x5 0.06030 0.651 0.310 1.120 2.0000
## RMSPE 23.20000 0.752 0.816 0.782 0.7600
##
## Coefficient matrix B:
## Estimate SE CI_025 CI_975 sig95
## S9_intercept -0.63700 0.04740 -0.72100 -0.53500 *
## S9_x2 0.66900 0.04020 0.59000 0.74500 *
## S9_x3 -0.57400 0.04940 -0.66400 -0.46500 *
## S9_x4 -0.16500 0.05600 -0.27800 -0.06590 *
## S9_x5 -0.26300 0.05390 -0.35800 -0.15900 *
## S10_intercept 0.11000 0.03330 0.04540 0.17500 *
## S10_x2 0.77500 0.03750 0.69800 0.84700 *
## S10_x3 0.75400 0.03610 0.68200 0.82300 *
## S10_x4 -0.50200 0.03650 -0.57100 -0.43000 *
## S10_x5 0.05280 0.03270 -0.01110 0.11600
## S7_intercept 1.23000 0.01980 1.19000 1.27000 *
## S7_x2 0.44500 0.01970 0.40600 0.48300 *
## S7_x3 -0.21500 0.01810 -0.25000 -0.18000 *
## S7_x4 1.02000 0.02120 0.97700 1.06000 *
## S7_x5 0.05360 0.01650 0.02220 0.08680 *
## S8_intercept -0.43900 0.06410 -0.56400 -0.31400 *
## S8_x2 1.55000 0.06370 1.43000 1.67000 *
## S8_x3 -0.19200 0.04500 -0.28000 -0.10500 *
## S8_x4 0.21000 0.04160 0.12600 0.29300 *
## S8_x5 1.22000 0.02540 1.17000 1.27000 *
## S1_intercept 0.53600 0.03560 0.46600 0.60700 *
## S1_x2 1.14000 0.03910 1.06000 1.21000 *
## S1_x3 1.35000 0.03620 1.28000 1.42000 *
## S1_x4 0.78900 0.03700 0.71700 0.86100 *
## S1_x5 1.48000 0.03610 1.41000 1.55000 *
## S2_intercept 1.03000 0.03070 0.97400 1.09000 *
## S2_x2 0.10500 0.03120 0.04520 0.16700 *
## S2_x3 0.25400 0.02950 0.19600 0.31100 *
## S2_x4 0.88900 0.03080 0.82600 0.94800 *
## S2_x5 1.08000 0.03110 1.03000 1.15000 *
## S11_intercept 0.24100 0.00277 0.23600 0.24700 *
## S11_x2 -0.05810 0.00277 -0.06350 -0.05280 *
## S11_x3 0.04340 0.00256 0.03860 0.04840 *
## S11_x4 -0.10100 0.00283 -0.10600 -0.09530 *
## S11_x5 0.04050 0.00279 0.03500 0.04610 *
## S12_intercept 0.16900 0.00243 0.16400 0.17300 *
## S12_x2 -0.04330 0.00220 -0.04790 -0.03930 *
## S12_x3 -0.04370 0.00217 -0.04820 -0.03970 *
## S12_x4 0.05900 0.00271 0.05410 0.06440 *
## S12_x5 0.03470 0.00214 0.03070 0.03900 *
## S13_intercept 0.25200 0.00195 0.24800 0.25600 *
## S13_x2 0.03120 0.00160 0.02800 0.03430 *
## S13_x3 0.08450 0.00171 0.08120 0.08770 *
## S13_x4 0.08370 0.00184 0.08000 0.08730 *
## S13_x5 -0.11100 0.00231 -0.11500 -0.10600 *
## S14_intercept 0.16300 0.00178 0.16000 0.16700 *
## S14_x2 0.06920 0.00174 0.06590 0.07270 *
## S14_x3 -0.05040 0.00176 -0.05390 -0.04700 *
## S14_x4 -0.00231 0.00177 -0.00580 0.00115
## S14_x5 0.06030 0.00171 0.05700 0.06380 *
## S15other_intercept 0.00000 0.00000 0.00000 0.00000
## S15other_x2 0.00000 0.00000 0.00000 0.00000
## S15other_x3 0.00000 0.00000 0.00000 0.00000
## S15other_x4 0.00000 0.00000 0.00000 0.00000
## S15other_x5 0.00000 0.00000 0.00000 0.00000
## S3_intercept -0.13500 0.03260 -0.19800 -0.06880 *
## S3_x2 1.45000 0.04610 1.36000 1.54000 *
## S3_x3 -0.21300 0.03170 -0.27500 -0.15100 *
## S3_x4 1.36000 0.04450 1.27000 1.44000 *
## S3_x5 0.65100 0.03550 0.58000 0.72000 *
## S4_intercept 1.92000 0.05970 1.80000 2.03000 *
## S4_x2 1.35000 0.04930 1.26000 1.44000 *
## S4_x3 1.89000 0.06210 1.77000 2.02000 *
## S4_x4 -0.40400 0.03460 -0.47100 -0.33700 *
## S4_x5 0.31000 0.03370 0.24400 0.37300 *
## S5_intercept 0.79900 0.04830 0.70900 0.89300 *
## S5_x2 0.18300 0.03700 0.11200 0.25800 *
## S5_x3 1.68000 0.07080 1.56000 1.83000 *
## S5_x4 0.45500 0.03780 0.38000 0.52900 *
## S5_x5 1.12000 0.04660 1.03000 1.21000 *
## S6_intercept 0.98400 0.03930 0.90900 1.06000 *
## S6_x2 1.88000 0.05450 1.77000 1.99000 *
## S6_x3 0.79200 0.03640 0.71900 0.86000 *
## S6_x4 0.06630 0.03190 0.00297 0.12800 *
## S6_x5 2.00000 0.05620 1.89000 2.11000 *
##
## Last column indicates if 95% posterior distribution contains zero.
##
## Coefficient matrix B, standardized for X:
## NULL
##
## Last column indicates if 95% posterior distribution contains zero.
##
## Coefficient matrix B, standardized for X and W:
## NULL
##
## Last column indicates if 95% posterior distribution contains zero.
##
## Sample contains n = 1000 observations on S = 15 response variables. Data types (typeNames) include PA, CA, DA, CC, OC. DA effort is 1 for all observations. DA counts per effort (W) ranges from 0 to 5.32. CC count range is (0, 7). 'other' class detected in ydata, 'S15other column ', not fitted. There are 0 missing values in X and 0 missing values in Y. The RMSPE is 18.2, and the DIC is 11552882. Computation involved 4000 Gibbs steps, with a burnin of 500.
# repeat with ng = 5000, burnin = 500, then plot data:
pl <- list(trueValues = f$trueValues)
gjamPlot(out, plotPars = pl)## ================================================================================
## simulated beta, corSpec vs betaMu, corMu (95%) -- return to continue
## cuts vs cutMu -- return to continue
## obs y vs predicted y -- return to continue
## x inverse prediction, covariates -- return to continue
## sensitivity over full model -- return to continue
## beta coefficient thinned chains -- return to continue
## correlation thinned chains -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## beta standardized for W/X, 95% posterior -- return to continue
## 95% posterior -- return to continue
More exercises
Exercise 2. A former student in this class had a survey with one binary response (typeNames = 'PA') and three ordinal responses (ordinal counts, 'OC'). Simulate his data using gjamSimData and analyze it using GJAM. Using the documentation for GJAM:
What are the partition estimates and what role do they play?
Which are better predicted by the fitted model, the
PAdata or theOCdata? Why might they differ?
Exercise 3. Select 10 species from the BBS data and do a model selection (using DIC) to evaluate a limited number of variable combinations.
How do the DIC values compare with overall sensitivity?
With inverse prediction?
Exercise 4. Using the block of code that generates the predictive coefficient of variation, provide an interpretation of what you see.
Appendix - summary of GJAM model
An observation consists of environmental variables and species attributes, \(\lbrace \mathbf{x}_{i}, \mathbf{y}_{i}\rbrace\), \(i = 1,..., n\). The vector \(\mathbf{x}_{i}\) contains predictors \(x_{iq}: q = 1,..., Q\). The vector \(\mathbf{y}_{i}\) contains attributes (responses), such as species abundance, presence-absence, and so forth, \(y_{is}: s = 1,..., S\). The effort \(E_{is}\) invested to obtain the observation of response \(s\) at location \(i\) can affect the observation.
A length-\(S\) vector \(\mathbf{w}\) represents response \(\mathbf{y}_i\) in continuous space. This continuous space allows for the dependence structure with a covariance matrix. An element \(w_{is}\) can be known (e.g., continuous response \(y_{is}\)) or unknown (e.g., discrete responses).
For discrete observations, \(k\) is a censored interval, and \(w_{is}\) is a latent variable. A length-\(S\) vector of integers \(\mathbf{z}_{i}\) labels \(\mathbf{y}_i\) in discrete space. Each observed \(y_{is}\) determine the interval \(z_{is} \in \{0,...,K_{is}\}\). The number of intervals \(K_{is}\) can differ between observations and between species, because each species can be observed in different ways.
The partition defines the relationship between discrete observations \(y_{is}\) and continuous \(w_{is}\). Continuous space is partitioned at points \(p_{is,z} \in{\mathcal{P}}\). The label assigned to \(y_{is}\) is \(z_{is}\). Two values \((p_{is,k}, p_{is,k+1}]\) bound the \(k^{th}\) interval of \(s\) in observation \(i\). Intervals are contiguous and provide support over the real line \((-\infty, \infty)\). The set of censored intervals is \(\mathcal{C}\). The partition set \(\mathcal{P}\) can include both known (discrete counts, including composition data) and unknown (ordinal, categorical) points.
An observation \(y\) maps to \(w\) as
\[y_{is} = \left \{ \begin{matrix} \ w_{is} & & continuous\\ \ z_{is} & w_{is} \in (p_{z_{is}}, p_{z_{is} + 1}] & discrete \end{matrix} \right.\]
As discussed above, effort \(E_{is}\) affects the partition for discrete data. For the simple case where there is no error in the assignment of discrete intervals, \(\mathbf{z}_i\) is known, and the model for \(\mathbf{w}_i\) is
\[\mathbf{w}_i|\mathbf{x}_i, \mathbf{y}_i, \mathbf{E}_i \sim MVN(\boldsymbol{\mu}_i,\boldsymbol{\Sigma}) \times \prod_{s=1}^S\mathcal{I}_{is}\] \[\boldsymbol{\mu}_i = \mathbf{B}'\mathbf{x}_i\] \[\mathcal{I}_{is} = \prod_{k \in \mathcal{C}}I_{is,k}^{I(y_{is} = k)} (1 - I_{is,k})^{I(y_{is} \neq k)}\]
where \(I_{is} =I(p_{z_{is}} < w_{is} < p_{z_{is} + 1}]\), \(\mathcal{C}\) is the set of discrete intervals, \(\mathbf{B}\) is a \(Q \times S\) matrix of coefficients, and \(\boldsymbol{\Sigma}\) is a \(S \times S\) covariance matrix. There is a correlation matrix associated with \(\boldsymbol{\Sigma}\),
\[\mathbf{R}_{s,s'} = \frac{\boldsymbol{\Sigma}_{s,s'}}{\sqrt{\boldsymbol{\Sigma}_{s,s} \boldsymbol{\Sigma}_{s',s'}}}\]