Conditional Prediction with GJAM

If you have questions or comments, email me at lane.scher@duke.edu, tweet at me @lane_scher, and find more information on my website.

How to use this tutorial

This tutorial works through the theoretical background of conditional prediction using GJAM and then provides an example. It does not explain the GJAM model itself—for more information on GJAM, see my GJAM tutorial.

Theoretical background

Briefly, conditional prediction uses information from “incidental” species to inform prediction of “target” species. It is able to use information from incidental species because all these species were modeled jointly in a GJAM, and a residual covariance matrix was produced.

Conditional prediction is useful because limited information about the environment enters the model through the covariates. There are many other things happening in the environment that species are responding to that we can’t measure, and often can’t even identify. Conditional prediction allows us to capture information about the environment through incidental species.

In the diagram below, A shows species \(y_1\) directly predicting species \(y_2\). This assumes the observation error \(\sigma_2^2\) only applies to \(y_2\) and that there is no observation error associated with \(y_1\). Realistically, both observation error applies to both species.

B demonstrates a joint model, in which both \(y_1\) and \(y_2\) enter the model as responses to the covariates, with a residual covariance \(\Sigma\) between them. The unobserved environment (snow) affects both species, but there is no \(\beta\) capturing the relationship between snow and species counts because it was not included in the model.

C shows prediction of \(y_2\) conditional on \(y_1\). It uses the residual of \(y_1\) to capture information about unobserved variables. In other words, if there are fewer than expected individuals of \(y_1\), that is because of unobserved variables (like snow). The residual therefore gives us information about the unobserved variables, which we can use to inform predictions of \(y_2\). diagram

Typical prediction defines the mean value \(\mu\) of target species \(s\) from the covariate values \(X\) and fitted coefficients for target species \(B_s\). \[ \mu_s = XB_s \]

Conditional prediction builds on this structure by adding information from the incidental species \(-s\). \[ \mu_s = XB_s + R_{-s}A \] Here, the first term is identical to traditional prediction, but we add a second term. \(R_{-s}\) contains residual abundances for incidental species (in other words, abundance beyond what is explained by the environment). \(A\) contains coefficient representing the relationships between incidental species \(s\) and target species \(-s\). The relationships between incidental and target species \(A\) come from the residual covariance in the fitted model: \[ A = \Sigma_{-s,-s}^{-1}\Sigma_{-s,s} \]

An example

Let’s quickly fit a GJAM using the data in birdData.rdata. For more information about this analysis, check my GJAM tutorial.

#install.packages('gjam')
library(gjam)
library(tidyverse)

load("birdData.rdata")

# trim ydata
ydata <- as.data.frame(gjamTrimY(ydata[,1:ncol(ydata)], maxCols = 20)$y)
ydata$other <- NULL

# make landcover a factor
xdata$landcover <- as.factor(xdata$landcover)
xdata$landcover <- relevel(xdata$landcover, 
                           "barren")

# format effort
elist <- list(columns = 1:ncol(ydata),
               values = xdata$eff)

# define formula
form1 <- as.formula(~ elevation + summerSMOS + precip + landcover)

# model list
ng <- 15000
burnin <- ng/2
mlist <- list(ng = ng, burnin = burnin, typeNames = 'DA',
                  effort = elist)

# fit model
out <- gjam(form1, xdata = xdata, ydata = ydata, modelList = mlist)

## ================================================================================================================================================================

Predictions

First, let’s look at predictions using the traditional method. These are calculated automatically and saved in the gjam output. We’ll look at predictions for three species: American crow, Common yellowthroat, and Great crested flycatcher.

trad_pred <- as.data.frame(out$prediction$ypredMu)

plot_trad <- trad_pred %>%
  select(americancrow, commonyellowthroat, greatcrestedflycatcher) %>%
  mutate(obs = 1:nrow(out$prediction$ypredMu)) %>%
  pivot_longer(!obs) 

plot_true <- as.data.frame(out$inputs$y) %>%
  select(americancrow, commonyellowthroat, greatcrestedflycatcher) %>%
  mutate(obs = 1:nrow(out$prediction$ypredMu)) %>%
  pivot_longer(!obs) 

plot_all1 <- inner_join(plot_trad, plot_true, by = c("obs", "name"))
colnames(plot_all1) <- c("obs", "name", "trad", "count")

ggplot(plot_all1) +
  geom_point(aes(x = count, y = trad)) +
  facet_wrap(~name)

Now, we’ll make conditional predictions for these three species. For each species, we’ll condition on all other species in the model.

cond_pred <- data.frame(obs = 1:nrow(trad_pred))

y <- out$inputs$y
for (s in c("americancrow", "commonyellowthroat", "greatcrestedflycatcher")) {
  
  # condition on all species except s
  condCols <- colnames(out$inputs$y)
  condCols <- condCols[-grep(s, condCols)]
  
  # predict s

  new <- list(ydataCond = y[,condCols], nsim = 200)
  cond_pred2 <- gjamPredict(output = out, newdata = new)

  cond_pred[,s] <- cond_pred2$sdList$yMu[,s]
}

## ================================================================================================================================================================================================================================================

Now plot those

plot_cond <- cond_pred %>%
  pivot_longer(!obs)

plot_all2 <- inner_join(plot_cond, plot_true, by = c("obs", "name"))
colnames(plot_all2) <- c("obs", "name", "cond", "count")

ggplot(plot_all2) +
  geom_point(aes(x = count, y = cond)) +
  facet_wrap(~name)

We can plot them next to each other, and binned, with a one-to-one line to really see the difference.

plot_all <- inner_join(plot_all1, plot_all2, by = c("obs", "name", "count")) %>%
  pivot_longer(cols = c("trad", "cond"), names_to = "type")

brks <- seq(-1, max(plot_all$value)+1, by = 10)
plot_all <- plot_all %>%
  mutate(bins = cut(count,
                    breaks = brks)) %>%
  mutate(bin = round(count, digits = -1))

plot_all$type <- factor(plot_all$type, levels = c("trad", "cond"))

ggplot(plot_all) +
  geom_abline(aes(intercept = 0, slope = 1), linetype = "dashed", color = "red", size = 1) +
  geom_boxplot(aes(x = bin, y = value, group = bin)) +
  facet_grid(rows = vars(type), cols = vars(name)) +
  labs(x = "Observed", y = "Predicted")

Notice that the conditional predictions (bottom panel) fall closer to the one-to-one line than traditional predictions, especially for high observed values. And for the American crow, traditional prediction predicts high values where the true count is 0, which conditional prediction avoids.

The A matrix

Conditional prediction is also useful to identify relationships between species, held in the \(A\) matrix. This can be pulled from the gjam output using gjamConditionalParameters(). Let’s say we want to know how many American goldfinches to expect, given the other birds present (i.e., we condition on the other birds).

condOn <- colnames(y)
condOn <- condOn[-grep("americangoldfinch", condOn)]
pars <- gjamConditionalParameters(out, conditionOn = condOn)

A <- pars$Atab %>%
  mutate(species = condOn)

ggplot(A) +
  geom_abline(intercept = 0, slope = 0) +
  geom_point(aes(x = reorder(species, Estimate), y = Estimate, color = sig95)) +
  geom_segment(aes(x = reorder(species, Estimate), xend = reorder(species, Estimate),
                   y = CI_975, yend = CI_025,
                   color = sig95)) +
  theme_bw() +
  theme(axis.text = element_text(angle = 45,
                                 hjust = 1))

This tells us that for each Red-tailed hawk beyond what is expected from the environment, there will be about 6 fewer American goldfinches. And an additional White-breasted means there will be about 3 more American goldfinches.