joint spp distance sampling

The goal here is to model changes in species abundances based on distance sampling. To model species responses, we want to use a joint species modelling approach a la Ovaskainen.

As a starting point we need a distance sampling model that can be fitted with Stan so that we can then include the joint species part.

Distance sampling is a good example of a hierarchical model where we separate the ecological quantities of interest (animal abundance/density), from the observation of such quantities.

Let’s start by simulating a data set and trying to recover known parameters. We assume that detection probability decreases with distance to the trasect as a half-normal: \(\exp\left(- \frac{x^2}{2\sigma^2} \right)\)

We assume further that a total of \(n\) individuals are located at random over the area being sampled. We run the transect for a length \(L\) and the boundaries of the study area are at a distance \(B\) from each side of the transect. Let’s simulate some data:

set.seed(36)
sigma = 1
n = 200
B = 3
L = 10

x <- runif(n, -B, B)
p <- exp(-x^2/2*sigma^2) # detection probability
y_obs <- rbinom(n, size = 1, prob = p)

Now we plot the distribution of distances of all the animals to the transect and the distance of those that were obseved:

hist(abs(x), nclass = 15,
     xlab = "Distance (x)", col = "grey", 
     main = "True (grey) \nand observed distances (blue)" )
hist(abs(x[y_obs==1]), col = "blue", add = TRUE)

To estimate density \(\frac{n}{ L \times 2 B}\) we can use a data augmentation approach as presented in Royle and Dorazio (2009). For this, we add to the list of observed individual a number \(nz\) of zeros and define a parameter \(\psi\) for the probability that an individual (observed or not) is part of the population under study.

Because with Stan we cannot have discrete-valued hidden variables we have to use a zero-inflated version, see the dist.stan file

Let’s fit this model to the simulated data

nz = 200

datos <- list(n_obs = sum(y_obs),
              nz = nz,
              x = abs(x[y_obs==1]),
              y = c(rep(1,sum(y_obs)), rep(0,nz)),
              B = B,
              Area = 2*B*L)

pars = c("D", "psi", "sigma")

library("rstan")

## Loading required package: StanHeaders

## Loading required package: ggplot2

## rstan (Version 2.19.3, GitRev: 2e1f913d3ca3)

## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan(file = 'dist.stan',
            data = datos,
            pars = pars,
            include = TRUE,
            iter = 1000, thin = 1, chains = 3)

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## http://mc-stan.org/misc/warnings.html#tail-ess

print(fit)

## Inference for Stan model: dist.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##          mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
## D        3.96    0.02  0.30    3.39    3.75    3.94    4.16    4.58   317 1.02
## psi      0.77    0.01  0.09    0.60    0.71    0.76    0.83    0.95   317 1.02
## sigma    1.08    0.00  0.09    0.91    1.02    1.08    1.14    1.24   540 1.01
## lp__  -288.86    1.08 16.13 -320.60 -300.00 -288.39 -278.01 -256.94   221 1.01
## 
## Samples were drawn using NUTS(diag_e) at Fri Jul 17 15:53:12 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

samples = extract(fit)

op = par(mfrow=c(1,2))
plot(density(samples$D), main = "density", xlab = "")
abline(v = n/(2*B*L))
plot(density(samples$sigma), main = "sigma", xlab = "")
abline(v=sigma)

par(op)

Integrating the detection function

An alternative way of estimating \(\sigma\) and population density is by looking at the probabilities of obtaining the observed distances \(x\). Using Bayes rule, we can see that

\[ p(x|y=1) = \frac{p(y=1|x)p(x)}{p(y=1)} \]

for the case of the half-normal, we get \[ p(x|y=1) = \frac{\exp(- x^2/(2 \sigma^2))}{\int_0^B{\exp(- x^2/(2 \sigma^2))}} \]

We can use the function integrate_1d in Stan to get the denominator. See the dist_int.stan file (this is a conditional likelihood model and not the full likelihood one. Should work out the full likelihood example…).

Let’s fit this model to the simulated data

datos <- list(n_obs = sum(y_obs),
              x = abs(x[y_obs==1]),
              B = B,
              Area = 2*B*L)

pars = c("D", "sigma", "pbar")

library("rstan")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

fit <- stan(file = 'dist_int.stan',
            data = datos,
            pars = pars,
            include = TRUE,
            iter = 1000, thin = 1, chains = 3)

print(fit)

## Inference for Stan model: dist_int.
## 3 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=1500.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## D      3.60    0.02 0.30  3.01  3.38  3.59  3.80  4.16   357    1
## sigma  0.94    0.00 0.08  0.81  0.88  0.93  0.99  1.12   339    1
## pbar   0.39    0.00 0.03  0.34  0.37  0.39  0.41  0.47   340    1
## lp__  36.78    0.03 0.74 34.53 36.54 37.07 37.29 37.35   630    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Jul 17 15:53:23 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

samples = extract(fit)

op = par(mfrow=c(1,2))
plot(density(samples$D), main = "density", xlab = "")
abline(v = n/(2*B*L))
plot(density(samples$sigma), main = "sigma", xlab = "")
abline(v=sigma)

par(op)

Joint Species Model

Now let’s look at a case where we have several species. Following Ovaskainen et al. 2019, we want to build a hierarchical model where the expected values of model parameters are related to species traits, and the variance-covariance of parameters is possibly related to phylogenetic similarity. Let’s denote the number of species by \(n_s\), the number of species traits by \(n_t\), and the number of species-specific parameters by \(n_p\). The species-specific parameters are combined into the \(n_s \times n_p\) matrix \(\mathbf{\Theta}\), so that each row of \(\mathbf{\Theta}\), i.e., the vector \(\mathbf{\Theta}_s\), contains the model parameters for species \(s\). The number of parameters (\(n_p\)) and their interpretations depend on the specific model.

To determine the way in which species respond to environmental factors (as described by the parameters \(\mathbf{\Theta}\)) depend on their traits and phylogenetic relationships, we vectorize the matrix \(\mathbf{\Theta}\) to the \(n_s n_p \times 1\) vector \(\mathbf{\theta} = vec(\mathbf{\Theta})\), which we model with a multivariate normal distribution as

\[ \theta \sim \text{N} \left(\mathbf{m}, \Sigma \otimes \left[ \rho \mathbf{C} + \left(1 - \rho \right) \mathbf{I}_{n_s} \right] \right) \]

here the \(n_s n_p \times 1\) vector \(\mathbf{m} = \text{vec}(\mathbf{M})\) is the vectorized version of the \(n_s \times n_p\) matrix \(\mathbf{M}\) (with elements \(m_{sp}\), where \(p = 1,...,n_p\) is the index for the regression parameter), and it gives the expected parameters based on species traits. The influence of traits is modeled with a linear regression \(m_{sp} = \sum_k t_{sk} \zeta_{kp}\) ,where the index \(k\) runs over the \(t\) traits, \(t_{sk}\) is the trait \(k\) for species \(s\), and the parameter \(\zeta_{kp}\) measures the influence of trait \(k\) on parameter \(p\). Arranging the regression parameters \(\zeta_{kp}\) into a \(n_t \times n_p\) matrix \(\mathbf{Z}\) and the trait values \(t_{sk}\) into a \(n_s \times n_t\) matrix \(\mathbf{T}\), we can write in matrix form \(\mathbf{M} = \mathbf{T} \mathbf{Z}\). To include the intercept into the model, we set \(t_{s1}=1\) for all species \(s\).

The \(n_p \times n_p\) variance–covariance matrix \(\Sigma\) above models the species-specific deviations around the expectation based on species traits, and \(\otimes\) is the Kronecker (outer) product. The \(n_s \times n_s\) matrix \(\mathbf{C}\) is a phylogenetic correlation matrix that can be derived from a phylogenetic tree based on genetic data, or constructed from taxonomic information if a quantitative phylogeny is not available. The matrix \(\mathbf{I}_{n_s}\) is the identity matrix of dimension \(n_s\), and the parameter \(0 \leq \rho \leq 1\) measures the strength of the phylogenetic signal.

We can use this setup to model the number of individuals of species \(s\) at sample \(i\) as

\[ \begin{aligned} & y_{is} \sim \text{Poisson}(\lambda_{is}) \\ & \log(\lambda_{is}) = \theta_{0s} + \theta_{1s} \times x_i \\ \end{aligned} \]

where parameters \(\theta_{0}\)s and \(\theta_{1}\)s are sampled from a multivariate normal with mean related to species traits and variance-covariance related to the phylogenetic similarity among species as described above.

Simulation

Let’s simulate some data using this joint species formulation. We start by defining the parameter structure as described above. We will have three parameters per species as we will use one for the scale of detection probability.

library(ape)
library(mvtnorm)
set.seed(123)

n_s <- 20  # number of spp
n_p <- 3   # number of parameters
n_t <- 2   # number of traits

# define species traits
dgrass <- rbeta(n_s, 1, 1)  # some variable such as fraction of grass in diet
log_bm <- rnorm(n_s, 0, 1)  # log body mass

# define trait matrix includding ones for the intercept
TT <- as.matrix(cbind(rep(1, n_s), scale(dgrass), scale(log_bm)))

# simulate phylogeny 
tree <- rtree(n = n_s)    
CC <- vcv(tree, corr = TRUE) # species correlation based on phylogeny

# sort species and re-arrange phylogenetic correlation
tmp <- dimnames(CC)
ids <- as.numeric(as.factor(tmp[[1]]))

C <- matrix(NA, ncol(CC), ncol(CC))
for(i in 1:ncol(CC)){
  for(j in 1:ncol(CC)){
    C[ids[i],ids[j]] <- CC[i,j]
  }
}

# define parameters that link traits to expected thetas
Z <- matrix(c(1, 0, 0.4, 0.1, 0.5, 0.4, 0.5, -0.5, 0.6 ), (n_t + 1), n_p)

# get expected thetas
M <- TT %*% Z

Sigma <- diag(n_p) * 0.5
rho = 0.5  # correlation with phylogeny

thetas <- rmvnorm(1, mean = as.vector(M), kronecker(Sigma, rho*C + (1-rho) * diag(n_s)))
Theta <- matrix(thetas[1,], n_s, n_p)

We can inspect now how grass in diet and body mass relate to the intercept and slope of the regression between the predictor \(x\) and the expected count at a samping unit

nf = layout(matrix(c(1,2,3,4,5,6,0,0),4,2,byrow=TRUE), widths=c(1,1,1), heights=c(1,1,1,0.1))
#layout.show(nf)

op <- par( mar = c(3, 3, 2, 2) + 0.1, mgp = c(3.5, 1, 0), las = 1, bty = "n", cex = 1.2)
plot(scale(dgrass), exp(Theta[,1]), ylab = "", xlab = "", main = "sigma")
plot(scale(log_bm), exp(Theta[,1]), ylab = "", xlab = "")
plot(scale(dgrass), Theta[,2], ylab = "", xlab = "", main = "intercept")
plot(scale(log_bm), Theta[,2], ylab = "", xlab = "")
plot(scale(dgrass), Theta[,3], ylab = "", xlab = "", main = "slope")
plot(scale(log_bm), Theta[,3], ylab = "", xlab = "")

mtext("         scaled grass in diet           scaled log body mass", side = 1, line = -2, outer = TRUE, cex=1.3)

par(op)

Now we define the sampling scheme. Here we follow the simulation sheme of package AHMbook.

n_sites <- 100 # sampling units
# sample level predictor (environmental covariate)
x <- rnorm(n_sites)
B = 5 # max distance

# species indicator
j <- rep(1:n_s, each = n_sites) 
site <- rep(1:n_sites, n_s)
x.s <- rep(x, n_s)
n <- length(j)

log_lambda <- Theta[j,2] + Theta[j,3] * x.s
N <- rpois(n, exp(log_lambda))
sigma <- exp(Theta[,1])

data <- NULL
for (i in 1:length(N)) {
  d <- runif(N[i], 0, B)
  p <- exp(-d * d/(2 * (sigma[j[i]]^2)))
  y <- rbinom(N[i], 1, p)
  d <- d[y == 1]
  y <- y[y == 1]
  
  if (sum(y) > 0){
    data <- rbind(data, cbind(rep(site[i], sum(y)), rep(j[i], sum(y)), y, d))
  } 
}
colnames(data) <- c("site","sp", "y", "d")
datos = as.data.frame(data)

Now call Stan

library("rstan")
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

stan_dat <- list(
  n_obs = dim(datos)[1],
  n_sites = n_sites,
  B = B,
  site = as.integer(datos$site), # (c(s, rep(1:n_sites, each = nzs ))),
  r = datos$d,
  K = 2, 
  X = as.matrix(cbind(numeric(length(x))+1, x)),
  n_max = rep(50, n_s),
  n_s = as.integer(n_s),
  n_t = 3,
  TT = TT,
  C = C,
  ones = numeric(n_s) + 1,
  sp = datos$sp
)

pars <- c("pbar", "b_m", "rho",  "Sigma", "z")


fit <- stan(file = 'dist_pois_bin_int.stan',
            data = stan_dat,
            pars = pars,
            iter = 1000, thin = 1, chains = 3)

## recompiling to avoid crashing R session

## Trying to compile a simple C file

## Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
## clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/unsupported"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/src/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/"  -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/rstan/include" -DEIGEN_NO_DEBUG  -D_REENTRANT  -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -include stan/math/prim/mat/fun/Eigen.hpp   -I/usr/local/include   -fPIC  -Wall -g -O2  -c foo.c -o foo.o
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Dense:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Core:88:
## /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:1: error: unknown type name 'namespace'
## namespace Eigen {
## ^
## /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:613:16: error: expected ';' after top level declarator
## namespace Eigen {
##                ^
##                ;
## In file included from <built-in>:1:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/StanHeaders/include/stan/math/prim/mat/fun/Eigen.hpp:13:
## In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Dense:1:
## /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppEigen/include/Eigen/Core:96:10: fatal error: 'complex' file not found
## #include <complex>
##          ^~~~~~~~~
## 3 errors generated.
## make: *** [foo.o] Error 1

#print(fit)

Before making any inferences we check for convergence and reasonable sample sizes on the posteriors

fit_summary <- summary(fit)$summary
op <- par(mfrow = c(1,2))
hist(fit_summary[,10], main = "R-hat")
hist(fit_summary[,9], main = "n-eff" )

par(op)

N <- dim(fit_summary)[[1]]
for (n in 1:N) {
  rhat <- fit_summary[,10][n]
  if (rhat > 1.1 || is.infinite(rhat) || is.nan(rhat)) {
      print(sprintf('Rhat for parameter %s is %s!',
                    rownames(fit_summary)[n], rhat))
    }
  }

Rhats and n-eff look good (we get a waringin for the Rhat of Omega[1,1] but that’s a fixed value so no worries).

We can now check some posteriors such as that for \(\rh0\)

draws <- extract(fit)

plot(density(draws$rho), main = "")
abline(v=rho)

Now compare true \(\zeta\) values (in red) to fitted ones

zs <- fit_summary[grepl("z", rownames(fit_summary)),]
#plot(c(Z) - zs[,1])
df <- data.frame(x = 1:dim(zs)[1],
                 tz = c(Z),
                 fz = zs[,1],
                 L = zs[,4],
                 U = zs[,8])

ggplot(df, aes(x = x, y = tz)) +
  geom_point(size = 2, color="red") +
  geom_point(aes(y = fz), size = 2) +
  geom_linerange(aes(ymin = L, ymax = U)) +
  theme_classic()

Let’s compare the sample-level regression coefficients

bs <- fit_summary[grepl("b_m", rownames(fit_summary)),]

nf = layout(matrix(c(1,2,3,4,0,0),3,2,byrow=TRUE), widths=c(1,1), heights=c(1,1,0.1))
#layout.show(nf)
op <- par( mar = c(3, 3, 2, 2) + 0.1, mgp = c(3.5, 1, 0), las = 1, bty = "n", cex = 1.2)
plot(scale(dgrass), Theta[,2], col = 2,  ylab = "", xlab = "", main = "intercept", ylim=c(-3,3))
points(scale(dgrass), bs[seq(1, (n_s*3) ,by=3) + 1,1])

plot(scale(dgrass), Theta[,3], col = 2 , ylab = "", xlab = "", main = "slope", ylim=c(-3,3) )
points(scale(dgrass), bs[seq(1, (n_s*3) ,by=3) + 2,1])

plot(scale(log_bm),Theta[,2], col = 2 , ylab = "", xlab = "", )
points(scale(log_bm), bs[seq(1, (n_s*3) ,by=3) + 1,1])

plot(scale(log_bm),Theta[,3], col = 2 , ylab = "", xlab = "")
points(scale(log_bm),  bs[seq(1, (n_s*3) ,by=3) + 2,1])
mtext("         scaled grass           scaled log body mass", side = 1, line = -2, outer = TRUE, cex=1.3)

par(op)

joint spp distance sampling

jmm

5/9/2020

Integrating the detection function

Joint Species Model

Simulation