Week 10 Joint SDMs

Author

MD

Week 10 Joint Species Distribution Modelling

I this week’s practical we will take a final look at community ecology analysis in R focusing on Joint Species Distribution Models (JSDMs). The practical is based on a data set from Labam’s Kayitete’s dissertation project from 2024 and takes presence data on seven species of vulture in East Africa sources from the GBIF database. We will model the co-occurence of these species as a u function of land cover and climatic variables and use the hmsc (Hierarchical Modelling of Species Coummnities).

After setting your working directory with setwd(), load in the packages needed:

# Install necessary packages if they are not already installed
required_packages = install.packages(c("sf", "terra", "Hmsc", "coda", "corrplot"))
# Load necessary libraries
library(sf)
Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(terra)
terra 1.9.11
library(Hmsc)
Loading required package: coda

Attaching package: 'coda'
The following objects are masked from 'package:terra':

    varnames, varnames<-
library(coda)
library(corrplot)
corrplot 0.95 loaded

And load in the data:

# Load the data

#species
vultures=st_read('vultures_samp_ea.shp')
Reading layer `vultures_samp_ea' from data source 
  `/Users/mattdennis/Library/Mobile Documents/com~apple~CloudDocs/Documents/DocumentsNew/DocumentsMatt/GEOG71922/vultures_samp_ea.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 5874 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 34.08972 ymin: -3.704935 xmax: 38.13105 ymax: -1.152151
Geodetic CRS:  WGS 84
#Environmental covariates
env=rast("env_EA.tif")

The env object is a stacked raster for which the first 6 bands are land cover layers and the subsequent layers are climatic varialble (and one layer for elevation) from the WorldClim data that we used in an earlier practical. We can check this with the names() function.

names(env)
 [1] "forest"                           "shrub"                           
 [3] "herbaceous"                       "Wetland"                         
 [5] "arable"                           "built"                           
 [7] "Annual Mean Temperature"          "Isothermality"                   
 [9] "Temperature Seasonality"          "Max Temperature of Warmest Month"
[11] "Min Temperature of Coldest Month" "Temperature Annual Range"        
[13] "elev"                            

Take a look:

plot(env)

The species data are in a different CRS than hte environment data so we need to transform these.

vultures=st_transform(vultures,crs(env))

Next we extract the data on the environment to our point data as in our previous SDM exercises.

envData=terra::extract(env,vultures)

Recall that we don’t have a classic species-abundance matrix and are using raster cells (of approximates 5 x 5 km) to approximate a sampling scheme where we records the number of times each species is recorded in each cell. In order to treat each cell as a separate case for analysis, we need to give each one an ID. The most sensible way to do this is to just use the cell number informatio already contained in the raster layer. We can use the cellFromXY() function to get the cell number at different locations. As we only wnat cells that contain point data, we use the coordinates of the point (vulture) layer to get the cell number as follows:

crdsVulture=st_coordinates(vultures)

cn=cellFromXY(env, crdsVulture)

Then we create two data frames, one for species, one ofr environment which both contain the cell numbers so we can combine them later in the modelling.

#create env data
enVar=cbind(st_drop_geometry(envData),cn)

#create species community data
spVar=cbind(st_drop_geometry(vultures),cn)

Next, to approximate a species-abundance table, we need to do some data manipulation to get our records on the correct arrangement. First we will get the frequency of each species in each cell using group_by() and summarise() from dplyr/tidyverse.

# Import the tidyverse libraries
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:terra':

    intersect, union
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)

Attaching package: 'tidyr'
The following object is masked from 'package:terra':

    extract
freqData=spVar|>
  group_by(cn,scientific)|>
  summarize(Freq=n(),.groups = "drop_last")

#inspect
head(freqData)
# A tibble: 6 × 3
# Groups:   cn [4]
     cn scientific            Freq
  <dbl> <chr>                <int>
1  1691 Necrosyrtes_monachus     1
2  1696 Gyps_africanus           1
3  1697 Gyps_africanus           2
4  1697 Necrosyrtes_monachus     1
5  1698 Gyps_africanus           2
6  1698 Gyps_rueppellii          1

With the freqencies obtained, the next job is to convert to wide format so that species are columns and sites (cells) are rows.

#use pivot_wider() to wrangle into wide format data frame
spData=pivot_wider(freqData,names_from = scientific, values_from = Freq,values_fill=0)

#inspect frequency table
head(spData)
# A tibble: 6 × 8
# Groups:   cn [6]
     cn Necrosyrtes_monachus Gyps_africanus Gyps_rueppellii Torgos_tracheliotos
  <dbl>                <int>          <int>           <int>               <int>
1  1691                    1              0               0                   0
2  1696                    0              1               0                   0
3  1697                    1              2               0                   0
4  1698                    0              2               1                   1
5  1710                    0              0               0                   0
6  1715                    0              0               1                   0
# ℹ 3 more variables: Neophron_percnopterus <int>,
#   Trigonoceps_occipitalis <int>, Gypohierax_angolensis <int>
#merge data by the "cn" column.
allData=merge(spData,enVar[2:ncol(enVar)],by="cn")
allData=distinct(allData)

Next we prepare the final matrices that we be entered in to Hmsc() function.

First, the community matrix:

# Prepare species data matrix - only need columns 1 to 8
Y <- data.frame(as.matrix(allData[, c(1:8)]))

#compute species richness for each cell and truncate to only those cells with a good number of occurrences - better for model fitting.
Y$Abs=rowSums(Y[,2:8])

#subset to cells with total abundance at least 40 
Y=Y[Y$Abs>=25,]

#collect all environment data into a single matrix
X <- data.frame(as.matrix(allData[, c(1,9:21)]))

#makes sure only cells that are in Y are kept
X=X[X$cn%in%Y$cn,]

#now both matricies are set up, we can lose the cell number column
Y=Y[,2:8]

#convert to data frames and ensure all coilumns are numeric
Y=data.frame(apply(Y,MARGIN=2,FUN=as.numeric))

#convert env data
XData=data.frame(apply(X,MARGIN=2,FUN=as.numeric))

One useful feature of the hmsc library function is that it is built to accept random effects. Adding random affects is one way of address sampling bias and, to a degree, spatial dependence in our data by assigned a random leveln effect such as site number or coordinates. We’ll add site (cell) number as a randon effect here:

# define the random level object. 

# Assuming dataset has unique identifiers for each sampling unit
library(Hmsc)

#the basis of the random level needs to be a factor (categorical)
cn=as.factor(XData$cn)

#this needs to be set up as a data frame - as the Hmsc( ) function only accepts data frames (not matrices)
studyDesign <- data.frame(sample = cn)

#use the HmscRandomLevel() function to store the random level
rL <- HmscRandomLevel(units = studyDesign$sample)

library(Hmsc)

cn=as.factor(XData$cn)

XData=XData[,2:ncol(XData)]

Now we are ready to actually specify out model formula.

#Write model formula
#Write model formula
XFormula = ~ shrub + herbaceous+Isothermality+built+
  Annual.Mean.Temperature+Temperature.Annual.Range 

With this information we can construct the model formula. The Hmsc() function takes the commnunity data, the environment data, the formula, the error distribution (we specify a probit model here as they approximate a normal error distribution which is helpful for model convergence and assessment), the study design (the sampling scheme i.e. sites) and the random levels (cell/site numbers).

# Construct the model


m <- Hmsc(Y = Y, XData = XData, XFormula = XFormula, distr = "probit",
          studyDesign = studyDesign,ranLevels = list(sample = rL))

To get our probability estimates for each species, the HMSC library uses a Markov Chain Monte Carlo approach to approximate the posterior distribution of the estimates. This is just a range of values for the regression coefficent with higher density close to the mean estimate and lower density for possible but less likely values. We set the number of chains (the number of times we attempt to estimate the distribution), the number of samples using in each run, the number of samples (transient) to ignore before estimating the distribution (because the first bunch of samples are always less stable)

# Set MCMC parameters
nChains <- 2
nIter <- 1000
nBurn <- 1000

set.seed(11)
# Fit the model
fit <- sampleMcmc(
  m,
  samples = nIter,
  transient = nBurn,
  nChains = nChains
)
Computing chain 1
Chain 1, iteration 40 of 2000 (transient)
Chain 1, iteration 80 of 2000 (transient)
Chain 1, iteration 120 of 2000 (transient)
Chain 1, iteration 160 of 2000 (transient)
Chain 1, iteration 200 of 2000 (transient)
Chain 1, iteration 240 of 2000 (transient)
Chain 1, iteration 280 of 2000 (transient)
Chain 1, iteration 320 of 2000 (transient)
Chain 1, iteration 360 of 2000 (transient)
Chain 1, iteration 400 of 2000 (transient)
Chain 1, iteration 440 of 2000 (transient)
Chain 1, iteration 480 of 2000 (transient)
Chain 1, iteration 520 of 2000 (transient)
Chain 1, iteration 560 of 2000 (transient)
Chain 1, iteration 600 of 2000 (transient)
Chain 1, iteration 640 of 2000 (transient)
Chain 1, iteration 680 of 2000 (transient)
Chain 1, iteration 720 of 2000 (transient)
Chain 1, iteration 760 of 2000 (transient)
Chain 1, iteration 800 of 2000 (transient)
Chain 1, iteration 840 of 2000 (transient)
Chain 1, iteration 880 of 2000 (transient)
Chain 1, iteration 920 of 2000 (transient)
Chain 1, iteration 960 of 2000 (transient)
Chain 1, iteration 1000 of 2000 (transient)
Chain 1, iteration 1040 of 2000 (sampling)
Chain 1, iteration 1080 of 2000 (sampling)
Chain 1, iteration 1120 of 2000 (sampling)
Chain 1, iteration 1160 of 2000 (sampling)
Chain 1, iteration 1200 of 2000 (sampling)
Chain 1, iteration 1240 of 2000 (sampling)
Chain 1, iteration 1280 of 2000 (sampling)
Chain 1, iteration 1320 of 2000 (sampling)
Chain 1, iteration 1360 of 2000 (sampling)
Chain 1, iteration 1400 of 2000 (sampling)
Chain 1, iteration 1440 of 2000 (sampling)
Chain 1, iteration 1480 of 2000 (sampling)
Chain 1, iteration 1520 of 2000 (sampling)
Chain 1, iteration 1560 of 2000 (sampling)
Chain 1, iteration 1600 of 2000 (sampling)
Chain 1, iteration 1640 of 2000 (sampling)
Chain 1, iteration 1680 of 2000 (sampling)
Chain 1, iteration 1720 of 2000 (sampling)
Chain 1, iteration 1760 of 2000 (sampling)
Chain 1, iteration 1800 of 2000 (sampling)
Chain 1, iteration 1840 of 2000 (sampling)
Chain 1, iteration 1880 of 2000 (sampling)
Chain 1, iteration 1920 of 2000 (sampling)
Chain 1, iteration 1960 of 2000 (sampling)
Chain 1, iteration 2000 of 2000 (sampling)
Computing chain 2
Chain 2, iteration 40 of 2000 (transient)
Chain 2, iteration 80 of 2000 (transient)
Chain 2, iteration 120 of 2000 (transient)
Chain 2, iteration 160 of 2000 (transient)
Chain 2, iteration 200 of 2000 (transient)
Chain 2, iteration 240 of 2000 (transient)
Chain 2, iteration 280 of 2000 (transient)
Chain 2, iteration 320 of 2000 (transient)
Chain 2, iteration 360 of 2000 (transient)
Chain 2, iteration 400 of 2000 (transient)
Chain 2, iteration 440 of 2000 (transient)
Chain 2, iteration 480 of 2000 (transient)
Chain 2, iteration 520 of 2000 (transient)
Chain 2, iteration 560 of 2000 (transient)
Chain 2, iteration 600 of 2000 (transient)
Chain 2, iteration 640 of 2000 (transient)
Chain 2, iteration 680 of 2000 (transient)
Chain 2, iteration 720 of 2000 (transient)
Chain 2, iteration 760 of 2000 (transient)
Chain 2, iteration 800 of 2000 (transient)
Chain 2, iteration 840 of 2000 (transient)
Chain 2, iteration 880 of 2000 (transient)
Chain 2, iteration 920 of 2000 (transient)
Chain 2, iteration 960 of 2000 (transient)
Chain 2, iteration 1000 of 2000 (transient)
Chain 2, iteration 1040 of 2000 (sampling)
Chain 2, iteration 1080 of 2000 (sampling)
Chain 2, iteration 1120 of 2000 (sampling)
Chain 2, iteration 1160 of 2000 (sampling)
Chain 2, iteration 1200 of 2000 (sampling)
Chain 2, iteration 1240 of 2000 (sampling)
Chain 2, iteration 1280 of 2000 (sampling)
Chain 2, iteration 1320 of 2000 (sampling)
Chain 2, iteration 1360 of 2000 (sampling)
Chain 2, iteration 1400 of 2000 (sampling)
Chain 2, iteration 1440 of 2000 (sampling)
Chain 2, iteration 1480 of 2000 (sampling)
Chain 2, iteration 1520 of 2000 (sampling)
Chain 2, iteration 1560 of 2000 (sampling)
Chain 2, iteration 1600 of 2000 (sampling)
Chain 2, iteration 1640 of 2000 (sampling)
Chain 2, iteration 1680 of 2000 (sampling)
Chain 2, iteration 1720 of 2000 (sampling)
Chain 2, iteration 1760 of 2000 (sampling)
Chain 2, iteration 1800 of 2000 (sampling)
Chain 2, iteration 1840 of 2000 (sampling)
Chain 2, iteration 1880 of 2000 (sampling)
Chain 2, iteration 1920 of 2000 (sampling)
Chain 2, iteration 1960 of 2000 (sampling)
Chain 2, iteration 2000 of 2000 (sampling)

We will check model convergence next. The following code puts the results from the MCMC in a format the can be plotted with the Coda package. We will plot the chains (each attempt to build the coefficients for each environmental covariate) to see how well the model performed. The plotted samples will move up and down around a relatively stable mean ifconvergence is good and the Beta paramter distribution should show a clear peak. Note the function produces plots for each species so there are a lot of outputs!

# Check MCMC convergence
mcmc_chains <- convertToCodaObject(fit)

par(mfrow = c(4, 4)) #adjust plotting scheme
par(mar = c(2, 2, 2, 2))  # Adjust margin sizes (bottom, left, top, right)
plot(mcmc_chains$Beta)#plot

Model convergence is not great in this case and would definitely benefit from more extensing sampling (perhaps many thousands for a serious bit of analysis) but we can work with this and, next, plot the mean estimates for each parameter for each species as a grid.

# Extract the posterior estimates of the beta parameters (model coefficients)
postBeta <- getPostEstimate(fit, "Beta")

par(mar = c(3, 3, 0.5, 0.5))  # Adjust margin sizes

#Plot the response of species to env covariates
plotBeta(fit, postBeta, param = "Mean")

The scale reductiom factor is a measure of how the variation within each chain compares to variation between chains. In a stable model the should be low. The following measure the ratio between the within-chain and between-chain variation and closer to one is good.

#Extract the potential scale reduction factor as a model diagnostic (should ideally be close to one for most runs)
mpost = convertToCodaObject(fit)

psrf.omega = gelman.diag(mpost$Omega[[1]], multivariate =
                           FALSE)$psrf

#plot values
hist(psrf.omega)

Next we compute the species association based on the model residuals using the computeAssociations() function.

# Compute associations using the fitted model

OmegaCor <- computeAssociations(fit)

# Extract the first set of associations from OmegaCor - mean between-species associations

toPlot <- OmegaCor[[1]]$mean

Now plot the species associations.

First reset the plotting panel

# Open a new plot window
dev.off()
null device 
          1 

Now use corrplot() to plot the associations:

library(corrplot)
# Plot the correlation matrix (interspecific interactions) using corrplot 
corrplot(toPlot,
         method = "color",
         col = colorRampPalette(c("blue", "white", "red"))(10),
         title = NULL,
         type = "lower",
         tl.col = "black",
         tl.cex = 0.5,  # Adjust text label size
         mar = c(0, 0, 6, 0))  # Adjust margins

This plot suggests that there are some weak-to-moderate associations between certains species and in particular some negative associations for Necrosyrtes Monachus which was our focal species in Laban’s 2025 paper.

There are a couple of options for model validation in HMSC. The computePredictedValues() function allows us to get the mean model AUC.

We can also build response plots. for example, below we plot the influence of the “Shrub” variable on species richness by setting measure = “S” in the plotGradient() function.

#compute gradient
Gradient = constructGradient(fit,focalVariable = "shrub")

#make prediction
predY = predict(fit, Gradient=Gradient, expected = TRUE)

#plot
plotGradient(fit, Gradient, pred=predY, measure="S", index = 1, showData = FALSE)

[1] 0.1405

We next construct the objects xy.grid and XData.grid that have the coordinates and use these to build a data frame that can be used for plotting with ggplot.

# environmental predictors from the env layer 

#forst set NAs to zero 
env[is.na(env)]=0

#get coordinates of all raster cells
xy.grid = as.data.frame(as.matrix(crds(env)))

#transform all raster covariates to a matrix of values
XData.grid = (as.matrix(env))

#convert to data frame
XData.grid=data.frame(XData.grid)


# We next use the prepareGradient function to convert the environmental and spatial
# predictors into a format that can be used as input for the predict function

Gradient = prepareGradient(fit, XDataNew = XData.grid,sDataNew = xy.grid)



# We are now ready to compute the predicted distribution (takes a minute so please be patient)

# Note that we used expected = TRUE to predict occurrence probabilities (e.g. 0.2) instead of occurrences (0 or 1) 
predY = predict(fit, Gradient=Gradient,expected=TRUE)


EpredY=Reduce("+",predY)/length(predY)


# EpredY is a matrix of posterior mean occurrence probabilities
# The next step is to post-process the predictions to those community features
# that we wish to illustrate over the prediction space. With the script below,
# we derive from the predictions for species richness


SpeciesRichness=rowSums(EpredY)

#coordinates
xy =xy.grid[,1:2]

#prediction for Hooded Vulture
H = EpredY[,1]

mapData=data.frame(xy,SpeciesRichness,H)


# We will use the ggplot function from the ggplot2 package, 
library(ggplot2)

# We can plot variation in species richness.

ggplot(data = mapData, aes(x=x, y=y, color=SpeciesRichness))+geom_point(size=3)  + scale_color_gradient(low="blue",high="green") + coord_equal()

# We then exemplify prediction for one focal species, here N. monachus 

ggplot(data = mapData, aes(x=x, y=y, color=H))+geom_point(size=2) + ggtitle(expression(italic("Necrosyrtes monachus ")))+ scale_color_gradient(low="blue", high="green") + coord_equal()

The distribution shown here for N. monachus mainly showing high probability in areas of otherwise low species richness is consistent with the correlation plot above in which this species exhibits negative associations with multiple other vulture species.