A. Introduction

Last week, in Lab 3a, we got species occurrence points, generated background points, got environmental data, and ‘sampled’ the environmental data at the presence and background points. This week, we’ll be continuing this exercise and running Species Distribution Models in R. Remember that the aim of SDM is to estimate the probability that a species occurs at a given site (in a given grid cell) based in that site’s (grid cell’s) environmental similarity with the environmental conditions at known species occurrence locations.

B. Set-up

1. Set up your R Script

1.1 Open RStudio
1.2 Create an R Script file. Click File -> NewFile -> R Script
1.3 Save the R Script to your Lab3 folder. Click File > Save As
1.4 Save the script frequently as you’re working. Click File > Save

2. Set your working Directory and load packages

As we did last week, set your working directory and load the packages we used last week. Remember to MODIFY my path name below to your own path name

setwd("/Users/megancattau/Dropbox/0_BoiseState/Courses/2024_spr_Ecol_Dist/Labs/Lab3_SDM")
library(dismo)
library(raster)
library(sf)
library(rgbif)
library(maps)

3. Load data and preprocess

3.1 Load sdm dataframe from Lab 3a and preprocess

sdm<-read.csv("sdmdata.csv")
sdm<-sdm[,-1]

3.2 Create presence / background sets

Model evaluation often relies on cross-validation. This consists of creating a model with one ’calibration / training’ data set, and testing it with another ‘evaluation / testing’ data set. Training and testing data can be created through random sampling (without replacement) from a single data set (we’ll also look at k-fold partitioning below). We now partition the presence data into 5 groups, using 4 of the groups for training and one group for testing. The background data will only be used for model testing and does not need to be partitioned. Note that we use the function kfold to assign each presence sample to a group, we are not doing true k-fold partitioning, as we do not run the model k times, alternating which set is training and which is testing.

# create presence data from our dataframe and partition
presence<-sdm[sdm$pb==1,-1]
group <- kfold(presence,5) 
pres_train <- presence[group != 1, ] 
pres_test <- presence[group == 1, ]

# create background data from our dataframe 
background<-sdm[sdm$pb==0, -1]

C. MODEL BUILDING, PREDICTION, AND EVALUATION

MODELING METHODS: Profile, regression, machine-learning

A large number of algorithms have been used in species distribution modeling. They can be classified as ’profile’, ’regression’, and ’machine learning’ methods. Profile methods only consider ’presence’ data, not absence or background data. Regression and machine learning methods often use both presence and absence or background data, but not always.

1. Build and execute a profile model: BIOCLIM

Several profile methods can be implemented in the dismo package, including BIOCLIM. The BIOCLIM algorithm is a ’climate-envelope-model’ (Booth et al., 2014), and it’s has been used extensively for SDM even though it generally does not perform as well as some other modeling methods, particularly in the context of climate change.

Note that here we’re fitting a Bioclim model using data.frame with each row representing the environmental data at known sites of presence of a species. You could instead fit a bioclim model simply using the predictors and the occurrence points (and the function would do the extracting for us).

Remind yourself what environmental variables we have available to us: https://worldclim.org/data/bioclim.html. Let’s select three varibales: BIO1, BIO4, and BIO12. Different modeling methods return different type of model objects (‘bc’ for the bioclim model we will run below). When implementing it in dismo, the results are reported between 0 and 1.

# Look at environmental variables we want to use
worldcl<-getData(name='worldclim', var='bio', res=2.5)
par(mfrow=c(1, 3))
plot(worldcl[[c(1,4,12)]])

# Run the model
bc <- dismo::bioclim(pres_train[,c(1,4,12)])

2. Model Evaluation

Different statistical measures can help us assess how well the model fits performs. We’ll want to use the test data set, which includes presence points that were not used in model calibration / training.
Threshold independent measures include the correlation coefficient and the Area Under the Receiver Operator Curve (AUC). AUC is a measure of rank-correlation, and a high AUC indicates that sites with high predicted suitability values tend to be areas of known presence and locations with lower model prediction values tend to be areas where the species is not known to be present. An AUC score of 0.5 means that the model is as good as random.

2.1 Evaluation

We evaluate the model by providing testing presence points, background (absence) points, and the model object (if we were using spatial presence / background points without the environmental data, we would also include a RasterStack of the environmental variables). Note that we include just the environmental data we used to build the model.

eBio <- dismo::evaluate(p=pres_test[,c(1,4,12)], a = background[,c(1,4,12)], model = bc)
eBio

## class          : ModelEvaluation 
## n presences    : 193 
## n absences     : 500 
## AUC            : 0.7173627 
## cor            : 0.2949839 
## max TPR+TNR at : 0.2771152

2.2 Evaluation Visualization

We can visualize the model evaluation by creating a 3-panel figure showing 1) the density curves of presence-absence data, 2) boxplots of suitability values for presence-absence data, and 3) an ROC curve

par(mfrow=c(1, 3))
density(eBio)
boxplot(eBio, col=c("blue", "red"))
plot(eBio, "ROC")

We can also look at the response function that creates response plots for each variable, with the other variables at their median value.

response(bc)

2.3 Threshold

For threshold dependent measures, a threshold must be set first that indicates presence / suitable above that threshold and absence / not suitable below that threshold. Some measures emphasize the weight of false absences; others give more weight to false presences.

Set the threshold and think: how did we find this threshold? There are several other options (run the code ?threshold to explore them). Think: is this the one you would choose?

tr <- threshold(eBio, "spec_sens") 
tr

## [1] 0.2771152

2.4 Prediction

The purpose is SDM is to create a map of suitability scores. Model objects can be used to with the predict function to make predictions for any combination of values of the independent variables, including for values across a raster. Here, we predict across the values of the bioclimatic variables in worldclim (Note that we include just the environmental data we used to build the model) at the extent of our species presence points./ Please note that if you’re predicting for a model outside of the dismo package (random forest, boosted regression trees, glm, etc.), you’ll want the order of the arguments in the predict function to be first the RasterStack, then the model object. In the example below, this would be predict(bc, worldcl) rather than predict(worldcl, bc) as we’ve done it.

zoom_x<-c(90,120)
zoom_y<-c(-10,10)
ext = extent(zoom_x[1], zoom_x[2], zoom_y[1], zoom_y[2])
pb <- dismo::predict(bc, worldcl[[c(1, 4, 12)]], ext=ext, progress="")

2.5 Visualize Prediction

Visualize predictions with a two-panel map with 1) the Bioclim raw output and 2) the thresholded model

par(mfrow=c(1,2)) 
plot(pb, main="Bioclim, raw values") 
plot(pb > tr, main="presence/absence")

It’s also worth considering what else can be done to evaluate a model. Useful questions include:
Does the model seem sensible, ecologically?
Do the fitted functions (the shapes of the modeled relationships) make sense?
Do the predictions seem reasonable? (map them, and think about them)?

REPORT

This report can be based on the data of the species of your choosing or on the orangutan data we worked with in class.

Introduction:

A brief overview (5-7ish sentences) a general discussion about SDM, why we care to do it, and a bit about the species you selected. Include 2-3 references if you are a grad student.

Methods:

A description of the data and a high-level overview of what you did (not a step-by-step, but a few sentences describing the workflow such that someone could repeat it, including citations for packages).

Describe the data and any way you manipulated it.

Include a map of the cleaned species occurrence (presence) points and the absence / background points (one of the Figures we turned in for Lab 3a).

Describe the environmental variables and why did you choose the environmental variables that you did.

Include a map of all the environmental variables used.

Briefly describe the BIOCLIM modeling approach.

Results (what you found in the data and what it means)

Include what threshold you chose, the AUC, and the cor Include a 3-panel figure showing 1) the density curves of presence-absence data, 2) boxplots of suitability values for presence-absence data, and 3) an ROC curve.

Include a two-panel map with 1) the Bioclim raw output and 2) the thresholded model.

Discussion:

Should include:
How well did the model perform using the evaluation metrics?
How well did the model perform based on any other considerations?
2-3 references if you are a grad student
Could include: What your results demonstrated, and a few sentences about if your results showed the relationship that you were expecting, caveats to the data / analysis, next steps for a broader analysis, and / or broader implications of the findings.

Everything below here is Optional

These are some additional approaches

Optional: explore k-fold partitioning with a BIOCLIM model

In real projects, you would want to use k-fold data partitioning instead of a single random sample. The dismo function kfold facilitates that type of data partitioning. It creates a vector that assigns each row in the data matrix to a a group (between 1 to k). You can get fancy and use kfold partitioning to run 5 models and take the average of those models to estimate the AUC and threshold, but you don’t have to.

# partition the data
k <- 5
e <- list()
for (i in 1:k) {
    train <- presence[group != i, c(1,4,12)]
    test <- presence[group == i, c(1,4,12)]
    bc <- bioclim(train)
    e[[i]] <- evaluate(p=test, a=background[ , c(1,4,12)], bc)
}

# Generate AUC and threshold
auc <- sapply( e, function(x){slot(x, "auc")} )
mean(auc)

## [1] 0.7072188

thresh<-sapply( e, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
mean(thresh)

## [1] 0.2786357

Optional: Explore building and executing a regression model: GLM

For the remaining modeling approach, models need to be fit with presence and absence (background) data. The two common regression methods: GLM and GAM. We won’t work through these in lab, but below is some additional information. In modeling, you may be familiar with taking a ’formula’ and identifying the dependent and independent variables, accompanied with a data.frame that holds these variables. For example, y ~ x1 + x2 + x3, i.e. y is a function of x1, x2, and x3. You’ll have noticed that models that are implemented in the dismo package (e.g., Bioclim) do not use a formula. In the example below, the function ’glm’ is used to fit generalized linear model, and it does use a formula.

m1 <- glm(pb ~ bio1 + bio4 + bio12, data=sdm)
class(m1)
summary(m1)

Optional: Build and execute a machine-learning model: Maxent

There are a variety of machine learning methods that can be easily implemented in R with associated packages, including Artifical Neural Networks (ANN) Classification and Regression Trees (CART), Random Forests, Boosted Regression Trees, and Support Vector Machines. Breiman (2001a) provides a accessible introduction to machine learning, and how it contrasts with ’classical statistics’ (model based probabilistic inference). Hastie et al., 2009 provide what is probably the most extensive overview of these methods.
MaxEnt (Maximum Entropy; Phillips et al., 2006) is the most widely used SDM algorithm. Elith et al. (2010) provide an explanation of the algorithm (and software) geared towards ecologists. MaxEnt is available as a standalone Java program. Dismo has a function ’maxent’ that communicates with this program.

1 Set-up for Maxent

1.1 Java development kit

Make sure you have the Java development kit installed on your machine: https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html. You may need to sign up for a user account.

1.2 Install and load the necessary packages in R

Run the code install.packages(‘rJava’) if you don’t have this package installed, then load below:

library(rJava)

1.3 Download and install MaxEnt

Access the program from https://biodiversityinformatics.amnh.org/open_source/maxent/. Note the citation for the software on this page.

1.4 Move files

Once the program is downloaded, put the file ’maxent.jar’ in the ’java’ folder of the ’dismo’ package. That is the folder returned by: system.file(“java”, package=“dismo”). Meaning, physically copy it from where you downloaded it and paste it in the file that the above command returns. Note that there are several different Library locations on a mac, and you’ll need to loacte the correct one.

Please note that this program (maxent.jar) cannot be redistributed or used for commercial purposes.

1.5 Run the model

Now you can run Maxent! If the jar file is in the right place, this should run. If not, you will see “cannot run this example because maxent is not available” in the console. If you see that, just put the jar file in the correct place.

jar <- paste(system.file(package="dismo"), "/java/maxent.jar", sep="") 
if (file.exists(jar)) {
    xm <- maxent(x=sdm[,c(2, 5, 13)], p= sdm[,1]) 
    plot(xm) 
} else { 
    cat('cannot run this example because maxent is not available')
        plot(1)
}

1.6 Model evaluation and prediction

eBioMax <- dismo::evaluate(p=pres_test[,c(1,4,12)], a = background[,c(1,4,12)], model = xm)
eBioMax

## class          : ModelEvaluation 
## n presences    : 193 
## n absences     : 500 
## AUC            : 0.8187409 
## cor            : 0.418324 
## max TPR+TNR at : 0.6100196

# Evaluation Visualization
# 1) the density curves of presence-absence data, 2) boxplots of suitability values for presence-absence data, and 3) an ROC curve
par(mfrow=c(1, 3))
density(eBioMax)
boxplot(eBioMax, col=c("blue", "red"))
plot(eBioMax, "ROC")

# Response plot
response(xm)

# Threshold
trMax <- threshold(eBioMax, "spec_sens") 
trMax

## [1] 0.6100196

# Prediction
pbMax <- dismo::predict(xm, worldcl[[c(1, 4, 12)]], ext=ext, progress="") 

# Visualize Prediction
par(mfrow=c(1,2)) 
plot(pbMax, main="Maxent, raw values") 
plot(pbMax > trMax, main="presence/absence")

Lab3b

Megan Cattau, Modified from Robert J. Hijmans and Jane Elith, 2016

2024