Last week, in Lab 3a, we got species occurrence points, generated background points, got environmental data, and ‘sampled’ the environmental data at the presence and background points. This week, we’ll be continuing this exercise and running Species Distribution Models in R. Remember that the aim of SDM is to estimate the probability that a species occurs at a given site (in a given grid cell) based in that site’s (grid cell’s) environmental similarity with the environmental conditions at known species occurrence locations.
1.1 Open RStudio
1.2 Create an R Script file. Click File -> NewFile -> R
Script
1.3 Save the R Script to your Lab3 folder. Click File > Save
As
1.4 Save the script frequently as you’re working. Click File
> Save
As we did last week, set your working directory and load the packages
we used last week. Remember to MODIFY my path name below
to your own path name
setwd("/Users/megancattau/Dropbox/0_BoiseState/Courses/2024_spr_Ecol_Dist/Labs/Lab3_SDM")
library(dismo)
library(raster)
library(sf)
library(rgbif)
library(maps)
3.1 Load sdm dataframe from Lab 3a and preprocess
sdm<-read.csv("sdmdata.csv")
sdm<-sdm[,-1]
Model evaluation often relies on cross-validation. This consists of creating a model with one ’calibration / training’ data set, and testing it with another ‘evaluation / testing’ data set. Training and testing data can be created through random sampling (without replacement) from a single data set (we’ll also look at k-fold partitioning below). We now partition the presence data into 5 groups, using 4 of the groups for training and one group for testing. The background data will only be used for model testing and does not need to be partitioned. Note that we use the function kfold to assign each presence sample to a group, we are not doing true k-fold partitioning, as we do not run the model k times, alternating which set is training and which is testing.
# create presence data from our dataframe and partition
presence<-sdm[sdm$pb==1,-1]
group <- kfold(presence,5)
pres_train <- presence[group != 1, ]
pres_test <- presence[group == 1, ]
# create background data from our dataframe
background<-sdm[sdm$pb==0, -1]
MODELING METHODS: Profile, regression, machine-learning
A large number of algorithms have been used in species distribution modeling. They can be classified as ’profile’, ’regression’, and ’machine learning’ methods. Profile methods only consider ’presence’ data, not absence or background data. Regression and machine learning methods often use both presence and absence or background data, but not always.
Several profile methods can be implemented in the dismo package, including BIOCLIM. The BIOCLIM algorithm is a ’climate-envelope-model’ (Booth et al., 2014), and it’s has been used extensively for SDM even though it generally does not perform as well as some other modeling methods, particularly in the context of climate change.
Note that here we’re fitting a Bioclim model using data.frame with
each row representing the environmental data at known sites of presence
of a species. You could instead fit a bioclim model simply using the
predictors and the occurrence points (and the function would do the
extracting for us).
Remind yourself what environmental variables we have available to us:
https://worldclim.org/data/bioclim.html. Let’s select
three varibales: BIO1, BIO4, and BIO12. Different modeling methods
return different type of model objects (‘bc’ for the bioclim model we
will run below). When implementing it in dismo, the results are reported
between 0 and 1.
# Look at environmental variables we want to use
worldcl<-getData(name='worldclim', var='bio', res=2.5)
par(mfrow=c(1, 3))
plot(worldcl[[c(1,4,12)]])
# Run the model
bc <- dismo::bioclim(pres_train[,c(1,4,12)])
Different statistical measures can help us assess how well the model
fits performs. We’ll want to use the test data set, which includes
presence points that were not used in model calibration /
training.
Threshold independent measures include the correlation coefficient and
the Area Under the Receiver Operator Curve (AUC). AUC is a measure of
rank-correlation, and a high AUC indicates that sites with high
predicted suitability values tend to be areas of known presence and
locations with lower model prediction values tend to be areas where the
species is not known to be present. An AUC score of 0.5 means that the
model is as good as random.
We evaluate the model by providing testing presence points, background (absence) points, and the model object (if we were using spatial presence / background points without the environmental data, we would also include a RasterStack of the environmental variables). Note that we include just the environmental data we used to build the model.
eBio <- dismo::evaluate(p=pres_test[,c(1,4,12)], a = background[,c(1,4,12)], model = bc)
eBio
## class : ModelEvaluation
## n presences : 193
## n absences : 500
## AUC : 0.7173627
## cor : 0.2949839
## max TPR+TNR at : 0.2771152
We can visualize the model evaluation by creating a 3-panel figure showing 1) the density curves of presence-absence data, 2) boxplots of suitability values for presence-absence data, and 3) an ROC curve
par(mfrow=c(1, 3))
density(eBio)
boxplot(eBio, col=c("blue", "red"))
plot(eBio, "ROC")
We can also look at the response function that creates response plots for each variable, with the other variables at their median value.
response(bc)
For threshold dependent measures, a threshold must be set first that indicates presence / suitable above that threshold and absence / not suitable below that threshold. Some measures emphasize the weight of false absences; others give more weight to false presences.
Set the threshold and think: how did we find this threshold? There are several other options (run the code ?threshold to explore them). Think: is this the one you would choose?
tr <- threshold(eBio, "spec_sens")
tr
## [1] 0.2771152
The purpose is SDM is to create a map of suitability scores. Model objects can be used to with the predict function to make predictions for any combination of values of the independent variables, including for values across a raster. Here, we predict across the values of the bioclimatic variables in worldclim (Note that we include just the environmental data we used to build the model) at the extent of our species presence points./ Please note that if you’re predicting for a model outside of the dismo package (random forest, boosted regression trees, glm, etc.), you’ll want the order of the arguments in the predict function to be first the RasterStack, then the model object. In the example below, this would be predict(bc, worldcl) rather than predict(worldcl, bc) as we’ve done it.
zoom_x<-c(90,120)
zoom_y<-c(-10,10)
ext = extent(zoom_x[1], zoom_x[2], zoom_y[1], zoom_y[2])
pb <- dismo::predict(bc, worldcl[[c(1, 4, 12)]], ext=ext, progress="")
Visualize predictions with a two-panel map with 1) the Bioclim raw output and 2) the thresholded model
par(mfrow=c(1,2))
plot(pb, main="Bioclim, raw values")
plot(pb > tr, main="presence/absence")
It’s also worth considering what else can be done to evaluate a
model. Useful questions include:
Does the model seem sensible, ecologically?
Do the fitted functions (the shapes of the modeled relationships) make
sense?
Do the predictions seem reasonable? (map them, and think about
them)?
A brief overview (5-7ish sentences) a general discussion about SDM, why we care to do it, and a bit about the species you selected. Include 2-3 references if you are a grad student.
A description of the data and a high-level overview of what you did (not a step-by-step, but a few sentences describing the workflow such that someone could repeat it, including citations for packages).
Describe the data and any way you manipulated it.
Include a map of the cleaned species occurrence (presence) points and the absence / background points (one of the Figures we turned in for Lab 3a).
Describe the environmental variables and why did you choose the environmental variables that you did.
Include a map of all the environmental variables used.
Briefly describe the BIOCLIM modeling approach.
Include what threshold you chose, the AUC, and the cor Include a 3-panel figure showing 1) the density curves of presence-absence data, 2) boxplots of suitability values for presence-absence data, and 3) an ROC curve.
Include a two-panel map with 1) the Bioclim raw output and 2) the thresholded model.
Should include:
How well did the model perform using the evaluation metrics?
How well did the model perform based on any other considerations?
2-3 references if you are a grad student
Could include: What your results demonstrated, and a few sentences about
if your results showed the relationship that you were expecting, caveats
to the data / analysis, next steps for a broader analysis, and / or
broader implications of the findings.
These are some additional approaches
In real projects, you would want to use k-fold data partitioning instead of a single random sample. The dismo function kfold facilitates that type of data partitioning. It creates a vector that assigns each row in the data matrix to a a group (between 1 to k). You can get fancy and use kfold partitioning to run 5 models and take the average of those models to estimate the AUC and threshold, but you don’t have to.
# partition the data
k <- 5
e <- list()
for (i in 1:k) {
train <- presence[group != i, c(1,4,12)]
test <- presence[group == i, c(1,4,12)]
bc <- bioclim(train)
e[[i]] <- evaluate(p=test, a=background[ , c(1,4,12)], bc)
}
# Generate AUC and threshold
auc <- sapply( e, function(x){slot(x, "auc")} )
mean(auc)
## [1] 0.7072188
thresh<-sapply( e, function(x){ x@t[which.max(x@TPR + x@TNR)] } )
mean(thresh)
## [1] 0.2786357
For the remaining modeling approach, models need to be fit with presence and absence (background) data. The two common regression methods: GLM and GAM. We won’t work through these in lab, but below is some additional information. In modeling, you may be familiar with taking a ’formula’ and identifying the dependent and independent variables, accompanied with a data.frame that holds these variables. For example, y ~ x1 + x2 + x3, i.e. y is a function of x1, x2, and x3. You’ll have noticed that models that are implemented in the dismo package (e.g., Bioclim) do not use a formula. In the example below, the function ’glm’ is used to fit generalized linear model, and it does use a formula.
m1 <- glm(pb ~ bio1 + bio4 + bio12, data=sdm)
class(m1)
summary(m1)
There are a variety of machine learning methods that can be easily
implemented in R with associated packages, including Artifical Neural
Networks (ANN) Classification and Regression Trees (CART), Random
Forests, Boosted Regression Trees, and Support Vector Machines. Breiman
(2001a) provides a accessible introduction to machine learning, and how
it contrasts with ’classical statistics’ (model based probabilistic
inference). Hastie et al., 2009 provide what is probably the most
extensive overview of these methods.
MaxEnt (Maximum Entropy; Phillips et al., 2006) is the most widely used
SDM algorithm. Elith et al. (2010) provide an explanation of the
algorithm (and software) geared towards ecologists. MaxEnt is available
as a standalone Java program. Dismo has a function ’maxent’ that
communicates with this program.
Make sure you have the Java development kit installed on your machine: https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html. You may need to sign up for a user account.
Run the code install.packages(‘rJava’) if you don’t have this package installed, then load below:
library(rJava)
Access the program from https://biodiversityinformatics.amnh.org/open_source/maxent/. Note the citation for the software on this page.
Once the program is downloaded, put the file ’maxent.jar’ in the
’java’ folder of the ’dismo’ package. That is the folder returned by:
system.file(“java”, package=“dismo”). Meaning, physically copy it from
where you downloaded it and paste it in the file that the above command
returns. Note that there are several different Library locations on a
mac, and you’ll need to loacte the correct one.
Please note that this program (maxent.jar) cannot be redistributed or used for commercial purposes.
Now you can run Maxent! If the jar file is in the right place, this should run. If not, you will see “cannot run this example because maxent is not available” in the console. If you see that, just put the jar file in the correct place.
jar <- paste(system.file(package="dismo"), "/java/maxent.jar", sep="")
if (file.exists(jar)) {
xm <- maxent(x=sdm[,c(2, 5, 13)], p= sdm[,1])
plot(xm)
} else {
cat('cannot run this example because maxent is not available')
plot(1)
}
eBioMax <- dismo::evaluate(p=pres_test[,c(1,4,12)], a = background[,c(1,4,12)], model = xm)
eBioMax
## class : ModelEvaluation
## n presences : 193
## n absences : 500
## AUC : 0.8187409
## cor : 0.418324
## max TPR+TNR at : 0.6100196
# Evaluation Visualization
# 1) the density curves of presence-absence data, 2) boxplots of suitability values for presence-absence data, and 3) an ROC curve
par(mfrow=c(1, 3))
density(eBioMax)
boxplot(eBioMax, col=c("blue", "red"))
plot(eBioMax, "ROC")
# Response plot
response(xm)
# Threshold
trMax <- threshold(eBioMax, "spec_sens")
trMax
## [1] 0.6100196
# Prediction
pbMax <- dismo::predict(xm, worldcl[[c(1, 4, 12)]], ext=ext, progress="")
# Visualize Prediction
par(mfrow=c(1,2))
plot(pbMax, main="Maxent, raw values")
plot(pbMax > trMax, main="presence/absence")