Step 1: Learning about the species

To learn about our target species, the Brown-throated sloth (Bradypus variegatus) check for available online resources. In addition to published literature, good sources can include Wikipedia (https://en.wikipedia.org/wiki/Brown-throated_sloth) or the IUCN Red List (https://www.iucnredlist.org/fr/species/3038/210442893).

Species Distribution Models (SDMs) or ecological niche models (ENMs) are empirical models that relate records of species occurrence (e.g. presence or presence/absence data) with environmental variables to predict a species’ potential distribution. SDMs are of great use in conservation biology, especially in the context of global climate change. In this lab, you will learn how to:

  1. Prepare species occurrence data.
  2. Prepare environmental predictor variables.
  3. Specify, train, and validate a Generalized Linear Model (GLM) for predicting the distribution of the brown-throated sloth (Bradypus variegatus).
  4. Predict the distribution across the entire study area and generate a suitability map.

Step 2: Collecting occurrence data

Before we start with any analyses in R, we will load all packages that we need. If there are packages missing, install them with the function install.packages().

We will use the following R packages:“sf” for handling spatial vector data; “terra” for handling spatial raster data, “mapview” for interactive visualization, “predicts” implements functions for spatial predictions methods, especially spatial (species) distribution models, including an R link to the ‘maxent’ model., “tidyverse”.

# load packages
library(sf)
library(terra)
library(mapview)
library(predicts)
library(tidyverse)
library(corrplot)

In addition, we need to set R’s working directory, ideally to a folder where our data to be analyzed is located.

# setwd("D:/YOUR/WORKING/DIRECTORY")

After having collected occurrence data, we can load them into R. Here, we want to load the table of occurrences for the Brown-throated sloth and inspect it. To read in datasets stored in a comma-separated table format (.csv files), you can use the read.csv() function.

The resulting object will be a data frame. To inspect the data frame, you can use functions such as str(), View(), nrow(), ncol(), colnames(), etc.

To see how it is done, you can have a look at the hidden code below. However, here and throughout the session, always try to first do implement the R code by yourself!

bradypus_presence <- read.csv("bradypus_presence.csv")

str(bradypus_presence)
## 'data.frame':    80 obs. of  2 variables:
##  $ lon: num  -65.1 -63.2 -64.4 -56.7 -59.1 ...
##  $ lat: num  -16.8 -17.8 -16 -2.6 -3.7 ...
nrow(bradypus_presence)
## [1] 80
ncol(bradypus_presence)
## [1] 2
colnames(bradypus_presence)
## [1] "lon" "lat"

As a next step, we need to convert the table of coordinates to a spatial dataset. We here use the sf package to represent spatial datasets (an alternative would be the older sp package). To convert a data frame to a a spatial sf point object, you can use the function st_as_sf(). To learn about how to use this or any other function, pull up the help file of the function by typing ?st_as_sf or searching for the function under the “Help” header inside Rstudio (Alt+Ctrl+F1). The crs argument has to be set to 4326, corresponding to the EPSG code of the WGS 84 projection for longitude / latitude data.

bradypus_presence_sf <- st_as_sf(bradypus_presence, coords = c("lon", "lat"), crs = 4326)

Step 3: Collecting environmental predictors

After having collected environmental predictor variables (typically raster data), we also need to import them into R and inspect them. We here will be using the terra package for this purpose.

To load a raster file, we can use the rast() function.

Here, our environmental predictors are in a .grd format, which allows saving layer names with the file. More commonly, environmental predictor rasters will be available in a GEOTiff (.tif) format.

To inspect layer names, we can use thenames() function. To visualize raster data, we can use the plot() function. To interactively view a raster, we can use the mapview() function from the mapview package. To allow us plotting one raster layer at a time, we can select a single layer of the raster stack using the layer number or name inside double squared brackets (e.g., env[[1]] to select the first layer of the raster object env).

Here is an overview of what the variable names stand for:

Name Variable (Unit)
tmp6190_ann Average annual temperature (10 degree C)
tmn6190_ann Minimum annual temperature (10 degree C)
tmx6190_ann Maximum annual temperature (10 degree C)
dtr6190_ann Annual daily temperature range (10 degree C)
frs6190_ann Annual frost frequency
pre6190_ann Annual rainfall (mm)
pre6190_l1 January rainfall (mm)
pre6190_l4 April rainfall (mm)
pre6190_l7 July rainfall (mm)
pre6190_l10 October rainfall (mm)
cld6190_ann Annual cloud cover (%)
vap6190_ann Annual vapor pressure (kPa)
h_dem Digital Elevation Model (m)
env <- terra::rast("environmental_vars.grd")
names(env)
##  [1] "cld6190_ann" "dtr6190_ann" "frs6190_ann" "h_dem"       "pre6190_ann"
##  [6] "pre6190_l1"  "pre6190_l10" "pre6190_l4"  "pre6190_l7"  "tmn6190_ann"
## [11] "tmp6190_ann" "tmx6190_ann" "vap6190_ann"
plot(env[[4]])

mapview(env[[4]])

Step 4: Matching occurrences and predictors

As a next step, we need to match occurrences with environmental predictors. Since we do not have any absence information available for our species, we need to generate background points before we can built a SDM. For a start, we here will use a random sample of background points across the entire study area.

Next to the sampling method, another important consideration is the number of background points to sample. For most modeling algorithms, you can only sample too few background points, while sampling too many is only going to result in longer computation times for the model. In theory, we can test whether we have sampled enough background points by fitting models with increasing number of background points and checking whether model coefficients are converging (i.e., not changing anymore). As a simpler alternative, a good rule-of-thumb is to determine a ratio of presence:background points you want to sample. We here will start with a 1:10 ratio of presence:background points.

Use the backgroundSample() function from the predicts for randomly sampling background points across the study area extent given by the set of environmental predictors. The result will be a data frame containing the coordinates of randomly sampled background points.

n_background <- nrow(bradypus_presence) * 10
background_points <- predicts::backgroundSample(env, n = n_background, p = bradypus_presence[,c("lon", "lat")])

Next, we need to convert the resulting data frame into a spatial sf object use the st_as_sf() function, as we did above with the presence points. Once you converted into a sf object, you can inspect the background points using the plot() or mapview() functions.

background_points_sf <- st_as_sf(as.data.frame(background_points), coords = c("x", "y"), crs = 4326)
mapview(background_points_sf)

Then, we combine the sf point objects of presences and background points. Before merging both, we need to add a new column to each data frame containing information whether a point is a presence or a background point. Therefore, add a column called ‘occ’ to both sf objects and fill it with 1s for the presences, and 0s for the background points. For example, you can use the $ operator together with the assign (<-) operator to add a new column. Finally, we can merge both datasets using the rbind() function.

bradypus_presence_sf$occ <- 1
background_points_sf$occ <- 0

bradypus_presence_background_sf <- rbind(bradypus_presence_sf, background_points_sf)

We can extract the values of environmental predictors at the presence and background locations using the extract() function from the raster package. If you have other packages loaded containing a function with the same name, make sure to use the notation raster::extract() to call the function. Also, we need to make sure that the information whether a point is a presence or background location is not getting lost during extraction. One way to make sure of this is to check the sp = TRUE option of the extract function.

Extract the environmental variables using the extract() function. You can inspect the resulting dataset using the summary() function. Are there any missing values (NAs)?

env_extract <- terra::extract(env, bradypus_presence_background_sf, sp =TRUE)
# summary(env_extract) 

Finally, we can convert the sp object resulting from the extract() function back into a regular data frame using as.data.frame(). To allow building the model correctly, we also need to convert the occurrence column (‘occ’) into a factor variable and remove rows containing missing values for the predictor variables. We can use the as.factor() and complete.cases() functions for these purposes.

model_df <- as.data.frame(env_extract)
model_df$occ <- as.factor(bradypus_presence_background_sf$occ)
model_df <- model_df[complete.cases(model_df),]

Step 5: Building a model

Now we are ready to fit our first species distribution model!

Your task now will be to build two candidate SDMs using 2-3 predictor variables in each model. Select variables based on what you learned about the Brown-throated sloth. Variable sets between candidate models can be partially overlapping. To avoid including highly correlated predictor variables, you can check correlations between predictor variables in your modeling data frame with the cor() function. Make sure to first select the columns corresponding to the environmental predictor variables.

variable_names <- names(env)
cor_results <- cor(model_df[,variable_names])
# ?corrplot::corrplot
corrplot::corrplot(cor_results, addCoef.col = "black", number.digits = 1, diag = FALSE)

Once we have decided on our candidate models, we can fit a logistic regression model using the glm() function for Generalized Linear Models (GLMs), selecting the correct error distribution by specifying family = binomial().

To select which variables in the data frame are used as predictors, we can use the formula interface. For example, to use the variables annual precipitation and elevation in the model, we can specify the formula occ ~ pre6190_ann+h_dem.

Check and interpret the resulting model coefficients using the summary() or coef() functions on the model object. You can use exp() function to convert coefficients to odds ratios.

glm1 <- glm(occ ~ pre6190_ann+h_dem, family=binomial(), data = model_df) 
summary(glm1)
## 
## Call:
## glm(formula = occ ~ pre6190_ann + h_dem, family = binomial(), 
##     data = model_df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.2243078  0.3907307 -10.811  < 2e-16 ***
## pre6190_ann  0.0389182  0.0056096   6.938 3.98e-12 ***
## h_dem       -0.0005839  0.0003574  -1.634    0.102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 536.16  on 879  degrees of freedom
## Residual deviance: 458.71  on 877  degrees of freedom
## AIC: 464.71
## 
## Number of Fisher Scoring iterations: 7
exp(coef(glm1))
## (Intercept) pre6190_ann       h_dem 
##  0.01463546  1.03968541  0.99941629
glm2 <- glm(occ ~ pre6190_ann+tmp6190_ann, family=binomial(), data = model_df) 
summary(glm2)
## 
## Call:
## glm(formula = occ ~ pre6190_ann + tmp6190_ann, family = binomial(), 
##     data = model_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.416760   1.146792  -5.595 2.20e-08 ***
## pre6190_ann  0.036811   0.005822   6.323 2.57e-10 ***
## tmp6190_ann  0.008731   0.004724   1.848   0.0646 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 536.16  on 879  degrees of freedom
## Residual deviance: 458.14  on 877  degrees of freedom
## AIC: 464.14
## 
## Number of Fisher Scoring iterations: 7
exp(coef(glm2))
## (Intercept) pre6190_ann tmp6190_ann 
## 0.001633942 1.037497076 1.008769381

Step 6: Validating the model

We now want to validate our model by assessing its predictive performance. We will calculate AUC and AIC values as two different indicators of performance.

To calculate AUC, we need to split our modeling data frame into a training and testing dataset. Therefore, we will first randomly sample the row numbers (i.e., observations) which will be assigned to the training dataset, while all remaining rows will be used as a testing dataset.

To sample row numbers, we can use the sample() function. As a sample size, we can start with using 80% of our observations in the training dataset.

Once we sampled our training observations, we need to create two separate data frames containing the training and testing datasets. We can do this by subsetting our full modeling data frame, selecting rows like this: model_df[train_ids,] (or model_df[train_ids,] to select all rows but train_ids).

How many observations (presence + background) do we have available for model training and testing, respectively?

train_ids <- sample(1:nrow(model_df), size = 0.8*nrow(model_df)) 
training_data <- model_df[train_ids,]
testing_data <- model_df[-train_ids,]
nrow(training_data)
## [1] 704
nrow(testing_data)
## [1] 176

Once we have defined training and testing data, we have to re-fit our candidate models on the training dataset.

Use the glm() function to re-fit your two candidate models on the training dataset. Use the same predictor variables as above.

glm1_val <- glm(occ ~ pre6190_ann+h_dem, family=binomial(), data = training_data) 
glm2_val <- glm(occ ~ pre6190_ann+cld6190_ann, family=binomial(), data = training_data) 

To calculate AUC values, we can use the pa_evaluate() function from the predicts package. We need to provide the function with three inputs: (1) a data frame containing all rows corresponding to presence observations in the testing dataset, (2) a data frame containing all rows corresponding to background observations in the testing dataset, and (3) the model object fitted on the training dataset. To select rows for inputs (1) and (2), we can use testing_data[testing_data$occ == 1,] and testing_data[testing_data$occ == 1,], respectively.

In addition, we again need to pass on type = "response” to the predict() function to predict on the response (probability) scale.

Evaluate your two candidate models fitted with the subset of training data and save the outputs as R objects. You can then print the objects to see the AUC values or use the plot() function along with “ROC” as a second argument to see the receiver-operator-characteristic (ROC) curve.

e1 <- predicts::pa_evaluate(p = testing_data[testing_data$occ == 1,], 
         a = testing_data[testing_data$occ == 0,], 
         glm1_val, 
         type = "response")

e1
## @stats
##   np  na prevalence  auc   cor pcor   ODP
## 1 20 156      0.114 0.77 0.403    0 0.886
## 
## @thresholds
##   max_kappa max_spec_sens no_omission equal_prevalence equal_sens_spec
## 1     0.236         0.072       0.022            0.113           0.092
## 
## @tr_stats
##     treshold kappa  CCR  TPR  TNR  FPR  FNR  PPP  NPP  MCR  OR
## 1          0     0 0.11    1    0    1    0 0.11  NaN 0.89 NaN
## 2          0     0 0.12    1 0.01 0.99    0 0.11    1 0.88 Inf
## 3          0     0 0.12    1 0.01 0.99    0 0.11    1 0.88 Inf
## 4        ...   ...  ...  ...  ...  ...  ...  ...  ...  ... ...
## 176     0.96  0.09 0.89 0.05    1    0 0.95    1 0.89 0.11 Inf
## 177     0.96     0 0.89    0    1    0    1  NaN 0.89 0.11 NaN
## 178     0.96     0 0.89    0    1    0    1  NaN 0.89 0.11 NaN
e2 <- predicts::pa_evaluate(p = testing_data[testing_data$occ == 1,], 
         a = testing_data[testing_data$occ == 0,], 
         glm2_val, 
         type = "response")

e2
## @stats
##   np  na prevalence  auc   cor pcor   ODP
## 1 20 156      0.114 0.76 0.371    0 0.886
## 
## @thresholds
##   max_kappa max_spec_sens no_omission equal_prevalence equal_sens_spec
## 1     0.256         0.056       0.033            0.114           0.082
## 
## @tr_stats
##     treshold kappa  CCR  TPR  TNR  FPR  FNR  PPP  NPP  MCR  OR
## 1       0.01     0 0.11    1    0    1    0 0.11  NaN 0.89 NaN
## 2       0.01     0 0.12    1 0.01 0.99    0 0.11    1 0.88 Inf
## 3       0.01     0 0.12    1 0.01 0.99    0 0.11    1 0.88 Inf
## 4        ...   ...  ...  ...  ...  ...  ...  ...  ...  ... ...
## 170     0.97  0.09 0.89 0.05    1    0 0.95    1 0.89 0.11 Inf
## 171     0.97     0 0.89    0    1    0    1  NaN 0.89 0.11 NaN
## 172     0.97     0 0.89    0    1    0    1  NaN 0.89 0.11 NaN
plot(e2, "ROC")

Calculating AIC values is much easier, since they can be calculated directly from the model object without needing to re-fit models on data subsets.

Calculate AIC values for your two candidate models using the AIC() function. Use your original models built with the full modeling data frame as an input.

AIC(glm1)
## [1] 464.7075
AIC(glm2)
## [1] 464.1395

Step 7: Assessing the environmental niche

Once we have built and validated our SDMs, we can start using them. One key motivation to use SDMs is to gain insights into the ecology of a species, for example to better understand its habitat preferences or which variables are the most important ones in determining its distribution or habitat use. One useful tool in this regard are response curves. Note that checking the relationships learned by your model (e.g., via response curves or model coefficients) and comparing them to your ecological knowledge about the species should also be considered an integral part of model validation to ensure that the model reflects realistic assumptions, and therefore is able to make good predictions!

You can generate plots of response curves (for each responce variable) using the partialResponse() function from the predicts package. We need to define a function that predicts a model object on a target data frame, setting type ='response' in order to return values on the response (probability) scale.

pr_pre6190_ann_1 <- partialResponse(model = glm1, 
                      data = training_data, 
                      var = "pre6190_ann",
                      nsteps = 30,
                      type = "response",    
                      plot = TRUE)

pr_h_dem_1 <- partialResponse(model = glm1, 
                      data = training_data, 
                      var = "h_dem",
                      nsteps = 30,
                      type = "response",    
                      plot = TRUE)

pr_pre6190_ann_1 <- partialResponse(model = glm2, 
                      data = training_data, 
                      var = "pre6190_ann",
                      nsteps = 30,
                      type = "response",    
                      plot = TRUE)

pr_cld6190_ann <- partialResponse(model = glm2, 
                      data = training_data, 
                      var = "cld6190_ann",
                      nsteps = 30,
                      type = "response",    
                      plot = TRUE)

Step 8: Making spatial predictions

As a final step, we can predict our model across geographical space to create maps of habitat suitability or habitat patches.

To do this, you can simply use the predict() function (i.e., the predict()- method provided by the raster package). To check out the help file for this function type ?raster::predict. You need to provide the function with a raster (stack) of predictor variables and your model object. In addition, you need to pass on the argument type=“response” to make sure that predictions are on the response (probability) scale.

After you have a habitat suitability map for the Brown-throated sloth by predicting your model, you can plot the result (e.g., using mapview()).

habitat_suitability <- predict(env, glm1, type ="response")
mapview(habitat_suitability)

To calculate thresholds for converting the continuous suitability prediction into a binary habitat-non habitat map, we can use the threshold() function from the predicts package. As an input, this requires the result of the pa_evaluate() function we used earlier to calculate the model’s predictive performance (i.e., AUC value).

thresholds <- threshold(e1)
thresholds
##   max_kappa max_spec_sens no_omission equal_prevalence equal_sens_spec
## 1 0.2356765    0.07161687  0.02191788         0.112584      0.09154412

To apply a threshold and reclassify your habitat suitability map, we can use the classify() function from the terra package. This functions requires a matrix specifying class boundaries. We here will use a three-column matrix with the structure “from-to-becomes”. We can create this matrix with the matrix() function, defining values as c(0,threshold,0, threshold, 1, 1) where threshold should correspond to your selected threshold. This means that all values below or equal to threshold will be assigned a value of 0 (no habitat), while values above will be assigned a value of 1 (habitat).

You can use the maximum sensitivity + specificity or the fixed sensitivity / omission threshold - or compare both.

classes <- matrix(
  c(0, thresholds[["max_spec_sens"]], 0,
    thresholds[["max_spec_sens"]],1, 1), 
  ncol = 3, byrow = TRUE) 
classes
##            [,1]       [,2] [,3]
## [1,] 0.00000000 0.07161687    0
## [2,] 0.07161687 1.00000000    1
binary_prediction <- terra::classify(habitat_suitability, classes)

mapview(binary_prediction)