Habitat Suitability Modeling (HSM)

Habitat suitability modeling is the process of using species occurrence records, environmental predictor variables, and statistical models to predict suitable habitat of a given species across a given extent. The model output is a probability raster that ranges from potential values of 0%-100% assigned to each cell within the raster.

Producing a habitat suitability model has it’s limitations. It’s important to know these!

Note: Do yourself a favor and toggle the code button at the top right of this document. It’ll expand all code throughout the tutorial! As of the created date, everything works. R is subject to change and some packages may update and create errors in the future. This tutorial is not maintained.


Limitations

  1. HSM does not tell you the abundance of a species in geographic space.
  • You could infer that more suitable areas would have higher abundance but that’s not really a true statement.
  1. HSM does not directly tell you where a species is.
  • You can infer that more suitable actually house the species. However, this isn’t what the model is predicting.
  1. HSM is limited to the spatial resolution of the lowest spatial resolution raster in your stack of predictors.
  • If you make a model with predictor variables that have a 1km spatial resolution, then your output will be at a 1km spatial resolution. Mixing spatial resolutions (e.g., using 800m climate data and 30m topography data) will leave the model output at 800m (though I’m not even sure how you’d do this using this methodology…).
  1. HSM is limited to the relevance of the predictor variables chosen.
  • For example, the primary drivers of plant distribution are climate, topography, and soils. If you develop a model and only use demographic and economic predictors, the model might actually predict “well” in terms of metrics. However, it doesn’t mean your model is ecologically relevant.
  1. HSM is temporally limited by the date of the occurrence records used.
  • Let’s say you’re modeling bird habitat suitability for a migratory species. If you only use occurrence points collected during summer, you’re building a summer habitat model, excluding any winter migrations.

Examples of Habitat Suitability Modeling

Habitat suitability modeling is synonymous with species distribution modeling and ecological niche modeling. Given it’s name, it’s easy to see why it’s heavily used in ecological research. Below are some links to popular publications that use OR discuss habitat suitability modeling:


Modeling Workflow

If you capture the whole process visually, this works I guess.

An extremely watered down verison of the entire process


Modeling Preprocessing

Let’s begin the modeling preprocessing. Click on each tab in the order they are presented to prepare your data.

1. Load Libraries

We’re using the following libraries:

library(sf)
library(dplyr)
library(tmap)
library(PresenceAbsence)
library(DAAG)
library(corrplot)
library(terra)
library(raster)
  • sf: Used for spatial data analysis of shapefiles.

  • dplyr: Used for dataframe organization and manipulation.

  • tmap: Used for making maps.

  • pander: Used for making nicer looking tables.

  • PresenceAbsence: Used for evaluating presence/absence models and deriving model metrics.

  • DAAG: Data analysis and graphics.

  • corrplot: Used for predictor variable correlation.

  • terra: Used for raster analysis.

  • raster: Also used for raster analysis. Generally, terra is the new and quicker way of doing raster analysis but not evey other package works well with terra. Sometimes we use both.

2. Organize Study Area & Occurrences

We first load in our occurrence data using the sf library and the st_read() function.

Note: A few things happen in the code below for the occurrences. We not only load in the occurrences shapefile, but we also reproject to match the same coordinate system of the study area (AOI) and then clip the occurrences to the study area.

# study area
AOI <- sf::st_read("Z:/2024/Summer/Habitat Suitability Modeling/WBP_HSM/data/study_area/study_area.shp") 
## Reading layer `study_area' from data source 
##   `Z:\2024\Summer\Habitat Suitability Modeling\WBP_HSM\data\study_area\study_area.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -115.1031 ymin: 42.31813 xmax: -108.6536 ymax: 46.9485
## Geodetic CRS:  WGS 84
# Occurrences
occ <- sf::st_read("Z:/2024/Summer/Habitat Suitability Modeling/WBP_HSM/data/occurrences/occurrences.shp") %>% 
  sf::st_transform(., st_crs(AOI)) %>% 
  st_intersection(., AOI)
## Reading layer `occurrences' from data source 
##   `Z:\2024\Summer\Habitat Suitability Modeling\WBP_HSM\data\occurrences\occurrences.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1083 features and 109 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 2425517 ymin: 1245625 xmax: 2928904 ymax: 1692369
## Projected CRS: NAD83 / Idaho Transverse Mercator

Now that shapefiles have been added into our R environment, let’s visualize using tmap with the study area in tan and the occurrences in black. Expand the code to see how.

tm_shape(AOI) + # Study area
  tm_polygons(col = "tan") + # polygons
  tm_shape(occ) + # occurrences
  tm_dots() + # points
  tm_scale_bar(position = c("left", "bottom")) + # scale bar location
  tm_compass(position = c("right", "top")) + # north arrow location
  tm_layout(frame = FALSE) # remove map border

3. Create Pseudo-absences

Presence/absence models require… well, presence and absence data. Often we use field data collection to record where a species is occurring. It’s actually quite rare to go out and record locations where the species is not. Thus, we develop pseudo-absences.

Pseudo-absences are randomized points created to simulate species absence. There are a few ways to do this and some extra steps one can take (developing a grid, etc.) but I find it changes modeling outcomes very little. Here, we will generate pseudo-absences throughout the study area at the rate of twice the number of presences. Why? Well, absence is tricky. Just because a species isn’t present doesn’t mean the habitat isn’t suitable, that the species wasn’t there previously, or that the species won’t occupy the location tomorrow. To account for these issues, we develop more absences than presence to account for the fact that some of these pseudo-absences will overlap or fall into suitable habitat.

What’s happening here: We use the st_sample() generate random points throughout the study area. Size refers to the number of points we want to create, which is 2 x the number of rows (observations) of the occurrences. We do some mapping for visualization but then take the coordinates of the presences and the absences and combine them into a single data frame with longitude, latitude, and presence/absence columns. We will identify presence with 1 and an absence with 0.

set.seed(1000)
occ.abs <- st_sample(st_geometry(AOI), type = "random", size = (nrow(occ)*2)) %>% 
  st_sf() 

tm_shape(AOI) +
    tm_polygons(col = "tan") +
  tm_shape(occ.abs) +
    tm_dots(col = "blue") +
  tm_shape(occ) +
    tm_dots(col = "red") +
  tm_compass(position = c("right", "top")) +
  tm_scale_bar(position = c("left", "bottom")) +
  tm_layout(frame = FALSE) 

### Combine coordinates of pres and abs into 1 dataframe #######################

abs.coords <- sf::st_coordinates(occ.abs) %>% 
  cbind.data.frame(., occ.abs$geometry) %>% 
  mutate(PA = 0)

pres.coords <- sf::st_coordinates(occ) %>% 
  cbind.data.frame(., occ$geometry) %>% 
  mutate(PA = 1)

occ.all <- rbind.data.frame(pres.coords, abs.coords) %>% 
  st_sf()

4. Ogranize Predictors

we’re now going to load into our environmental predictor variables into the R environment using the terra library. We can also check some metadata and raster information. One thing to know is that in R (and many others) stack the predictor variables into one assignment. While “topo” is a single variable in our environment, it actually represents 7 rasters stacked on top of each other. You can get this information when printing “topo, under the dimensions attribute with number of rows of cells, number of columns of cells, and the number of layers (7 rasters).

What’s happening here: We are identifying the raster data in the stored folder by selecting all files that share the pattern in the naming file “.tif”. Then we are loading in those data and clipping (masking) the rasters to the study area shapefile, which may take a few minutes to process.

topo_list <- list.files("Z:/2024/Summer/Habitat Suitability Modeling/WBP_HSM/data/topo", pattern = '.tif$', all.files = T, full.names = T)
topo <- terra::rast(topo_list) %>% 
  terra::mask(., AOI)
## |---------|---------|---------|---------|=========================================                                          
topo
## class       : SpatRaster 
## dimensions  : 17183, 23933, 7  (nrow, ncol, nlyr)
## resolution  : 0.0002694946, 0.0002694946  (x, y)
## extent      : -115.1033, -108.6535, 42.31793, 46.94865  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## source      : spat_243fc1407efb_148476.tif 
## color table : 3 
## names       :   aspect, elevation, landcover,       NDVI,    slope, constant, ... 
## min values  :   0.0000,         0,         0, -0.8963234,  0.00000,        0, ... 
## max values  : 359.8173,      4216,        19,  0.8600246, 86.88714,        1, ...
plot(topo$elevation, main = "Elevation Plot")

5. Extract Predictor Values

In this section, we are going to extract information from the predictor variables to each individual presence/absence point. Basically, we overlay the points onto the raster stack, and drill through each raster, resulting in a table of information of our predictors by presence or absence. This information is crucial as it is the response variable data for training our model.

# topo

spec.trained.topo <- terra::extract(topo, occ.all) %>% 
  cbind.data.frame(occ.all)

spec.trained.topo <- spec.trained.topo[c(11, 2:8)]
spec.trained.topo <- na.omit(spec.trained.topo)

pander::pander(head(spec.trained.topo))
Table continues below
PA aspect elevation landcover NDVI slope constant
1 180 2475 1 0.1697 1.854 0.6224
1 216.3 2916 1 0.06712 18.86 0.8563
1 285.1 2966 1 0.1591 29.17 0.7535
1 237.6 2759 1 0.3655 13.58 0.3594
1 154.7 2643 1 0.1411 12.12 0.5509
1 125.4 2698 8 0.4312 17.08 0.9548
NLCD_Percent_Tree_Canopy_Cover
41
28
32
37
37
21

One important thing to examine when modeling (in basically anything) is to check the variable correlation coefficient or multicollinearity of the variables. The variable correlation coefficient is a statistical measure of the strength of the linear relationship between two variables. Ranging in values from -1 to 1 (perfect negative to perfect positive), a 0 indicates no linear relationship.

So why does it matter? Well, multicollinearity can make it hard to determine the effect of each predictor on the response variable. It can also lead to less reliable results. It can increase model complexity and overfitting, which can decrease predictive capacity of your model. So we seek to identify if there is a high correlation coefficient among variables. Sometimes a threshold value varies from subject to subject, with some saying .6 is too high or .8 is too high. We’re gonna stick with a .7 but it’s really up to you.

So what’s happening: We’re selecting the columns of just the extracted information of each predictor, removing an NA’s, and running a pearson correlation among the extracted predictor information. The plot shows -1 to 1 (red to blue) and the correlation coefficient for each variables by each variable. If we are above .7, we will remove a variable that is highly correlated with another. If we are under .7, we’re good to go ahead with the modeling process as is.

spec.cor.topo <- spec.trained.topo[2:7] # subset dataframe to show just extracted predictor info
spec.cor.topo <- na.omit(spec.cor.topo) # can't have any NA's for the correlation plot
topo.cor <- cor(spec.cor.topo) # correlation
corrplot.mixed(topo.cor, lower = 'number', upper = 'square') # plot


Modeling

We’re now going to go step by step through the modeling process. It’s a good idea to continually run the code to see if there are any issues as you building it. If you wait until you have all the code, you may come across errors that you can identify.

1. GLM

Below is the model written within the glm() function. We assign our response variable (PA = presence/absence) the class of factor. A factor is suitable for using a logit link function, or a response that is binary in nature (0 or 1 = Absence or Presence). By assigning the term, family = binomial, we make this model a logistic regression.

The math for the model can be written as:

\[ logit(P(A)) = \beta_0 + \beta_{1aspect} + \beta_{2elevation} + \beta_{3landcover} + \beta_{4slope} + \beta_{5NDVI} + \beta_{6CanopyCover} + \beta_{7TopographicDiversity} \]

where logit(P(Y=1)) is the log odds of the probability of the response variable being 1 (e.g., presence). The model coefficients (β0 −β7) are estimates derived from maximum likelihood estimation, which finds the parameter values that maximize the probability of the observed data given the model.

### Basic GLM with logit link = Binomial, dataframe = spec.trained.topo
mod1.LR <- glm(as.factor(PA) ~ aspect + elevation + landcover + slope + NDVI + constant + NLCD_Percent_Tree_Canopy_Cover, 
               family = binomial, data = spec.trained.topo)

mod1.pred <- predict(mod1.LR, type = "response") 

# save
#save(mod1.LR, file = "data/model_outputs/30m/r.mods/mod1.LR.RData")
#save(mod1.pred, file = "data/model_outputs/30m/r.mods/mod1.pred.LR.RData")

2. Model Summary

# summary 
summary(mod1.LR)
## 
## Call:
## glm(formula = as.factor(PA) ~ aspect + elevation + landcover + 
##     slope + NDVI + constant + NLCD_Percent_Tree_Canopy_Cover, 
##     family = binomial, data = spec.trained.topo)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -6.0054208  0.3908276 -15.366  < 2e-16 ***
## aspect                          0.0007096  0.0004038   1.758  0.07883 .  
## elevation                       0.0021325  0.0001310  16.276  < 2e-16 ***
## landcover                      -0.0401000  0.0138759  -2.890  0.00385 ** 
## slope                          -0.0276999  0.0053365  -5.191 2.10e-07 ***
## NDVI                           -0.4971664  0.2976998  -1.670  0.09491 .  
## constant                        1.2445847  0.2462757   5.054 4.34e-07 ***
## NLCD_Percent_Tree_Canopy_Cover -0.0017935  0.0028487  -0.630  0.52898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4134.4  on 3246  degrees of freedom
## Residual deviance: 3552.3  on 3239  degrees of freedom
## AIC: 3568.3
## 
## Number of Fisher Scoring iterations: 4
#plot
# plot(mod1.LR) # you can additionally plot this for residual plots, etc.

In this logistic regression model, we are examining the relationship between the binary response variable (PA) and several predictor variables. The model’s significant predictors at the 0.05 level include elevation, landcover, slope, and the constant term, indicating that these variables have a statistically significant impact on the likelihood of an area being suitable or not. Specifically, higher elevations are associated with a higher likelihood of suitability (positive coefficient), while higher values landcover (this is assigned to landcover class and is not quantitative), slope, and NDVI values decrease this likelihood. Aspect and NDVI show weaker significance levels (0.05 < p < 0.1), suggesting potential influence. The model’s goodness-of-fit is indicated by the residual deviance of 3552.3 and an AIC of 3568.3, showing a decent fit, with the null deviance considerably higher at 4134.4. Overall, the results suggest that topographical and vegetation-related factors play important roles in the designation of habitat suitability.

3. Explaining model deviance

A way to gauge how well your model explains the variability in the response variability is to assess deviance. This is done through the equation:

\[ 100 * (1 - \frac{ResidualDeviance}{NullDeviance}) \]

The null and residual deviance is found in the summary above. Looking at our value of 14.08, we see that the model only explains 14% of the deviance in the data. So we are capturing some variability in the response variable, but a substantial portion of variability is yet to be explained. This most likely due to the lack of predictor variables available at the 30m spatial resolution.

mod1.fit <- 100 * (1 - mod1.LR$deviance/mod1.LR$null.deviance) 
mod1.fit
## [1] 14.07929

4. Stepwise variable reduction

One thing you can do is a stepwise variable reduction, which is a simplified model that excludes variables that the model deems as not significantly contributing. This may improve the model, it may not. The new logistic regression model after backward stepwise variable reduction includes the following predictors: aspect, elevation, landcover, slope, NDVI, and a constant term. The NLCD Percent Tree Canopy Cover variable was removed during the stepwise reduction process due to its lack of significant contribution to the model. Overall, they performed vary similarly, but we’re going to keep the stepwise model because it is simpler.

# backwards stepwise variable reduction
mod2.LR <- step(mod1.LR, trace = F)
mod2.LR
## 
## Call:  glm(formula = as.factor(PA) ~ aspect + elevation + landcover + 
##     slope + NDVI + constant, family = binomial, data = spec.trained.topo)
## 
## Coefficients:
## (Intercept)       aspect    elevation    landcover        slope         NDVI  
##  -6.0635104    0.0007113    0.0021310   -0.0345246   -0.0275199   -0.5540655  
##    constant  
##   1.2453904  
## 
## Degrees of Freedom: 3246 Total (i.e. Null);  3240 Residual
## Null Deviance:       4134 
## Residual Deviance: 3553  AIC: 3567
# save
#save(mod2.LR, file = "data/model_outputs/30m/r.mods/mod2.LR.RData") 

# model 2 fit
100 * (1- mod2.LR$deviance/mod2.LR$null.deviance)
## [1] 14.0697
# model 2 prediction
mod2.pred <- predict(mod2.LR, type = "response")

# save
#save(mod2.pred, file = "data/model_outputs/30m/r.mods/mod2.pred.LR.RData")

5. Confusion Matrix

Below, we do a few things to find optimal threshold for the model (if we had to split the model into 2 class (suitable/not suitable), which value would be the threshold?) and create the confusion matrix (prediction summary in matrix form = shows correct and incorrect predictions per class).

We produce a model cut threshold of .36 which was determined using “MaxKappa” methodology, a measure of classification accurracy.

We also produce a confusion matrix showing 1530 true negatives, 634 false positives, 321 false negatives, and 762 true positives.

### model confusion matrix #####################################################

modl <- "mod2.LR"
dat2 <- cbind(modl, spec.trained.topo[1], mod2.pred)

# confusion matrix
mod.cut <- optimal.thresholds(dat2, opt.methods = c("MaxKappa"))
mod2.cfmat <- table(dat2[[2]], factor(as.numeric(dat2$mod2.pred >= mod.cut$mod2.pred)))
mod.cut
##     Method mod2.pred
## 1 MaxKappa      0.36
mod2.cfmat
##    
##        0    1
##   0 1530  634
##   1  321  762
# save
#save(mod.cut, file = "data/model_outputs/30m/r.mods/mod.cut.LR.RData")
#save(mod2.cfmat, file = "data/model_outputs/30m/r.mods/mod2.cfmat.LR.RData")

6. Model Accuracy Metrics

Here, we calculate a few metrics commonly used for reporting the predictive capabilities of a model:

  • Sensitivity: Known as the true positive rate, is a measure of how well the model identifies positive cases (i.e., suitable habitat).

  • Specificity: True negative rate, describes how well the model identifies negative cases, or non-suitability.

  • Area Under the Curve: A measurement of a model’s performance, specifically its ability to distinguish between classes. It’s represented by the percentage of area below the ROC (Receiver Operating Characteristic) curve, and is a key part of ROC analysis. A higher AUC indicates better model performance, especially in binary classification tasks.

  • True Skill Statistic: A metric used to evaluate the accuracy of machine learning-based species distribution models.

Model sensitivity and specificity were reported as 0.704 and 0.707, respectively. Sensitivity, also known as the true positive rate, is a measure of how well the model identifies positive cases (i.e., suitable habitat). Sensitivity, the true negative rate, describes how well the model identifies negative cases, or non-suitability. Another principal metric in determining the modeling predictive performance is the area under the curve (AUC). Model AUC reported 0.756 out of 1 which performed higher than a random classifier (e.g., AUC of 0.5). Lastly, the true skill statistic was reported as 0.411, which suggests that the model performed better than random chance.

### Model Accuracy #############################################################
mod2.acc <- presence.absence.accuracy(dat2, threshold = mod.cut$mod2.pred, st.dev = F)
tss <- mod2.acc$sensitivity + mod2.acc$specificity - 1
mod2.acc <- cbind(mod2.acc[1:7], tss)
pander::pander(mod2.acc[c(1,4:5,7:8)])
model sensitivity specificity AUC tss
mod2.pred 0.7036 0.707 0.7558 0.4106
auc.roc.plot(dat2, color = T)

#savePlot("auc.roc.plot.LR.jpeg", type = "jpeg")

7. Cross-validation

Below cross-validation was performed using the CVbinary() function, which performs cross-validation for a binary classification model and employed 5 folds. Examining the table output, we see that for each point (either presence or absence represented by PA), we have a model prediction under the mod2.pred column, and mod2.cv10.1 which can be used to assess the model’s performance more robustly, as cross-validation helps to mitigate overfitting and provides a more realistic estimate of the model’s predictive performance on unseen data.

# Cross validation accuracy
mod2.cv10 <- CVbinary(mod2.LR, nfolds = 5, print.details = F)
ls(mod2.cv10)
## [1] "acc.cv"       "acc.internal" "acc.training" "cvhat"        "internal"    
## [6] "training"
mod2.cv10.1 <- mod2.cv10$cvhat
dat2 <- cbind(dat2, mod2.cv10.1)
pander::pander(head(dat2))
modl PA mod2.pred mod2.cv10.1
mod2.LR 1 0.4836 0.4864
mod2.LR 1 0.6857 0.6547
mod2.LR 1 0.6161 0.6463
mod2.LR 1 0.4556 0.4647
mod2.LR 1 0.4796 0.4955
mod2.LR 1 0.4948 0.4859

8. Prediction

The best part of the whole process, prediction and visualization.

We will create 3 model output rasters:

  • Probability 0-100%

  • Classified Binary (either 0 or 1 based on model optimal threshold value)

  • Binned Probability with 5 classes based on probability (really only used to simply visualizations).

Probability

spp.predsR <- as(topo, "Raster") # this is why we need raster

modFprob.LR.1 <- predict(spp.predsR, mod2.LR, filename = "mod2.LRprob_2.tif", 
                         type = "response", fun = predict, 
                         index = 2, overwrite = TRUE)

plot(modFprob.LR.1)

##terra::writeRaster(modFprob.LR.1, filename = "data/model_outputs/30m/rasters/Probability_WBP_30m_topo.tif", overwrite = T)

Classified Binary

# classified binary prediction
modFprobclas.R=reclassify(modFprob.LR.1,c(0,mod.cut[[2]],0,mod.cut[[2]],1,1))

plot(modFprobclas.R)

# terra::writeRaster(modFprobclas.R, filename = "data/model_outputs/30m/rasters/Classification_30m.tif", overwrite = T)

Binned Probability

modFprobclas.bin.R <- reclassify(modFprob.LR.1,c(0,.2,1,
                                                 .2,.4,2,
                                                 .4,.6,3,
                                                 .6,.8,4,
                                                 0.8,1,5))
plot(modFprobclas.bin.R)

# terra::writeRaster(modFprobclas.bin.R, filename = "data/model_outputs/30m/rasters/BinnedClassification_30m.tif", overwrite = T)

Summary

You have now created a generalized logistic regression model to predict habitat suitability of whitebark pine across the study area for the NASA DEVELOP ID NODE - Summer 2024 project. Future work would benefit from identifying and using more predictor variables at the 30m resolution. Currently, climate is pretty much limited to 800m (PRISM data) and soil data is variable but usually in polygons that just need to be rastrized. The climate data needs to be downscaled by someone, which will probably happen sometime but I don’t think 30m climate data is of interest to some climate modelers focused on regional studies (plus, does climate really vary that much among neighboring 30m pixels? probably not. and also imagine the computing power needed - wild). Additionally, develop other models and use a mean average (just use the mean function on 2 model outputs from two different models to get the best of both worlds). Models have bias and error. Some predict conservatively and some predict liberally. Use many models to reduce the amount of uncertainty.

Enjoy!