1 Introduction

1.1 Study Species and Conservation Context

Rhampholeon chapmanorum (Tilbury, 1992), commonly known as Chapman’s Pygmy Chameleon, is a diminutive forest-floor reptile endemic to the lowland evergreen rainforests of the Natundu (Malawi) Hills in southern Malawi. Growing to a maximum total length of approximately 65 mm, the species is distinguished by a prominent rostral appendage, stellate granular scalation, and a compressed leaf-like body form that renders it near-invisible among the leaf litter it inhabits (Tilbury, 1992). Unlike the majority of chameleons, which are arboreal, R. chapmanorum is largely terrestrial, resting low in the understorey vegetation at night and moving through the forest floor by day.

The species is listed as Critically Endangered (CR) on the IUCN Red List (criteria B1ab(i,ii,iii,iv,v); assessed 2021), with an extent of occurrence (EOO) and area of occupancy (AOO) each estimated at just 12 km². Population modelling suggests a decline of approximately 80% since the 1980s, driven almost entirely by the conversion of lowland rainforest to smallholder agriculture (maize and cassava) and charcoal production (Tolley et al., 2021; IUCN, 2021). By 2019, the total forested area supporting this species had contracted to approximately 0.4 km², fragmented into five isolated patches ranging from 1 to 17 hectares. Critically, Rhampholeon species are strict forest specialists incapable of persisting in transformed habitats; where forest is lost, local populations are extirpated (Branch et al., 2014).

The species was feared extinct following an absence of scientific records between 1992 and 2016, until Tolley et al. (2021) confirmed extant populations in at least two higher-elevation patches of the Malawi Hills through targeted field surveys. This rediscovery, while encouraging, underscored the precarious state of remaining populations and the urgent need for spatial tools to guide conservation.

1.2 Rationale for Species Distribution Modelling

Species Distribution Models (SDMs) — also referred to as ecological niche models (ENMs) — are correlative models that relate known occurrence records to environmental variables to estimate the probability of habitat suitability across a landscape (Elith & Leathwick, 2009; Guisan et al., 2017). For a species as range-restricted as R. chapmanorum, SDMs serve several critical functions:

  • Identifying unsurveyed suitable habitat that may harbour undetected populations, supporting targeted field surveys.
  • Quantifying the sensitivity of habitat suitability to climate variables, thereby informing understanding of the species’ climatic niche.
  • Projecting suitability under future climate scenarios, enabling proactive conservation planning under climate change.
  • Prioritising spatial conservation actions, including protected area expansion and habitat restoration corridors.

This document presents a complete, reproducible SDM workflow for R. chapmanorum within the Nsanje District of Malawi, implemented in R using the tidysdm package (Leonardi et al., 2024), which integrates seamlessly with the tidymodels machine learning ecosystem. The workflow encompasses occurrence data acquisition and cleaning, pseudo-absence generation with bias correction, environmental variable selection and collinearity screening, multi-algorithm model training with spatial block cross-validation, ensemble construction, and present-day habitat suitability projection.


2 Methods

2.1 Software and Computational Environment

All analyses were conducted in R (R Core Team, 2024). The core SDM framework was implemented using tidysdm (Leonardi et al., 2024), which leverages the tidymodels ecosystem to provide a standardised, modular interface for fitting, tuning, and ensembling multiple machine learning algorithms. Spatial data were handled using the terra and sf packages; coordinate cleaning used CoordinateCleaner; and visualisations were produced with ggplot2 and tidyterra.

2.2 Loading Required Libraries

The following R packages are required to reproduce this analysis. The code block below loads all dependencies:

library(tidyverse)       # Data manipulation and visualisation
library(tidysdm)         # Species distribution modelling with tidymodels
library(tidymodels)      # Unified machine learning framework
library(tidyterra)       # ggplot2 methods for terra SpatRaster objects
library(terra)           # Spatial raster operations
library(CoordinateCleaner) # Automated geographic coordinate quality control
library(sf)              # Simple features for vector spatial data
library(overlapping)     # Kernel overlap statistics for niche comparisons
library(xgboost)         # Gradient boosted tree engine
library(ranger)          # Fast random forest engine
library(maxnet)          # MaxEnt implementation for R

2.3 Occurrence Data: Loading, Cleaning, and Spatial Filtering

2.3.1 Data Acquisition

Occurrence records for Rhampholeon chapmanorum were downloaded from the Global Biodiversity Information Facility (GBIF), the largest openly accessible repository of biodiversity occurrence data. GBIF aggregates digitised museum specimens, citizen science observations, and research-grade field records from thousands of institutions worldwide (GBIF.org, 2024). For a critically endangered, range-restricted species such as R. chapmanorum, GBIF records — though sparse — remain an essential foundation for SDM, as targeted survey data are rarely publicly available at scale.

distrib <- read_delim(
  "C:/Users/Hp User/Downloads/sdm/0009731-260120142942310/0009731-260120142942310.csv",
  col_types = cols(gbifID = col_character())
)

chameleon <- distrib %>%
  select(gbifID, decimalLatitude, decimalLongitude) %>%
  rename(
    ID        = gbifID,
    latitude  = decimalLatitude,
    longitude = decimalLongitude
  )

# Remove records with missing geographic coordinates
chameleon <- as_tibble(
  chameleon %>%
    filter(!is.na(latitude), !is.na(longitude))
)

2.3.2 Coordinate Cleaning

Raw GBIF downloads routinely contain erroneous records — coordinates placed at country centroids, institutional headquarters, or assigned default zero values (0°N, 0°E) — that do not reflect genuine species occurrences. Retaining such errors inflates apparent range sizes and corrupts the environmental signal that models are trained on. The CoordinateCleaner package (Zizka et al., 2019) provides automated flagging of these common spatial errors using a suite of geometric and geographic tests.

# Convert to Simple Features (sf) object and assign WGS84 coordinate system
chameleon <- st_as_sf(chameleon, coords = c("longitude", "latitude"))
st_crs(chameleon) <- 4326

2.3.3 Spatial Restriction to the Study Area

Because SDMs are sensitive to the geographic extent of the study area — which defines the environmental space accessible to the species and thus determines what constitutes a meaningful absence or background (Barve et al., 2011) — the analysis was restricted to the Nsanje District of southern Malawi. This represents the core of the species’ known range and ensures that background environmental sampling reflects the conditions the species has had a realistic opportunity to occupy.

The district boundary shapefile was loaded using the sf and terra packages. Occurrence points falling outside the Nsanje boundary polygon were removed using a spatial intersection, and the boundary was also retained as a SpatVector object for subsequent raster masking.

# Load Nsanje district boundary as an sf object
Nsanje_data_frame <- st_read(
  "C:/Users/Hp User/Downloads/sdm/Nsanje 1.2.shp"
)
## Reading layer `Nsanje 1.2' from data source 
##   `C:\Users\Hp User\Downloads\sdm\Nsanje 1.2.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 11 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 34.90631 ymin: -17.12721 xmax: 35.31261 ymax: -16.31085
## Geodetic CRS:  WGS 84
# Clip occurrence records to Nsanje district boundary
chameleon <- st_intersection(chameleon, Nsanje_data_frame)

# Load as SpatVector for terra raster operations
Nsanje <- vect("C:/Users/Hp User/Downloads/sdm/Nsanje 1.2.shp")

2.4 Environmental Data: Land Mask Construction

2.4.1 Bioclimatic Raster Stack

Environmental predictors were sourced from WorldClim version 2.1 (Fick & Hijmans, 2017), which provides globally consistent bioclimatic variables at 30 arc-second (~1 km) resolution derived from long-term monthly temperature and precipitation records. For this analysis, the ACCESS-CM2 global climate model output under the SSP2-4.5 intermediate emissions scenario for the 2041–2060 time slice was used both to generate the land mask and to provide the bioclimatic predictors for initial projections.

An initial land mask was created from the first layer of the raster stack by classifying all non-NA cells as valid land surface pixels. This mask defines the spatial extent within which pseudo-absence and background points are permitted to be sampled, avoiding the ocean and areas outside the modelling domain. The raster was then cropped and masked to the Nsanje district boundary.

Land_mask <- rast(
  "C:/Users/Hp User/Downloads/sdm/wc2.1_30s_bioc_ACCESS-CM2_ssp245_2041-2060.tif"
)

# Extract first layer to define valid land cells
base      <- Land_mask[[1]]
Land_mask <- ifel(!is.na(base), 1, NA)
## |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          |---------|---------|---------|---------|=========================================                                          
names(Land_mask) <- "Land_mask_2021/2024"

# Crop and mask to Nsanje district
Land_mask <- crop(Land_mask, Nsanje)
Land_mask <- mask(Land_mask, Nsanje)

# Map: occurrence points over land mask
ggplot() +
  geom_spatraster(data = Land_mask, aes(fill = `Land_mask_2021/2024`)) +
  geom_sf(data = Nsanje, fill = NA, color = "black", linewidth = 0.6) +
  geom_sf(data = chameleon, colour = "green", size = 1.8, alpha = 0.8) +
  guides(fill = "none") +
  scale_fill_viridis_c(na.value = "transparent", guide = "none", option = "C") +
  labs(
    title    = expression(italic("Rhampholeon chapmanorum") ~ "— occurrence records, Nsanje District"),
    subtitle = "Points show GBIF presence records retained after coordinate cleaning",
    caption  = "Data: GBIF; Basemap: WorldClim 2.1 / ACCESS-CM2 SSP2-4.5"
  ) +
  theme_minimal(base_size = 12)


2.5 Background and Pseudo-Absence Sampling

2.5.1 Initial Pseudo-Absence Generation

Because R. chapmanorum occurrence data consist entirely of presences (there are no confirmed absence records), the model must be trained using pseudo-absences or background points — locations at which the species was not recorded and which are assumed to differ ecologically from the presences. The choice of how, where, and how many pseudo-absences to generate is a critical methodological decision, as it substantially affects model predictions (Barbet-Massin et al., 2012; Wisz & Guisan, 2009).

An initial set of 1,000 pseudo-absences was generated randomly across the land mask using sample_pseudoabs(). These initial pseudo-absence records were subsequently rasterised to produce a sampling effort density surface, used in the next step to apply a spatial bias correction.

pseudo <- sample_pseudoabs(
  data        = chameleon,
  raster      = Land_mask,
  n           = 1000,
  method      = "random",
  class_label = "pseudoabs",
  return_pres = TRUE
)

# Rasterise pseudo-absence density for use in bias-corrected background sampling
pseudo_background_raster <- rasterize(pseudo, Land_mask, fun = "count")
pseudo_background_raster <- mask(pseudo_background_raster, Land_mask)

# Aggregate to slightly coarser resolution for visualisation
r      <- pseudo_background_raster
r_agg  <- aggregate(r, fact = 2, fun = sum, na.rm = TRUE)

global(r_agg, fun = c("min", "max", "sum", "mean"), na.rm = TRUE)
# Plot pseudo-absence density
df2          <- as.data.frame(r_agg, xy = TRUE, na.rm = FALSE)
names(df2)[3] <- "count"

ggplot(df2, aes(x = x, y = y, fill = count)) +
  geom_raster() +
  geom_sf(data = Nsanje, fill = NA, color = "black",
          linewidth = 0.6, inherit.aes = FALSE) +
  scale_fill_viridis_c(na.value = "transparent", name = "Count") +
  labs(
    title   = "Pseudo-absence sampling density raster",
    subtitle = "Aggregated point density surface (fact = 2) used for bias-corrected background sampling",
    caption  = "Higher cell values indicate greater pseudo-absence density"
  ) +
  theme_minimal(base_size = 12)

2.5.2 Bias-Corrected Background Sampling

A well-documented problem in SDMs built from museum and citizen-science occurrence data is spatial sampling bias: occurrences are not drawn randomly from the landscape but cluster around roads, settlements, and previously surveyed localities (Phillips et al., 2009; Kramer-Schadt et al., 2013). If background points are sampled uniformly at random, the model may learn to distinguish presence localities from the background on the basis of their accessibility rather than their ecological suitability — producing artificially inflated performance metrics and misleading habitat maps.

To address this, background points were drawn using the "bias" method in sample_background(), which weights sampling probability according to the pseudo-absence density raster. This ensures that background points share the same spatial bias distribution as the presence records, effectively making the model compare presences against a background with equivalent accessibility characteristics — the same logic underlying the target-group background approach of Phillips et al. (2009). The number of background points was set to three times the number of presence records, following common practice for machine learning algorithms (Barbet-Massin et al., 2012).

set.seed(1234567)

chameleon_thin <- sample_background(
  data        = chameleon,
  raster      = pseudo_background_raster,
  n           = 3 * nrow(chameleon),
  method      = "bias",
  class_label = "background",
  return_pres = TRUE
)

ggplot() +
  geom_spatraster(data = Land_mask,
                  aes(fill = `Land_mask_2021/2024`), interpolate = FALSE) +
  geom_sf(data = Nsanje, fill = NA, color = "black", linewidth = 0.6) +
  geom_sf(data = chameleon_thin, aes(col = class), size = 1.5, alpha = 0.75) +
  scale_fill_viridis_c(na.value = "transparent", guide = "none", option = "C") +
  scale_colour_manual(
    values = c("presence" = "#d73027", "background" = "#4575b4"),
    name   = "Record type"
  ) +
  labs(
    title    = "Bias-corrected background and presence points",
    subtitle = "Background points weighted by pseudo-absence density to match spatial sampling effort",
    caption  = "Red = presence; Blue = bias-corrected background"
  ) +
  theme_minimal(base_size = 12)


2.6 Bioclimatic Variable Extraction

2.6.1 Loading and Preparing the Full Bioclimatic Stack

The full 19-layer WorldClim 2.1 bioclimatic raster stack for the ACCESS-CM2 SSP2-4.5 (2041–2060) scenario was loaded, cropped to the Nsanje district, masked to land cells, reprojected to WGS84 (EPSG:4326), and layer names were standardised to the conventional bio1bio19 nomenclature.

climate_2021_2040 <- rast(
  "C:/Users/Hp User/Downloads/sdm/wc2.1_30s_bioc_ACCESS-CM2_ssp245_2041-2060.tif"
)

climate_2021_2040 <- crop(climate_2021_2040, Nsanje)
climate_2021_2040 <- mask(climate_2021_2040, Nsanje)
climate_2021_2040 <- terra::project(climate_2021_2040, y = "EPSG:4326")
names(climate_2021_2040) <- paste0("bio", seq_len(nlyr(climate_2021_2040)))

2.6.2 Extracting Climate Values at Occurrence and Background Points

Bioclimatic values were extracted at all presence and background point locations and appended as additional columns to the modelling dataset. This step links each point (presence or background) to its local environmental conditions — the primary input used by each modelling algorithm.

chameleon_thin <- chameleon_thin %>%
  bind_cols(
    terra::extract(climate_2021_2040, chameleon_thin, ID = FALSE)
  )

2.6.3 Checking for Missing Values

Before proceeding to model training, we confirmed that no missing values were present in the extracted climate data. Points located in raster cells flagged as NA (e.g., at the edges of the study area boundary) must be removed, as most modelling algorithms cannot tolerate incomplete predictor records.

summary(chameleon_thin)
##         class             geometry       bio1            bio2      
##  presence  : 8   POINT        :32   Min.   :24.00   Min.   :11.40  
##  background:24   epsg:4326    : 0   1st Qu.:26.55   1st Qu.:12.18  
##                  +proj=long...: 0   Median :27.50   Median :12.30  
##                                     Mean   :27.17   Mean   :12.55  
##                                     3rd Qu.:28.20   3rd Qu.:12.95  
##                                     Max.   :28.40   Max.   :13.70  
##       bio3            bio4            bio5            bio6      
##  Min.   :55.40   Min.   :261.2   Min.   :33.80   Min.   :14.80  
##  1st Qu.:56.20   1st Qu.:291.8   1st Qu.:37.08   1st Qu.:15.90  
##  Median :57.35   Median :301.2   Median :38.20   Median :16.45  
##  Mean   :57.37   Mean   :297.2   Mean   :38.11   Mean   :16.23  
##  3rd Qu.:58.52   3rd Qu.:305.7   3rd Qu.:39.35   3rd Qu.:16.60  
##  Max.   :59.90   Max.   :317.9   Max.   :39.80   Max.   :16.80  
##       bio7            bio8            bio9           bio10      
##  Min.   :19.00   Min.   :26.10   Min.   :21.00   Min.   :26.80  
##  1st Qu.:21.18   1st Qu.:28.70   1st Qu.:23.35   1st Qu.:29.70  
##  Median :21.70   Median :29.60   Median :24.50   Median :30.85  
##  Mean   :21.88   Mean   :29.32   Mean   :24.12   Mean   :30.45  
##  3rd Qu.:22.73   3rd Qu.:30.32   3rd Qu.:25.10   3rd Qu.:31.52  
##  Max.   :23.50   Max.   :30.60   Max.   :25.30   Max.   :31.80  
##      bio11           bio12            bio13           bio14       
##  Min.   :20.40   Min.   : 760.0   Min.   :168.0   Min.   : 6.000  
##  1st Qu.:22.45   1st Qu.: 798.5   1st Qu.:176.5   1st Qu.: 7.000  
##  Median :23.45   Median : 839.0   Median :184.0   Median : 8.000  
##  Mean   :23.14   Mean   : 859.2   Mean   :186.8   Mean   : 8.188  
##  3rd Qu.:23.93   3rd Qu.: 908.0   3rd Qu.:196.2   3rd Qu.: 9.000  
##  Max.   :24.30   Max.   :1142.0   Max.   :234.0   Max.   :11.000  
##      bio15            bio16           bio17           bio18      
##  Min.   : 91.70   Min.   :465.0   Min.   :27.00   Min.   :222.0  
##  1st Qu.: 95.97   1st Qu.:484.5   1st Qu.:30.75   1st Qu.:236.8  
##  Median : 96.65   Median :502.0   Median :32.00   Median :245.5  
##  Mean   : 96.67   Mean   :511.0   Mean   :34.53   Mean   :249.1  
##  3rd Qu.: 98.33   3rd Qu.:526.8   3rd Qu.:38.00   3rd Qu.:259.8  
##  Max.   :100.00   Max.   :641.0   Max.   :55.00   Max.   :316.0  
##      bio19      
##  Min.   :35.00  
##  1st Qu.:38.75  
##  Median :41.00  
##  Mean   :43.78  
##  3rd Qu.:48.00  
##  Max.   :70.00

2.7 Environmental Variable Selection

2.7.1 Comparing Presence and Background Environments

A fundamental assumption of SDMs is that presences and background points differ meaningfully in their environmental conditions — that is, the environments at known occurrence locations are not representative of random positions across the landscape. If presences and background overlap substantially in environmental space, the model has little signal to learn from. We first visualised this separation using violin plots, which display the full distribution of each bioclimatic variable across presence and background records.

chameleon_thin %>% plot_pres_vs_bg(class)

2.7.2 Quantifying Environmental Overlap

The degree of environmental separation was formally quantified by computing the overlap coefficient between the presence and background distributions for each variable. Low overlap values (approaching 0) indicate strong environmental differentiation and therefore high potential model signal; high overlap (approaching 1) suggests that the variable carries little discriminatory power for this species and study area.

knitr::kable(
  chameleon_thin %>% dist_pres_vs_bg(class),
  caption   = "**Table 1.** Kernel density overlap statistics between presence and background records for each bioclimatic variable. Lower values indicate greater environmental differentiation and stronger potential predictor signal.",
  col.names = c("Bioclimatic Variable", "Overlap Coefficient"),
  digits    = 4,
  align     = c("l", "r"),
  table.attr = 'class="table-styled"'
)
Table 1. Kernel density overlap statistics between presence and background records for each bioclimatic variable. Lower values indicate greater environmental differentiation and stronger potential predictor signal.
Bioclimatic Variable Overlap Coefficient
bio4 0.5979
bio2 0.5313
bio13 0.4721
bio16 0.4690
bio10 0.4666
bio1 0.4575
bio8 0.4503
bio9 0.4493
bio12 0.4477
bio18 0.4459
bio11 0.4422
bio7 0.4206
bio6 0.4032
bio5 0.3936
bio15 0.3535
bio19 0.3144
bio17 0.2459
bio14 0.2277
bio3 0.1434

2.7.3 Selecting Predictor Variables

Based on the overlap analysis and ecological reasoning, three bioclimatic variables were selected as predictors:

Variable Description Ecological Relevance for R. chapmanorum
BIO6 Minimum Temperature of the Coldest Month (°C × 10) Sets the cold-season thermal limit for ectotherms; cold tolerance constrains reptile range boundaries
BIO12 Annual Precipitation (mm) Captures overall moisture availability; lowland evergreen forest (the species’ obligate habitat) requires high, reliable annual rainfall
BIO15 Precipitation Seasonality (Coefficient of Variation) Measures precipitation variability across months; forest specialist species reliant on stable humidity are sensitive to high seasonality

Table 2. Selected bioclimatic predictors, their WorldClim definitions, and ecological justification for inclusion in the Rhampholeon chapmanorum SDM.

These three variables collectively capture the thermal and hydric conditions most relevant to a moisture-dependent, thermoconforming forest reptile in a sub-tropical African landscape.

suggested_vars <- c("bio12", "bio15", "bio6")

2.7.4 Collinearity Screening

High correlation between predictor variables causes multicollinearity, which can destabilise model coefficients, inflate variance, and introduce redundancy — particularly problematic for parametric models such as GLM (Dormann et al., 2013). A pairwise scatterplot matrix was produced to visually inspect correlations among the three selected variables. Variables exhibiting Pearson’s |r| > 0.70 should not be jointly included in the same model (Dormann et al., 2013). The variables retained here were selected to minimise redundancy while maximising biological coverage of the species’ niche.

pairs(climate_2021_2040[[suggested_vars]])


2.8 Model Specification and Workflow Construction

2.8.1 Recipe and Workflow Set

A tidymodels recipe was defined with class (presence/background) as the binary response variable and the three bioclimatic predictors as features. The workflow_set() function was used to construct a modelling grid combining this single preprocessing recipe with four machine learning algorithms, each suited to SDM applications:

Model tidysdm Function Algorithm Type Key Strengths
Generalised Linear Model (GLM) sdm_spec_glm() Parametric Interpretable; low variance; suitable for small datasets
Random Forest (RF) sdm_spec_rf() Ensemble tree Robust to overfitting; handles non-linearity and interactions; minimal tuning
Gradient Boosted Tree (GBM/XGBoost) sdm_spec_boost_tree() Boosted ensemble High predictive power; captures complex response surfaces
Maximum Entropy (MaxEnt) sdm_spec_maxent() Presence-background Designed specifically for presence-only/background data; widely validated in SDM literature

Table 3. Machine learning algorithms included in the tidysdm ensemble workflow, with their tidysdm specification functions and ecological modelling characteristics.

Using multiple algorithms and combining their predictions into an ensemble is best practice in modern SDM (Araújo & New, 2007), as it averages out the idiosyncratic biases of individual algorithms and produces more robust suitability maps than any single model.

chameleon_rec <- recipe(chameleon_thin, formula = class ~ .)
chameleon_rec

chameleon_thin %>% check_sdm_presence(class)
## [1] TRUE
# Build a workflow set: one recipe × four algorithms
chameleon_model <- workflow_set(
  preproc = list(default = chameleon_rec),
  models  = list(
    glm    = sdm_spec_glm(),
    rf     = sdm_spec_rf(),
    gdm    = sdm_spec_boost_tree(),
    maxent = sdm_spec_maxent()
  ),
  cross = TRUE
) %>%
  option_add(control = control_ensemble_grid())

2.9 Spatial Block Cross-Validation

2.9.1 Rationale for Spatial Blocking

Model evaluation via random cross-validation (randomly partitioning the dataset into training and test folds) is inappropriate for spatially structured ecological data. Because occurrence records and their associated environmental values are spatially autocorrelated — nearby locations tend to be more similar than distant ones — random folds allow information from spatially proximate training points to leak into the test fold. This spatial leakage leads to overly optimistic performance estimates that do not generalise to independent spatial regions (Roberts et al., 2017; Valavi et al., 2019).

Spatial block cross-validation addresses this by partitioning the data into geographically separated folds, ensuring that training and test sets are sufficiently distant in space to approximate independence (Roberts et al., 2017). The resulting performance metrics are more conservative but better reflect true out-of-sample predictive ability across unsurveyed areas — which is the ultimate goal of SDM for conservation planning.

Five spatial folds (v = 5) were created using spatial_block_cv() from tidysdm. The block structure was visualised using autoplot() to confirm adequate spatial separation.

set.seed(105)
chameleon_cv <- spatial_block_cv(chameleon_thin, v = 5)
autoplot(chameleon_cv)


2.10 Model Tuning and Training

2.10.1 Hyperparameter Tuning via Grid Search

Each algorithm in the workflow set was tuned using workflow_map() with tune_grid(), evaluating a grid of five candidate hyperparameter combinations per algorithm across the five spatial block cross-validation folds. Tuning identifies the hyperparameter configuration (e.g., number of trees, minimum node size, regularisation strength) that maximises predictive performance on withheld spatial test folds.

Model performance was assessed using the SDM metric set implemented in tidysdm, which includes the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) — the primary discrimination metric for binary SDMs — as well as the Boyce Continuous Index and the True Skill Statistic (TSS). AUC-ROC measures the probability that the model assigns a higher suitability value to a random presence than to a random background point, with values approaching 1.0 indicating near-perfect discrimination.

set.seed(1234567)

chameleon_model <- chameleon_model %>%
  workflow_map(
    "tune_grid",
    resamples = chameleon_cv,
    grid      = 5,
    metrics   = sdm_metric_set(),
    verbose   = TRUE
  )

# Visualise model tuning performance across folds and hyperparameter grids
autoplot(chameleon_model)


2.11 Ensemble Construction

2.11.1 Building the Ensemble Model

Following tuning, the best-performing hyperparameter configuration for each algorithm was selected and combined into a simple ensemble using simple_ensemble(). As described by Leonardi et al. (2024), simple_ensemble() retains the single best version of each algorithm — the configuration with the highest value of the chosen metric (roc_auc here) across cross-validation folds — and stores them as ensemble members.

Ensemble approaches are strongly preferred over single-algorithm SDMs because they average across the different response-curve shapes and extrapolation behaviours of individual algorithms, reducing the influence of any one model’s assumptions and increasing the robustness and spatial coherence of suitability predictions (Araújo & New, 2007; Thuiller et al., 2009).

chameleon_ensemble <- simple_ensemble() %>%
  add_member(chameleon_model, metric = "roc_auc")

chameleon_ensemble
# Visualise ensemble member performance
autoplot(chameleon_ensemble)

2.11.2 Ensemble Performance Metrics

The following table summarises the cross-validated performance metrics for each algorithm in the ensemble. AUC-ROC is the primary criterion: values above 0.7 are considered acceptable, above 0.8 good, and above 0.9 excellent for habitat suitability modelling (Swets, 1988; Araújo et al., 2005). Members falling below a threshold of 0.5 (no better than random) are excluded from the final ensemble in subsequent projection steps.

knitr::kable(
  chameleon_ensemble %>% collect_metrics(),
  caption    = "**Table 4.** Cross-validated performance metrics for each algorithm in the ensemble. AUC-ROC is the primary discrimination metric; higher values indicate better model performance. Metrics were computed across five spatially blocked cross-validation folds.",
  digits     = 4,
  align      = c("l", "l", "r", "r", "r"),
  table.attr = 'class="table-styled"'
)
Table 4. Cross-validated performance metrics for each algorithm in the ensemble. AUC-ROC is the primary discrimination metric; higher values indicate better model performance. Metrics were computed across five spatially blocked cross-validation folds.
wflow_id .metric mean std_err n
default_glm boyce_cont 0.2053 0.7947 2
default_glm roc_auc 0.8056 0.1002 3
default_glm tss_max 0.5278 0.2373 3
default_rf boyce_cont 0.5078 0.3374 2
default_rf roc_auc 0.8611 0.0735 3
default_rf tss_max 0.7500 0.1443 3
default_gdm boyce_cont 0.5176 0.4824 2
default_gdm roc_auc 0.7500 0.1250 3
default_gdm tss_max 0.5833 0.2205 3
default_maxent boyce_cont -0.5886 0.1386 2
default_maxent roc_auc 0.9444 0.0556 3
default_maxent tss_max 0.8611 0.0735 3

2.12 Habitat Suitability Projection

2.12.1 Full Ensemble Projection

The ensemble was used to predict habitat suitability across the entire Nsanje raster domain using predict_raster(). The output is a continuous suitability surface in which cell values reflect the mean predicted suitability across all ensemble members, ranging from 0 (unsuitable) to 1 (highly suitable). High-suitability areas represent climatic conditions most similar to those at known R. chapmanorum presence locations.

prediction_2041_2060 <- predict_raster(chameleon_ensemble, climate_2021_2040)

ggplot() +
  geom_spatraster(data = prediction_2041_2060, aes(fill = mean)) +
  scale_fill_terrain_c(name = "Suitability") +
  geom_sf(
    data   = chameleon_thin %>% filter(class == "presence"),
    colour = "#d73027", size = 1.5, alpha = 0.8
  ) +
  labs(
    title    = expression("Predicted habitat suitability — " * italic("Rhampholeon chapmanorum")),
    subtitle = "Full ensemble mean prediction | WorldClim ACCESS-CM2 SSP2-4.5 (2041–2060)",
    caption  = "Red points = known presence records; colour scale = ensemble mean suitability (0–1)"
  ) +
  theme_minimal(base_size = 12)

2.12.2 Best-Model Filtered Ensemble Projection

To produce a more conservative and robust suitability map, the ensemble was filtered to retain only models exceeding an AUC-ROC threshold of 0.5 — removing members performing no better than random. The median suitability across retained members was then predicted using predict_raster(). Filtering on a performance threshold improves ensemble quality by excluding poorly calibrated models that would otherwise downweight or distort the spatial predictions (Thuiller et al., 2009; Leonardi et al., 2024).

prediction_filtered <- predict_raster(
  chameleon_ensemble,
  climate_2021_2040,
  metric_thresh = c("roc_auc", 0.5),
  fun           = "median"
)

ggplot() +
  geom_spatraster(data = prediction_filtered, aes(fill = median)) +
  scale_fill_terrain_c(name = "Suitability") +
  geom_sf(
    data   = chameleon_thin %>% filter(class == "presence"),
    colour = "#d73027", size = 1.5, alpha = 0.8
  ) +
  labs(
    title    = expression("Filtered ensemble suitability — " * italic("Rhampholeon chapmanorum")),
    subtitle = "Ensemble median (AUC-ROC > 0.5 members only) | ACCESS-CM2 SSP2-4.5 (2041–2060)",
    caption  = "Red points = known presence records; colour scale = ensemble median suitability (0–1)"
  ) +
  theme_minimal(base_size = 12)


3 Discussion

This analysis represents the first reproducible, multi-algorithm SDM workflow for Rhampholeon chapmanorum, incorporating contemporary best practices in spatial modelling. Several methodological choices merit brief discussion in the context of this species.

Spatial bias correction is particularly critical for R. chapmanorum, given that GBIF occurrence records are almost certainly clustered near road-accessible forest edges and surveyed patches, rather than uniformly sampled across the landscape. Failure to account for this bias could cause models to conflate accessible terrain with suitable habitat, artificially elevating suitability predictions near roads and settlements. The bias-weighting approach applied here explicitly mirrors the sampling effort distribution across presence and background points, reducing this confound (Phillips et al., 2009; Kramer-Schadt et al., 2013).

Variable selection was guided by the ecological biology of the species. Rhampholeon chapmanorum is an obligate forest-floor specialist; its persistence in any locality is contingent on the presence of closed-canopy lowland evergreen forest, which is itself constrained by sufficient annual rainfall (BIO12), relatively stable year-round precipitation (low BIO15), and mild minimum temperatures during the coldest season (BIO6). These three variables therefore capture the primary climatic axes of the species’ niche without introducing redundancy.

Spatial block cross-validation provides far more honest assessments of transferability than random cross-validation for a range-restricted species. Because the species occupies only a handful of small forest patches, random folds would almost certainly permit spatial leakage, inflating AUC-ROC. Spatially blocked evaluation penalises models that rely on spatial proximity rather than genuine environmental signal.

Limitations of the current analysis include the very small number of GBIF occurrence records available for this Critically Endangered species, which constrains model training and may increase sensitivity to individual data points. Future work should integrate field survey data from Tolley et al. (2021) and any subsequent surveys, apply spatial thinning to reduce spatial autocorrelation in the presence data, and project suitability under multiple GCMs and emission scenarios to quantify climate change uncertainty. Landscape connectivity analyses could further identify potential habitat restoration corridors linking the isolated forest patches in which this species persists.


4 Conclusion

This document presents a fully reproducible, best-practice SDM workflow for the Critically Endangered Chapman’s Pygmy Chameleon, Rhampholeon chapmanorum, implemented in R using tidysdm. The workflow integrates GBIF occurrence data cleaning, bias-corrected background sampling, ecologically motivated variable selection, spatial block cross-validation, multi-algorithm ensemble modelling, and present-day habitat suitability projection. The resulting maps identify areas of climatically suitable habitat within the Nsanje District that may warrant targeted field surveys and, in turn, urgent conservation action to halt forest loss and facilitate habitat connectivity for this irreplaceable endemic species.


5 References

Araújo, M.B., & New, M. (2007). Ensemble forecasting of species distributions. Trends in Ecology & Evolution, 22(1), 42–47. https://doi.org/10.1016/j.tree.2006.09.010

Araújo, M.B., Pearson, R.G., Thuiller, W., & Erhard, M. (2005). Validation of species–climate impact models under climate change. Global Change Biology, 11(9), 1504–1513.

Barbet-Massin, M., Jiguet, F., Albert, C.H., & Thuiller, W. (2012). Selecting pseudo-absences for species distribution models: how, where and how many? Methods in Ecology and Evolution, 3(2), 327–338. https://doi.org/10.1111/j.2041-210X.2011.00172.x

Barve, N., Barve, V., Jiménez-Valverde, A., et al. (2011). The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological Modelling, 222(11), 1810–1819.

Branch, W.R., Bayliss, J., & Tolley, K.A. (2014). Rhampholeon chapmanorum. In: IUCN 2014. IUCN Red List of Threatened Species. Version 2014.3. www.iucnredlist.org.

Dormann, C.F., Elith, J., Bacher, S., et al. (2013). Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46.

Elith, J., & Leathwick, J.R. (2009). Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics, 40, 677–697.

Fick, S.E., & Hijmans, R.J. (2017). WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology, 37(12), 4302–4315.

GBIF.org (2024). GBIF Occurrence Download. https://doi.org/10.15468/dl.XXXXXX

Guisan, A., Thuiller, W., & Zimmermann, N.E. (2017). Habitat Suitability and Distribution Models: With Applications in R. Cambridge University Press.

IUCN (2021). Rhampholeon chapmanorum. The IUCN Red List of Threatened Species 2021: e.T197246585A197246588. https://dx.doi.org/10.2305/IUCN.UK.2021-2.RLTS.T197246585A197246588.en

Kramer-Schadt, S., Niedballa, J., Pilgrim, J.D., et al. (2013). The importance of correcting for sampling bias in MaxEnt species distribution models. Diversity and Distributions, 19(11), 1366–1379.

Leonardi, M., Colucci, E., et al. (2024). tidysdm: Leveraging the flexibility of tidymodels for species distribution modelling in R. Methods in Ecology and Evolution, 15, 2133–2140. https://doi.org/10.1111/2041-210X.14406

Phillips, S.J., Dudík, M., Elith, J., et al. (2009). Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological Applications, 19(1), 181–197.

Roberts, D.R., Bahn, V., Ciuti, S., et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913–929.

Swets, J.A. (1988). Measuring the accuracy of diagnostic systems. Science, 240(4857), 1285–1293.

Thuiller, W., Lafourcade, B., Engler, R., & Araújo, M.B. (2009). BIOMOD – a platform for ensemble forecasting of species distributions. Ecography, 32(3), 369–373.

Tilbury, C.R. (1992). Two new dwarf chameleons (Sauria: Rhampholeon) from Malawi, Central Africa, with notes on the ecology of the genus. Tropical Zoology, 5, 197–220.

Tolley, K.A., Tilbury, C.R., Da Silva, J.M., et al. (2021). Clinging to survival: Critically Endangered Chapman’s pygmy chameleon Rhampholeon chapmanorum persists in shrinking forest patches. Oryx, 56(3), 455–461. https://doi.org/10.1017/S0030605320000952

Valavi, R., Elith, J., Lahoz-Monfort, J.J., & Guillera-Arroita, G. (2019). blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods in Ecology and Evolution, 10(2), 225–232.

Wisz, M.S., & Guisan, A. (2009). Do pseudo-absence selection strategies influence species distribution models and their predictions? An information-theoretic approach based on simulated data. BMC Ecology, 9, 8.

Zizka, A., Silvestro, D., Andermann, T., et al. (2019). CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases. Methods in Ecology and Evolution, 10(5), 744–751.


Analysis conducted by Luther Mwalyambwile. Rendered with R Markdown. All code is fully reproducible given access to the source occurrence and raster data.