Rhampholeon chapmanorum (Tilbury, 1992), commonly known as Chapman’s Pygmy Chameleon, is a diminutive forest-floor reptile endemic to the lowland evergreen rainforests of the Natundu (Malawi) Hills in southern Malawi. Growing to a maximum total length of approximately 65 mm, the species is distinguished by a prominent rostral appendage, stellate granular scalation, and a compressed leaf-like body form that renders it near-invisible among the leaf litter it inhabits (Tilbury, 1992). Unlike the majority of chameleons, which are arboreal, R. chapmanorum is largely terrestrial, resting low in the understorey vegetation at night and moving through the forest floor by day.
The species is listed as Critically Endangered (CR) on the IUCN Red List (criteria B1ab(i,ii,iii,iv,v); assessed 2021), with an extent of occurrence (EOO) and area of occupancy (AOO) each estimated at just 12 km². Population modelling suggests a decline of approximately 80% since the 1980s, driven almost entirely by the conversion of lowland rainforest to smallholder agriculture (maize and cassava) and charcoal production (Tolley et al., 2021; IUCN, 2021). By 2019, the total forested area supporting this species had contracted to approximately 0.4 km², fragmented into five isolated patches ranging from 1 to 17 hectares. Critically, Rhampholeon species are strict forest specialists incapable of persisting in transformed habitats; where forest is lost, local populations are extirpated (Branch et al., 2014).
The species was feared extinct following an absence of scientific records between 1992 and 2016, until Tolley et al. (2021) confirmed extant populations in at least two higher-elevation patches of the Malawi Hills through targeted field surveys. This rediscovery, while encouraging, underscored the precarious state of remaining populations and the urgent need for spatial tools to guide conservation.
Species Distribution Models (SDMs) — also referred to as ecological niche models (ENMs) — are correlative models that relate known occurrence records to environmental variables to estimate the probability of habitat suitability across a landscape (Elith & Leathwick, 2009; Guisan et al., 2017). For a species as range-restricted as R. chapmanorum, SDMs serve several critical functions:
This document presents a complete, reproducible SDM workflow for
R. chapmanorum within the Nsanje District of Malawi,
implemented in R using the tidysdm package (Leonardi et
al., 2024), which integrates seamlessly with the
tidymodels machine learning ecosystem. The workflow
encompasses occurrence data acquisition and cleaning, pseudo-absence
generation with bias correction, environmental variable selection and
collinearity screening, multi-algorithm model training with spatial
block cross-validation, ensemble construction, and present-day habitat
suitability projection.
All analyses were conducted in R (R Core Team, 2024). The core SDM
framework was implemented using tidysdm (Leonardi et
al., 2024), which leverages the tidymodels ecosystem
to provide a standardised, modular interface for fitting, tuning, and
ensembling multiple machine learning algorithms. Spatial data were
handled using the terra and sf packages;
coordinate cleaning used CoordinateCleaner; and
visualisations were produced with ggplot2 and
tidyterra.
The following R packages are required to reproduce this analysis. The code block below loads all dependencies:
library(tidyverse) # Data manipulation and visualisation
library(tidysdm) # Species distribution modelling with tidymodels
library(tidymodels) # Unified machine learning framework
library(tidyterra) # ggplot2 methods for terra SpatRaster objects
library(terra) # Spatial raster operations
library(CoordinateCleaner) # Automated geographic coordinate quality control
library(sf) # Simple features for vector spatial data
library(overlapping) # Kernel overlap statistics for niche comparisons
library(xgboost) # Gradient boosted tree engine
library(ranger) # Fast random forest engine
library(maxnet) # MaxEnt implementation for ROccurrence records for Rhampholeon chapmanorum were downloaded from the Global Biodiversity Information Facility (GBIF), the largest openly accessible repository of biodiversity occurrence data. GBIF aggregates digitised museum specimens, citizen science observations, and research-grade field records from thousands of institutions worldwide (GBIF.org, 2024). For a critically endangered, range-restricted species such as R. chapmanorum, GBIF records — though sparse — remain an essential foundation for SDM, as targeted survey data are rarely publicly available at scale.
distrib <- read_delim(
"C:/Users/Hp User/Downloads/sdm/0009731-260120142942310/0009731-260120142942310.csv",
col_types = cols(gbifID = col_character())
)
chameleon <- distrib %>%
select(gbifID, decimalLatitude, decimalLongitude) %>%
rename(
ID = gbifID,
latitude = decimalLatitude,
longitude = decimalLongitude
)
# Remove records with missing geographic coordinates
chameleon <- as_tibble(
chameleon %>%
filter(!is.na(latitude), !is.na(longitude))
)Raw GBIF downloads routinely contain erroneous records — coordinates
placed at country centroids, institutional headquarters, or assigned
default zero values (0°N, 0°E) — that do not reflect genuine species
occurrences. Retaining such errors inflates apparent range sizes and
corrupts the environmental signal that models are trained on. The
CoordinateCleaner package (Zizka et al., 2019)
provides automated flagging of these common spatial errors using a suite
of geometric and geographic tests.
Because SDMs are sensitive to the geographic extent of the study area — which defines the environmental space accessible to the species and thus determines what constitutes a meaningful absence or background (Barve et al., 2011) — the analysis was restricted to the Nsanje District of southern Malawi. This represents the core of the species’ known range and ensures that background environmental sampling reflects the conditions the species has had a realistic opportunity to occupy.
The district boundary shapefile was loaded using the sf
and terra packages. Occurrence points falling outside the
Nsanje boundary polygon were removed using a spatial intersection, and
the boundary was also retained as a SpatVector object for
subsequent raster masking.
# Load Nsanje district boundary as an sf object
Nsanje_data_frame <- st_read(
"C:/Users/Hp User/Downloads/sdm/Nsanje 1.2.shp"
)## Reading layer `Nsanje 1.2' from data source
## `C:\Users\Hp User\Downloads\sdm\Nsanje 1.2.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 11 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 34.90631 ymin: -17.12721 xmax: 35.31261 ymax: -16.31085
## Geodetic CRS: WGS 84
# Clip occurrence records to Nsanje district boundary
chameleon <- st_intersection(chameleon, Nsanje_data_frame)
# Load as SpatVector for terra raster operations
Nsanje <- vect("C:/Users/Hp User/Downloads/sdm/Nsanje 1.2.shp")Environmental predictors were sourced from WorldClim version 2.1 (Fick & Hijmans, 2017), which provides globally consistent bioclimatic variables at 30 arc-second (~1 km) resolution derived from long-term monthly temperature and precipitation records. For this analysis, the ACCESS-CM2 global climate model output under the SSP2-4.5 intermediate emissions scenario for the 2041–2060 time slice was used both to generate the land mask and to provide the bioclimatic predictors for initial projections.
An initial land mask was created from the first layer of the raster
stack by classifying all non-NA cells as valid land surface
pixels. This mask defines the spatial extent within which pseudo-absence
and background points are permitted to be sampled, avoiding the ocean
and areas outside the modelling domain. The raster was then cropped and
masked to the Nsanje district boundary.
Land_mask <- rast(
"C:/Users/Hp User/Downloads/sdm/wc2.1_30s_bioc_ACCESS-CM2_ssp245_2041-2060.tif"
)
# Extract first layer to define valid land cells
base <- Land_mask[[1]]
Land_mask <- ifel(!is.na(base), 1, NA)## |---------|---------|---------|---------|========================================= |---------|---------|---------|---------|========================================= |---------|---------|---------|---------|========================================= |---------|---------|---------|---------|=========================================
names(Land_mask) <- "Land_mask_2021/2024"
# Crop and mask to Nsanje district
Land_mask <- crop(Land_mask, Nsanje)
Land_mask <- mask(Land_mask, Nsanje)
# Map: occurrence points over land mask
ggplot() +
geom_spatraster(data = Land_mask, aes(fill = `Land_mask_2021/2024`)) +
geom_sf(data = Nsanje, fill = NA, color = "black", linewidth = 0.6) +
geom_sf(data = chameleon, colour = "green", size = 1.8, alpha = 0.8) +
guides(fill = "none") +
scale_fill_viridis_c(na.value = "transparent", guide = "none", option = "C") +
labs(
title = expression(italic("Rhampholeon chapmanorum") ~ "— occurrence records, Nsanje District"),
subtitle = "Points show GBIF presence records retained after coordinate cleaning",
caption = "Data: GBIF; Basemap: WorldClim 2.1 / ACCESS-CM2 SSP2-4.5"
) +
theme_minimal(base_size = 12)Because R. chapmanorum occurrence data consist entirely of presences (there are no confirmed absence records), the model must be trained using pseudo-absences or background points — locations at which the species was not recorded and which are assumed to differ ecologically from the presences. The choice of how, where, and how many pseudo-absences to generate is a critical methodological decision, as it substantially affects model predictions (Barbet-Massin et al., 2012; Wisz & Guisan, 2009).
An initial set of 1,000 pseudo-absences was generated randomly across
the land mask using sample_pseudoabs(). These initial
pseudo-absence records were subsequently rasterised to produce a
sampling effort density surface, used in the next step
to apply a spatial bias correction.
pseudo <- sample_pseudoabs(
data = chameleon,
raster = Land_mask,
n = 1000,
method = "random",
class_label = "pseudoabs",
return_pres = TRUE
)
# Rasterise pseudo-absence density for use in bias-corrected background sampling
pseudo_background_raster <- rasterize(pseudo, Land_mask, fun = "count")
pseudo_background_raster <- mask(pseudo_background_raster, Land_mask)
# Aggregate to slightly coarser resolution for visualisation
r <- pseudo_background_raster
r_agg <- aggregate(r, fact = 2, fun = sum, na.rm = TRUE)
global(r_agg, fun = c("min", "max", "sum", "mean"), na.rm = TRUE)# Plot pseudo-absence density
df2 <- as.data.frame(r_agg, xy = TRUE, na.rm = FALSE)
names(df2)[3] <- "count"
ggplot(df2, aes(x = x, y = y, fill = count)) +
geom_raster() +
geom_sf(data = Nsanje, fill = NA, color = "black",
linewidth = 0.6, inherit.aes = FALSE) +
scale_fill_viridis_c(na.value = "transparent", name = "Count") +
labs(
title = "Pseudo-absence sampling density raster",
subtitle = "Aggregated point density surface (fact = 2) used for bias-corrected background sampling",
caption = "Higher cell values indicate greater pseudo-absence density"
) +
theme_minimal(base_size = 12)A well-documented problem in SDMs built from museum and citizen-science occurrence data is spatial sampling bias: occurrences are not drawn randomly from the landscape but cluster around roads, settlements, and previously surveyed localities (Phillips et al., 2009; Kramer-Schadt et al., 2013). If background points are sampled uniformly at random, the model may learn to distinguish presence localities from the background on the basis of their accessibility rather than their ecological suitability — producing artificially inflated performance metrics and misleading habitat maps.
To address this, background points were drawn using the
"bias" method in
sample_background(), which weights sampling probability
according to the pseudo-absence density raster. This ensures that
background points share the same spatial bias distribution as the
presence records, effectively making the model compare presences against
a background with equivalent accessibility characteristics — the same
logic underlying the target-group background approach of Phillips et
al. (2009). The number of background points was set to three times
the number of presence records, following common practice for machine
learning algorithms (Barbet-Massin et al., 2012).
set.seed(1234567)
chameleon_thin <- sample_background(
data = chameleon,
raster = pseudo_background_raster,
n = 3 * nrow(chameleon),
method = "bias",
class_label = "background",
return_pres = TRUE
)
ggplot() +
geom_spatraster(data = Land_mask,
aes(fill = `Land_mask_2021/2024`), interpolate = FALSE) +
geom_sf(data = Nsanje, fill = NA, color = "black", linewidth = 0.6) +
geom_sf(data = chameleon_thin, aes(col = class), size = 1.5, alpha = 0.75) +
scale_fill_viridis_c(na.value = "transparent", guide = "none", option = "C") +
scale_colour_manual(
values = c("presence" = "#d73027", "background" = "#4575b4"),
name = "Record type"
) +
labs(
title = "Bias-corrected background and presence points",
subtitle = "Background points weighted by pseudo-absence density to match spatial sampling effort",
caption = "Red = presence; Blue = bias-corrected background"
) +
theme_minimal(base_size = 12)The full 19-layer WorldClim 2.1 bioclimatic raster stack for the
ACCESS-CM2 SSP2-4.5 (2041–2060) scenario was loaded, cropped to the
Nsanje district, masked to land cells, reprojected to WGS84 (EPSG:4326),
and layer names were standardised to the conventional
bio1–bio19 nomenclature.
climate_2021_2040 <- rast(
"C:/Users/Hp User/Downloads/sdm/wc2.1_30s_bioc_ACCESS-CM2_ssp245_2041-2060.tif"
)
climate_2021_2040 <- crop(climate_2021_2040, Nsanje)
climate_2021_2040 <- mask(climate_2021_2040, Nsanje)
climate_2021_2040 <- terra::project(climate_2021_2040, y = "EPSG:4326")
names(climate_2021_2040) <- paste0("bio", seq_len(nlyr(climate_2021_2040)))Bioclimatic values were extracted at all presence and background point locations and appended as additional columns to the modelling dataset. This step links each point (presence or background) to its local environmental conditions — the primary input used by each modelling algorithm.
Before proceeding to model training, we confirmed that no missing
values were present in the extracted climate data. Points located in
raster cells flagged as NA (e.g., at the edges of the study
area boundary) must be removed, as most modelling algorithms cannot
tolerate incomplete predictor records.
## class geometry bio1 bio2
## presence : 8 POINT :32 Min. :24.00 Min. :11.40
## background:24 epsg:4326 : 0 1st Qu.:26.55 1st Qu.:12.18
## +proj=long...: 0 Median :27.50 Median :12.30
## Mean :27.17 Mean :12.55
## 3rd Qu.:28.20 3rd Qu.:12.95
## Max. :28.40 Max. :13.70
## bio3 bio4 bio5 bio6
## Min. :55.40 Min. :261.2 Min. :33.80 Min. :14.80
## 1st Qu.:56.20 1st Qu.:291.8 1st Qu.:37.08 1st Qu.:15.90
## Median :57.35 Median :301.2 Median :38.20 Median :16.45
## Mean :57.37 Mean :297.2 Mean :38.11 Mean :16.23
## 3rd Qu.:58.52 3rd Qu.:305.7 3rd Qu.:39.35 3rd Qu.:16.60
## Max. :59.90 Max. :317.9 Max. :39.80 Max. :16.80
## bio7 bio8 bio9 bio10
## Min. :19.00 Min. :26.10 Min. :21.00 Min. :26.80
## 1st Qu.:21.18 1st Qu.:28.70 1st Qu.:23.35 1st Qu.:29.70
## Median :21.70 Median :29.60 Median :24.50 Median :30.85
## Mean :21.88 Mean :29.32 Mean :24.12 Mean :30.45
## 3rd Qu.:22.73 3rd Qu.:30.32 3rd Qu.:25.10 3rd Qu.:31.52
## Max. :23.50 Max. :30.60 Max. :25.30 Max. :31.80
## bio11 bio12 bio13 bio14
## Min. :20.40 Min. : 760.0 Min. :168.0 Min. : 6.000
## 1st Qu.:22.45 1st Qu.: 798.5 1st Qu.:176.5 1st Qu.: 7.000
## Median :23.45 Median : 839.0 Median :184.0 Median : 8.000
## Mean :23.14 Mean : 859.2 Mean :186.8 Mean : 8.188
## 3rd Qu.:23.93 3rd Qu.: 908.0 3rd Qu.:196.2 3rd Qu.: 9.000
## Max. :24.30 Max. :1142.0 Max. :234.0 Max. :11.000
## bio15 bio16 bio17 bio18
## Min. : 91.70 Min. :465.0 Min. :27.00 Min. :222.0
## 1st Qu.: 95.97 1st Qu.:484.5 1st Qu.:30.75 1st Qu.:236.8
## Median : 96.65 Median :502.0 Median :32.00 Median :245.5
## Mean : 96.67 Mean :511.0 Mean :34.53 Mean :249.1
## 3rd Qu.: 98.33 3rd Qu.:526.8 3rd Qu.:38.00 3rd Qu.:259.8
## Max. :100.00 Max. :641.0 Max. :55.00 Max. :316.0
## bio19
## Min. :35.00
## 1st Qu.:38.75
## Median :41.00
## Mean :43.78
## 3rd Qu.:48.00
## Max. :70.00
A fundamental assumption of SDMs is that presences and background points differ meaningfully in their environmental conditions — that is, the environments at known occurrence locations are not representative of random positions across the landscape. If presences and background overlap substantially in environmental space, the model has little signal to learn from. We first visualised this separation using violin plots, which display the full distribution of each bioclimatic variable across presence and background records.
The degree of environmental separation was formally quantified by computing the overlap coefficient between the presence and background distributions for each variable. Low overlap values (approaching 0) indicate strong environmental differentiation and therefore high potential model signal; high overlap (approaching 1) suggests that the variable carries little discriminatory power for this species and study area.
knitr::kable(
chameleon_thin %>% dist_pres_vs_bg(class),
caption = "**Table 1.** Kernel density overlap statistics between presence and background records for each bioclimatic variable. Lower values indicate greater environmental differentiation and stronger potential predictor signal.",
col.names = c("Bioclimatic Variable", "Overlap Coefficient"),
digits = 4,
align = c("l", "r"),
table.attr = 'class="table-styled"'
)| Bioclimatic Variable | Overlap Coefficient |
|---|---|
| bio4 | 0.5979 |
| bio2 | 0.5313 |
| bio13 | 0.4721 |
| bio16 | 0.4690 |
| bio10 | 0.4666 |
| bio1 | 0.4575 |
| bio8 | 0.4503 |
| bio9 | 0.4493 |
| bio12 | 0.4477 |
| bio18 | 0.4459 |
| bio11 | 0.4422 |
| bio7 | 0.4206 |
| bio6 | 0.4032 |
| bio5 | 0.3936 |
| bio15 | 0.3535 |
| bio19 | 0.3144 |
| bio17 | 0.2459 |
| bio14 | 0.2277 |
| bio3 | 0.1434 |
Based on the overlap analysis and ecological reasoning, three bioclimatic variables were selected as predictors:
| Variable | Description | Ecological Relevance for R. chapmanorum |
|---|---|---|
| BIO6 | Minimum Temperature of the Coldest Month (°C × 10) | Sets the cold-season thermal limit for ectotherms; cold tolerance constrains reptile range boundaries |
| BIO12 | Annual Precipitation (mm) | Captures overall moisture availability; lowland evergreen forest (the species’ obligate habitat) requires high, reliable annual rainfall |
| BIO15 | Precipitation Seasonality (Coefficient of Variation) | Measures precipitation variability across months; forest specialist species reliant on stable humidity are sensitive to high seasonality |
Table 2. Selected bioclimatic predictors, their WorldClim definitions, and ecological justification for inclusion in the Rhampholeon chapmanorum SDM.
These three variables collectively capture the thermal and hydric conditions most relevant to a moisture-dependent, thermoconforming forest reptile in a sub-tropical African landscape.
High correlation between predictor variables causes multicollinearity, which can destabilise model coefficients, inflate variance, and introduce redundancy — particularly problematic for parametric models such as GLM (Dormann et al., 2013). A pairwise scatterplot matrix was produced to visually inspect correlations among the three selected variables. Variables exhibiting Pearson’s |r| > 0.70 should not be jointly included in the same model (Dormann et al., 2013). The variables retained here were selected to minimise redundancy while maximising biological coverage of the species’ niche.
A tidymodels recipe was defined with class
(presence/background) as the binary response variable and the three
bioclimatic predictors as features. The workflow_set()
function was used to construct a modelling grid combining this single
preprocessing recipe with four machine learning
algorithms, each suited to SDM applications:
| Model | tidysdm Function |
Algorithm Type | Key Strengths |
|---|---|---|---|
| Generalised Linear Model (GLM) | sdm_spec_glm() |
Parametric | Interpretable; low variance; suitable for small datasets |
| Random Forest (RF) | sdm_spec_rf() |
Ensemble tree | Robust to overfitting; handles non-linearity and interactions; minimal tuning |
| Gradient Boosted Tree (GBM/XGBoost) | sdm_spec_boost_tree() |
Boosted ensemble | High predictive power; captures complex response surfaces |
| Maximum Entropy (MaxEnt) | sdm_spec_maxent() |
Presence-background | Designed specifically for presence-only/background data; widely validated in SDM literature |
Table 3. Machine learning algorithms included in the tidysdm
ensemble workflow, with their tidysdm specification
functions and ecological modelling characteristics.
Using multiple algorithms and combining their predictions into an ensemble is best practice in modern SDM (Araújo & New, 2007), as it averages out the idiosyncratic biases of individual algorithms and produces more robust suitability maps than any single model.
chameleon_rec <- recipe(chameleon_thin, formula = class ~ .)
chameleon_rec
chameleon_thin %>% check_sdm_presence(class)## [1] TRUE
# Build a workflow set: one recipe × four algorithms
chameleon_model <- workflow_set(
preproc = list(default = chameleon_rec),
models = list(
glm = sdm_spec_glm(),
rf = sdm_spec_rf(),
gdm = sdm_spec_boost_tree(),
maxent = sdm_spec_maxent()
),
cross = TRUE
) %>%
option_add(control = control_ensemble_grid())Model evaluation via random cross-validation (randomly partitioning the dataset into training and test folds) is inappropriate for spatially structured ecological data. Because occurrence records and their associated environmental values are spatially autocorrelated — nearby locations tend to be more similar than distant ones — random folds allow information from spatially proximate training points to leak into the test fold. This spatial leakage leads to overly optimistic performance estimates that do not generalise to independent spatial regions (Roberts et al., 2017; Valavi et al., 2019).
Spatial block cross-validation addresses this by partitioning the data into geographically separated folds, ensuring that training and test sets are sufficiently distant in space to approximate independence (Roberts et al., 2017). The resulting performance metrics are more conservative but better reflect true out-of-sample predictive ability across unsurveyed areas — which is the ultimate goal of SDM for conservation planning.
Five spatial folds (v = 5) were created using
spatial_block_cv() from tidysdm. The block
structure was visualised using autoplot() to confirm
adequate spatial separation.
Each algorithm in the workflow set was tuned using
workflow_map() with tune_grid(), evaluating a
grid of five candidate hyperparameter combinations per algorithm across
the five spatial block cross-validation folds. Tuning identifies the
hyperparameter configuration (e.g., number of trees, minimum node size,
regularisation strength) that maximises predictive performance on
withheld spatial test folds.
Model performance was assessed using the SDM metric
set implemented in tidysdm, which includes the
Area Under the Receiver Operating Characteristic Curve
(AUC-ROC) — the primary discrimination metric for binary SDMs —
as well as the Boyce Continuous Index and the True Skill Statistic
(TSS). AUC-ROC measures the probability that the model assigns a higher
suitability value to a random presence than to a random background
point, with values approaching 1.0 indicating near-perfect
discrimination.
set.seed(1234567)
chameleon_model <- chameleon_model %>%
workflow_map(
"tune_grid",
resamples = chameleon_cv,
grid = 5,
metrics = sdm_metric_set(),
verbose = TRUE
)
# Visualise model tuning performance across folds and hyperparameter grids
autoplot(chameleon_model)Following tuning, the best-performing hyperparameter configuration
for each algorithm was selected and combined into a simple
ensemble using simple_ensemble(). As described by
Leonardi et al. (2024), simple_ensemble() retains
the single best version of each algorithm — the configuration with the
highest value of the chosen metric (roc_auc here) across
cross-validation folds — and stores them as ensemble members.
Ensemble approaches are strongly preferred over single-algorithm SDMs because they average across the different response-curve shapes and extrapolation behaviours of individual algorithms, reducing the influence of any one model’s assumptions and increasing the robustness and spatial coherence of suitability predictions (Araújo & New, 2007; Thuiller et al., 2009).
chameleon_ensemble <- simple_ensemble() %>%
add_member(chameleon_model, metric = "roc_auc")
chameleon_ensembleThe following table summarises the cross-validated performance metrics for each algorithm in the ensemble. AUC-ROC is the primary criterion: values above 0.7 are considered acceptable, above 0.8 good, and above 0.9 excellent for habitat suitability modelling (Swets, 1988; Araújo et al., 2005). Members falling below a threshold of 0.5 (no better than random) are excluded from the final ensemble in subsequent projection steps.
knitr::kable(
chameleon_ensemble %>% collect_metrics(),
caption = "**Table 4.** Cross-validated performance metrics for each algorithm in the ensemble. AUC-ROC is the primary discrimination metric; higher values indicate better model performance. Metrics were computed across five spatially blocked cross-validation folds.",
digits = 4,
align = c("l", "l", "r", "r", "r"),
table.attr = 'class="table-styled"'
)| wflow_id | .metric | mean | std_err | n |
|---|---|---|---|---|
| default_glm | boyce_cont | 0.2053 | 0.7947 | 2 |
| default_glm | roc_auc | 0.8056 | 0.1002 | 3 |
| default_glm | tss_max | 0.5278 | 0.2373 | 3 |
| default_rf | boyce_cont | 0.5078 | 0.3374 | 2 |
| default_rf | roc_auc | 0.8611 | 0.0735 | 3 |
| default_rf | tss_max | 0.7500 | 0.1443 | 3 |
| default_gdm | boyce_cont | 0.5176 | 0.4824 | 2 |
| default_gdm | roc_auc | 0.7500 | 0.1250 | 3 |
| default_gdm | tss_max | 0.5833 | 0.2205 | 3 |
| default_maxent | boyce_cont | -0.5886 | 0.1386 | 2 |
| default_maxent | roc_auc | 0.9444 | 0.0556 | 3 |
| default_maxent | tss_max | 0.8611 | 0.0735 | 3 |
The ensemble was used to predict habitat suitability across the
entire Nsanje raster domain using predict_raster(). The
output is a continuous suitability surface in which cell values reflect
the mean predicted suitability across all ensemble
members, ranging from 0 (unsuitable) to 1 (highly suitable).
High-suitability areas represent climatic conditions most similar to
those at known R. chapmanorum presence locations.
prediction_2041_2060 <- predict_raster(chameleon_ensemble, climate_2021_2040)
ggplot() +
geom_spatraster(data = prediction_2041_2060, aes(fill = mean)) +
scale_fill_terrain_c(name = "Suitability") +
geom_sf(
data = chameleon_thin %>% filter(class == "presence"),
colour = "#d73027", size = 1.5, alpha = 0.8
) +
labs(
title = expression("Predicted habitat suitability — " * italic("Rhampholeon chapmanorum")),
subtitle = "Full ensemble mean prediction | WorldClim ACCESS-CM2 SSP2-4.5 (2041–2060)",
caption = "Red points = known presence records; colour scale = ensemble mean suitability (0–1)"
) +
theme_minimal(base_size = 12)To produce a more conservative and robust suitability map, the
ensemble was filtered to retain only models exceeding an AUC-ROC
threshold of 0.5 — removing members performing no better than
random. The median suitability across retained members was then
predicted using predict_raster(). Filtering on a
performance threshold improves ensemble quality by excluding poorly
calibrated models that would otherwise downweight or distort the spatial
predictions (Thuiller et al., 2009; Leonardi et al.,
2024).
prediction_filtered <- predict_raster(
chameleon_ensemble,
climate_2021_2040,
metric_thresh = c("roc_auc", 0.5),
fun = "median"
)
ggplot() +
geom_spatraster(data = prediction_filtered, aes(fill = median)) +
scale_fill_terrain_c(name = "Suitability") +
geom_sf(
data = chameleon_thin %>% filter(class == "presence"),
colour = "#d73027", size = 1.5, alpha = 0.8
) +
labs(
title = expression("Filtered ensemble suitability — " * italic("Rhampholeon chapmanorum")),
subtitle = "Ensemble median (AUC-ROC > 0.5 members only) | ACCESS-CM2 SSP2-4.5 (2041–2060)",
caption = "Red points = known presence records; colour scale = ensemble median suitability (0–1)"
) +
theme_minimal(base_size = 12)This analysis represents the first reproducible, multi-algorithm SDM workflow for Rhampholeon chapmanorum, incorporating contemporary best practices in spatial modelling. Several methodological choices merit brief discussion in the context of this species.
Spatial bias correction is particularly critical for R. chapmanorum, given that GBIF occurrence records are almost certainly clustered near road-accessible forest edges and surveyed patches, rather than uniformly sampled across the landscape. Failure to account for this bias could cause models to conflate accessible terrain with suitable habitat, artificially elevating suitability predictions near roads and settlements. The bias-weighting approach applied here explicitly mirrors the sampling effort distribution across presence and background points, reducing this confound (Phillips et al., 2009; Kramer-Schadt et al., 2013).
Variable selection was guided by the ecological biology of the species. Rhampholeon chapmanorum is an obligate forest-floor specialist; its persistence in any locality is contingent on the presence of closed-canopy lowland evergreen forest, which is itself constrained by sufficient annual rainfall (BIO12), relatively stable year-round precipitation (low BIO15), and mild minimum temperatures during the coldest season (BIO6). These three variables therefore capture the primary climatic axes of the species’ niche without introducing redundancy.
Spatial block cross-validation provides far more honest assessments of transferability than random cross-validation for a range-restricted species. Because the species occupies only a handful of small forest patches, random folds would almost certainly permit spatial leakage, inflating AUC-ROC. Spatially blocked evaluation penalises models that rely on spatial proximity rather than genuine environmental signal.
Limitations of the current analysis include the very small number of GBIF occurrence records available for this Critically Endangered species, which constrains model training and may increase sensitivity to individual data points. Future work should integrate field survey data from Tolley et al. (2021) and any subsequent surveys, apply spatial thinning to reduce spatial autocorrelation in the presence data, and project suitability under multiple GCMs and emission scenarios to quantify climate change uncertainty. Landscape connectivity analyses could further identify potential habitat restoration corridors linking the isolated forest patches in which this species persists.
This document presents a fully reproducible, best-practice SDM
workflow for the Critically Endangered Chapman’s Pygmy Chameleon,
Rhampholeon chapmanorum, implemented in R using
tidysdm. The workflow integrates GBIF occurrence data
cleaning, bias-corrected background sampling, ecologically motivated
variable selection, spatial block cross-validation, multi-algorithm
ensemble modelling, and present-day habitat suitability projection. The
resulting maps identify areas of climatically suitable habitat within
the Nsanje District that may warrant targeted field surveys and, in
turn, urgent conservation action to halt forest loss and facilitate
habitat connectivity for this irreplaceable endemic species.
Araújo, M.B., & New, M. (2007). Ensemble forecasting of species distributions. Trends in Ecology & Evolution, 22(1), 42–47. https://doi.org/10.1016/j.tree.2006.09.010
Araújo, M.B., Pearson, R.G., Thuiller, W., & Erhard, M. (2005). Validation of species–climate impact models under climate change. Global Change Biology, 11(9), 1504–1513.
Barbet-Massin, M., Jiguet, F., Albert, C.H., & Thuiller, W. (2012). Selecting pseudo-absences for species distribution models: how, where and how many? Methods in Ecology and Evolution, 3(2), 327–338. https://doi.org/10.1111/j.2041-210X.2011.00172.x
Barve, N., Barve, V., Jiménez-Valverde, A., et al. (2011). The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological Modelling, 222(11), 1810–1819.
Branch, W.R., Bayliss, J., & Tolley, K.A. (2014). Rhampholeon chapmanorum. In: IUCN 2014. IUCN Red List of Threatened Species. Version 2014.3. www.iucnredlist.org.
Dormann, C.F., Elith, J., Bacher, S., et al. (2013). Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography, 36(1), 27–46.
Elith, J., & Leathwick, J.R. (2009). Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics, 40, 677–697.
Fick, S.E., & Hijmans, R.J. (2017). WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology, 37(12), 4302–4315.
GBIF.org (2024). GBIF Occurrence Download. https://doi.org/10.15468/dl.XXXXXX
Guisan, A., Thuiller, W., & Zimmermann, N.E. (2017). Habitat Suitability and Distribution Models: With Applications in R. Cambridge University Press.
IUCN (2021). Rhampholeon chapmanorum. The IUCN Red List of Threatened Species 2021: e.T197246585A197246588. https://dx.doi.org/10.2305/IUCN.UK.2021-2.RLTS.T197246585A197246588.en
Kramer-Schadt, S., Niedballa, J., Pilgrim, J.D., et al. (2013). The importance of correcting for sampling bias in MaxEnt species distribution models. Diversity and Distributions, 19(11), 1366–1379.
Leonardi, M., Colucci, E., et al. (2024). tidysdm: Leveraging the flexibility of tidymodels for species distribution modelling in R. Methods in Ecology and Evolution, 15, 2133–2140. https://doi.org/10.1111/2041-210X.14406
Phillips, S.J., Dudík, M., Elith, J., et al. (2009). Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. Ecological Applications, 19(1), 181–197.
Roberts, D.R., Bahn, V., Ciuti, S., et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913–929.
Swets, J.A. (1988). Measuring the accuracy of diagnostic systems. Science, 240(4857), 1285–1293.
Thuiller, W., Lafourcade, B., Engler, R., & Araújo, M.B. (2009). BIOMOD – a platform for ensemble forecasting of species distributions. Ecography, 32(3), 369–373.
Tilbury, C.R. (1992). Two new dwarf chameleons (Sauria: Rhampholeon) from Malawi, Central Africa, with notes on the ecology of the genus. Tropical Zoology, 5, 197–220.
Tolley, K.A., Tilbury, C.R., Da Silva, J.M., et al. (2021). Clinging to survival: Critically Endangered Chapman’s pygmy chameleon Rhampholeon chapmanorum persists in shrinking forest patches. Oryx, 56(3), 455–461. https://doi.org/10.1017/S0030605320000952
Valavi, R., Elith, J., Lahoz-Monfort, J.J., & Guillera-Arroita, G. (2019). blockCV: An R package for generating spatially or environmentally separated folds for k-fold cross-validation of species distribution models. Methods in Ecology and Evolution, 10(2), 225–232.
Wisz, M.S., & Guisan, A. (2009). Do pseudo-absence selection strategies influence species distribution models and their predictions? An information-theoretic approach based on simulated data. BMC Ecology, 9, 8.
Zizka, A., Silvestro, D., Andermann, T., et al. (2019). CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases. Methods in Ecology and Evolution, 10(5), 744–751.
Analysis conducted by Luther Mwalyambwile. Rendered with R Markdown. All code is fully reproducible given access to the source occurrence and raster data.