Abstract

Xerochlamys bojeriana is a shrub that primarily grows in the seasonally dry tropical biome and is an endemic species of Madagascar(Hong-Wa 2009). This assessment aims to use species distribution models (SDMs) and maps to model and describe the current distribution of a plant species occurring in Madagascar. After modeling, we are able to describe this species’ distribution and found that precipitation in the wettest quarter and annual average temperature are the primary factors affecting the distribution of this species. Xerochlamys bojeriana’s predicted habitats are mostly in the central region of Madagascar. Also, at the top and bottom of Madagascar, Xerochlamys bojeriana has a has a high probability of survival. By evaluating this model with Accuracy, TSS and AUC, it can be seen that the model can fit the data well. Although this analysis provides some information for the study of this species and the evaluation indicators also show that the model is good, there is still room for improvement in the model, such as using cross validation and other methods to enhance its robustness(Marla et al. 2020).

Introduction

Madagascar is an island country located off the southeastern coast of Africa (worldatlas.com 2017). The combination of southeast trade winds and northwest monsoons resulted in a hot rainy season (November to April) and frequent destructive cyclones, as well as a relatively cool dry season (May to October)(Vences et al. 2009). The diverse landscapes of the island, including tropical rainforest, thorny forests, and plateaus, contribute to its rich ecological tapestry. Its rich biodiversity, shaped by millions of years of separation from the mainland, has given rise to a plethora of endemic species, making it a global hotspot for conservation efforts(Vences et al. 2009).

Madagascar boasts remarkable botanical diversity with a high number of plant species found nowhere else on Earth. From the iconic baobabs to rare orchids, the plant life on the island plays a vital role in its ecosystems. However, many plant species face severe threats from habitat destruction, climate change, and human activities such as fires(Alvarado et al. 2014).

Species distribution modelling (SDM) uses computer algorithms to predict the distribution of a species across geographic space and time using environmental data(Elith and Leathwick 2009). By analyzing environmental variables, climate patterns, and plant occurrence records, SDMs empower scientists to unravel the intricate relationships between plants and their habitats. Predictions from an SDM may include a species’ future distribution under climate change, a species’ past distribution to assess evolutionary relationships, or the potential future distribution of an invasive species. This knowledge becomes invaluable in crafting targeted conservation strategies, ensuring the preservation of Madagascar’s botanical treasures for generations to come.

Methods

  • Describe and motivate your choice of methods

Library all packages before starting.

library(pacman)
p_load(rgbif, DT, tidyverse, CoordinateCleaner, rnaturalearth, rnaturalearthdata, sf, mapview, terra, flexsdm, here, corrplot, SDMtune, virtualspecies, exactextractr)

1. Download locality data from GBIF

Download GBIF occurrence data for this Xerochlamys bojeriana(Chamberlain and Boettiger 2017).

spp <- c("Xerochlamys bojeriana")
gbif_download <- occ_data(scientificName = spp, hasCoordinate = TRUE, country = 'MG', limit = 10000)

Select only the data and visualise the data and the interaction map.

gbif_data <- gbif_download$data
gbif_data %>% DT::datatable(extensions = 'Scroller',
                            options = list(scrollX = TRUE,
                                           scrollY = 200,
                                           scroller = TRUE))
gbif_data %>% 
  st_as_sf(coords = c('decimalLongitude', 'decimalLatitude'), crs = 4326) %>% 
  dplyr::select(year) %>% 
  mapview(layer.name = spp)

2. Clean GBIF data

Use custom filter and cc_*() functions to clean the data(Zizka et al. 2019).

gbif_data %>%
  filter(!decimalLatitude == 0 | !decimalLongitude == 0) %>%
  distinct(decimalLongitude,decimalLatitude,speciesKey,datasetKey, .keep_all = TRUE) -> gbif_filt

gbif_filt %>%
  cc_cen(lat = 'decimalLatitude', lon = 'decimalLongitude', buffer = 2000) %>% # remove country centroids within 2km 
  cc_cap(lat = 'decimalLatitude', lon = 'decimalLongitude', buffer = 2000) %>% # remove capitals centroids within 2km
  cc_inst(lat = 'decimalLatitude', lon = 'decimalLongitude', buffer = 2000) %>% # remove zoo and herbaria within 2km 
  cc_sea(lat = 'decimalLatitude', lon = 'decimalLongitude') -> spp_clean # remove from ocean
## Testing sea coordinates
## Testing biodiversity institutions
## Testing country capitals
## Testing country centroids
## Removed 2 records.
## Removed 3 records.
## Removed 0 records.
## Removed 1 records.

We have now removed another 6 records, leaving us with 241 records. Download the country border with rnaturalearth package and ne_countries() function and return it as an sf object.

mad <- ne_countries(country = 'Madagascar', scale = 'medium', returnclass = 'sf')
spp_clean %>% 
  dplyr::select(key, year, basisOfRecord, locality, coordinateUncertaintyInMeters, occurrenceRemarks,  recordedBy, lon = decimalLongitude, lat = decimalLatitude) -> spp_sel
spp_sf <- st_as_sf(spp_sel, coords = c('lon', 'lat'))
st_crs(spp_sf) = 4326

Plot the spatial layers and show Interactive map.

# Spatial layers
ggplot() +
  geom_sf(data = mad) +
  geom_sf(data = spp_sf, size = 3, alpha = 0.5) +
  labs(title = 'Plotting as simple features (sf)') +
  theme_void()

# Interactive maps
spp_sf %>% 
  st_jitter(factor = 0.0001) %>% 
  mapview(zcol = 'year')

3. Spatial thinning

Load in worldclim locally. Plot mean annual temperature (C).

worldclim <- rast(here("data/rast/afr_worldclim.tif"))
names(worldclim)
##  [1] "bio1"  "bio2"  "bio3"  "bio4"  "bio5"  "bio6"  "bio7"  "bio8"  "bio9" 
## [10] "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17" "bio18"
## [19] "bio19"
names(worldclim) <- c("mean_ann_t","mean_diurnal_t_range","isothermality", "t_seas", 'max_t_warm_m','min_t_cold_m',"t_ann_range",'mean_t_wet_q','mean_t_dry_q','mean_t_warm_q','mean_t_cold_q','ann_p', 'p_wet_m','p_dry_m','p_seas','p_wet_q','p_dry_q','p_warm_q','p_cold_q')
plot(worldclim$mean_ann_t, main = 'mean annual temperature (C)')

Plot all climate factors.

mad <- st_transform(mad, 'EPSG:4326')
wc_mad <- worldclim %>% 
  crop(., vect(mad)) %>% 
  mask(., vect(mad))
plot(wc_mad)

Thin records by distance.

set.seed(42)
spp_filt <- occfilt_geo(
  data = spp_sel %>% rename(x = lon, y = lat),
  x = 'x',
  y = 'y',
  method = c('defined', d = 10), # set a 10 km threshold
  env_layer = wc_mad,
  prj = crs(wc_mad)
)
## Extracting values from raster ...
## Number of unfiltered records: 241
## Distance threshold(km): 10
## Number of filtered records: 46

Show localities included and filtered on one plot.

# join the result
spp_filt_sf <- spp_filt %>% 
  st_as_sf(coords = c('x','y'), crs = st_crs(4326))
# identify the localities that were removed
spp_removed_sf <- spp_sel %>% 
  filter(!key %in% spp_filt$key) %>% 
  st_as_sf(coords = c('lon','lat'), crs = st_crs(4326))
# Visualise output
mapview(spp_filt_sf, zcol = NULL, col.regions = 'blue', layer.name = 'localities included') +
  mapview(spp_removed_sf, zcol = NULL, col.regions = 'red', layer.name = 'localities filtered')

4. Create pseudo-absences

Create random absence points in spaces outside of a 10 km buffer of presence points to accurate the model.

set.seed(42) # make sure the result is repeatable
pseudo_abs <-  sample_pseudoabs(
  data = spp_filt,
  x = 'x',
  y = 'y',
  n = nrow(spp_filt),
  method = c('geo_const', width = 10000), 
  rlayer = wc_mad,
  calibarea = vect(mad)
)
head(pseudo_abs)
## # A tibble: 6 × 3
##       x     y pr_ab
##   <dbl> <dbl> <dbl>
## 1  49.2 -17.6     0
## 2  46.8 -19.7     0
## 3  46.0 -17.4     0
## 4  45.0 -21.6     0
## 5  47.0 -15.7     0
## 6  45.7 -16.0     0

Visualise presences and pseudoabsences points on one plot, and show the result sf.

# process presence points to match pseudoabsences
pres <- spp_filt %>% 
  dplyr::select(x, y) %>% 
  mutate(pr_ab = 1)
head(pres)
## # A tibble: 6 × 3
##       x     y pr_ab
##   <dbl> <dbl> <dbl>
## 1  46.8 -20.6     1
## 2  46.9 -20.5     1
## 3  45.4 -22.6     1
## 4  47.4 -20.3     1
## 5  45.3 -22.5     1
## 6  47.2 -19.0     1
# combine presences and pseudoabsences
all_pts <- rbind(pres, pseudo_abs)
# convert to spatial format
all_pts_sf <- all_pts %>% 
  mutate(pr_ab = as.factor(pr_ab)) %>% 
  st_as_sf(coords = c('x','y'), crs = st_crs(4326))
# visualise
mapview(all_pts_sf, zcol = 'pr_ab', layer.name = 'all points')
# Export locality data
write_sf(all_pts_sf, 
         here(paste0('output/species_localities/',gbif_data$genus[1],'_',gbif_data$specificEpithet[1],'_','all_points.shp')))
# Show locality data
DT::datatable(all_pts_sf)

5. Process environmental data

Download the country border and return it as an sf object

spp_code <- gsub(' ', '_', spp)
all_pts <- vect(here(paste0('output/species_localities/',spp_code, '_all_points.shp')))
mad_sf <- ne_countries(country = 'Madagascar', scale = 'medium', returnclass = 'sf')
mad <- vect(mad_sf)

Load worldclim data.

worldclim <- rast(here("data/rast/afr_worldclim.tif"))
names(worldclim) <- c("mean_ann_t","mean_diurnal_t_range","isothermality", "t_seas", 'max_t_warm_m','min_t_cold_m',"t_ann_range",'mean_t_wet_q','mean_t_dry_q','mean_t_warm_q','mean_t_cold_q','ann_p', 'p_wet_m','p_dry_m','p_seas','p_wet_q','p_dry_q','p_warm_q','p_cold_q')

mad <- project(mad, 'EPSG:4326')
# crop and mask
wc_mad <- worldclim %>% 
  crop(., mad) %>% 
  mask(., mad)

Re-scale temperature values and visualise raw data.

wc_mad[[c(1:2,5:11)]] <- wc_mad[[c(1:2,5:11)]]/10
wc_mad[[3:4]] <- wc_mad[[3:4]]/100
 
plot(wc_mad$mean_ann_t)
plot(mad, add = T)
plot(all_pts, add = T)

Check projections.

crs(wc_mad) == crs(mad)
## [1] TRUE
crs(wc_mad) == crs(all_pts)
## [1] FALSE
crs(mad) == crs(all_pts)
## [1] FALSE
mad <- project(mad, wc_mad)
crs(wc_mad) == crs(mad)
## [1] TRUE

Check for collinearity 检查共线性

# Using Pearson correlation
cov_colin <- correct_colinvar(wc_mad, method = c('pearson', th = "0.7"))
# Take a look at the correlations using corrplot
corrplot(cov_colin$cor_table, type = 'lower', diag = FALSE, order = 'hclust', tl.cex = 0.6, tl.col = 'black', addCoef.col = 'black')

# remove collinearity
set.seed(42)
non_colin <- removeCollinearity(raster::stack(wc_mad), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)

non_colin
## [1] "mean_ann_t"           "mean_diurnal_t_range" "isothermality"       
## [4] "t_seas"               "mean_t_wet_q"         "p_warm_q"            
## [7] "p_wet_q"              "p_seas"
# subset variables and run again x1
wc_mad_sel <- wc_mad[[non_colin]]
set.seed(42)
non_colin_check <- removeCollinearity(raster::stack(wc_mad_sel), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)

# subset variables and run again x2
wc_mad_sel <- wc_mad_sel[[non_colin_check]]
set.seed(42)
non_colin_check_2 <- removeCollinearity(raster::stack(wc_mad_sel), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)
##   - No multicollinearity detected in your data at threshold 0.7

6. Prepare extracted data

Extract data.

# Prepare a SWD (Sample with Data), which is a class of data specifically used in the SDMtune package
all_pts_df <- as.data.frame(all_pts, geom = 'XY')
SWDdata <- prepareSWD(
  species = 'Aristida rufescens',
  p = all_pts_df %>% filter(pr_ab == 1) %>% dplyr::select(x, y),
  a = all_pts_df %>% filter(pr_ab == 0) %>% dplyr::select(x, y),
  env = wc_mad_sel
)
## 
ℹ Extracting predictor information for presence locations

✔ Extracting predictor information for presence locations [129ms]
## 
ℹ Extracting predictor information for absence/background locations

✔ Extracting predictor information for absence/background locations [78ms]
# Inspect 10 random rows
sample_n(cbind(SWDdata@pa, SWDdata@data), 10)
##    SWDdata@pa mean_ann_t mean_diurnal_t_range isothermality t_seas p_warm_q
## 1           0       25.6                 13.2          0.69  15.42      470
## 2           0       23.0                  9.0          0.62  22.40     1046
## 3           1       14.1                 12.1          0.67  22.73      833
## 4           0       26.7                 10.7          0.68  13.23      807
## 5           1       15.8                 12.5          0.65  25.99      786
## 6           1       19.6                 11.4          0.65  25.57      959
## 7           1       15.9                 12.4          0.67  24.14      823
## 8           1       21.6                 13.6          0.67  22.18      515
## 9           1       19.1                 13.1          0.69  20.21      933
## 10          1       21.9                 14.4          0.67  23.05      438
##    p_wet_q p_seas
## 1     1051    112
## 2     1046     50
## 3      867     90
## 4     1075    124
## 5      786     84
## 6      959     83
## 7      841     90
## 8      515    104
## 9      933    101
## 10     498    102

Split locations in training (80%) and testing (20%) datasets.

split_data <- trainValTest(SWDdata, test = 0.2, seed = 42)
train <- split_data[[1]]
test <- split_data[[2]]

7. Run & evaluate model

Run a random forest model.

set.seed(42)
rf_model <- train(method = c('RF'), ntrees = 500, data = train)

Predict probability for training points.

cbind(train@pa, predict(rf_model, data = train)) %>% 
  as.data.frame() %>% 
  rename(true_class = V1, predicted_class = V2) %>% 
  mutate(threshold_0.4 = ifelse(predicted_class >= 0.5, 1, 0),
         threshold_0.5 = ifelse(predicted_class >= 0.5, 1, 0),
         threshold_0.6 = ifelse(predicted_class >= 0.6, 1, 0)) %>%
  DT::datatable()

Evaluate model.

cm <- confMatrix(rf_model, test = test, th = 0.5)
cm_format <- as.data.frame(rbind(c(cm[,2],cm[,3]), c(cm[,4],cm[,5])))
names(cm_format) <- c('Positive', 'Negative')
rownames(cm_format) <- c('Positive', 'Negative')
cm_format %>% DT::datatable()
# Accuracy
round((cm$tp + cm$tn)/sum(cm$tp, cm$tp, cm$fp, cm$fn)*100, 2)
## [1] 71.43
# True Skill Statistic (TSS)
tss(rf_model, test = test)
## [1] 0.8888889
# Receiver Operator Curve (ROC)
plotROC(rf_model, test = test)

# Area Under Curve (AUC)
auc(rf_model, test = test)
## [1] 0.962963
# Selecting a threshold
(ths <- thresholds(rf_model, test = test))
##                                       Threshold  value
## 1                     Minimum training presence  0.672
## 2    Equal training sensitivity and specificity  0.672
## 3 Maximum training sensitivity plus specificity  0.672
## 4        Equal test sensitivity and specificity  0.724
## 5     Maximum test sensitivity plus specificity  0.810
##   Fractional predicted area Training omission rate Test omission rate P-values
## 1                         0             0.00000000         0.05555556        0
## 2                         0             0.00000000         0.05555556        0
## 3                         0             0.00000000         0.05555556        0
## 4                         0             0.02702703         0.05555556        0
## 5                         0             0.24324324         0.05555556        0
ths %>% DT::datatable()
cm_custom <- confMatrix(rf_model, test = test, th = 0.71)
# True Positive Rate (Sensitivity)
(TPR <- cm_custom$tp/(cm_custom$tp + cm_custom$fn))
## [1] 0.8888889
# False Positive Rate (1 - Specificity)
(FPR <- 1 - cm_custom$tn/(cm_custom$tn + cm_custom$fp))
## [1] 0.1111111
# Receiver Operator Curve (ROC) showing best threshold
plotROC(rf_model, test = test) + 
  geom_point(aes(x = FPR, y = TPR), col = 'red', size = 3) +
  labs(x = 'False Positive Rate (1 - Specificity)', 
       y = 'True Positive Rate (Sensitivity)')

8. Variable insights & predicting habitat suitability

# Variable importance
vi <- varImp(rf_model, permut = 5)
plotVarImp(vi[,1:2])

Response curves.

SDMtune::plotResponse(rf_model, var = vi$Variable[1], marginal = TRUE, rug = TRUE)

SDMtune::plotResponse(rf_model, var = vi$Variable[2], marginal = TRUE, rug = TRUE)

Predict habitat suitability.

pred <- predict(rf_model, data = wc_mad_sel)
pred_df <- as.data.frame(pred, xy = TRUE)

ggplot() +
  geom_tile(data = pred_df, aes(x = x, y = y, fill = lyr1)) +
  scale_fill_gradientn(colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"), 
                       na.value = NA, name = 'Habitat\nsuitability') +
  geom_sf(data = mad_sf, fill = NA) +
  geom_point(data = all_pts_df %>% filter(pr_ab == 1), aes(x = x, y = y)) +
  labs(x = 'Longitude', y = 'Latitude') +
  theme_minimal()

Predict habitat suitability with threshold

ths <- thresholds(rf_model, test = test)
th = ths[4,2]
print(paste('The chosen threshold value is:', th))
## [1] "The chosen threshold value is: 0.724"
pred_th <- pred >= th
pred_th_df <- as.data.frame(pred_th, xy = TRUE)
ggplot() +
  geom_tile(data = pred_th_df, aes(x = x, y = y, fill = as.factor(lyr1))) +
  scale_fill_manual(values = c("#2c7bb6", "#d7191c"), na.value = NA,
                    name = paste0('Habitat\nsuitability\nbinary (>= ',th,')'),
                    labels = c('absent','present','')) +
  geom_sf(data = mad_sf, fill = NA) +
  geom_point(data = all_pts_df %>% filter(pr_ab == 1), aes(x = x, y = y)) +
  labs(x = 'Longitude', y = 'Latitude') +
  theme_minimal()

Identify overlap with Protected Areas.

# load in protected area shapefiles. There are 3 files, so we want to load them all in together and then bind them into one file
prot_areas <- list.files(here('data/vect/WDPA_Madagascar'), pattern = '*.shp', full.names = TRUE)
prot_areas_list <- lapply(prot_areas, read_sf)

Bind the 3 files togther.

# convert to equal area projection
prot_areas_all <- bind_rows(prot_areas_list) %>% filter(MARINE == 0)
# convert the protected areas
prot_areas_all %>% 
  st_transform(crs = 'EPSG:29702') -> prot_areas_all_proj
# convert the presence/absence raster
pred_th %>% 
  project(.,vect(prot_areas_all_proj), method = 'near') -> pred_th_proj
## Warning: [project,SpatRaster] argument y (the crs) should be a character value
# visualise the different projections
par(mfrow = c(1,2))
plot(pred_th)
plot(vect(prot_areas_all), add = TRUE)
plot(pred_th_proj)
plot(vect(prot_areas_all_proj), add = TRUE)

We select and sum only the cells with 1’s, then multiply this by the size of the raster cells and lastly divide this by meters to get a result in km2.

pres_area <- (sum(pred_th_proj[] == 1, na.rm = TRUE) * (res(pred_th_proj)[1]*res(pred_th_proj)[2]) / (1000^2))
paste('The area of species presences is',pres_area, 'km2')
## [1] "The area of species presences is 64359.2795802828 km2"
# Calculate the area of all cells
all_area <- (sum(!is.na(pred_th_proj[])) * (res(pred_th_proj)[1]*res(pred_th_proj)[2]) / (1000^2))
paste('The area of all cells is',all_area, 'km2')
## [1] "The area of all cells is 615331.402308378 km2"
# And lastly calculate the percentage of coverage of our species across all of Madagascar
paste('The species presences cover',round(pres_area/all_area*100, 2), '% of Madagascar')
## [1] "The species presences cover 10.46 % of Madagascar"

Create custom function to calculate the proportion of area covered by each Protected Area.

sum_cover <- function(x){
  list(x %>%
         group_by(value) %>%
         summarize(total_area = sum(coverage_area)))
}
# extract the amount of area covered 
extract_all <- exact_extract(pred_th_proj, prot_areas_all_proj, coverage_area = TRUE, summarize_df = TRUE, fun = sum_cover)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==========================================                            |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |===================================================                   |  74%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================| 100%

Add the names of the protected areas back on to our extraction

names(extract_all) <- prot_areas_all_proj$ORIG_NAME
# convert the list to a data frame
extract_df <- bind_rows(extract_all, .id = 'ORIG_NAME')
# take a look at the first 6 rows
head(extract_df)
## # A tibble: 6 × 3
##   ORIG_NAME       value  total_area
##   <chr>           <dbl>       <dbl>
## 1 Ankarafantsika      0 1695222347.
## 2 Andohahela          0  872005779.
## 3 Marojejy            0  752626644.
## 4 Tsaratanana         0 1084988320.
## 5 Tsaratanana         1  405561313.
## 6 Tsimanampesotse     0 2629427415.

Sum all of the area that overlaps with the protected areas for presences (i.e. 1’s) and divide this by the total area of all presences.

area_under_pas <- extract_df %>% 
  filter(value == 1) %>% 
  summarise(sum(total_area)/(1000^2))
paste(round(area_under_pas/pres_area * 100, 2),'% of the predicted presences are found within protected areas')
## [1] "18.5 % of the predicted presences are found within protected areas"
# provides important context on both the quality and quantity of protected areas overlapping with our species range:
iucn_cat <- prot_areas_all_proj %>% 
  st_drop_geometry() %>% 
  dplyr::select(ORIG_NAME, IUCN_CAT)
extract_df %>% 
  left_join(iucn_cat, by = 'ORIG_NAME', multiple = "all") %>% 
  filter(value == 1) %>%
  group_by(IUCN_CAT) %>%
  summarise(area = sum(total_area)/(1000^2)) %>%
  mutate(perc = round(area/sum(area) * 100, 2))
## # A tibble: 7 × 3
##   IUCN_CAT        area  perc
##   <chr>          <dbl> <dbl>
## 1 II             2471. 20.8 
## 2 IV              790.  6.63
## 3 Ia              406.  3.41
## 4 Not Applicable  395.  3.32
## 5 Not Reported   3493. 29.3 
## 6 V              2773. 23.3 
## 7 VI             1582. 13.3

Results

  • By drawing a distribution map, it can be found that the survival area of this species mainly exists in the central part of Madagascar, with a few distributed near the boundary.
  • From the second step, we can see the occurance records from GBIF were mostly earlier than 2000. This may lead to inaccurate recent assessments of this species, and we need more recent data.
  • From climate calculation results, annual temperature range is at a relatively high level. And the high isotherm may be related to the humid regional environment.
  • By using a recursive approach using hierarchical classification of groups of similar variables, we select 7 variables with no intercorrelation among them with a cutoff at 0.7. They are mean_ann_t, t_seas, p_wet_q, isothermality, p_seas, mean_diurnal_t_range and p_warm_q. Treat these variables as predictors to build a model. Due to limited data, 80% of the dataset was used as the training set and 20% as the testing set for machine learning training. Use confusion matrice to evaluate the model. Results here suggest that there are 9 true positives and 6 true negatives, 3 false positives and no false negatives. TO calculate the number of observations correctly classified (true) over all of the observations, we can use formula below: \[ Accuracy = ((True Positive + True Negative) / Total Observations) * 100 \] In this case, the formula is: \[ Accuracy = ((9 + 6) / 21 ) * 100 \] \[ Accuracy = 71.43 % \] Value of accuracy in this case is above 70%, which means may be doing a decent to very good job.

In order to make the model evaluation more accurate, there are a few other metrics we could use, such as the True Skill Statistic (TSS) or the Area Under Curve (AUC). The formula for TSS is: \[ TSS = True Positive Rate + True Negative Rate - 1 \]

where:

\[ True Positive Rate = True Positives / (True Positives + False Negatives) \] and:

\[ True Negative Rate = True Negatives / (True Negatives + False Positives) \] In this case, True Positive Rate: \[ True Positive Rate = 9 / (9 + 0) = 1 \] True Negative Rate: \[ True Negative Rate = 6 / (6 + 3) = 0.67 \] TSS formula:

\[ TSS = True Positive Rate + True Negative Rate - 1 = 1 + 0.67 - 1 = 0.67 \] AUC is calculated as the integral of the area under the curve of the receiver operator curve (ROC). It is a performance metric that measures the quality of machine learning. In this case, AUC value is 0.96, which means the model performs well and is close to perfect. - By calculating the variable importance, the result suggests that p_wet_q(49.8%) and mean_ann_t(46.5%) are the most important variables in this model. Meanwhile, these two indicators are also within the 7 indicators after excluding collinearity, and can be considered effective. - Xerochlamys bojeriana’s predicted habitats are mostly in the middle of Madagascar. Also, on the top and bottom of Madagascar, Xerochlamys bojeriana has a has a high probability of survival. - The area of species presences is 84700.81 km2. The area of all cells is 615331.40 km2. The species presences cover 13.77 % of Madagascar. 19.82 % of the predicted presences are found within protected areas.

Discussion

In conclusion, Xerochlamys bojeriana has sufficient locality data to accurately predict its species distribution. Metrics such as Accuracy, TSS, and AUC were employed to evaluate the model, and the results are reliable. Typically, a model with an Accuracy greater than 0.7 is considered a good fit. In this case, the model’s Accuracy is 71.43%. However, it’s essential to note that Accuracy is a simplistic measure, and some argue it may not be entirely reliable. Therefore, additional evaluation using TSS and AUC was conducted, and all metrics yielded positive results.

The variable mean_t_dry_q might also play a significant role in determining the species’ distribution. The locations with higher growth probability due to this factor closely align with the actual distribution of the species, suggesting its importance. Notably, this variable was excluded during the collinearity analysis step.

According to the calculation results, precipitation of wettest quarter and annual average temperature are the biggest factors affecting the distribution of this species. Xerochlamys bojeriana is a shrub and grows primarilly in the seasonally dry tropical biome. Madagascar is effected by the combination of southeastern trade winds and northwestern monsoons. That produces a hot rainy season and a relatively cooler dry season. According to the species distribution modelling results, Xerochlamys bojeriana mainliy grows in the middle of the country, which are highlands of the country. Highlands are both drier and cooler than other places in Madagascar. And the plant here mainly grows during the rainy season rather than the dry season, so the precipitation during the rainy season is particularly important for this plant. According to the probability of presence on mean_ann_t, these two variables exhibit a nearly linear relationship, namely the lower the mean_ Ann_t, the higher the probability of presence. This is probably related to the fact that the plant grows at a cooler highland.

  • Cross-validation and adding factors could make the modeling results more robust.

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. Using different data for model training and testing may lead to different model results. If the data is evenly divided into n equal parts, a portion is taken each time as the test set, and the remaining portion is used as the training set. After repeating this process n times and taking the mean for modeling, this problem can be avoided. It is called K-fold Cross Validation, a kind of cross-validation way.

In addition, there are also some methods that can be solved from the perspective of machine learning model computation. But in addition, more factors can also be considered to be added. For example, if it is necessary to improve the robustness of habitat prediction, taking into account the evolution trends of soil and climate during prediction can make the predicted results have temporal robustness.

References

Alvarado, Swanni T, Elise Buisson, Harison Rabarison, Charlotte Rajeriarison, Chris Birkinshaw, Porter P Lowry II, and Leonor PC Morellato. 2014. “Fire and the Reproductive Phenology of Endangered Madagascar Sclerophyllous Tapia Woodlands.” South African Journal of Botany 94: 79–87.
Chamberlain, Scott A, and Carl Boettiger. 2017. “R Python, and Ruby Clients for GBIF Species Occurrence Data.” PeerJ Preprints.
Elith, Jane, and John R Leathwick. 2009. “Species Distribution Models: Ecological Explanation and Prediction Across Space and Time.” Annual Review of Ecology, Evolution, and Systematics 40: 677–97.
Hong-Wa, Cynthia. 2009. “Endemic Families of Madagascar. XII. Resurrection and Taxonomic Revision of the Genera Mediusella (Cavaco) Hutchinson and Xerochlamys Baker (Sarcolaenaceae).” Adansonia 31 (2): 311–39.
Marla, Lavanya, Alexander Rikun, Gautier Stauffer, and Eleni Pratsini. 2020. “Robust Modeling and Planning: Insights from Three Industrial Applications.” Operations Research Perspectives 7: 100150.
Vences, Miguel, Katharina C Wollenberg, David R Vieites, and David C Lees. 2009. “Madagascar as a Model Region of Species Diversification.” Trends in Ecology & Evolution 24 (8): 456–65.
worldatlas.com. 2017. “Which Are the Island Countries of the World? - WorldAtlas.com.” https://web.archive.org/web/20171207094959/http://www.worldatlas.com/articles/which-are-the-island-countries-of-the-world.html.
Zizka, Alexander, Daniele Silvestro, Tobias Andermann, Josue Azevedo, Camila Duarte Ritter, Daniel Edler, Harith Farooq, et al. 2019. “CoordinateCleaner: Standardized Cleaning of Occurrence Records from Biological Collection Databases.” Methods in Ecology and Evolution, no. 10: –7. https://doi.org/10.1111/2041-210X.13152.