Xerochlamys bojeriana is a shrub that primarily grows in the
seasonally dry tropical biome and is an endemic species of
Madagascar(Hong-Wa 2009). This assessment
aims to use species distribution models (SDMs) and maps to model and
describe the current distribution of a plant species occurring in
Madagascar. After modeling, we are able to describe this species’
distribution and found that
precipitation in the wettest quarter and
annual average temperature are the primary factors
affecting the distribution of this species. Xerochlamys
bojeriana’s predicted habitats are mostly in the central region of
Madagascar. Also, at the top and bottom of Madagascar, Xerochlamys
bojeriana has a has a high probability of survival. By evaluating
this model with Accuracy, TSS and
AUC, it can be seen that the model can fit the data well.
Although this analysis provides some information for the study of this
species and the evaluation indicators also show that the model is good,
there is still room for improvement in the model, such as using cross
validation and other methods to enhance its robustness(Marla et al. 2020).
Madagascar is an island country located off the southeastern coast of Africa (worldatlas.com 2017). The combination of southeast trade winds and northwest monsoons resulted in a hot rainy season (November to April) and frequent destructive cyclones, as well as a relatively cool dry season (May to October)(Vences et al. 2009). The diverse landscapes of the island, including tropical rainforest, thorny forests, and plateaus, contribute to its rich ecological tapestry. Its rich biodiversity, shaped by millions of years of separation from the mainland, has given rise to a plethora of endemic species, making it a global hotspot for conservation efforts(Vences et al. 2009).
Madagascar boasts remarkable botanical diversity with a high number of plant species found nowhere else on Earth. From the iconic baobabs to rare orchids, the plant life on the island plays a vital role in its ecosystems. However, many plant species face severe threats from habitat destruction, climate change, and human activities such as fires(Alvarado et al. 2014).
Species distribution modelling (SDM) uses computer algorithms to predict the distribution of a species across geographic space and time using environmental data(Elith and Leathwick 2009). By analyzing environmental variables, climate patterns, and plant occurrence records, SDMs empower scientists to unravel the intricate relationships between plants and their habitats. Predictions from an SDM may include a species’ future distribution under climate change, a species’ past distribution to assess evolutionary relationships, or the potential future distribution of an invasive species. This knowledge becomes invaluable in crafting targeted conservation strategies, ensuring the preservation of Madagascar’s botanical treasures for generations to come.
Library all packages before starting.
library(pacman)
p_load(rgbif, DT, tidyverse, CoordinateCleaner, rnaturalearth, rnaturalearthdata, sf, mapview, terra, flexsdm, here, corrplot, SDMtune, virtualspecies, exactextractr)
Download GBIF occurrence data for this Xerochlamys bojeriana(Chamberlain and Boettiger 2017).
spp <- c("Xerochlamys bojeriana")
gbif_download <- occ_data(scientificName = spp, hasCoordinate = TRUE, country = 'MG', limit = 10000)
Select only the data and visualise the data and the interaction map.
gbif_data <- gbif_download$data
gbif_data %>% DT::datatable(extensions = 'Scroller',
options = list(scrollX = TRUE,
scrollY = 200,
scroller = TRUE))
gbif_data %>%
st_as_sf(coords = c('decimalLongitude', 'decimalLatitude'), crs = 4326) %>%
dplyr::select(year) %>%
mapview(layer.name = spp)
Use custom filter and cc_*() functions to
clean the data(Zizka et al. 2019).
gbif_data %>%
filter(!decimalLatitude == 0 | !decimalLongitude == 0) %>%
distinct(decimalLongitude,decimalLatitude,speciesKey,datasetKey, .keep_all = TRUE) -> gbif_filt
gbif_filt %>%
cc_cen(lat = 'decimalLatitude', lon = 'decimalLongitude', buffer = 2000) %>% # remove country centroids within 2km
cc_cap(lat = 'decimalLatitude', lon = 'decimalLongitude', buffer = 2000) %>% # remove capitals centroids within 2km
cc_inst(lat = 'decimalLatitude', lon = 'decimalLongitude', buffer = 2000) %>% # remove zoo and herbaria within 2km
cc_sea(lat = 'decimalLatitude', lon = 'decimalLongitude') -> spp_clean # remove from ocean
## Testing sea coordinates
## Testing biodiversity institutions
## Testing country capitals
## Testing country centroids
## Removed 2 records.
## Removed 3 records.
## Removed 0 records.
## Removed 1 records.
We have now removed another 6 records, leaving us with 241 records.
Download the country border with rnaturalearth package and
ne_countries() function and return it as an sf object.
mad <- ne_countries(country = 'Madagascar', scale = 'medium', returnclass = 'sf')
spp_clean %>%
dplyr::select(key, year, basisOfRecord, locality, coordinateUncertaintyInMeters, occurrenceRemarks, recordedBy, lon = decimalLongitude, lat = decimalLatitude) -> spp_sel
spp_sf <- st_as_sf(spp_sel, coords = c('lon', 'lat'))
st_crs(spp_sf) = 4326
Plot the spatial layers and show Interactive map.
# Spatial layers
ggplot() +
geom_sf(data = mad) +
geom_sf(data = spp_sf, size = 3, alpha = 0.5) +
labs(title = 'Plotting as simple features (sf)') +
theme_void()
# Interactive maps
spp_sf %>%
st_jitter(factor = 0.0001) %>%
mapview(zcol = 'year')
Load in worldclim locally. Plot mean annual temperature (C).
worldclim <- rast(here("data/rast/afr_worldclim.tif"))
names(worldclim)
## [1] "bio1" "bio2" "bio3" "bio4" "bio5" "bio6" "bio7" "bio8" "bio9"
## [10] "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17" "bio18"
## [19] "bio19"
names(worldclim) <- c("mean_ann_t","mean_diurnal_t_range","isothermality", "t_seas", 'max_t_warm_m','min_t_cold_m',"t_ann_range",'mean_t_wet_q','mean_t_dry_q','mean_t_warm_q','mean_t_cold_q','ann_p', 'p_wet_m','p_dry_m','p_seas','p_wet_q','p_dry_q','p_warm_q','p_cold_q')
plot(worldclim$mean_ann_t, main = 'mean annual temperature (C)')
Plot all climate factors.
mad <- st_transform(mad, 'EPSG:4326')
wc_mad <- worldclim %>%
crop(., vect(mad)) %>%
mask(., vect(mad))
plot(wc_mad)
Thin records by distance.
set.seed(42)
spp_filt <- occfilt_geo(
data = spp_sel %>% rename(x = lon, y = lat),
x = 'x',
y = 'y',
method = c('defined', d = 10), # set a 10 km threshold
env_layer = wc_mad,
prj = crs(wc_mad)
)
## Extracting values from raster ...
## Number of unfiltered records: 241
## Distance threshold(km): 10
## Number of filtered records: 46
Show localities included and filtered on one plot.
# join the result
spp_filt_sf <- spp_filt %>%
st_as_sf(coords = c('x','y'), crs = st_crs(4326))
# identify the localities that were removed
spp_removed_sf <- spp_sel %>%
filter(!key %in% spp_filt$key) %>%
st_as_sf(coords = c('lon','lat'), crs = st_crs(4326))
# Visualise output
mapview(spp_filt_sf, zcol = NULL, col.regions = 'blue', layer.name = 'localities included') +
mapview(spp_removed_sf, zcol = NULL, col.regions = 'red', layer.name = 'localities filtered')
Create random absence points in spaces outside of a 10 km buffer of presence points to accurate the model.
set.seed(42) # make sure the result is repeatable
pseudo_abs <- sample_pseudoabs(
data = spp_filt,
x = 'x',
y = 'y',
n = nrow(spp_filt),
method = c('geo_const', width = 10000),
rlayer = wc_mad,
calibarea = vect(mad)
)
head(pseudo_abs)
## # A tibble: 6 × 3
## x y pr_ab
## <dbl> <dbl> <dbl>
## 1 49.2 -17.6 0
## 2 46.8 -19.7 0
## 3 46.0 -17.4 0
## 4 45.0 -21.6 0
## 5 47.0 -15.7 0
## 6 45.7 -16.0 0
Visualise presences and pseudoabsences points on one plot, and show the result sf.
# process presence points to match pseudoabsences
pres <- spp_filt %>%
dplyr::select(x, y) %>%
mutate(pr_ab = 1)
head(pres)
## # A tibble: 6 × 3
## x y pr_ab
## <dbl> <dbl> <dbl>
## 1 46.8 -20.6 1
## 2 46.9 -20.5 1
## 3 45.4 -22.6 1
## 4 47.4 -20.3 1
## 5 45.3 -22.5 1
## 6 47.2 -19.0 1
# combine presences and pseudoabsences
all_pts <- rbind(pres, pseudo_abs)
# convert to spatial format
all_pts_sf <- all_pts %>%
mutate(pr_ab = as.factor(pr_ab)) %>%
st_as_sf(coords = c('x','y'), crs = st_crs(4326))
# visualise
mapview(all_pts_sf, zcol = 'pr_ab', layer.name = 'all points')
# Export locality data
write_sf(all_pts_sf,
here(paste0('output/species_localities/',gbif_data$genus[1],'_',gbif_data$specificEpithet[1],'_','all_points.shp')))
# Show locality data
DT::datatable(all_pts_sf)
Download the country border and return it as an sf object
spp_code <- gsub(' ', '_', spp)
all_pts <- vect(here(paste0('output/species_localities/',spp_code, '_all_points.shp')))
mad_sf <- ne_countries(country = 'Madagascar', scale = 'medium', returnclass = 'sf')
mad <- vect(mad_sf)
Load worldclim data.
worldclim <- rast(here("data/rast/afr_worldclim.tif"))
names(worldclim) <- c("mean_ann_t","mean_diurnal_t_range","isothermality", "t_seas", 'max_t_warm_m','min_t_cold_m',"t_ann_range",'mean_t_wet_q','mean_t_dry_q','mean_t_warm_q','mean_t_cold_q','ann_p', 'p_wet_m','p_dry_m','p_seas','p_wet_q','p_dry_q','p_warm_q','p_cold_q')
mad <- project(mad, 'EPSG:4326')
# crop and mask
wc_mad <- worldclim %>%
crop(., mad) %>%
mask(., mad)
Re-scale temperature values and visualise raw data.
wc_mad[[c(1:2,5:11)]] <- wc_mad[[c(1:2,5:11)]]/10
wc_mad[[3:4]] <- wc_mad[[3:4]]/100
plot(wc_mad$mean_ann_t)
plot(mad, add = T)
plot(all_pts, add = T)
Check projections.
crs(wc_mad) == crs(mad)
## [1] TRUE
crs(wc_mad) == crs(all_pts)
## [1] FALSE
crs(mad) == crs(all_pts)
## [1] FALSE
mad <- project(mad, wc_mad)
crs(wc_mad) == crs(mad)
## [1] TRUE
Check for collinearity 检查共线性
# Using Pearson correlation
cov_colin <- correct_colinvar(wc_mad, method = c('pearson', th = "0.7"))
# Take a look at the correlations using corrplot
corrplot(cov_colin$cor_table, type = 'lower', diag = FALSE, order = 'hclust', tl.cex = 0.6, tl.col = 'black', addCoef.col = 'black')
# remove collinearity
set.seed(42)
non_colin <- removeCollinearity(raster::stack(wc_mad), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)
non_colin
## [1] "mean_ann_t" "mean_diurnal_t_range" "isothermality"
## [4] "t_seas" "mean_t_wet_q" "p_warm_q"
## [7] "p_wet_q" "p_seas"
# subset variables and run again x1
wc_mad_sel <- wc_mad[[non_colin]]
set.seed(42)
non_colin_check <- removeCollinearity(raster::stack(wc_mad_sel), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)
# subset variables and run again x2
wc_mad_sel <- wc_mad_sel[[non_colin_check]]
set.seed(42)
non_colin_check_2 <- removeCollinearity(raster::stack(wc_mad_sel), 0.7, method = 'pearson', plot = TRUE, select.variables = TRUE, sample.points = FALSE)
## - No multicollinearity detected in your data at threshold 0.7
Extract data.
# Prepare a SWD (Sample with Data), which is a class of data specifically used in the SDMtune package
all_pts_df <- as.data.frame(all_pts, geom = 'XY')
SWDdata <- prepareSWD(
species = 'Aristida rufescens',
p = all_pts_df %>% filter(pr_ab == 1) %>% dplyr::select(x, y),
a = all_pts_df %>% filter(pr_ab == 0) %>% dplyr::select(x, y),
env = wc_mad_sel
)
##
ℹ Extracting predictor information for presence locations
✔ Extracting predictor information for presence locations [129ms]
##
ℹ Extracting predictor information for absence/background locations
✔ Extracting predictor information for absence/background locations [78ms]
# Inspect 10 random rows
sample_n(cbind(SWDdata@pa, SWDdata@data), 10)
## SWDdata@pa mean_ann_t mean_diurnal_t_range isothermality t_seas p_warm_q
## 1 0 25.6 13.2 0.69 15.42 470
## 2 0 23.0 9.0 0.62 22.40 1046
## 3 1 14.1 12.1 0.67 22.73 833
## 4 0 26.7 10.7 0.68 13.23 807
## 5 1 15.8 12.5 0.65 25.99 786
## 6 1 19.6 11.4 0.65 25.57 959
## 7 1 15.9 12.4 0.67 24.14 823
## 8 1 21.6 13.6 0.67 22.18 515
## 9 1 19.1 13.1 0.69 20.21 933
## 10 1 21.9 14.4 0.67 23.05 438
## p_wet_q p_seas
## 1 1051 112
## 2 1046 50
## 3 867 90
## 4 1075 124
## 5 786 84
## 6 959 83
## 7 841 90
## 8 515 104
## 9 933 101
## 10 498 102
Split locations in training (80%) and testing (20%) datasets.
split_data <- trainValTest(SWDdata, test = 0.2, seed = 42)
train <- split_data[[1]]
test <- split_data[[2]]
Run a random forest model.
set.seed(42)
rf_model <- train(method = c('RF'), ntrees = 500, data = train)
Predict probability for training points.
cbind(train@pa, predict(rf_model, data = train)) %>%
as.data.frame() %>%
rename(true_class = V1, predicted_class = V2) %>%
mutate(threshold_0.4 = ifelse(predicted_class >= 0.5, 1, 0),
threshold_0.5 = ifelse(predicted_class >= 0.5, 1, 0),
threshold_0.6 = ifelse(predicted_class >= 0.6, 1, 0)) %>%
DT::datatable()
Evaluate model.
cm <- confMatrix(rf_model, test = test, th = 0.5)
cm_format <- as.data.frame(rbind(c(cm[,2],cm[,3]), c(cm[,4],cm[,5])))
names(cm_format) <- c('Positive', 'Negative')
rownames(cm_format) <- c('Positive', 'Negative')
cm_format %>% DT::datatable()
# Accuracy
round((cm$tp + cm$tn)/sum(cm$tp, cm$tp, cm$fp, cm$fn)*100, 2)
## [1] 71.43
# True Skill Statistic (TSS)
tss(rf_model, test = test)
## [1] 0.8888889
# Receiver Operator Curve (ROC)
plotROC(rf_model, test = test)
# Area Under Curve (AUC)
auc(rf_model, test = test)
## [1] 0.962963
# Selecting a threshold
(ths <- thresholds(rf_model, test = test))
## Threshold value
## 1 Minimum training presence 0.672
## 2 Equal training sensitivity and specificity 0.672
## 3 Maximum training sensitivity plus specificity 0.672
## 4 Equal test sensitivity and specificity 0.724
## 5 Maximum test sensitivity plus specificity 0.810
## Fractional predicted area Training omission rate Test omission rate P-values
## 1 0 0.00000000 0.05555556 0
## 2 0 0.00000000 0.05555556 0
## 3 0 0.00000000 0.05555556 0
## 4 0 0.02702703 0.05555556 0
## 5 0 0.24324324 0.05555556 0
ths %>% DT::datatable()
cm_custom <- confMatrix(rf_model, test = test, th = 0.71)
# True Positive Rate (Sensitivity)
(TPR <- cm_custom$tp/(cm_custom$tp + cm_custom$fn))
## [1] 0.8888889
# False Positive Rate (1 - Specificity)
(FPR <- 1 - cm_custom$tn/(cm_custom$tn + cm_custom$fp))
## [1] 0.1111111
# Receiver Operator Curve (ROC) showing best threshold
plotROC(rf_model, test = test) +
geom_point(aes(x = FPR, y = TPR), col = 'red', size = 3) +
labs(x = 'False Positive Rate (1 - Specificity)',
y = 'True Positive Rate (Sensitivity)')
# Variable importance
vi <- varImp(rf_model, permut = 5)
plotVarImp(vi[,1:2])
Response curves.
SDMtune::plotResponse(rf_model, var = vi$Variable[1], marginal = TRUE, rug = TRUE)
SDMtune::plotResponse(rf_model, var = vi$Variable[2], marginal = TRUE, rug = TRUE)
Predict habitat suitability.
pred <- predict(rf_model, data = wc_mad_sel)
pred_df <- as.data.frame(pred, xy = TRUE)
ggplot() +
geom_tile(data = pred_df, aes(x = x, y = y, fill = lyr1)) +
scale_fill_gradientn(colours = c("#2c7bb6", "#abd9e9", "#ffffbf", "#fdae61", "#d7191c"),
na.value = NA, name = 'Habitat\nsuitability') +
geom_sf(data = mad_sf, fill = NA) +
geom_point(data = all_pts_df %>% filter(pr_ab == 1), aes(x = x, y = y)) +
labs(x = 'Longitude', y = 'Latitude') +
theme_minimal()
Predict habitat suitability with threshold
ths <- thresholds(rf_model, test = test)
th = ths[4,2]
print(paste('The chosen threshold value is:', th))
## [1] "The chosen threshold value is: 0.724"
pred_th <- pred >= th
pred_th_df <- as.data.frame(pred_th, xy = TRUE)
ggplot() +
geom_tile(data = pred_th_df, aes(x = x, y = y, fill = as.factor(lyr1))) +
scale_fill_manual(values = c("#2c7bb6", "#d7191c"), na.value = NA,
name = paste0('Habitat\nsuitability\nbinary (>= ',th,')'),
labels = c('absent','present','')) +
geom_sf(data = mad_sf, fill = NA) +
geom_point(data = all_pts_df %>% filter(pr_ab == 1), aes(x = x, y = y)) +
labs(x = 'Longitude', y = 'Latitude') +
theme_minimal()
Identify overlap with Protected Areas.
# load in protected area shapefiles. There are 3 files, so we want to load them all in together and then bind them into one file
prot_areas <- list.files(here('data/vect/WDPA_Madagascar'), pattern = '*.shp', full.names = TRUE)
prot_areas_list <- lapply(prot_areas, read_sf)
Bind the 3 files togther.
# convert to equal area projection
prot_areas_all <- bind_rows(prot_areas_list) %>% filter(MARINE == 0)
# convert the protected areas
prot_areas_all %>%
st_transform(crs = 'EPSG:29702') -> prot_areas_all_proj
# convert the presence/absence raster
pred_th %>%
project(.,vect(prot_areas_all_proj), method = 'near') -> pred_th_proj
## Warning: [project,SpatRaster] argument y (the crs) should be a character value
# visualise the different projections
par(mfrow = c(1,2))
plot(pred_th)
plot(vect(prot_areas_all), add = TRUE)
plot(pred_th_proj)
plot(vect(prot_areas_all_proj), add = TRUE)
We select and sum only the cells with 1’s, then multiply this by the
size of the raster cells and lastly divide this by meters to get a
result in km2.
pres_area <- (sum(pred_th_proj[] == 1, na.rm = TRUE) * (res(pred_th_proj)[1]*res(pred_th_proj)[2]) / (1000^2))
paste('The area of species presences is',pres_area, 'km2')
## [1] "The area of species presences is 64359.2795802828 km2"
# Calculate the area of all cells
all_area <- (sum(!is.na(pred_th_proj[])) * (res(pred_th_proj)[1]*res(pred_th_proj)[2]) / (1000^2))
paste('The area of all cells is',all_area, 'km2')
## [1] "The area of all cells is 615331.402308378 km2"
# And lastly calculate the percentage of coverage of our species across all of Madagascar
paste('The species presences cover',round(pres_area/all_area*100, 2), '% of Madagascar')
## [1] "The species presences cover 10.46 % of Madagascar"
Create custom function to calculate the proportion of area covered by each Protected Area.
sum_cover <- function(x){
list(x %>%
group_by(value) %>%
summarize(total_area = sum(coverage_area)))
}
# extract the amount of area covered
extract_all <- exact_extract(pred_th_proj, prot_areas_all_proj, coverage_area = TRUE, summarize_df = TRUE, fun = sum_cover)
##
|
| | 0%
|
|= | 1%
|
|= | 2%
|
|== | 3%
|
|=== | 4%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======== | 11%
|
|======== | 12%
|
|========= | 13%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 17%
|
|============= | 18%
|
|============= | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|================ | 22%
|
|================ | 23%
|
|================= | 24%
|
|================= | 25%
|
|================== | 26%
|
|=================== | 26%
|
|=================== | 27%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 30%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|============================ | 39%
|
|============================ | 40%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 47%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 50%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 56%
|
|======================================== | 57%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 70%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 73%
|
|=================================================== | 74%
|
|==================================================== | 74%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 87%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 100%
Add the names of the protected areas back on to our extraction
names(extract_all) <- prot_areas_all_proj$ORIG_NAME
# convert the list to a data frame
extract_df <- bind_rows(extract_all, .id = 'ORIG_NAME')
# take a look at the first 6 rows
head(extract_df)
## # A tibble: 6 × 3
## ORIG_NAME value total_area
## <chr> <dbl> <dbl>
## 1 Ankarafantsika 0 1695222347.
## 2 Andohahela 0 872005779.
## 3 Marojejy 0 752626644.
## 4 Tsaratanana 0 1084988320.
## 5 Tsaratanana 1 405561313.
## 6 Tsimanampesotse 0 2629427415.
Sum all of the area that overlaps with the protected areas for presences (i.e. 1’s) and divide this by the total area of all presences.
area_under_pas <- extract_df %>%
filter(value == 1) %>%
summarise(sum(total_area)/(1000^2))
paste(round(area_under_pas/pres_area * 100, 2),'% of the predicted presences are found within protected areas')
## [1] "18.5 % of the predicted presences are found within protected areas"
# provides important context on both the quality and quantity of protected areas overlapping with our species range:
iucn_cat <- prot_areas_all_proj %>%
st_drop_geometry() %>%
dplyr::select(ORIG_NAME, IUCN_CAT)
extract_df %>%
left_join(iucn_cat, by = 'ORIG_NAME', multiple = "all") %>%
filter(value == 1) %>%
group_by(IUCN_CAT) %>%
summarise(area = sum(total_area)/(1000^2)) %>%
mutate(perc = round(area/sum(area) * 100, 2))
## # A tibble: 7 × 3
## IUCN_CAT area perc
## <chr> <dbl> <dbl>
## 1 II 2471. 20.8
## 2 IV 790. 6.63
## 3 Ia 406. 3.41
## 4 Not Applicable 395. 3.32
## 5 Not Reported 3493. 29.3
## 6 V 2773. 23.3
## 7 VI 1582. 13.3
mean_ann_t, t_seas, p_wet_q,
isothermality, p_seas,
mean_diurnal_t_range and p_warm_q. Treat these
variables as predictors to build a model. Due to limited
data, 80% of the dataset was used as the training set and 20% as the
testing set for machine learning training. Use confusion matrice to
evaluate the model. Results here suggest that there are 9 true positives
and 6 true negatives, 3 false positives and no false negatives. TO
calculate the number of observations correctly classified (true) over
all of the observations, we can use formula below: \[
Accuracy = ((True Positive + True Negative) / Total Observations) * 100
\] In this case, the formula is: \[
Accuracy = ((9 + 6) / 21 ) * 100
\] \[
Accuracy = 71.43 %
\] Value of accuracy in this case is above 70%, which means may
be doing a decent to very good job.In order to make the model evaluation more accurate, there are a few other metrics we could use, such as the True Skill Statistic (TSS) or the Area Under Curve (AUC). The formula for TSS is: \[ TSS = True Positive Rate + True Negative Rate - 1 \]
where:
\[ True Positive Rate = True Positives / (True Positives + False Negatives) \] and:
\[ True Negative Rate = True Negatives / (True Negatives + False Positives) \] In this case, True Positive Rate: \[ True Positive Rate = 9 / (9 + 0) = 1 \] True Negative Rate: \[ True Negative Rate = 6 / (6 + 3) = 0.67 \] TSS formula:
\[
TSS = True Positive Rate + True Negative Rate - 1 = 1 + 0.67 - 1 = 0.67
\] AUC is calculated as the integral of the area under the curve
of the receiver operator curve (ROC). It is a performance metric that
measures the quality of machine learning. In this case, AUC value is
0.96, which means the model performs well and is close to perfect. - By
calculating the variable importance, the result suggests that
p_wet_q(49.8%) and mean_ann_t(46.5%) are the
most important variables in this model. Meanwhile, these two indicators
are also within the 7 indicators after excluding collinearity, and can
be considered effective. - Xerochlamys bojeriana’s predicted
habitats are mostly in the middle of Madagascar. Also, on the top and
bottom of Madagascar, Xerochlamys bojeriana has a has a high
probability of survival. - The area of species presences is 84700.81
km2. The area of all cells is 615331.40 km2. The species presences cover
13.77 % of Madagascar. 19.82 % of the predicted presences are found
within protected areas.
In conclusion, Xerochlamys bojeriana has sufficient locality
data to accurately predict its species distribution. Metrics such as
Accuracy, TSS, and AUC were
employed to evaluate the model, and the results are reliable. Typically,
a model with an Accuracy greater than 0.7 is considered a good fit. In
this case, the model’s Accuracy is 71.43%. However, it’s essential to
note that Accuracy is a simplistic measure, and some argue it may not be
entirely reliable. Therefore, additional evaluation using TSS and AUC
was conducted, and all metrics yielded positive results.
The variable mean_t_dry_q might also play a significant
role in determining the species’ distribution. The locations with higher
growth probability due to this factor closely align with the actual
distribution of the species, suggesting its importance. Notably, this
variable was excluded during the collinearity analysis step.
According to the calculation results,
precipitation of wettest quarter and
annual average temperature are the biggest factors
affecting the distribution of this species. Xerochlamys
bojeriana is a shrub and grows primarilly in the seasonally dry
tropical biome. Madagascar is effected by the combination of
southeastern trade winds and northwestern monsoons. That produces a hot
rainy season and a relatively cooler dry season. According to the
species distribution modelling results, Xerochlamys bojeriana
mainliy grows in the middle of the country, which are highlands of the
country. Highlands are both drier and cooler than other places in
Madagascar. And the plant here mainly grows during the rainy season
rather than the dry season, so the precipitation during the rainy season
is particularly important for this plant. According to the probability
of presence on mean_ann_t, these two variables exhibit a
nearly linear relationship, namely the lower the
mean_ Ann_t, the higher the
probability of presence. This is probably related to the
fact that the plant grows at a cooler highland.
Cross-validation is a resampling method that uses
different portions of the data to test and train a model on different
iterations. Using different data for model training and testing may lead
to different model results. If the data is evenly divided into
n equal parts, a portion is taken each time as the test set,
and the remaining portion is used as the training set. After repeating
this process n times and taking the mean for modeling, this
problem can be avoided. It is called
K-fold Cross Validation, a kind of cross-validation
way.
In addition, there are also some methods that can be solved from the perspective of machine learning model computation. But in addition, more factors can also be considered to be added. For example, if it is necessary to improve the robustness of habitat prediction, taking into account the evolution trends of soil and climate during prediction can make the predicted results have temporal robustness.