This report uses unsupervised learning to build a simple global typology of national biodiversity patterns. The goal is to group countries into a small number of clusters based on:
The analysis follows the strategy document “A Global Typology of National Biodiversity” and uses data from the World Bank dataset:
National level dataset of Species Endemism, Small Occurrence and Species at Risk data https://datacatalog.worldbank.org/search/dataset/0066034/global-biodiversity-data
All clustering is done on standardized indicators and is purely data-driven (no labels or targets).
The objective of this analysis is to classify countries into distinct groups based on their biodiversity characteristics. Rather than a simple ranking (e.g., “most biodiverse”), we aim to create a typology-a set of categories that describes the nature of biodiversity and risk in different parts of the world. This helps in understanding which countries face similar conservation challenges.
Scope: 237 national and territorial entities.
Variables: Counts of total species, endemic species (unique to that location), and threatened species.
The first code chunk loads all required R packages and sets a random seed for reproducibility.
library(cluster)
library(factoextra)
library(flexclust)
library(NbClust)
library(hopkins)
library(ClustGeo)
library(psych)
library(rstatix)
library(FeatureImpCluster)
library(data.table)
library(ape)
library(stats)
library(gridExtra)
library(sf)
library(rnaturalearth)
library(dplyr)
library(ggplot2)
library(knitr)
set.seed(123)
In this step the script:
national_species_summary_ddh.csv.We start with 238 country–territory records. After dropping rows with missing values, 237 remain.
setwd("C:/Users/natal/OneDrive/Pulpit/DSBA - UW/Semestr 1/Unsupervised Learning/Final Projects")
data_raw <- read.csv("national_species_summary_ddh.csv")
head(data_raw)
## administrative.entity admin_code land.and.or.EEZ Number.of.endemics.at.100.
## 1 Albania ALB land-eez 33
## 2 Algeria DZA land-eez 98
## 3 American Samoa ASM land-eez 15
## 4 Angola AGO land-eez 213
## 5 Anguilla AIA land-eez 2
## 6 Antigua and Barbuda ATG land-eez 5
## Number.of.species.with.small.occurrence.regions..50x50.
## 1 895
## 2 613
## 3 127
## 4 469
## 5 337
## 6 370
## Number.of.species.with.extinction.threat.probability.gte.80
## 1 165
## 2 137
## 3 46
## 4 73
## 5 140
## 6 157
## Number.of.endemic.species.with.small.occurrence.regions..50x50.
## 1 31
## 2 47
## 3 15
## 4 89
## 5 2
## 6 4
## Number.of.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
## 1 118
## 2 79
## 3 9
## 4 46
## 5 36
## 6 33
## Number.of.endemic.species.with.extinction.threat.probability.gte.80
## 1 5
## 2 23
## 3 3
## 4 25
## 5 1
## 6 0
## Number.of.endemic.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
## 1 4
## 2 20
## 3 3
## 4 17
## 5 1
## 6 0
## Number.of.total.species
## 1 16505
## 2 9943
## 3 1851
## 4 7872
## 5 6097
## 6 6906
# Data Structure
str(data_raw)
## 'data.frame': 238 obs. of 11 variables:
## $ administrative.entity : chr "Albania" "Algeria" "American Samoa" "Angola" ...
## $ admin_code : chr "ALB" "DZA" "ASM" "AGO" ...
## $ land.and.or.EEZ : chr "land-eez" "land-eez" "land-eez" "land-eez" ...
## $ Number.of.endemics.at.100. : int 33 98 15 213 2 5 1694 7 54427 10 ...
## $ Number.of.species.with.small.occurrence.regions..50x50. : int 895 613 127 469 337 370 1596 460 13148 508 ...
## $ Number.of.species.with.extinction.threat.probability.gte.80 : int 165 137 46 73 140 157 716 202 5540 84 ...
## $ Number.of.endemic.species.with.small.occurrence.regions..50x50. : int 31 47 15 89 2 4 624 7 11533 8 ...
## $ Number.of.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80 : int 118 79 9 46 36 33 324 48 3378 69 ...
## $ Number.of.endemic.species.with.extinction.threat.probability.gte.80 : int 5 23 3 25 1 0 289 1 4831 7 ...
## $ Number.of.endemic.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80: int 4 20 3 17 1 0 192 1 3189 7 ...
## $ Number.of.total.species : int 16505 9943 1851 7872 6097 6906 20209 8656 80856 7789 ...
# Data Summary - Checking for NAs
summary(data_raw)
## administrative.entity admin_code land.and.or.EEZ
## Length:238 Length:238 Length:238
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Number.of.endemics.at.100.
## Min. : 0
## 1st Qu.: 15
## Median : 68
## Mean : 1120
## 3rd Qu.: 358
## Max. :54427
## NA's :1
## Number.of.species.with.small.occurrence.regions..50x50.
## Min. : 41
## 1st Qu.: 338
## Median : 553
## Mean : 1113
## 3rd Qu.: 1233
## Max. :13148
## NA's :1
## Number.of.species.with.extinction.threat.probability.gte.80
## Min. : 4.0
## 1st Qu.: 78.0
## Median : 148.0
## Mean : 413.2
## 3rd Qu.: 403.0
## Max. :6456.0
## NA's :1
## Number.of.endemic.species.with.small.occurrence.regions..50x50.
## Min. : 0.0
## 1st Qu.: 12.0
## Median : 49.0
## Mean : 393.8
## 3rd Qu.: 197.0
## Max. :11533.0
## NA's :1
## Number.of.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
## Min. : 1.0
## 1st Qu.: 39.0
## Median : 79.0
## Mean : 231.6
## 3rd Qu.: 205.0
## Max. :3378.0
## NA's :1
## Number.of.endemic.species.with.extinction.threat.probability.gte.80
## Min. : 0.0
## 1st Qu.: 3.0
## Median : 16.0
## Mean : 177.5
## 3rd Qu.: 67.0
## Max. :4831.0
## NA's :1
## Number.of.endemic.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
## Min. : 0.0
## 1st Qu.: 3.0
## Median : 15.0
## Mean : 132.5
## 3rd Qu.: 51.0
## Max. :3189.0
## NA's :1
## Number.of.total.species
## Min. : 444
## 1st Qu.: 4685
## Median : 8712
## Mean : 14795
## 3rd Qu.: 18977
## Max. :102900
## NA's :1
new_names <- c(
"admin_entity", "admin_code", "land_eez",
"n_endemic_100", "n_small_occ", "n_threat",
"n_endemic_small_occ", "n_small_occ_threat", "n_endemic_threat",
"n_endemic_small_occ_threat", "n_total"
)
if (length(names(data_raw)) == length(new_names)) {
names(data_raw) <- new_names
} else {
stop(paste("Column count mismatch. Expected", length(new_names), "columns but found", length(names(data_raw)), "."))
}
data_clean <- na.omit(data_raw)
print(paste("Loaded", nrow(data_raw), "rows. After removing NAs,", nrow(data_clean), "rows remain."))
## [1] "Loaded 238 rows. After removing NAs, 237 rows remain."
data_identifiers <- data_clean[, c("admin_entity", "admin_code", "land_eez")]
data_numeric_raw <- data_clean[, new_names[!(new_names %in% c("admin_entity", "admin_code", "land_eez"))]]
The admin_entity, admin_code, and
land_eez columns are used as identifiers (country name, ISO
code, and land/EEZ flag). The remaining columns contain counts of
endemic species, small-range species, threatened species, and their
overlaps.
Instead of clustering directly on raw counts (which are strongly influenced by country size), the analysis builds five ratio-based indicators:
Division by zero is handled by replacing resulting NA
with 0. Finally, all five indicators are standardized (z-scores) so that
each has mean 0 and standard deviation 1.
# Creating a new data frame for the 5 engineered features
data_features <- data.frame(matrix(ncol = 5, nrow = nrow(data_numeric_raw)))
names(data_features) <- c(
"Prop_Endemic", "Prop_At_Risk", "Prop_Small_Occurrence",
"Risk_Concentration_Endemic", "Risk_Concentration_Small_Occ"
)
# 1. Prop_Endemic (Proportion of Endemism)
data_features$Prop_Endemic <- data_numeric_raw$n_endemic_100 / data_numeric_raw$n_total
# 2. Prop_At_Risk (Proportion of Total Biodiversity at Risk)
data_features$Prop_At_Risk <- data_numeric_raw$n_threat / data_numeric_raw$n_total
# 3. Prop_Small_Occurrence (Proportion of Geographically Concentrated Species)
data_features$Prop_Small_Occurrence <- data_numeric_raw$n_small_occ / data_numeric_raw$n_total
# 4. Risk_Concentration_Endemic (Risk Concentration among Endemic Species)
data_features$Risk_Concentration_Endemic <- data_numeric_raw$n_endemic_threat / data_numeric_raw$n_endemic_100
# 5. Risk_Concentration_Small_Occ (Risk Concentration among Small Occurrence Species)
data_features$Risk_Concentration_Small_Occ <- data_numeric_raw$n_small_occ_threat / data_numeric_raw$n_small_occ
data_features[is.na(data_features)] <- 0
data_scaled <- scale(data_features)
These engineered features allow us to compare very different countries (for example, small islands and large continental states) on a common scale.
Before choosing a clustering method, it is helpful to check whether the data actually contain cluster structure.
The Hopkins statistic tests whether the data are more “clustered” than a random uniform distribution. Values close to 1 indicate strong cluster tendency; values around 0.5 suggest randomness.
hopkins_stat_result <- hopkins(data_scaled)
hopkins_stat_result
## [1] 0.9999822
In this analysis the Hopkins statistic is approximately 0.9999, which is extremely high. This means that the biodiversity indicators are very far from random and are suitable for clustering.
We compute an ordered dissimilarity matrix (ODM) based on Euclidean distances and visualize it as a heatmap.
dist_matrix_odm <- get_dist(data_scaled, method = "euclidean")
fviz_dist(
dist_matrix_odm,
show_labels = FALSE,
gradient = list(low = "darkblue", mid = "white", high = "#FC4E07")
) +
labs(title = "Visual Assessment of Trends (VAT / ODM)")
Blocks of similar colours along the diagonal correspond to groups of countries with similar biodiversity profiles. The heatmap shows several such blocks, which visually confirms the strong cluster tendency indicated by the Hopkins statistic.
To decide how many clusters to use, three standard methods are applied to K-means clustering:
# 1. Elbow Method (WSS)
print(fviz_nbclust(data_scaled, kmeans, method = "wss") +
labs(title = "Elbow Method (WSS)"))
# 2. Average Silhouette Method
print(fviz_nbclust(data_scaled, kmeans, method = "silhouette") +
labs(title = "Average Silhouette Method"))
# 3. Gap Statistic Method
print(fviz_nbclust(data_scaled, kmeans, method = "gap_stat", nboot = 50) +
labs(title = "Gap Statistic Method"))
Interpretation:
To explore both possibilities, the analysis continues with k = 2 (more compact) and k = 3 (more detailed) in the next section.
This part uses two partitioning methods:
Two scenarios are tested:
k_optimal = 2 (from the silhouette method),k_alternative = 3 (from the elbow method and VAT
heatmap).k_optimal <- 2
k_alternative <- 3
set.seed(123)
# K-Means Model (k=2)
model_kmeans_k2 <- kmeans(data_scaled, centers = k_optimal, nstart = 25)
print(model_kmeans_k2)
## K-means clustering with 2 clusters of sizes 31, 206
##
## Cluster means:
## Prop_Endemic Prop_At_Risk Prop_Small_Occurrence Risk_Concentration_Endemic
## 1 1.8653927 1.9589414 1.7778007 -0.051982188
## 2 -0.2807144 -0.2947921 -0.2675331 0.007822562
## Risk_Concentration_Small_Occ
## 1 1.7656863
## 2 -0.2657101
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 2 1 2
## [38] 2 2 2 1 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2
## [112] 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [186] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 203.4500 491.5743
## (between_SS / total_SS = 41.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(
model_kmeans_k2,
data = data_scaled,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "K-Means Clustering (k=2)"
)
# K-Means Model (k=3)
model_kmeans_k3 <- kmeans(data_scaled, centers = k_alternative, nstart = 25)
print(model_kmeans_k3)
## K-means clustering with 3 clusters of sizes 101, 16, 120
##
## Cluster means:
## Prop_Endemic Prop_At_Risk Prop_Small_Occurrence Risk_Concentration_Endemic
## 1 -0.0955359 0.2278040 0.0128061 0.7290054
## 2 3.0782215 2.6118455 2.6549181 -0.3153484
## 3 -0.3300202 -0.5399811 -0.3647676 -0.5715331
## Risk_Concentration_Small_Occ
## 1 0.5241211
## 2 1.8014436
## 3 -0.6813277
##
## Clustering vector:
## [1] 3 3 3 3 1 3 1 3 2 1 3 3 3 3 3 3 1 1 1 2 1 3 3 1 3 1 1 3 3 3 2 2 1 3 3 1 3
## [38] 1 3 1 2 3 1 1 3 1 1 3 1 1 1 1 1 3 3 1 3 3 1 1 2 3 3 3 1 1 3 1 1 1 1 1 1 3
## [75] 1 1 3 3 1 1 1 3 3 3 1 1 1 2 3 3 1 3 3 1 3 1 1 1 1 1 2 1 3 1 3 3 1 2 1 3 3
## [112] 1 3 3 1 3 3 2 2 1 3 3 1 1 1 3 1 3 1 2 1 3 1 3 1 1 3 3 1 2 3 3 1 1 3 3 1 3
## [149] 1 3 1 3 3 2 1 2 3 3 3 3 3 3 1 1 1 1 3 3 3 3 3 3 1 3 3 3 1 1 2 1 1 3 1 1 3
## [186] 3 1 3 3 1 1 1 3 1 3 3 3 3 1 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 3 3 1 3 3
## [223] 1 1 3 1 3 3 3 1 1 3 1 1 1 3 3
##
## Within cluster sum of squares by cluster:
## [1] 258.90021 96.39229 151.13102
## (between_SS / total_SS = 57.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
fviz_cluster(
model_kmeans_k3,
data = data_scaled,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "K-Means Clustering (k=3)"
)
# PAM Model (k=2)
model_pam_k2 <- pam(data_scaled, k = k_optimal)
print(model_pam_k2)
## Medoids:
## ID Prop_Endemic Prop_At_Risk Prop_Small_Occurrence
## [1,] 172 -0.3077431 -0.3475631 -0.2724983
## [2,] 165 0.4410704 0.8886212 0.4464451
## Risk_Concentration_Endemic Risk_Concentration_Small_Occ
## [1,] -0.10015024 -0.4070351
## [2,] 0.03519766 1.4972228
## Clustering vector:
## [1] 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1
## [38] 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 2 1 1 1
## [75] 1 2 1 1 1 2 2 1 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1
## [112] 2 1 1 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1
## [149] 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1
## [186] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## Objective function:
## build swap
## 1.549300 1.501007
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
fviz_cluster(
model_pam_k2,
data = data_scaled,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "PAM Clustering (k=2)"
)
# PAM Model (k=3)
model_pam_k3 <- pam(data_scaled, k = k_alternative)
print(model_pam_k3)
## Medoids:
## ID Prop_Endemic Prop_At_Risk Prop_Small_Occurrence
## [1,] 89 -0.4460285 -0.3735735 -0.2970748
## [2,] 46 -0.2690618 -0.4485538 -0.2116820
## [3,] 36 0.9068746 1.3936975 0.7277914
## Risk_Concentration_Endemic Risk_Concentration_Small_Occ
## [1,] -0.5806052 -0.58383933
## [2,] 0.8693392 -0.01175344
## [3,] -0.1161640 1.88626341
## Clustering vector:
## [1] 1 1 1 1 2 1 3 1 3 2 1 1 1 1 1 1 2 1 2 3 2 1 2 3 1 2 3 1 1 1 3 3 2 1 1 3 1
## [38] 2 1 2 3 1 2 3 1 2 2 1 2 3 3 2 2 1 1 2 1 1 2 3 3 1 1 1 2 2 2 3 3 2 2 2 2 1
## [75] 2 2 1 2 2 3 3 1 2 1 2 2 1 3 1 1 3 1 1 3 1 2 2 2 2 2 3 3 1 2 1 1 2 3 2 2 1
## [112] 2 1 1 1 1 1 3 3 2 1 1 3 2 2 1 2 1 3 3 3 1 2 1 3 3 1 1 1 3 1 1 3 2 1 1 1 1
## [149] 2 1 2 1 1 3 3 3 1 1 1 1 1 2 3 2 3 2 1 2 1 1 1 1 2 2 1 1 2 2 3 3 2 1 2 2 1
## [186] 1 3 2 1 2 2 2 2 2 2 1 1 2 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2
## [223] 2 2 1 2 2 1 1 2 2 1 2 2 2 1 1
## Objective function:
## build swap
## 1.371894 1.300829
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
fviz_cluster(
model_pam_k3,
data = data_scaled,
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "PAM Clustering (k=3)"
)
Key observations:
These results suggest that more than one reasonable partition exists, which motivates the use of hierarchical clustering to better understand the structure.
Hierarchical clustering is applied using a Ward linkage on the same distance matrix. This method builds a tree (dendrogram) of nested clusters and does not require specifying k in advance.
# 1. Distance Matrix
dist_matrix_hc <- dist_matrix_odm
# 2. Model Building (Ward's method)
model_hc <- hclust(dist_matrix_hc, method = "ward.D2")
# 3. Visualization (Dendrogram)
fviz_dend(
model_hc,
k = k_optimal, # k=2
rect = TRUE,
k_colors = "jco",
cex = 0.5,
main = paste("Hierarchical Clustering Dendrogram (k=", k_optimal, ")")
)
fviz_dend(
model_hc,
k = k_alternative, # k=3
rect = TRUE,
k_colors = "jco",
cex = 0.5,
main = paste("Hierarchical Clustering Dendrogram (k=", k_alternative, ")")
)
# 4. Tree Cutting and Inertia Validation
total_inertia <- inertdiss(dist_matrix_hc)
print("Hierarchical Clustering Validation")
## [1] "Hierarchical Clustering Validation"
print(paste("Total Inertia:", round(total_inertia, 2)))
## [1] "Total Inertia: 4.98"
# Validation for k=2
clusters_hc_k2 <- cutree(model_hc, k = k_optimal)
within_inertia_k2 <- withindiss(dist_matrix_hc, clusters_hc_k2)
q_metric_k2 <- 1 - (within_inertia_k2 / total_inertia)
print(paste("Within-Cluster Inertia (k=2):", round(within_inertia_k2, 2)))
## [1] "Within-Cluster Inertia (k=2): 3.06"
print(paste("Q Metric (Proportion of Variance Explained) (k=2):", round(q_metric_k2, 4)))
## [1] "Q Metric (Proportion of Variance Explained) (k=2): 0.3847"
# Validation for k=3
clusters_hc_k3 <- cutree(model_hc, k = k_alternative)
within_inertia_k3 <- withindiss(dist_matrix_hc, clusters_hc_k3)
q_metric_k3 <- 1 - (within_inertia_k3 / total_inertia)
print(paste("Within-Cluster Inertia (k=3):", round(within_inertia_k3, 2)))
## [1] "Within-Cluster Inertia (k=3): 2.23"
print(paste("Q Metric (Proportion of Variance Explained) (k=3):", round(q_metric_k3, 4)))
## [1] "Q Metric (Proportion of Variance Explained) (k=3): 0.553"
Interpretation:
Because the k = 3 solution explains substantially more variance and matches the visual structure of the dendrogram, the analysis chooses hierarchical clustering with k = 3 as the final model.
The final step assigns each country to one of the three hierarchical clusters and computes descriptive statistics for each feature.
k_final <- 3
clusters_final <- clusters_hc_k3
data_profile_numeric <- data_features
data_profile_numeric$Cluster <- as.factor(clusters_final)
data_final_profile <- cbind(data_identifiers, data_profile_numeric)
print(head(data_final_profile))
## admin_entity admin_code land_eez Prop_Endemic Prop_At_Risk
## 1 Albania ALB land-eez 0.0019993941 0.009996971
## 2 Algeria DZA land-eez 0.0098561802 0.013778538
## 3 American Samoa ASM land-eez 0.0081037277 0.024851432
## 4 Angola AGO land-eez 0.0270579268 0.009273374
## 5 Anguilla AIA land-eez 0.0003280302 0.022962113
## 6 Antigua and Barbuda ATG land-eez 0.0007240081 0.022733855
## Prop_Small_Occurrence Risk_Concentration_Endemic Risk_Concentration_Small_Occ
## 1 0.05422599 0.1515152 0.13184358
## 2 0.06165141 0.2346939 0.12887439
## 3 0.06861156 0.2000000 0.07086614
## 4 0.05957825 0.1173709 0.09808102
## 5 0.05527309 0.5000000 0.10682493
## 6 0.05357660 0.0000000 0.08918919
## Cluster
## 1 1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 1
profile_stats <- describeBy(
data_profile_numeric,
group = data_profile_numeric$Cluster
)
# Simplified mean-based profile
for (i in 1:k_final) {
print(paste("Cluster", i))
cluster_data <- profile_stats[[i]]
print(cluster_data[, c("n", "mean", "sd", "min", "max")])
}
## [1] "Cluster 1"
## n mean sd min max
## Prop_Endemic 108 0.01 0.02 0.00 0.07
## Prop_At_Risk 108 0.01 0.01 0.00 0.03
## Prop_Small_Occurrence 108 0.06 0.01 0.04 0.12
## Risk_Concentration_Endemic 108 0.11 0.09 0.00 0.29
## Risk_Concentration_Small_Occ 108 0.11 0.04 0.01 0.17
## Cluster 108 1.00 0.00 1.00 1.00
## [1] "Cluster 2"
## n mean sd min max
## Prop_Endemic 114 0.03 0.03 0.00 0.13
## Prop_At_Risk 114 0.02 0.01 0.01 0.06
## Prop_Small_Occurrence 114 0.07 0.02 0.05 0.16
## Risk_Concentration_Endemic 114 0.33 0.15 0.00 1.00
## Risk_Concentration_Small_Occ 114 0.18 0.05 0.06 0.33
## Cluster 114 2.00 0.00 2.00 2.00
## [1] "Cluster 3"
## n mean sd min max
## Prop_Endemic 15 0.32 0.15 0.15 0.67
## Prop_At_Risk 15 0.06 0.02 0.04 0.09
## Prop_Small_Occurrence 15 0.14 0.03 0.11 0.21
## Risk_Concentration_Endemic 15 0.16 0.05 0.09 0.28
## Risk_Concentration_Small_Occ 15 0.27 0.04 0.20 0.35
## Cluster 15 3.00 0.00 3.00 3.00
Based on the means printed by describeBy, the three
clusters can be interpreted as follows:
Cluster 1 – “Low-Profile” biodiversity (n = 108)
Cluster 2 – “Hidden Risk” biodiversity (n = 114)
Cluster 3 – “Endemic Hotspots” (n = 15)
To check which indicators truly separate the clusters, one-way ANOVA tests are run for each feature using cluster membership as the factor.
for (var_name in colnames(data_features)) {
formula_str <- paste(var_name, "~ Cluster")
anova_result <- anova_test(data = data_profile_numeric, formula = as.formula(formula_str))
print(paste(
"Variable:", var_name,
"| p-value:", anova_result$p, 10,
"| Significance:", anova_result$p < 0.05
))
}
## [1] "Variable: Prop_Endemic | p-value: 9.35e-67 10 | Significance: TRUE"
## [1] "Variable: Prop_At_Risk | p-value: 1.52e-48 10 | Significance: TRUE"
## [1] "Variable: Prop_Small_Occurrence | p-value: 1.6e-34 10 | Significance: TRUE"
## [1] "Variable: Risk_Concentration_Endemic | p-value: 6.07e-30 10 | Significance: TRUE"
## [1] "Variable: Risk_Concentration_Small_Occ | p-value: 1.93e-36 10 | Significance: TRUE"
All five indicators have extremely small p-values (on the order of 10⁻³⁰ to 10⁻⁶⁷), which are far below 0.05. This confirms that each variable contributes significantly to distinguishing the three biodiversity clusters.
Finally, the stability of the chosen hierarchical k = 3 solution is checked by comparing it to the K-means and PAM k = 3 solutions using the Adjusted Rand Index (ARI).
# All cluster assignments for k=3
clusters_km_k3 <- model_kmeans_k3$cluster
clusters_pam_k3 <- model_pam_k3$clustering
clusters_hc_k3 <- clusters_final
# Adjusted Rand Index (ARI)
ari_km_hc_k3 <- randIndex(table(clusters_km_k3, clusters_hc_k3))
ari_pam_hc_k3 <- randIndex(table(clusters_pam_k3, clusters_hc_k3))
print(paste("ARI (K-Means k=3 vs. Hierarchical k=3):", round(ari_km_hc_k3, 4)))
## [1] "ARI (K-Means k=3 vs. Hierarchical k=3): 0.6731"
print(paste("ARI (PAM k=3 vs. Hierarchical k=3):", round(ari_pam_hc_k3, 4)))
## [1] "ARI (PAM k=3 vs. Hierarchical k=3): 0.5857"
The ARI values are:
These values indicate moderate agreement. The different algorithms largely agree on the main structure but not on every single country. This is typical in real data sets and suggests that the three-cluster typology is reasonably stable but not unique.
To visualise the typology geographically, the cluster labels are
joined to a world map from the rnaturalearth package. Some
special cases (for example Falkland Islands, Aland, Somaliland, Kosovo,
Northern Cyprus, Siachen Glacier, Antarctica) are handled with custom
join codes.
tryCatch({
if (!exists("clusters_hc_k3") || !exists("data_identifiers")) {
stop("Prerequisite objects 'clusters_hc_k3' or 'data_identifiers' not found.")
}
final_clusters_vector <- clusters_hc_k3
data_to_map <- data_identifiers
data_to_map$Cluster <- as.factor(final_clusters_vector)
world_sf_original <- ne_countries(scale = "medium", returnclass = "sf")
# Harmonizing Map Join Codes
world_sf_harmonized <- world_sf_original %>%
mutate(
join_key = case_when(
admin == "Falkland Islands" ~ "GBR",
admin == "Aland" ~ "FIN",
admin == "Somaliland" ~ "SOL",
admin == "Northern Cyprus" ~ "CYN",
admin == "Siachen Glacier" ~ "KAS",
admin == "Kosovo" ~ "KOS",
admin == "Antarctica" ~ "ATF",
TRUE ~ iso_a3_eh
)
)
# Joining Cluster Data to Harmonized Map
map_with_clusters <- left_join(
world_sf_harmonized,
data_to_map,
by = c("join_key" = "admin_code")
)
final_map_plot <- ggplot(data = map_with_clusters) +
geom_sf(aes(fill = Cluster), color = "grey60", size = 0.1) +
scale_fill_manual(
name = "Biodiversity Cluster",
values = c(
"1" = "#1b9e77",
"2" = "#f0c24f",
"3" = "#e41a1c"
),
labels = c(
"1" = "1: Low-Profile",
"2" = "2: Hidden Risk",
"3" = "3: Endemic Hotspot"
),
na.value = "grey80"
) +
labs(
title = "Global Distribution of Biodiversity Clusters (k=3)",
caption = "Clusters: 1 (Low-Profile, n=108), 2 (Hidden Risk, n=114), 3 (Endemic Hotspot, n=15)"
) +
coord_sf(crs = "+proj=robin") +
theme_void() +
theme(
plot.title = element_text(size = 18, face = "bold"),
plot.subtitle = element_text(size = 12),
legend.position = "bottom",
legend.title = element_text(face = "bold")
)
print(final_map_plot)
final_grey_list <- map_with_clusters %>%
filter(is.na(Cluster)) %>%
st_drop_geometry() %>%
select(CountryName = admin, JoinCode = join_key)
print("The remaining grey countries:")
print(final_grey_list)
}, error = function(e) {
print("MAP SCRIPT FAILED")
print("An error occurred:")
print(paste("Error:", e$message))
})
## [1] "The remaining grey countries:"
## CountryName JoinCode
## 1 Vatican VAT
## 2 Taiwan TWN
## 3 Somaliland SOL
## 4 Western Sahara ESH
## 5 Northern Cyprus CYN
## 6 Siachen Glacier KAS
The map shows:
A small number of territories remain grey because they are not present in the input dataset (for example, Vatican City, Taiwan, Somaliland, Western Sahara, Northern Cyprus, Siachen Glacier).
This unsupervised learning analysis uses a global biodiversity dataset to derive a simple three-cluster typology of countries:
hotspot_countries <- data_final_profile %>%
dplyr::filter(Cluster == 3) %>%
dplyr::arrange(desc(Prop_Endemic)) %>%
dplyr::select(admin_entity, admin_code, Prop_Endemic, Prop_At_Risk)
hotspot_countries
## admin_entity admin_code Prop_Endemic Prop_At_Risk
## 1 Australia AUS 0.6731350 0.06851687
## 2 Madagascar MDG 0.5531475 0.08869402
## 3 New Zealand NZL 0.4987617 0.07930653
## 4 South Africa ZAF 0.3972094 0.08343904
## 5 Brazil BRA 0.3526888 0.05120711
## 6 New Caledonia NCL 0.3021393 0.05344123
## 7 United States USA 0.2881633 0.06274052
## 8 Mexico MEX 0.2455140 0.04971188
## 9 French Polynesia PYF 0.2392743 0.05400213
## 10 Japan JPN 0.2346024 0.05859095
## 11 Sri Lanka LKA 0.2331965 0.03891326
## 12 Chile CHL 0.2226353 0.07132976
## 13 China CHN 0.2062673 0.04268305
## 14 Papua New Guinea PNG 0.1800443 0.05273705
## 15 Costa Rica CRI 0.1514903 0.07767581
All engineered indicators contribute significantly to separating the clusters, and the hierarchical k = 3 solution explains more than half of the variance in the distance matrix. Although different clustering algorithms do not agree perfectly, they show consistent patterns, and the global map helps to connect the quantitative results with geographic reality.
The red “Endemic Hotspot” cluster (Cluster 3, shown in red on the map) contains only 15 countries and territories, but these are among the most important places for global conservation. They are dominated by tropical forest and island states where a large share of species is both endemic and highly threatened.
Many of these countries are internationally recognised as “megadiverse” nations that together hold a large share of global biodiversity; this includes Australia, Brazil, China, Madagascar, Mexico, Papua New Guinea, South Africa and the United States, which appear consistently in lists of the 17 megadiverse countries identified by Conservation International. (iberdrola.com)
Several of the remaining red countries lie inside or overlap major biodiversity hotspots as defined by Conservation International – such as Madagascar and the Indian Ocean Islands, New Caledonia, Polynesia-Micronesia (including French Polynesia), Mesoamerica (Mexico and Costa Rica), Japan and New Zealand where high endemism and strong human pressure are both documented. (Conservation International)
This agreement between the data-driven clusters and independent global assessments suggests that the model is capturing real ecological patterns rather than noise. In practical terms, the typology can support high-level discussions on where conservation strategies might focus on preventing future risks and where they must respond to existing critical hotspots (Endemic Hotspots).
An AI assistant (ChatGPT, OpenAI) was occasionally consulted for minor R code debugging.