1. Introduction

This report uses unsupervised learning to build a simple global typology of national biodiversity patterns. The goal is to group countries into a small number of clusters based on:

how many species are endemic (found only in that country),
how many species have small geographic ranges, and
how many species are at high risk of extinction.

The analysis follows the strategy document “A Global Typology of National Biodiversity” and uses data from the World Bank dataset:

National level dataset of Species Endemism, Small Occurrence and Species at Risk data https://datacatalog.worldbank.org/search/dataset/0066034/global-biodiversity-data

All clustering is done on standardized indicators and is purely data-driven (no labels or targets).

1.1 Research Objective

The objective of this analysis is to classify countries into distinct groups based on their biodiversity characteristics. Rather than a simple ranking (e.g., “most biodiverse”), we aim to create a typology-a set of categories that describes the nature of biodiversity and risk in different parts of the world. This helps in understanding which countries face similar conservation challenges.

Scope: 237 national and territorial entities.

Variables: Counts of total species, endemic species (unique to that location), and threatened species.

2. Data and feature construction

2.1 Loading packages and setting the seed

The first code chunk loads all required R packages and sets a random seed for reproducibility.

library(cluster)       
library(factoextra)    
library(flexclust)     
library(NbClust)       
library(hopkins)       
library(ClustGeo)      
library(psych)         
library(rstatix)       
library(FeatureImpCluster) 
library(data.table)    
library(ape)           
library(stats)        
library(gridExtra)
library(sf)
library(rnaturalearth)
library(dplyr)
library(ggplot2)
library(knitr)

set.seed(123)

2.2 Reading, cleaning, and inspecting the data

In this step the script:

Sets the working directory.
Reads the CSV file national_species_summary_ddh.csv.
Renames the long original column names to shorter, more readable ones.
Removes rows with missing values.

We start with 238 country–territory records. After dropping rows with missing values, 237 remain.

setwd("C:/Users/natal/OneDrive/Pulpit/DSBA - UW/Semestr 1/Unsupervised Learning/Final Projects")

data_raw <- read.csv("national_species_summary_ddh.csv")

head(data_raw)

##   administrative.entity admin_code land.and.or.EEZ Number.of.endemics.at.100.
## 1               Albania        ALB        land-eez                         33
## 2               Algeria        DZA        land-eez                         98
## 3        American Samoa        ASM        land-eez                         15
## 4                Angola        AGO        land-eez                        213
## 5              Anguilla        AIA        land-eez                          2
## 6   Antigua and Barbuda        ATG        land-eez                          5
##   Number.of.species.with.small.occurrence.regions..50x50.
## 1                                                     895
## 2                                                     613
## 3                                                     127
## 4                                                     469
## 5                                                     337
## 6                                                     370
##   Number.of.species.with.extinction.threat.probability.gte.80
## 1                                                         165
## 2                                                         137
## 3                                                          46
## 4                                                          73
## 5                                                         140
## 6                                                         157
##   Number.of.endemic.species.with.small.occurrence.regions..50x50.
## 1                                                              31
## 2                                                              47
## 3                                                              15
## 4                                                              89
## 5                                                               2
## 6                                                               4
##   Number.of.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
## 1                                                                                              118
## 2                                                                                               79
## 3                                                                                                9
## 4                                                                                               46
## 5                                                                                               36
## 6                                                                                               33
##   Number.of.endemic.species.with.extinction.threat.probability.gte.80
## 1                                                                   5
## 2                                                                  23
## 3                                                                   3
## 4                                                                  25
## 5                                                                   1
## 6                                                                   0
##   Number.of.endemic.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
## 1                                                                                                        4
## 2                                                                                                       20
## 3                                                                                                        3
## 4                                                                                                       17
## 5                                                                                                        1
## 6                                                                                                        0
##   Number.of.total.species
## 1                   16505
## 2                    9943
## 3                    1851
## 4                    7872
## 5                    6097
## 6                    6906

# Data Structure
str(data_raw)

## 'data.frame':    238 obs. of  11 variables:
##  $ administrative.entity                                                                                   : chr  "Albania" "Algeria" "American Samoa" "Angola" ...
##  $ admin_code                                                                                              : chr  "ALB" "DZA" "ASM" "AGO" ...
##  $ land.and.or.EEZ                                                                                         : chr  "land-eez" "land-eez" "land-eez" "land-eez" ...
##  $ Number.of.endemics.at.100.                                                                              : int  33 98 15 213 2 5 1694 7 54427 10 ...
##  $ Number.of.species.with.small.occurrence.regions..50x50.                                                 : int  895 613 127 469 337 370 1596 460 13148 508 ...
##  $ Number.of.species.with.extinction.threat.probability.gte.80                                             : int  165 137 46 73 140 157 716 202 5540 84 ...
##  $ Number.of.endemic.species.with.small.occurrence.regions..50x50.                                         : int  31 47 15 89 2 4 624 7 11533 8 ...
##  $ Number.of.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80        : int  118 79 9 46 36 33 324 48 3378 69 ...
##  $ Number.of.endemic.species.with.extinction.threat.probability.gte.80                                     : int  5 23 3 25 1 0 289 1 4831 7 ...
##  $ Number.of.endemic.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80: int  4 20 3 17 1 0 192 1 3189 7 ...
##  $ Number.of.total.species                                                                                 : int  16505 9943 1851 7872 6097 6906 20209 8656 80856 7789 ...

# Data Summary - Checking for NAs
summary(data_raw)

##  administrative.entity  admin_code        land.and.or.EEZ   
##  Length:238            Length:238         Length:238        
##  Class :character      Class :character   Class :character  
##  Mode  :character      Mode  :character   Mode  :character  
##                                                             
##                                                             
##                                                             
##                                                             
##  Number.of.endemics.at.100.
##  Min.   :    0             
##  1st Qu.:   15             
##  Median :   68             
##  Mean   : 1120             
##  3rd Qu.:  358             
##  Max.   :54427             
##  NA's   :1                 
##  Number.of.species.with.small.occurrence.regions..50x50.
##  Min.   :   41                                          
##  1st Qu.:  338                                          
##  Median :  553                                          
##  Mean   : 1113                                          
##  3rd Qu.: 1233                                          
##  Max.   :13148                                          
##  NA's   :1                                              
##  Number.of.species.with.extinction.threat.probability.gte.80
##  Min.   :   4.0                                             
##  1st Qu.:  78.0                                             
##  Median : 148.0                                             
##  Mean   : 413.2                                             
##  3rd Qu.: 403.0                                             
##  Max.   :6456.0                                             
##  NA's   :1                                                  
##  Number.of.endemic.species.with.small.occurrence.regions..50x50.
##  Min.   :    0.0                                                
##  1st Qu.:   12.0                                                
##  Median :   49.0                                                
##  Mean   :  393.8                                                
##  3rd Qu.:  197.0                                                
##  Max.   :11533.0                                                
##  NA's   :1                                                      
##  Number.of.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
##  Min.   :   1.0                                                                                  
##  1st Qu.:  39.0                                                                                  
##  Median :  79.0                                                                                  
##  Mean   : 231.6                                                                                  
##  3rd Qu.: 205.0                                                                                  
##  Max.   :3378.0                                                                                  
##  NA's   :1                                                                                       
##  Number.of.endemic.species.with.extinction.threat.probability.gte.80
##  Min.   :   0.0                                                     
##  1st Qu.:   3.0                                                     
##  Median :  16.0                                                     
##  Mean   : 177.5                                                     
##  3rd Qu.:  67.0                                                     
##  Max.   :4831.0                                                     
##  NA's   :1                                                          
##  Number.of.endemic.species.with.small.occurrence.regions..50x50..and.extinction.threat.probability.gte.80
##  Min.   :   0.0                                                                                          
##  1st Qu.:   3.0                                                                                          
##  Median :  15.0                                                                                          
##  Mean   : 132.5                                                                                          
##  3rd Qu.:  51.0                                                                                          
##  Max.   :3189.0                                                                                          
##  NA's   :1                                                                                               
##  Number.of.total.species
##  Min.   :   444         
##  1st Qu.:  4685         
##  Median :  8712         
##  Mean   : 14795         
##  3rd Qu.: 18977         
##  Max.   :102900         
##  NA's   :1

new_names <- c(
  "admin_entity", "admin_code", "land_eez",
  "n_endemic_100", "n_small_occ", "n_threat",
  "n_endemic_small_occ", "n_small_occ_threat", "n_endemic_threat",
  "n_endemic_small_occ_threat", "n_total"
)

if (length(names(data_raw)) == length(new_names)) {
  names(data_raw) <- new_names
} else {
  stop(paste("Column count mismatch. Expected", length(new_names), "columns but found", length(names(data_raw)), "."))
}

data_clean <- na.omit(data_raw)
print(paste("Loaded", nrow(data_raw), "rows. After removing NAs,", nrow(data_clean), "rows remain."))

## [1] "Loaded 238 rows. After removing NAs, 237 rows remain."

data_identifiers <- data_clean[, c("admin_entity", "admin_code", "land_eez")]

data_numeric_raw <- data_clean[, new_names[!(new_names %in% c("admin_entity", "admin_code", "land_eez"))]]

The admin_entity, admin_code, and land_eez columns are used as identifiers (country name, ISO code, and land/EEZ flag). The remaining columns contain counts of endemic species, small-range species, threatened species, and their overlaps.

2.3 Feature engineering and scaling

Instead of clustering directly on raw counts (which are strongly influenced by country size), the analysis builds five ratio-based indicators:

Prop_Endemic – share of total species that are endemic.
Prop_At_Risk – share of total species that are highly threatened.
Prop_Small_Occurrence – share of total species with small geographic ranges.
Risk_Concentration_Endemic – among endemic species, share that are highly threatened.
Risk_Concentration_Small_Occ – among small-range species, share that are highly threatened.

Division by zero is handled by replacing resulting NA with 0. Finally, all five indicators are standardized (z-scores) so that each has mean 0 and standard deviation 1.

# Creating a new data frame for the 5 engineered features
data_features <- data.frame(matrix(ncol = 5, nrow = nrow(data_numeric_raw)))
names(data_features) <- c(
  "Prop_Endemic", "Prop_At_Risk", "Prop_Small_Occurrence",
  "Risk_Concentration_Endemic", "Risk_Concentration_Small_Occ"
)

# 1. Prop_Endemic (Proportion of Endemism) 
data_features$Prop_Endemic <- data_numeric_raw$n_endemic_100 / data_numeric_raw$n_total

# 2. Prop_At_Risk (Proportion of Total Biodiversity at Risk) 
data_features$Prop_At_Risk <- data_numeric_raw$n_threat / data_numeric_raw$n_total

# 3. Prop_Small_Occurrence (Proportion of Geographically Concentrated Species) 
data_features$Prop_Small_Occurrence <- data_numeric_raw$n_small_occ / data_numeric_raw$n_total

# 4. Risk_Concentration_Endemic (Risk Concentration among Endemic Species) 
data_features$Risk_Concentration_Endemic <- data_numeric_raw$n_endemic_threat / data_numeric_raw$n_endemic_100

# 5. Risk_Concentration_Small_Occ (Risk Concentration among Small Occurrence Species) 
data_features$Risk_Concentration_Small_Occ <- data_numeric_raw$n_small_occ_threat / data_numeric_raw$n_small_occ


data_features[is.na(data_features)] <- 0

data_scaled <- scale(data_features)

These engineered features allow us to compare very different countries (for example, small islands and large continental states) on a common scale.

3. Assessing clustering tendency

Before choosing a clustering method, it is helpful to check whether the data actually contain cluster structure.

3.1 Hopkins statistic

The Hopkins statistic tests whether the data are more “clustered” than a random uniform distribution. Values close to 1 indicate strong cluster tendency; values around 0.5 suggest randomness.

hopkins_stat_result <- hopkins(data_scaled)
hopkins_stat_result

## [1] 0.9999822

In this analysis the Hopkins statistic is approximately 0.9999, which is extremely high. This means that the biodiversity indicators are very far from random and are suitable for clustering.

3.2 Visual Assessment of Trends (VAT / ODM)

We compute an ordered dissimilarity matrix (ODM) based on Euclidean distances and visualize it as a heatmap.

dist_matrix_odm <- get_dist(data_scaled, method = "euclidean")
fviz_dist(
  dist_matrix_odm,
  show_labels = FALSE,
  gradient = list(low = "darkblue", mid = "white", high = "#FC4E07")
) +
  labs(title = "Visual Assessment of Trends (VAT / ODM)")

Blocks of similar colours along the diagonal correspond to groups of countries with similar biodiversity profiles. The heatmap shows several such blocks, which visually confirms the strong cluster tendency indicated by the Hopkins statistic.

4. Selecting the number of clusters

To decide how many clusters to use, three standard methods are applied to K-means clustering:

Elbow method using within-cluster sum of squares (WSS).
Average silhouette width.
Gap statistic.

# 1. Elbow Method (WSS)
print(fviz_nbclust(data_scaled, kmeans, method = "wss") +
        labs(title = "Elbow Method (WSS)"))

# 2. Average Silhouette Method 
print(fviz_nbclust(data_scaled, kmeans, method = "silhouette") +
        labs(title = "Average Silhouette Method"))

# 3. Gap Statistic Method
print(fviz_nbclust(data_scaled, kmeans, method = "gap_stat", nboot = 50) +
        labs(title = "Gap Statistic Method"))

Interpretation:

The Elbow method shows a clear reduction in WSS up to around k = 3, after which the curve flattens. This suggests that 3 clusters capture most of the structure.
The Silhouette method reaches its maximum average silhouette width at k = 2, indicating that two well-separated clusters also fit the data well.
The Gap statistic curve shows a gradual increase in Gap values as k grows, with no clear local maximum at small k. Using the standard 1-SE rule, the method actually selects k = 1, which would correspond to “no clear clustering structure”. Given the very high Hopkins statistic and the clear patterns in the WSS and silhouette plots, I treat the Gap statistic as inconclusive rather than decisive, and rely on the elbow and silhouette methods when exploring k = 2 and k = 3.

To explore both possibilities, the analysis continues with k = 2 (more compact) and k = 3 (more detailed) in the next section.

5. Partitional clustering: K-means and PAM

This part uses two partitioning methods:

K-means – efficient but sensitive to outliers.
PAM (Partitioning Around Medoids) – more robust as it uses medoids instead of means.

Two scenarios are tested:

k_optimal = 2 (from the silhouette method),
k_alternative = 3 (from the elbow method and VAT heatmap).

k_optimal <- 2
k_alternative <- 3

set.seed(123)

5.1 K-Means

# K-Means Model (k=2) 
model_kmeans_k2 <- kmeans(data_scaled, centers = k_optimal, nstart = 25)
print(model_kmeans_k2)

## K-means clustering with 2 clusters of sizes 31, 206
## 
## Cluster means:
##   Prop_Endemic Prop_At_Risk Prop_Small_Occurrence Risk_Concentration_Endemic
## 1    1.8653927    1.9589414             1.7778007               -0.051982188
## 2   -0.2807144   -0.2947921            -0.2675331                0.007822562
##   Risk_Concentration_Small_Occ
## 1                    1.7656863
## 2                   -0.2657101
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 2 1 2
##  [38] 2 2 2 1 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2
## [112] 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2
## [186] 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 203.4500 491.5743
##  (between_SS / total_SS =  41.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(
  model_kmeans_k2,
  data = data_scaled,
  ellipse.type = "convex", 
  palette = "jco",         
  ggtheme = theme_minimal(),
  main = "K-Means Clustering (k=2)"
)

# K-Means Model (k=3)
model_kmeans_k3 <- kmeans(data_scaled, centers = k_alternative, nstart = 25)
print(model_kmeans_k3)

## K-means clustering with 3 clusters of sizes 101, 16, 120
## 
## Cluster means:
##   Prop_Endemic Prop_At_Risk Prop_Small_Occurrence Risk_Concentration_Endemic
## 1   -0.0955359    0.2278040             0.0128061                  0.7290054
## 2    3.0782215    2.6118455             2.6549181                 -0.3153484
## 3   -0.3300202   -0.5399811            -0.3647676                 -0.5715331
##   Risk_Concentration_Small_Occ
## 1                    0.5241211
## 2                    1.8014436
## 3                   -0.6813277
## 
## Clustering vector:
##   [1] 3 3 3 3 1 3 1 3 2 1 3 3 3 3 3 3 1 1 1 2 1 3 3 1 3 1 1 3 3 3 2 2 1 3 3 1 3
##  [38] 1 3 1 2 3 1 1 3 1 1 3 1 1 1 1 1 3 3 1 3 3 1 1 2 3 3 3 1 1 3 1 1 1 1 1 1 3
##  [75] 1 1 3 3 1 1 1 3 3 3 1 1 1 2 3 3 1 3 3 1 3 1 1 1 1 1 2 1 3 1 3 3 1 2 1 3 3
## [112] 1 3 3 1 3 3 2 2 1 3 3 1 1 1 3 1 3 1 2 1 3 1 3 1 1 3 3 1 2 3 3 1 1 3 3 1 3
## [149] 1 3 1 3 3 2 1 2 3 3 3 3 3 3 1 1 1 1 3 3 3 3 3 3 1 3 3 3 1 1 2 1 1 3 1 1 3
## [186] 3 1 3 3 1 1 1 3 1 3 3 3 3 1 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 3 3 1 3 3
## [223] 1 1 3 1 3 3 3 1 1 3 1 1 1 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 258.90021  96.39229 151.13102
##  (between_SS / total_SS =  57.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(
  model_kmeans_k3,
  data = data_scaled,
  ellipse.type = "convex",
  palette = "jco",
  ggtheme = theme_minimal(),
  main = "K-Means Clustering (k=3)"
)

5.2 PAM Model

# PAM Model (k=2)
model_pam_k2 <- pam(data_scaled, k = k_optimal)
print(model_pam_k2)

## Medoids:
##       ID Prop_Endemic Prop_At_Risk Prop_Small_Occurrence
## [1,] 172   -0.3077431   -0.3475631            -0.2724983
## [2,] 165    0.4410704    0.8886212             0.4464451
##      Risk_Concentration_Endemic Risk_Concentration_Small_Occ
## [1,]                -0.10015024                   -0.4070351
## [2,]                 0.03519766                    1.4972228
## Clustering vector:
##   [1] 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 1 1 2 2 1 1 1 2 1
##  [38] 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 1 2 1 1 1
##  [75] 1 2 1 1 1 2 2 1 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 2 1 1 1
## [112] 2 1 1 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1
## [149] 2 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1
## [186] 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## Objective function:
##    build     swap 
## 1.549300 1.501007 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

fviz_cluster(
  model_pam_k2,
  data = data_scaled,
  ellipse.type = "convex",
  palette = "jco",
  ggtheme = theme_minimal(),
  main = "PAM Clustering (k=2)"
)

# PAM Model (k=3)
model_pam_k3 <- pam(data_scaled, k = k_alternative)
print(model_pam_k3)

## Medoids:
##      ID Prop_Endemic Prop_At_Risk Prop_Small_Occurrence
## [1,] 89   -0.4460285   -0.3735735            -0.2970748
## [2,] 46   -0.2690618   -0.4485538            -0.2116820
## [3,] 36    0.9068746    1.3936975             0.7277914
##      Risk_Concentration_Endemic Risk_Concentration_Small_Occ
## [1,]                 -0.5806052                  -0.58383933
## [2,]                  0.8693392                  -0.01175344
## [3,]                 -0.1161640                   1.88626341
## Clustering vector:
##   [1] 1 1 1 1 2 1 3 1 3 2 1 1 1 1 1 1 2 1 2 3 2 1 2 3 1 2 3 1 1 1 3 3 2 1 1 3 1
##  [38] 2 1 2 3 1 2 3 1 2 2 1 2 3 3 2 2 1 1 2 1 1 2 3 3 1 1 1 2 2 2 3 3 2 2 2 2 1
##  [75] 2 2 1 2 2 3 3 1 2 1 2 2 1 3 1 1 3 1 1 3 1 2 2 2 2 2 3 3 1 2 1 1 2 3 2 2 1
## [112] 2 1 1 1 1 1 3 3 2 1 1 3 2 2 1 2 1 3 3 3 1 2 1 3 3 1 1 1 3 1 1 3 2 1 1 1 1
## [149] 2 1 2 1 1 3 3 3 1 1 1 1 1 2 3 2 3 2 1 2 1 1 1 1 2 2 1 1 2 2 3 3 2 1 2 2 1
## [186] 1 3 2 1 2 2 2 2 2 2 1 1 2 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2
## [223] 2 2 1 2 2 1 1 2 2 1 2 2 2 1 1
## Objective function:
##    build     swap 
## 1.371894 1.300829 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

fviz_cluster(
  model_pam_k3,
  data = data_scaled,
  ellipse.type = "convex",
  palette = "jco",
  ggtheme = theme_minimal(),
  main = "PAM Clustering (k=3)"
)

Key observations:

For K-means with k = 2, the two clusters are clearly separated in the first two principal dimensions, with about 31 and 206 entities.
For K-means with k = 3, the data split into a small high-biodiversity cluster (16 entities) and two larger clusters (101 and 120 entities). The between-cluster variance explained increases from about 41% (k = 2) to about 57% (k = 3).
The PAM solutions (k = 2 and k = 3) show similar shapes in the 2-D projection but use medoids as representatives, which makes them less sensitive to extreme countries.

These results suggest that more than one reasonable partition exists, which motivates the use of hierarchical clustering to better understand the structure.

6. Hierarchical clustering

Hierarchical clustering is applied using a Ward linkage on the same distance matrix. This method builds a tree (dendrogram) of nested clusters and does not require specifying k in advance.

# 1. Distance Matrix
dist_matrix_hc <- dist_matrix_odm

# 2. Model Building (Ward's method)
model_hc <- hclust(dist_matrix_hc, method = "ward.D2")

# 3. Visualization (Dendrogram)
fviz_dend(
  model_hc,
  k = k_optimal, # k=2
  rect = TRUE,
  k_colors = "jco",
  cex = 0.5, 
  main = paste("Hierarchical Clustering Dendrogram (k=", k_optimal, ")")
)

fviz_dend(
  model_hc,
  k = k_alternative, # k=3
  rect = TRUE,
  k_colors = "jco",
  cex = 0.5,
  main = paste("Hierarchical Clustering Dendrogram (k=", k_alternative, ")")
)

# 4. Tree Cutting and Inertia Validation
total_inertia <- inertdiss(dist_matrix_hc)
print("Hierarchical Clustering Validation")

## [1] "Hierarchical Clustering Validation"

print(paste("Total Inertia:", round(total_inertia, 2)))

## [1] "Total Inertia: 4.98"

# Validation for k=2
clusters_hc_k2 <- cutree(model_hc, k = k_optimal)
within_inertia_k2 <- withindiss(dist_matrix_hc, clusters_hc_k2)
q_metric_k2 <- 1 - (within_inertia_k2 / total_inertia)
print(paste("Within-Cluster Inertia (k=2):", round(within_inertia_k2, 2)))

## [1] "Within-Cluster Inertia (k=2): 3.06"

print(paste("Q Metric (Proportion of Variance Explained) (k=2):", round(q_metric_k2, 4)))

## [1] "Q Metric (Proportion of Variance Explained) (k=2): 0.3847"

# Validation for k=3
clusters_hc_k3 <- cutree(model_hc, k = k_alternative)
within_inertia_k3 <- withindiss(dist_matrix_hc, clusters_hc_k3)
q_metric_k3 <- 1 - (within_inertia_k3 / total_inertia)
print(paste("Within-Cluster Inertia (k=3):", round(within_inertia_k3, 2)))

## [1] "Within-Cluster Inertia (k=3): 2.23"

print(paste("Q Metric (Proportion of Variance Explained) (k=3):", round(q_metric_k3, 4)))

## [1] "Q Metric (Proportion of Variance Explained) (k=3): 0.553"

Interpretation:

Total inertia (overall variance) is about 4.98.
For k = 2, the within-cluster inertia is about 3.06, leading to a Q metric (variance explained) of 0.3847.
For k = 3, the within-cluster inertia drops to 2.23, so the Q metric increases to 0.553.

Because the k = 3 solution explains substantially more variance and matches the visual structure of the dendrogram, the analysis chooses hierarchical clustering with k = 3 as the final model.

7. Cluster profiling and validation

7.1 Building cluster profiles

The final step assigns each country to one of the three hierarchical clusters and computes descriptive statistics for each feature.

k_final <- 3
clusters_final <- clusters_hc_k3

data_profile_numeric <- data_features
data_profile_numeric$Cluster <- as.factor(clusters_final)

data_final_profile <- cbind(data_identifiers, data_profile_numeric)

print(head(data_final_profile))

##          admin_entity admin_code land_eez Prop_Endemic Prop_At_Risk
## 1             Albania        ALB land-eez 0.0019993941  0.009996971
## 2             Algeria        DZA land-eez 0.0098561802  0.013778538
## 3      American Samoa        ASM land-eez 0.0081037277  0.024851432
## 4              Angola        AGO land-eez 0.0270579268  0.009273374
## 5            Anguilla        AIA land-eez 0.0003280302  0.022962113
## 6 Antigua and Barbuda        ATG land-eez 0.0007240081  0.022733855
##   Prop_Small_Occurrence Risk_Concentration_Endemic Risk_Concentration_Small_Occ
## 1            0.05422599                  0.1515152                   0.13184358
## 2            0.06165141                  0.2346939                   0.12887439
## 3            0.06861156                  0.2000000                   0.07086614
## 4            0.05957825                  0.1173709                   0.09808102
## 5            0.05527309                  0.5000000                   0.10682493
## 6            0.05357660                  0.0000000                   0.08918919
##   Cluster
## 1       1
## 2       1
## 3       1
## 4       1
## 5       2
## 6       1

profile_stats <- describeBy(
  data_profile_numeric,
  group = data_profile_numeric$Cluster
)

# Simplified mean-based profile
for (i in 1:k_final) {
  print(paste("Cluster", i))
  cluster_data <- profile_stats[[i]]
  print(cluster_data[, c("n", "mean", "sd", "min", "max")])
}

## [1] "Cluster 1"
##                                n mean   sd  min  max
## Prop_Endemic                 108 0.01 0.02 0.00 0.07
## Prop_At_Risk                 108 0.01 0.01 0.00 0.03
## Prop_Small_Occurrence        108 0.06 0.01 0.04 0.12
## Risk_Concentration_Endemic   108 0.11 0.09 0.00 0.29
## Risk_Concentration_Small_Occ 108 0.11 0.04 0.01 0.17
## Cluster                      108 1.00 0.00 1.00 1.00
## [1] "Cluster 2"
##                                n mean   sd  min  max
## Prop_Endemic                 114 0.03 0.03 0.00 0.13
## Prop_At_Risk                 114 0.02 0.01 0.01 0.06
## Prop_Small_Occurrence        114 0.07 0.02 0.05 0.16
## Risk_Concentration_Endemic   114 0.33 0.15 0.00 1.00
## Risk_Concentration_Small_Occ 114 0.18 0.05 0.06 0.33
## Cluster                      114 2.00 0.00 2.00 2.00
## [1] "Cluster 3"
##                               n mean   sd  min  max
## Prop_Endemic                 15 0.32 0.15 0.15 0.67
## Prop_At_Risk                 15 0.06 0.02 0.04 0.09
## Prop_Small_Occurrence        15 0.14 0.03 0.11 0.21
## Risk_Concentration_Endemic   15 0.16 0.05 0.09 0.28
## Risk_Concentration_Small_Occ 15 0.27 0.04 0.20 0.35
## Cluster                      15 3.00 0.00 3.00 3.00

Based on the means printed by describeBy, the three clusters can be interpreted as follows:

Cluster 1 – “Low-Profile” biodiversity (n = 108)
- Endemic share around 1% of all species.
- Species at high risk around 1%.
- About 6% of species have small ranges.
- Risk concentration among endemics and small-range species is relatively low (≈11%). These countries have modest endemism and relatively few highly threatened species in relative terms.
Cluster 2 – “Hidden Risk” biodiversity (n = 114)
- Endemic share around 3%.
- Species at high risk around 2%.
- Around 7% of species have small ranges.
- Risk concentration among endemics is high (≈33%), and for small-range species ≈18%. These countries do not have the highest overall biodiversity, but a large fraction of their endemic and small-range species is under serious threat.
Cluster 3 – “Endemic Hotspots” (n = 15)
- Very high endemic share (on average 32% of all species).
- Highest share of species at risk (around 6%).
- Many species with small ranges (around 14%).
- High concentration of risk among small-range species (around 27%). This small group corresponds to global biodiversity hotspots where conservation stakes are very high.

7.2 Statistical validation with ANOVA

To check which indicators truly separate the clusters, one-way ANOVA tests are run for each feature using cluster membership as the factor.

for (var_name in colnames(data_features)) {
  
  formula_str <- paste(var_name, "~ Cluster")
  
  anova_result <- anova_test(data = data_profile_numeric, formula = as.formula(formula_str))
  
  print(paste(
    "Variable:", var_name,
    "| p-value:", anova_result$p, 10,
    "| Significance:", anova_result$p < 0.05
  ))
}

## [1] "Variable: Prop_Endemic | p-value: 9.35e-67 10 | Significance: TRUE"
## [1] "Variable: Prop_At_Risk | p-value: 1.52e-48 10 | Significance: TRUE"
## [1] "Variable: Prop_Small_Occurrence | p-value: 1.6e-34 10 | Significance: TRUE"
## [1] "Variable: Risk_Concentration_Endemic | p-value: 6.07e-30 10 | Significance: TRUE"
## [1] "Variable: Risk_Concentration_Small_Occ | p-value: 1.93e-36 10 | Significance: TRUE"

All five indicators have extremely small p-values (on the order of 10⁻³⁰ to 10⁻⁶⁷), which are far below 0.05. This confirms that each variable contributes significantly to distinguishing the three biodiversity clusters.

7.3 Stability analysis (Adjusted Rand Index)

Finally, the stability of the chosen hierarchical k = 3 solution is checked by comparing it to the K-means and PAM k = 3 solutions using the Adjusted Rand Index (ARI).

# All cluster assignments for k=3
clusters_km_k3 <- model_kmeans_k3$cluster  
clusters_pam_k3 <- model_pam_k3$clustering 
clusters_hc_k3 <- clusters_final           

# Adjusted Rand Index (ARI)
ari_km_hc_k3 <- randIndex(table(clusters_km_k3, clusters_hc_k3))
ari_pam_hc_k3 <- randIndex(table(clusters_pam_k3, clusters_hc_k3))

print(paste("ARI (K-Means k=3 vs. Hierarchical k=3):", round(ari_km_hc_k3, 4)))

## [1] "ARI (K-Means k=3 vs. Hierarchical k=3): 0.6731"

print(paste("ARI (PAM k=3 vs. Hierarchical k=3):", round(ari_pam_hc_k3, 4)))

## [1] "ARI (PAM k=3 vs. Hierarchical k=3): 0.5857"

The ARI values are:

0.6731 for K-means vs. hierarchical,
0.5857 for PAM vs. hierarchical.

These values indicate moderate agreement. The different algorithms largely agree on the main structure but not on every single country. This is typical in real data sets and suggests that the three-cluster typology is reasonably stable but not unique.

8. Mapping the global biodiversity clusters

To visualise the typology geographically, the cluster labels are joined to a world map from the rnaturalearth package. Some special cases (for example Falkland Islands, Aland, Somaliland, Kosovo, Northern Cyprus, Siachen Glacier, Antarctica) are handled with custom join codes.

tryCatch({
  if (!exists("clusters_hc_k3") || !exists("data_identifiers")) {
    stop("Prerequisite objects 'clusters_hc_k3' or 'data_identifiers' not found.")
  }
  
  final_clusters_vector <- clusters_hc_k3
  data_to_map <- data_identifiers
  
  data_to_map$Cluster <- as.factor(final_clusters_vector)
  
  world_sf_original <- ne_countries(scale = "medium", returnclass = "sf")
  
  # Harmonizing Map Join Codes
  world_sf_harmonized <- world_sf_original %>%
    mutate(
      join_key = case_when(
        admin == "Falkland Islands" ~ "GBR", 
        admin == "Aland"               ~ "FIN", 
        admin == "Somaliland"          ~ "SOL",
        admin == "Northern Cyprus"      ~ "CYN",
        admin == "Siachen Glacier"      ~ "KAS",
        admin == "Kosovo"              ~ "KOS",
        admin == "Antarctica"          ~ "ATF",
        
        TRUE                           ~ iso_a3_eh
      )
    )
  
  # Joining Cluster Data to Harmonized Map 
  
  map_with_clusters <- left_join(
    world_sf_harmonized, 
    data_to_map, 
    by = c("join_key" = "admin_code") 
  )
  
  final_map_plot <- ggplot(data = map_with_clusters) +
    geom_sf(aes(fill = Cluster), color = "grey60", size = 0.1) +
    
    scale_fill_manual(
      name = "Biodiversity Cluster",
      
      values = c(
        "1" = "#1b9e77",  
        "2" = "#f0c24f",  
        "3" = "#e41a1c" 
      ),
      
      labels = c(
        "1" = "1: Low-Profile",
        "2" = "2: Hidden Risk",
        "3" = "3: Endemic Hotspot"
      ),
      
      na.value = "grey80"
    ) +
    
    labs(
      title = "Global Distribution of Biodiversity Clusters (k=3)",
      caption = "Clusters: 1 (Low-Profile, n=108), 2 (Hidden Risk, n=114), 3 (Endemic Hotspot, n=15)"
    ) +
    
    coord_sf(crs = "+proj=robin") + 
    theme_void() + 
    theme(
      plot.title = element_text(size = 18, face = "bold"),
      plot.subtitle = element_text(size = 12),
      legend.position = "bottom",
      legend.title = element_text(face = "bold")
    )
  
  print(final_map_plot)
  
  final_grey_list <- map_with_clusters %>%
    filter(is.na(Cluster)) %>%
    st_drop_geometry() %>%
    select(CountryName = admin, JoinCode = join_key)
  
  print("The remaining grey countries:")
  print(final_grey_list)
  
}, error = function(e) {
  print("MAP SCRIPT FAILED")
  print("An error occurred:")
  print(paste("Error:", e$message))
})

## [1] "The remaining grey countries:"
##       CountryName JoinCode
## 1         Vatican      VAT
## 2          Taiwan      TWN
## 3      Somaliland      SOL
## 4  Western Sahara      ESH
## 5 Northern Cyprus      CYN
## 6 Siachen Glacier      KAS

The map shows:

Green (Cluster 1 – Low-Profile) countries with relatively low endemism and risk.
Yellow (Cluster 2 – Hidden Risk) countries with moderate biodiversity but high risk concentration among endemic and small-range species.
Red (Cluster 3 – Endemic Hotspots) countries where a large share of biodiversity is both unique and threatened.

A small number of territories remain grey because they are not present in the input dataset (for example, Vatican City, Taiwan, Somaliland, Western Sahara, Northern Cyprus, Siachen Glacier).

9. Conclusions

This unsupervised learning analysis uses a global biodiversity dataset to derive a simple three-cluster typology of countries:

Low-Profile biodiversity systems with modest endemism and comparatively low threat levels.
Hidden Risk systems where a large fraction of endemic and small-range species are threatened, even if overall endemism is moderate.
Endemic Hotspots with very high shares of unique species and elevated extinction risk.

hotspot_countries <- data_final_profile %>%
  dplyr::filter(Cluster == 3) %>%
  dplyr::arrange(desc(Prop_Endemic)) %>%
  dplyr::select(admin_entity, admin_code, Prop_Endemic, Prop_At_Risk)

hotspot_countries

##        admin_entity admin_code Prop_Endemic Prop_At_Risk
## 1         Australia        AUS    0.6731350   0.06851687
## 2        Madagascar        MDG    0.5531475   0.08869402
## 3       New Zealand        NZL    0.4987617   0.07930653
## 4      South Africa        ZAF    0.3972094   0.08343904
## 5            Brazil        BRA    0.3526888   0.05120711
## 6     New Caledonia        NCL    0.3021393   0.05344123
## 7     United States        USA    0.2881633   0.06274052
## 8            Mexico        MEX    0.2455140   0.04971188
## 9  French Polynesia        PYF    0.2392743   0.05400213
## 10            Japan        JPN    0.2346024   0.05859095
## 11        Sri Lanka        LKA    0.2331965   0.03891326
## 12            Chile        CHL    0.2226353   0.07132976
## 13            China        CHN    0.2062673   0.04268305
## 14 Papua New Guinea        PNG    0.1800443   0.05273705
## 15       Costa Rica        CRI    0.1514903   0.07767581

All engineered indicators contribute significantly to separating the clusters, and the hierarchical k = 3 solution explains more than half of the variance in the distance matrix. Although different clustering algorithms do not agree perfectly, they show consistent patterns, and the global map helps to connect the quantitative results with geographic reality.

The red “Endemic Hotspot” cluster (Cluster 3, shown in red on the map) contains only 15 countries and territories, but these are among the most important places for global conservation. They are dominated by tropical forest and island states where a large share of species is both endemic and highly threatened.

Many of these countries are internationally recognised as “megadiverse” nations that together hold a large share of global biodiversity; this includes Australia, Brazil, China, Madagascar, Mexico, Papua New Guinea, South Africa and the United States, which appear consistently in lists of the 17 megadiverse countries identified by Conservation International. (iberdrola.com)

Several of the remaining red countries lie inside or overlap major biodiversity hotspots as defined by Conservation International – such as Madagascar and the Indian Ocean Islands, New Caledonia, Polynesia-Micronesia (including French Polynesia), Mesoamerica (Mexico and Costa Rica), Japan and New Zealand where high endemism and strong human pressure are both documented. (Conservation International)

This agreement between the data-driven clusters and independent global assessments suggests that the model is capturing real ecological patterns rather than noise. In practical terms, the typology can support high-level discussions on where conservation strategies might focus on preventing future risks and where they must respond to existing critical hotspots (Endemic Hotspots).

Bibliography

World Bank. Global Biodiversity Data – National level dataset of Species Endemism, Small Occurrence and Species at Risk data. https://datacatalog.worldbank.org/search/dataset/0066034/global-biodiversity-data

An AI assistant (ChatGPT, OpenAI) was occasionally consulted for minor R code debugging.

A Global Typology of Biodiversity

Natalia Namysłowska

2025-12-01