Introduction

The primary goal of this study is to identify similarities among European countries in terms of their level of development. Two key variables are used for this purpose. The first, GDP per capita, reflects a country’s economic and financial prosperity. The second, the Human Development Index (HDI), captures the quality of human capital, including factors such as life expectancy and years of education. Together, these variables provide a comprehensive view of a country’s development level. To explore patterns in the data, multiple clustering algorithms will be applied: k-means, PAM (Partitioning Around Medoids) and hierarchical clustering (both agglomerative and divisive approaches). Using different methods will help verify the robustness of the results and uncover meaningful groupings among European countries.

Chapter 1. Short Description of the Dataset

I created the dataset myself using data from 2023. GDP per capita values were obtained from the World Bank, while HDI values were taken from the UNDP Human Development Reports. Some countries were excluded due to missing values: San Marino and Monaco had incomplete data for either GDP or HDI, and Kosovo was not included due to missing HDI data and the fact that not all countries recognize it as a country. Russia was included because the most developed part of the country is located in Europe.

Chapter 2. Data Preparation and Presentation

2.1. Loading the Dataset and Summarizing Its Descriptive Statistics

#Loading the libraries
library(dplyr)
library(utils)
library(factoextra)
library(ggplot2)
library(fpc)
library(clustertend)
library(NbClust)
library(readxl)
library(stats)
library(cluster)
library(maps)
library(tidyverse)
library(dendextend)
library(gridExtra)
library(grid)
library(hopkins)
library(googlesheets4)
library(readr)

# Importing the dataset from Google Drive and presenting its basic descriptive statistics:
url <- "https://docs.google.com/spreadsheets/d/1Rs6R63fb-JOXBKLn5sHHjuQBzx32VPTn/export?format=csv"

europe <- read_csv(
  url,
  locale = locale(decimal_mark = ",")
)

head(europe)
## # A tibble: 6 × 3
##   Country                  hdi GDP_per_capita
##   <chr>                  <dbl>          <dbl>
## 1 Albania                0.81           8575.
## 2 Andorra                0.913         46812.
## 3 Austria                0.93          56034.
## 4 Belarus                0.824          7897.
## 5 Belgium                0.951         54690.
## 6 Bosnia and Herzegovina 0.804          8663.
summary(europe)
##    Country               hdi         GDP_per_capita  
##  Length:42          Min.   :0.7790   Min.   :  5140  
##  Class :character   1st Qu.:0.8640   1st Qu.: 19270  
##  Mode  :character   Median :0.9130   Median : 32101  
##                     Mean   :0.8987   Mean   : 43156  
##                     3rd Qu.:0.9440   3rd Qu.: 54503  
##                     Max.   :0.9720   Max.   :207974

2.2. Visualization of the Data

# Making a plot:
ggplot(europe, aes(x = hdi, y = GDP_per_capita)) +
  geom_point(color = "black", size = 1.5) +
  labs(x = "HDI", y = "GDP per capita") +
  theme_gray(base_size = 14) +
  theme(panel.grid.major = element_line(color = "gray80"),
        panel.grid.minor = element_blank())

2.3. Standardization of Variables

The algorithms applied in this study require the variables to be standardized before the analysis.

# Standardizing the variables using Z-score standardization:
data <- europe[, 2:3]
data_z <- as.data.frame(lapply(data, scale))
head(data_z)
##          hdi GDP_per_capita
## 1 -1.6355506    -0.88692481
## 2  0.2633737     0.09379253
## 3  0.5767884     0.33029765
## 4 -1.3774444    -0.90432506
## 5  0.9639477     0.29583980
## 6 -1.7461676    -0.88467162

2.4. Checking Dataset Clusterability

It is important to check whether the data are clusterable. For this reason, the Hopkins statistic will be calculated. If the value of the statistic is close to 1, it indicates that the data are significantly clusterable. The “hopkins” library will be used, as it is actively maintained and designed to align with modern clustering approaches.

#Checking dataset clusterability using Hopkins statistic
data_matrix <- as.matrix(data_z)
set.seed(123)
hopkins_stat <- round(hopkins(data_matrix), digits=2)
print(hopkins_stat)
## [1] 0.95

The value of the Hopkins statistic is 0.95 which suggests that the data are significantly clusterable.

Chapter 3. K-means Algorithm

First algorithm that will be used to cluster the data is k-means algorithm. K-means works as follows:

  1. Choose the number of clusters k.
  2. Initialize k cluster centers (centroids).
  3. Assign each data point to the nearest centroid to form clusters.
  4. Recalculate each centroid as the mean of all points assigned to it.
  5. Repeat steps 3–4 until cluster assignments no longer change or a maximum number of iterations is reached.

The ultimate goal of the algorithm is to minimize the square error of the intra-class dissimilarity.

3.1. Finding the Optimal Number of Clusters

The first step is to determine the optimal number of clusters. For this purpose, both the elbow method and the silhouette method are applied:

# Finding the optimal number of clusters using elbow method and silhouette method.
elb_meth_kmeans<-fviz_nbclust(data_z, kmeans, method = "wss")
sil_meth_kmeans<-fviz_nbclust(data_z, kmeans, method = "silhouette")
grid.arrange(elb_meth_kmeans, sil_meth_kmeans,
             nrow=1,
             top = textGrob("Optimal number of clusters for k-means",
                            gp = gpar(fontsize = 16, fontface = "bold")))

Both methods suggest that 4 is the optimal number of clusters. However, k=3 does not appear to be significantly worse, so it will also be considered. Cluster quality measures, such as the average silhouette width and the Calinski-Harabasz index, will be calculated to determine which number of clusters actually fits the data better and will be chosen as the final solution.

3.2. K-means for k=4

#K-means algorithm for k=4:
set.seed(123)
k_means_4 <- kmeans(data_z, centers=4, nstart=25)

#Average silhouette width for k=4:
sil_kmeans_4 <- silhouette(k_means_4$cluster, dist(data_z))
fviz_silhouette(sil_kmeans_4)
##   cluster size ave.sil.width
## 1       1   11          0.61
## 2       2   12          0.49
## 3       3   17          0.60
## 4       4    2          0.17

#Calinski-Harabasz index for k=4:
round(calinhara(data_z, k_means_4$cluster), digits = 2)
## [1] 93.23

The average silhouette width is equal to 0.55, which means the clusters are reasonably well-separated. The Calinski-Harabasz index is equal to 93.23.

3.3. K-means for k=3:

#K-means algorithm for k=3:
set.seed(123)
k_means_3 <- kmeans(data_z, centers=3, nstart=25)

#Average silhouette width for k=3:
sil_kmeans_3 <- silhouette(k_means_3$cluster, dist(data_z))
fviz_silhouette(sil_kmeans_3)
##   cluster size ave.sil.width
## 1       1   12          0.63
## 2       2    6          0.32
## 3       3   24          0.54

#Calinski-Harabasz index for k=3:
round(calinhara(data_z, k_means_3$cluster), digits = 2)
## [1] 63.73

Both the average silhouette width and the Calinski-Harabasz index are higher for k=4, suggesting that it produces more distinct and compact clusters compared to k=3. Therefore, k=4 is chosen as the final number of clusters for the K-means algorithm.

3.4. Visualization of the Clusters

# Visualization of the clusters:
plot_kmeans <- fviz_cluster(k_means_4, data = data_z, frame = FALSE, geom = "point", main = "K-means Algorithm Results (k=4)")
plot_kmeans

It is now time to assign clusters to each country in the dataset.

# Assigning number of k-means cluster to each observation in "europe" dataset:
europe$cluster_kmeans <- k_means_4$cluster
head(europe)
## # A tibble: 6 × 4
##   Country                  hdi GDP_per_capita cluster_kmeans
##   <chr>                  <dbl>          <dbl>          <int>
## 1 Albania                0.81           8575.              1
## 2 Andorra                0.913         46812.              3
## 3 Austria                0.93          56034.              2
## 4 Belarus                0.824          7897.              1
## 5 Belgium                0.951         54690.              2
## 6 Bosnia and Herzegovina 0.804          8663.              1

Chapter 4. PAM (Partitioning Around Medoids) Algorithm

The PAM (Partitioning Around Medoids) algorithm is the next method applied in this study. It works as follows:

  1. Choose number of clusters k.
  2. Select initial medoids – pick k representative points (medoids) from the dataset.
  3. Assign points to clusters – each data point is assigned to the nearest medoid.
  4. Update medoids – for each cluster, choose the point that minimizes the total dissimilarity within the cluster as the new medoid.
  5. Repeat steps 3–4 – continue until medoids no longer change or the total dissimilarity is minimized.

The ultimate goal is to minimize the sum of dissimilarities between points and their assigned medoids. PAM is more robust than k-means in the presence of noise and outliers because using medoids is less influenced by outliers or other extreme values than calculating mean values like in k-means.

4.1. Finding the Optimal Number of Clusters

The elbow method and the silhouette method are applied to find the optimal number of clusters:

elb_meth_pam<-fviz_nbclust(data_z, pam, method = "wss")
sil_meth_pam<-fviz_nbclust(data_z, pam, method = "silhouette")
grid.arrange(elb_meth_pam, sil_meth_pam,
             nrow=1,
             top = textGrob("Optimal number of clusters for PAM",
                            gp = gpar(fontsize = 16, fontface = "bold")))

Both the elbow and silhouette methods again suggest that 4 is the optimal number of clusters. However, as before, k=3 will also be considered, and cluster quality measures will be calculated to determine which k actually fits the data better.

4.2. PAM for k=4

#PAM algorithm for k=4:
set.seed(123)
pam_4 <- pam(data_z, 4)

#Average silhouette width for k=4:
sil_pam_4 <- silhouette(pam_4$clustering, dist(data_z))
fviz_silhouette(sil_pam_4)
##   cluster size ave.sil.width
## 1       1   11          0.61
## 2       2   17          0.62
## 3       3   13          0.39
## 4       4    1          0.00

#Calinski-Harabasz index for k=4:
round(calinhara(data_z, pam_4$clustering), digits = 2)
## [1] 85.63

The average silhouette width is 0.53, suggesting that the clusters are reasonably well-separated. However, it is slightly lower than that obtained with K-means (0.55).

4.3. PAM for k=3

#PAM algorithm for k=3:
set.seed(123)
pam_3 <- pam(data_z, 3)

#Average silhouette width for k=3:
sil_pam_3 <- silhouette(pam_3$clustering, dist(data_z))
fviz_silhouette(sil_pam_3)
##   cluster size ave.sil.width
## 1       1   11          0.61
## 2       2   17          0.65
## 3       3   14          0.25

#Calinski-Harabasz index for k=3:
round(calinhara(data_z, pam_3$clustering), digits = 2)
## [1] 54.37

Both the average silhouette width and the Calinski-Harabasz index are higher for k=4. Therefore, k=4 is chosen as the final number of clusters for the PAM algorithm.

4.4. Visualization of the Clusters

#Visualization of the clusters:
plot_pam <- fviz_cluster(pam_4, data = data_z, frame = FALSE, geom = "point", main = "PAM Algorithm Results (k=4)")
plot_pam

It can be observed that, compared to K-means, similar clusters have different numbers assigned to them. For example, cluster 2 of K-means corresponds almost exactly to cluster 3 of PAM, and vice versa. Therefore, the specific numbering of clusters should not be given importance. What matters is the actual grouping, as the results of each algorithm will ultimately be compared using a colored map of Europe.

It is now time to assign clusters to each country in the dataset.

#Assigning PAM clusters to each observation in the dataset:
europe$cluster_pam <- pam_4$clustering
head(europe)
## # A tibble: 6 × 5
##   Country                  hdi GDP_per_capita cluster_kmeans cluster_pam
##   <chr>                  <dbl>          <dbl>          <int>       <int>
## 1 Albania                0.81           8575.              1           1
## 2 Andorra                0.913         46812.              3           2
## 3 Austria                0.93          56034.              2           3
## 4 Belarus                0.824          7897.              1           1
## 5 Belgium                0.951         54690.              2           3
## 6 Bosnia and Herzegovina 0.804          8663.              1           1

Chapter 5. Hierarchical Clustering (Agglomerative and Divisive Approaches)

The next algorithm used in this study is hierarchical clustering, which has two approaches - agglomerative and divisive. Unlike previous algorithms (K-means and PAM), hierarchical clustering does not require specifying the desired number of clusters at the beggining.

5.1. Agglomerative Approach

The agglomerative approach works as follows:

  1. Start with individual points – treat each data point as its own cluster.
  2. Compute distances – calculate the distance (dissimilarity) between all clusters.
  3. Merge closest clusters – combine the two clusters with the smallest distance.
  4. Update distances – recalculate distances between the new cluster and all remaining clusters using one of the linkage methods.
  5. Repeat steps 3–4 – continue merging until all points are in a single cluster.

5.1.1. Choosing the Linkage Method and Running the Agglomerative Algorithm

First, let’s compare the agglomerative coefficients (AC) across different linkage methods to choose the best one.

#Comparing different agglomerative coefficients (AC) across different linkage methods:
m <- c( "average", "single", "complete", "ward")
ac <- function(x) {
  agnes(data_z, method = x)$ac
}
map_dbl(m, ac)
## [1] 0.9346212 0.9037888 0.9462390 0.9682194

The “Ward” method has the highest agglomerative coefficient (~0.97), suggesting that it is the best linkage method.

#Hierarchical clustering (agglomerative approach) with "Ward" linkage method:
set.seed(123)
hc_agg <- agnes(data_z, method = "ward")

#Calculating agglomerative coefficient
round(hc_agg$ac, digits=2)
## [1] 0.97

The agglomerative coefficient is 0.97. Such a high value indicates that the clusters are tight and well-separated.

5.1.2. Creating the Dendrogram

#Dendrogram for agglomerative approach:
pltree(hc_agg, cex = 0.6, hang = -1, main = "Hierarchical clustering (agglomerative approach)")

5.1.3. Cutting the Dendrogram into Desired Number of Clusters

Now, it’s time to cut the dendrogram into three and four clusters and compute cluster quality metrics to assess the most appropriate number of clusters.

#4 clusters:
hc_agg_cut_4 <- cutree(hc_agg, k = 4)

#Average silhouette width for k=4:
sil_agg_4 <- silhouette(hc_agg_cut_4, dist(data_z))
fviz_silhouette(sil_agg_4)
##   cluster size ave.sil.width
## 1       1   11          0.62
## 2       2   18          0.58
## 3       3   12          0.41
## 4       4    1          0.00

#Calinski-Harabasz index for k=4:
round(calinhara(data_z, hc_agg_cut_4), digits = 2)
## [1] 84.08
#3 clusters:
hc_agg_cut_3 <- cutree(hc_agg, k = 3)

#Average silhouette width for k=3:
sil_agg_3 <- silhouette(hc_agg_cut_3, dist(data_z))
fviz_silhouette(sil_agg_3)
##   cluster size ave.sil.width
## 1       1   11          0.62
## 2       2   18          0.62
## 3       3   13          0.25

#Calinski-Harabasz index for k=3:
round(calinhara(data_z, hc_agg_cut_3), digits = 2)
## [1] 54.76

Both the average silhouette width and the Calinski-Harabasz index are higher for k=4. Therefore, k=4 is chosen as the final number of clusters for the agglomerative hierarchical algorithm.

5.1.4. Visualizaton of the Clusters

##Visualization of the clusters:
plot_agg <- fviz_cluster(list(data = data_z, cluster = hc_agg_cut_4), labelsize = 0, main = "Hierarchical Clustering: Agglomerative Approach Results For k=4")
plot_agg

Time to assign the clusters to each country in the dataset.

europe <- europe %>%
  mutate(cluster_agg = hc_agg_cut_4)
head(europe)
## # A tibble: 6 × 6
##   Country              hdi GDP_per_capita cluster_kmeans cluster_pam cluster_agg
##   <chr>              <dbl>          <dbl>          <int>       <int>       <int>
## 1 Albania            0.81           8575.              1           1           1
## 2 Andorra            0.913         46812.              3           2           2
## 3 Austria            0.93          56034.              2           3           2
## 4 Belarus            0.824          7897.              1           1           1
## 5 Belgium            0.951         54690.              2           3           3
## 6 Bosnia and Herzeg… 0.804          8663.              1           1           1

5.2. Divisive Approach

The divisive approach works as follows:

  1. Start with all data in one cluster – all points belong to a single group.
  2. Compute dissimilarities – measure distances between data points.
  3. Split the cluster – divide the current cluster into two smaller clusters to maximize their separation.
  4. Select a cluster to split next – usually the one with the largest internal dissimilarity.
  5. Repeat splitting – continue until each point is its own cluster.

5.2.1. Running the Divisive Algorithm

#Divisive approach:
set.seed(123)
hc_div <- diana(data_z)

#Divisive coefficient:
round(hc_div$dc, digits = 2)
## [1] 0.95

Value of the divisive coefficient is equal to 0.95 which means that the divisive clustering captures meaningful structure.

5.2.2. Creating the Dendrogram

#Dendrogram of the divisive approach:
pltree(hc_div, cex = 0.6, hang = -1, main = "Hierarchical clustering (divisive approach)")

5.2.3. Cutting the Dendrogram into Desired Number of Clusters

Now, it is time to cut the dendrogram into three and four clusters and compute cluster quality metrics to assess the most appropriate number of clusters.

#Cutting the tree into 4 and 3 clusters to see which number is more optimal:
hc_div_cut_4 <- cutree(hc_div, k=4)

#Average silhouette width for k=4:
sil_div_4 <- silhouette(hc_div_cut_4, dist(data_z))
fviz_silhouette(sil_div_4)
##   cluster size ave.sil.width
## 1       1   11          0.61
## 2       2   17          0.62
## 3       3   13          0.39
## 4       4    1          0.00

#Calinski-Harabasz index for k=4:
round(calinhara(data_z, hc_div_cut_4), digits = 2)
## [1] 85.63
#Cutting the tree into 3 clusters:
hc_div_cut_3 <- cutree(hc_div, k=3)

#Average silhouette width for k=3:
sil_div_3 <- silhouette(hc_div_cut_3, dist(data_z))
fviz_silhouette(sil_div_3)
##   cluster size ave.sil.width
## 1       1   28          0.43
## 2       2   13          0.57
## 3       3    1          0.00

#Calinski-Harabasz index for k=3:
round(calinhara(data_z, hc_div_cut_3), digits = 2)
## [1] 37.22

Again, k=4 seems to be the better option. Therefore, k=4 is the final choice.

5.2.4. Visualization of the Clusters

#Visualization of the clusters:
plot_div <- fviz_cluster(list(data = data_z, cluster = hc_div_cut_4), labelsize = 0, main = "Hierarchical Clustering: Divisive Approach Results For k=4")
plot_div

Assigning the clusters to each country in the dataset:

europe <- europe %>%
  mutate(cluster_div = hc_div_cut_4)
head(europe)
## # A tibble: 6 × 7
##   Country              hdi GDP_per_capita cluster_kmeans cluster_pam cluster_agg
##   <chr>              <dbl>          <dbl>          <int>       <int>       <int>
## 1 Albania            0.81           8575.              1           1           1
## 2 Andorra            0.913         46812.              3           2           2
## 3 Austria            0.93          56034.              2           3           2
## 4 Belarus            0.824          7897.              1           1           1
## 5 Belgium            0.951         54690.              2           3           3
## 6 Bosnia and Herzeg… 0.804          8663.              1           1           1
## # ℹ 1 more variable: cluster_div <int>

All four algorithms have been applied to cluster the data. Now, it is time to compare the obtained results to assess whether they are robust to changes in the method and, therefore, whether the clustering is actually meaningful.

Chapter 6. Clustering Results: Comparison, Interpretation and Visualization of Similarities and Differences

6.1. Comparison of Cluster Quality Measures

#Comparison of cluster quality measures:
algorithm <- c("K-means", "PAM", "Agglomerative hierarchical", "Divisive hierarchical")
avg_sil_width <- c(0.55, 0.53, 0.53, 0.53)
cal_har_index <- c(round(calinhara(data_z, k_means_4$cluster), digits = 2),
                   round(calinhara(data_z, pam_4$clustering), digits = 2),
                   round(calinhara(data_z, hc_agg_cut_4), digits = 2),
                   round(calinhara(data_z, hc_div_cut_4), digits = 2))
quality_stat_table <- cbind(algorithm, avg_sil_width, cal_har_index)
quality_stat_table
##      algorithm                    avg_sil_width cal_har_index
## [1,] "K-means"                    "0.55"        "93.23"      
## [2,] "PAM"                        "0.53"        "85.63"      
## [3,] "Agglomerative hierarchical" "0.53"        "84.08"      
## [4,] "Divisive hierarchical"      "0.53"        "85.63"

K-means achieved the highest average silhouette width and Calinski–Harabasz index, indicating the most compact and well-separated clusters. The remaining methods (PAM and both hierarchical approaches) produced similar results, with slightly lower cluster quality.

6.2. Comparison of the Plots

#Making the plot:
plot1 <- fviz_cluster(k_means_4, data = data_z, frame = FALSE, geom = "point", main = "K-means (k=4)")
plot2 <- fviz_cluster(pam_4, data = data_z, frame = FALSE, geom = "point", main = "PAM (k=4)")
plot3 <- fviz_cluster(list(data = data_z, cluster = hc_agg_cut_4), labelsize = 0, main = "Hierarchical Agglomerative (k=4)")
plot4 <- fviz_cluster(list(data = data_z, cluster = hc_div_cut_4), labelsize = 0, main = "Hierarchical Divisive (k=4)")
grid.arrange(plot1, plot2, plot3, plot4, nrow = 2)

It is clearly visible that the results are very similar and that the differences between them are small. This suggests that the results are robust to changes in the clustering algorithm, which makes them more reliable.

Now, let us take a more detailed look at the four plots. It can be observed that Cluster 1 consistently represents the least developed countries, characterized by the lowest GDP per capita and HDI values.

Cluster 3 in the K-means algorithm and Cluster 2 in the PAM, hierarchical agglomerative, and hierarchical divisive algorithms generally represent moderately developed countries, with moderate values of GDP per capita and HDI.

Cluster 2 in the K-means algorithm and Cluster 3 in the PAM, hierarchical agglomerative, and hierarchical divisive algorithms generally represent highly developed countries, characterized by high GDP per capita and HDI.

Cluster 4 represents one or two outliers. In this case, these countries are Liechtenstein and Luxembourg, which can be described as the most developed countries in the dataset.

6.3. Visualization of the Clusters on the Map of Europe

Now, let us prepare a colored map of Europe for all four algorithms to visualize which countries belong to which clusters. Since the clusters differ in names and colors across the algorithms, the same colors will be assigned to the same types of clusters as described above for consistency. The “maps” library will be used for this visualization.

map_countries <- unique(maps::map("world", plot = FALSE)$names)
setdiff(europe$Country, map_countries)
## [1] "Czechia"        "United Kingdom"

Since the names of Czechia and the United Kingdom differ in the “map_countries” dataset (where they appear as Czech Republic and UK), it is necessary to modify these names in the “europe” dataset in order to merge the two datasets correctly. We can also include only European part of Russia on the map.

europe$Country[europe$Country == "Czechia"] <- "Czech Republic"
europe$Country[europe$Country == "United Kingdom"] <- "UK"

europe_map <- map_data("world", region = europe$Country)

europe_map <- europe_map %>%
  left_join(europe, by = c("region" = "Country"))

#Including only European part of Russia
europe_map <- europe_map %>% 
  filter(!(region == "Russia" & long > 60))

As mentioned before, since the names and colors of clusters including the same countries can vary slightly across algorithms, it is useful to assign the same color to the same type of cluster for consistency.

#For k-means:
my_cols_kmeans <- c(
  "1" = "#FF0000",
  "2" = "#FFA500",   
  "3" = "#FFFF00",   
  "4" = "#FF69B4"    
)
MAP_KMEANS <- ggplot(europe_map, aes(long, lat, group = group,)) +
  geom_polygon(aes(fill = factor(cluster_kmeans)), colour = "black") +
  scale_fill_manual(values = my_cols_kmeans) +
  coord_cartesian(xlim = c(-25, 45), ylim = c(34, 72)) +
  theme_minimal()+
  labs(fill = "Cluster", title = "K-means")


my_cols_rest <- c(
  "1" = "#FF0000",
  "2" = "#FFFF00",   
  "3" = "#FFA500",   
  "4" = "#FF69B4"    
)

#For PAM:
MAP_PAM <- ggplot(europe_map, aes(long, lat, group = group,)) +
  geom_polygon(aes(fill = factor(cluster_pam)), colour = "black") +
  scale_fill_manual(values = my_cols_rest) +
  coord_cartesian(xlim = c(-25, 45), ylim = c(34, 72)) +
  theme_minimal()+
  labs(fill = "Cluster", title = "PAM")

#For hierarchical agglomerative:
MAP_AGG <- ggplot(europe_map, aes(long, lat, group = group,)) +
  geom_polygon(aes(fill = factor(cluster_agg)), colour = "black") +
  scale_fill_manual(values = my_cols_rest) +
  coord_cartesian(xlim = c(-25, 45), ylim = c(34, 72)) +
  theme_minimal()+
  labs(fill = "Cluster", title = " Hierarchical Agglomerative")

#For hierarchical divisive:
MAP_DIV <- ggplot(europe_map, aes(long, lat, group = group,)) +
  geom_polygon(aes(fill = factor(cluster_div)), colour = "black") +
  scale_fill_manual(values = my_cols_rest) +
  coord_cartesian(xlim = c(-25, 45), ylim = c(34, 72)) +
  theme_minimal()+
  labs(fill = "Cluster", title = " Hierarchical Divisive")


grid.arrange(MAP_KMEANS, MAP_PAM, MAP_AGG, MAP_DIV, nrow = 2)

Looking at the maps, it becomes even more apparent that the differences between the algorithms are very small.

What is particularly interesting is the visible geographical pattern of the clusters. The red clusters (least developed) include countries that were part of the former Soviet Union as well as most of the Balkans. The yellow clusters (moderately developed) include countries of Central Europe and some Western European countries such as France, Portugal, and Spain. The orange clusters (highly developed) consist mainly of highly-developed Western and Northern Europe. Finally, the pink clusters (most developed) include outliers like Liechtenstein or Luxembourg.

6.4. Which Countries Belong to Which Cluster?

6.4.1. Least Developed Countries (Red Cluster)

All algorithms gave the exact same result for red clusters. Countries that are classified as least developed are:

  1. Albania
  2. Belarus
  3. Bosnia and Herzegovina
  4. Bulgaria
  5. Moldova
  6. Montenegro
  7. North Macedonia
  8. Romania
  9. Russia
  10. Serbia
  11. Ukraine

6.4.2. Moderately Developed Countries (Yellow Cluster)

The only difference across the algorithms is that the agglomerative hierarchical algorithm assigns Austria to this cluster, while the other algorithms don’t. Since the agglomerative hierarchical algorithm has the lowest value of the Calinski–Harabasz index among all the applied algorithms, it is reasonable to conclude that Austria should rather not be classified as belonging to this group.

In general, the countries that can be described as moderately developed are:

  1. Andorra
  2. Croatia
  3. Cyprus
  4. Czechia
  5. Estonia
  6. France
  7. Greece
  8. Hungary
  9. Italy
  10. Latvia
  11. Lithuania
  12. Malta
  13. Poland
  14. Portugal
  15. Slovakia
  16. Slovenia
  17. Spain

6.4.3. Highly Developed Countries (Orange Cluster)

There are some very small differences between the results. Austria is classified as a highly developed country by almost all algorithms, except for the agglomerative hierarchical algorithm, which assigns it to the moderately developed group. Similarly, Luxembourg is classified as highly developed by most algorithms, except for K-means, which assigns it to the most developed cluster.

In general, the countries that can be described as highly developed are:

  1. Austria
  2. Belgium
  3. Denmark
  4. Finland
  5. Germany
  6. Iceland
  7. Ireland
  8. Luxembourg (classified to this cluster by every algorithm except k-means)
  9. Netherlands
  10. Norway
  11. Sweden
  12. Switzerland
  13. United Kingdom

6.4.4. Most Developed Countries (Pink Cluster)

K-means is the only algorithm that assigned both Liechtenstein and Luxembourg to this cluster. The other algorithms created this separate cluster only for Liechtenstein.

Summary and Conclusions

This study aimed to identify patterns of similarity among European countries based on their level of development using two indicators: GDP per capita and the Human Development Index (HDI). Four clustering approaches were applied and compared: K-means, Partitioning Around Medoids, Hierarchical Agglomerative Clustering and Hierarchical Divisive Clustering. All algorithms produced similar and consistent results, confirming the robustness of the clustering structure, with k-means achieving slightly higher cluster quality measures than the other methods.

The analysis shows that European countries can be divided into four development groups.

The cluster of least developed countries largely consists of post-socialist countries, former Soviet states, and Balkan countries, which explains their similar economic and social development patterns.

The moderately developed cluster mainly consists of Central and Southern European countries, as well as some Western European countries, characterized by medium levels of GDP per capita and HDI. The highly developed cluster consists primarily of Western and Northern European countries, known for strong economies, high institutional quality, and advanced welfare systems.

Liechtenstein consistently forms its own cluster due to its exceptionally high GDP per capita, making it a clear outlier among European countries. Luxembourg sometimes joins this cluster depending on the algorithm, but Liechtenstein remains the most extreme case.

These results show clear economic differences between European countries. The geographical distribution of the clusters confirms that some parts of Europe are generally less developed than others.