Introduction

In this project I will try to group European countries based on their economic characteristics. The idea is to find which countries are similar to each other. I use data from World Bank (https://data.worldbank.org/).

The main questions: - How many groups (clusters) exist? - Which countries belong together? - What makes each group different?

Loading Libraries and Data

# install if needed
if (!require("cluster")) install.packages("cluster")
if (!require("factoextra")) install.packages("factoextra")
if (!require("corrplot")) install.packages("corrplot")

library(cluster)
library(factoextra)
library(corrplot)

# data from World Bank 2022
europe <- data.frame(
  Country = c("Albania", "Austria", "Belgium", "Bulgaria", "Croatia", 
              "Czech Republic", "Denmark", "Estonia", "Finland", "France",
              "Germany", "Greece", "Hungary", "Ireland", "Italy",
              "Latvia", "Lithuania", "Netherlands", "Norway", "Poland",
              "Portugal", "Romania", "Slovakia", "Slovenia", "Spain",
              "Sweden", "Switzerland", "United Kingdom"),
  GDP_per_capita = c(6810, 52085, 49582, 13772, 18570, 27220, 67803, 28247, 
                     50732, 40886, 48636, 20867, 18390, 103685, 34776, 21947, 
                     24032, 57025, 106149, 18688, 24521, 15892, 21088, 28439, 
                     29674, 56424, 93259, 45295),
  Life_expectancy = c(76.5, 81.3, 81.9, 74.8, 78.1, 78.3, 81.4, 78.6, 82.0, 
                      82.5, 80.6, 80.1, 76.2, 82.0, 82.9, 75.3, 75.7, 81.4, 
                      83.2, 77.0, 81.1, 74.2, 77.0, 80.6, 83.0, 83.0, 84.0, 80.4),
  Unemployment = c(11.0, 4.8, 5.6, 4.3, 7.0, 2.2, 4.5, 5.6, 6.8, 7.3, 3.1, 
                   12.4, 3.6, 4.5, 8.1, 6.9, 5.9, 3.5, 3.2, 2.9, 6.0, 5.6, 
                   6.1, 4.0, 12.9, 7.5, 4.3, 3.7),
  Education = c(58, 89, 79, 79, 67, 65, 82, 70, 93, 67, 72, 143, 52, 80, 64, 
                88, 72, 90, 84, 67, 66, 52, 60, 82, 93, 78, 63, 62),
  Internet_users = c(79, 93, 92, 80, 82, 88, 98, 92, 93, 86, 93, 83, 89, 92, 
                     85, 90, 87, 93, 98, 87, 84, 84, 90, 89, 94, 95, 96, 97),
  CO2_emissions = c(1.8, 7.1, 8.0, 5.7, 4.3, 9.3, 5.1, 8.4, 7.5, 4.6, 8.1, 
                    5.7, 4.9, 7.7, 5.3, 3.8, 4.5, 8.8, 7.5, 8.1, 4.4, 3.7, 
                    5.8, 6.1, 5.1, 3.8, 4.0, 5.2)
)
rownames(europe) <- europe$Country

# quick look at data
head(europe)

##                       Country GDP_per_capita Life_expectancy Unemployment
## Albania               Albania           6810            76.5         11.0
## Austria               Austria          52085            81.3          4.8
## Belgium               Belgium          49582            81.9          5.6
## Bulgaria             Bulgaria          13772            74.8          4.3
## Croatia               Croatia          18570            78.1          7.0
## Czech Republic Czech Republic          27220            78.3          2.2
##                Education Internet_users CO2_emissions
## Albania               58             79           1.8
## Austria               89             93           7.1
## Belgium               79             92           8.0
## Bulgaria              79             80           5.7
## Croatia               67             82           4.3
## Czech Republic        65             88           9.3

summary(europe[,-1])

##  GDP_per_capita   Life_expectancy  Unemployment      Education     
##  Min.   :  6810   Min.   :74.20   Min.   : 2.200   Min.   : 52.00  
##  1st Qu.: 21033   1st Qu.:77.00   1st Qu.: 3.925   1st Qu.: 64.75  
##  Median : 29056   Median :80.60   Median : 5.600   Median : 72.00  
##  Mean   : 40160   Mean   :79.75   Mean   : 5.832   Mean   : 75.61  
##  3rd Qu.: 51070   3rd Qu.:82.00   3rd Qu.: 6.925   3rd Qu.: 82.50  
##  Max.   :106149   Max.   :84.00   Max.   :12.900   Max.   :143.00  
##  Internet_users  CO2_emissions  
##  Min.   :79.00   Min.   :1.800  
##  1st Qu.:85.75   1st Qu.:4.475  
##  Median :90.00   Median :5.500  
##  Mean   :89.61   Mean   :5.868  
##  3rd Qu.:93.00   3rd Qu.:7.550  
##  Max.   :98.00   Max.   :9.300

Correlation Check

Before clustering lets see if variables are related.

cor_matrix <- cor(europe[,-1])
corrplot(cor_matrix, method = "number", type = "upper")

GDP and life expectancy are positively correlated which makes sense - richer countries usually have better healthcare.

Data Standardization

Variables have different scales (GDP in thousands vs percentages) so we need to standardize.

data_scaled <- scale(europe[,-1])
head(data_scaled)

##                GDP_per_capita Life_expectancy Unemployment  Education
## Albania            -1.2598410      -1.1125061   1.91125058 -0.9912533
## Austria             0.4504572       0.5287762  -0.38172178  0.7539959
## Belgium             0.3559045       0.7339365  -0.08585438  0.1910123
## Bulgaria           -0.9968461      -1.6937935  -0.56663891  0.1910123
## Croatia            -0.8155979      -0.5654120   0.43191357 -0.4845681
## Czech Republic     -0.4888374      -0.4970252  -1.34329083 -0.5971648
##                Internet_users CO2_emissions
## Albania            -1.9879612   -2.15864455
## Austria             0.6358799    0.65384756
## Belgium             0.4484626    1.13144056
## Bulgaria           -1.8005440   -0.08907488
## Croatia            -1.4257096   -0.83199733
## Czech Republic     -0.3012062    1.82129711

Finding Optimal Number of Clusters

Elbow Method

fviz_nbclust(data_scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2, color = "red")

Looks like elbow is around 3.

Silhouette Method

fviz_nbclust(data_scaled, kmeans, method = "silhouette")

Silhouette suggests 2, but 3 also looks ok.

I will go with 3 clusters because it gives more interesting grouping.

K-Means Clustering

set.seed(123)
km <- kmeans(data_scaled, centers = 3, nstart = 25)

# cluster sizes
table(km$cluster)

## 
##  1  2  3 
## 12  2 14

# add cluster to data
europe$cluster <- as.factor(km$cluster)

# visualize
fviz_cluster(km, data = data_scaled, geom = "point", 
             ellipse.type = "convex") +
  ggtitle("K-Means Clustering (k=3)")

Which Countries in Each Cluster?

for(i in 1:3) {
  cat("Cluster", i, ":\n")
  cat(europe$Country[europe$cluster == i], sep = ", ")
  cat("\n\n")
}

## Cluster 1 :
## Albania, Bulgaria, Croatia, France, Hungary, Italy, Latvia, Lithuania, Poland, Portugal, Romania, Slovakia
## 
## Cluster 2 :
## Greece, Spain
## 
## Cluster 3 :
## Austria, Belgium, Czech Republic, Denmark, Estonia, Finland, Germany, Ireland, Netherlands, Norway, Slovenia, Sweden, Switzerland, United Kingdom

Cluster Profiles

Lets see average values for each cluster:

cluster_means <- aggregate(europe[,2:7], by = list(Cluster = europe$cluster), FUN = mean)
cluster_means[,-1] <- round(cluster_means[,-1], 1)
print(cluster_means)

##   Cluster GDP_per_capita Life_expectancy Unemployment Education Internet_users
## 1       1        21614.3            77.6          6.2      66.0           85.2
## 2       2        25270.5            81.6         12.7     118.0           88.5
## 3       3        58184.4            81.3          4.5      77.8           93.5
##   CO2_emissions
## 1           4.7
## 2           5.4
## 3           6.9

Looking at results: - One cluster has highest GDP (rich countries like Norway, Switzerland, Ireland) - One has medium values (most Western European countries)
- One has lower GDP but still developing (Eastern European countries)

Hierarchical Clustering

Lets compare with another method.

dist_matrix <- dist(data_scaled)
hc <- hclust(dist_matrix, method = "ward.D2")

plot(hc, main = "Dendrogram", cex = 0.7)
rect.hclust(hc, k = 3)

hc_clusters <- cutree(hc, k = 3)
europe$hc_cluster <- as.factor(hc_clusters)

# compare with kmeans
table(KMeans = europe$cluster, Hierarchical = europe$hc_cluster)

##       Hierarchical
## KMeans  1  2  3
##      1 11  1  0
##      2  0  0  2
##      3  0 14  0

Results are similar but not exactly the same. Some countries are on the border between clusters.

PAM Clustering

PAM uses medoids instead of means - its more robust to outliers.

pam_result <- pam(data_scaled, k = 3)
europe$pam_cluster <- as.factor(pam_result$clustering)

# medoids (representative countries)
europe$Country[pam_result$medoids]

##  [1] "Austria"        "Belgium"        "Bulgaria"       "Croatia"       
##  [5] "Czech Republic" "Denmark"        "Estonia"        "Finland"       
##  [9] "France"         "Germany"        "Greece"         "Hungary"       
## [13] "Ireland"        "Italy"          "Latvia"         "Lithuania"     
## [17] "Netherlands"    "Norway"         "Poland"         "Portugal"      
## [21] "Romania"        "Slovakia"       "Slovenia"       "Spain"         
## [25] "Sweden"         "Switzerland"    "United Kingdom"

fviz_cluster(pam_result, data = data_scaled, geom = "point",
             ellipse.type = "convex") +
  ggtitle("PAM Clustering")

Validation - Silhouette Analysis

sil <- silhouette(km$cluster, dist_matrix)
fviz_silhouette(sil)

##   cluster size ave.sil.width
## 1       1   12          0.30
## 2       2    2          0.18
## 3       3   14          0.33

Average silhouette width is around 0.3 which is not great but acceptable. Some countries have low silhouette meaning they could belong to different cluster.

Conclusions

What I found:

European countries can be divided into roughly 3 groups based on economic indicators
The richest group includes Norway, Switzerland, Ireland, Denmark - these have GDP over $50k
Middle group has most Western European countries like France, Germany, UK
Third group is mostly Eastern European countries that are still developing economically
All three methods (K-means, Hierarchical, PAM) gave similar results which is good

Limitations: - Only used data from one year - 6 variables might not be enough - Some countries are between clusters (like Greece, Spain)

References

World Bank: https://data.worldbank.org/
Kassambara (2017) - Cluster Analysis in R

sessionInfo()

## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.95    factoextra_1.0.7 ggplot2_4.0.0    cluster_2.1.8.1 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     dplyr_1.1.4        compiler_4.5.1    
##  [5] ggsignif_0.6.4     tidyselect_1.2.1   Rcpp_1.1.0         tidyr_1.3.1       
##  [9] jquerylib_0.1.4    scales_1.4.0       yaml_2.3.10        fastmap_1.2.0     
## [13] R6_2.6.1           ggpubr_0.6.2       labeling_0.4.3     generics_0.1.4    
## [17] Formula_1.2-5      knitr_1.50         backports_1.5.0    ggrepel_0.9.6     
## [21] tibble_3.3.0       car_3.1-3          bslib_0.9.0        pillar_1.11.1     
## [25] RColorBrewer_1.1-3 rlang_1.1.6        broom_1.0.10       cachem_1.1.0      
## [29] xfun_0.53          sass_0.4.10        S7_0.2.0           cli_3.6.5         
## [33] withr_3.0.2        magrittr_2.0.4     digest_0.6.37      grid_4.5.1        
## [37] rstudioapi_0.17.1  lifecycle_1.0.4    vctrs_0.6.5        rstatix_0.7.3     
## [41] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       abind_1.4-8       
## [45] carData_3.0-6      rmarkdown_2.30     purrr_1.1.0        tools_4.5.1       
## [49] pkgconfig_2.0.3    htmltools_0.5.8.1

Clustering of European Countries by Economic Indicators

Mekhroj Doliev

2026-02-01