Introduction

In this project I will try to group European countries based on their economic characteristics. The idea is to find which countries are similar to each other. I use data from World Bank (https://data.worldbank.org/).

The main questions: - How many groups (clusters) exist? - Which countries belong together? - What makes each group different?

Loading Libraries and Data

# install if needed
if (!require("cluster")) install.packages("cluster")
if (!require("factoextra")) install.packages("factoextra")
if (!require("corrplot")) install.packages("corrplot")

library(cluster)
library(factoextra)
library(corrplot)
# data from World Bank 2022
europe <- data.frame(
  Country = c("Albania", "Austria", "Belgium", "Bulgaria", "Croatia", 
              "Czech Republic", "Denmark", "Estonia", "Finland", "France",
              "Germany", "Greece", "Hungary", "Ireland", "Italy",
              "Latvia", "Lithuania", "Netherlands", "Norway", "Poland",
              "Portugal", "Romania", "Slovakia", "Slovenia", "Spain",
              "Sweden", "Switzerland", "United Kingdom"),
  GDP_per_capita = c(6810, 52085, 49582, 13772, 18570, 27220, 67803, 28247, 
                     50732, 40886, 48636, 20867, 18390, 103685, 34776, 21947, 
                     24032, 57025, 106149, 18688, 24521, 15892, 21088, 28439, 
                     29674, 56424, 93259, 45295),
  Life_expectancy = c(76.5, 81.3, 81.9, 74.8, 78.1, 78.3, 81.4, 78.6, 82.0, 
                      82.5, 80.6, 80.1, 76.2, 82.0, 82.9, 75.3, 75.7, 81.4, 
                      83.2, 77.0, 81.1, 74.2, 77.0, 80.6, 83.0, 83.0, 84.0, 80.4),
  Unemployment = c(11.0, 4.8, 5.6, 4.3, 7.0, 2.2, 4.5, 5.6, 6.8, 7.3, 3.1, 
                   12.4, 3.6, 4.5, 8.1, 6.9, 5.9, 3.5, 3.2, 2.9, 6.0, 5.6, 
                   6.1, 4.0, 12.9, 7.5, 4.3, 3.7),
  Education = c(58, 89, 79, 79, 67, 65, 82, 70, 93, 67, 72, 143, 52, 80, 64, 
                88, 72, 90, 84, 67, 66, 52, 60, 82, 93, 78, 63, 62),
  Internet_users = c(79, 93, 92, 80, 82, 88, 98, 92, 93, 86, 93, 83, 89, 92, 
                     85, 90, 87, 93, 98, 87, 84, 84, 90, 89, 94, 95, 96, 97),
  CO2_emissions = c(1.8, 7.1, 8.0, 5.7, 4.3, 9.3, 5.1, 8.4, 7.5, 4.6, 8.1, 
                    5.7, 4.9, 7.7, 5.3, 3.8, 4.5, 8.8, 7.5, 8.1, 4.4, 3.7, 
                    5.8, 6.1, 5.1, 3.8, 4.0, 5.2)
)
rownames(europe) <- europe$Country
# quick look at data
head(europe)
##                       Country GDP_per_capita Life_expectancy Unemployment
## Albania               Albania           6810            76.5         11.0
## Austria               Austria          52085            81.3          4.8
## Belgium               Belgium          49582            81.9          5.6
## Bulgaria             Bulgaria          13772            74.8          4.3
## Croatia               Croatia          18570            78.1          7.0
## Czech Republic Czech Republic          27220            78.3          2.2
##                Education Internet_users CO2_emissions
## Albania               58             79           1.8
## Austria               89             93           7.1
## Belgium               79             92           8.0
## Bulgaria              79             80           5.7
## Croatia               67             82           4.3
## Czech Republic        65             88           9.3
summary(europe[,-1])
##  GDP_per_capita   Life_expectancy  Unemployment      Education     
##  Min.   :  6810   Min.   :74.20   Min.   : 2.200   Min.   : 52.00  
##  1st Qu.: 21033   1st Qu.:77.00   1st Qu.: 3.925   1st Qu.: 64.75  
##  Median : 29056   Median :80.60   Median : 5.600   Median : 72.00  
##  Mean   : 40160   Mean   :79.75   Mean   : 5.832   Mean   : 75.61  
##  3rd Qu.: 51070   3rd Qu.:82.00   3rd Qu.: 6.925   3rd Qu.: 82.50  
##  Max.   :106149   Max.   :84.00   Max.   :12.900   Max.   :143.00  
##  Internet_users  CO2_emissions  
##  Min.   :79.00   Min.   :1.800  
##  1st Qu.:85.75   1st Qu.:4.475  
##  Median :90.00   Median :5.500  
##  Mean   :89.61   Mean   :5.868  
##  3rd Qu.:93.00   3rd Qu.:7.550  
##  Max.   :98.00   Max.   :9.300

Correlation Check

Before clustering lets see if variables are related.

cor_matrix <- cor(europe[,-1])
corrplot(cor_matrix, method = "number", type = "upper")

GDP and life expectancy are positively correlated which makes sense - richer countries usually have better healthcare.

Data Standardization

Variables have different scales (GDP in thousands vs percentages) so we need to standardize.

data_scaled <- scale(europe[,-1])
head(data_scaled)
##                GDP_per_capita Life_expectancy Unemployment  Education
## Albania            -1.2598410      -1.1125061   1.91125058 -0.9912533
## Austria             0.4504572       0.5287762  -0.38172178  0.7539959
## Belgium             0.3559045       0.7339365  -0.08585438  0.1910123
## Bulgaria           -0.9968461      -1.6937935  -0.56663891  0.1910123
## Croatia            -0.8155979      -0.5654120   0.43191357 -0.4845681
## Czech Republic     -0.4888374      -0.4970252  -1.34329083 -0.5971648
##                Internet_users CO2_emissions
## Albania            -1.9879612   -2.15864455
## Austria             0.6358799    0.65384756
## Belgium             0.4484626    1.13144056
## Bulgaria           -1.8005440   -0.08907488
## Croatia            -1.4257096   -0.83199733
## Czech Republic     -0.3012062    1.82129711

Finding Optimal Number of Clusters

Elbow Method

fviz_nbclust(data_scaled, kmeans, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2, color = "red")

Looks like elbow is around 3.

Silhouette Method

fviz_nbclust(data_scaled, kmeans, method = "silhouette")

Silhouette suggests 2, but 3 also looks ok.

I will go with 3 clusters because it gives more interesting grouping.

K-Means Clustering

set.seed(123)
km <- kmeans(data_scaled, centers = 3, nstart = 25)

# cluster sizes
table(km$cluster)
## 
##  1  2  3 
## 12  2 14
# add cluster to data
europe$cluster <- as.factor(km$cluster)

# visualize
fviz_cluster(km, data = data_scaled, geom = "point", 
             ellipse.type = "convex") +
  ggtitle("K-Means Clustering (k=3)")

Which Countries in Each Cluster?

for(i in 1:3) {
  cat("Cluster", i, ":\n")
  cat(europe$Country[europe$cluster == i], sep = ", ")
  cat("\n\n")
}
## Cluster 1 :
## Albania, Bulgaria, Croatia, France, Hungary, Italy, Latvia, Lithuania, Poland, Portugal, Romania, Slovakia
## 
## Cluster 2 :
## Greece, Spain
## 
## Cluster 3 :
## Austria, Belgium, Czech Republic, Denmark, Estonia, Finland, Germany, Ireland, Netherlands, Norway, Slovenia, Sweden, Switzerland, United Kingdom

Cluster Profiles

Lets see average values for each cluster:

cluster_means <- aggregate(europe[,2:7], by = list(Cluster = europe$cluster), FUN = mean)
cluster_means[,-1] <- round(cluster_means[,-1], 1)
print(cluster_means)
##   Cluster GDP_per_capita Life_expectancy Unemployment Education Internet_users
## 1       1        21614.3            77.6          6.2      66.0           85.2
## 2       2        25270.5            81.6         12.7     118.0           88.5
## 3       3        58184.4            81.3          4.5      77.8           93.5
##   CO2_emissions
## 1           4.7
## 2           5.4
## 3           6.9

Looking at results: - One cluster has highest GDP (rich countries like Norway, Switzerland, Ireland) - One has medium values (most Western European countries)
- One has lower GDP but still developing (Eastern European countries)

Hierarchical Clustering

Lets compare with another method.

dist_matrix <- dist(data_scaled)
hc <- hclust(dist_matrix, method = "ward.D2")

plot(hc, main = "Dendrogram", cex = 0.7)
rect.hclust(hc, k = 3)

hc_clusters <- cutree(hc, k = 3)
europe$hc_cluster <- as.factor(hc_clusters)

# compare with kmeans
table(KMeans = europe$cluster, Hierarchical = europe$hc_cluster)
##       Hierarchical
## KMeans  1  2  3
##      1 11  1  0
##      2  0  0  2
##      3  0 14  0

Results are similar but not exactly the same. Some countries are on the border between clusters.

PAM Clustering

PAM uses medoids instead of means - its more robust to outliers.

pam_result <- pam(data_scaled, k = 3)
europe$pam_cluster <- as.factor(pam_result$clustering)

# medoids (representative countries)
europe$Country[pam_result$medoids]
##  [1] "Austria"        "Belgium"        "Bulgaria"       "Croatia"       
##  [5] "Czech Republic" "Denmark"        "Estonia"        "Finland"       
##  [9] "France"         "Germany"        "Greece"         "Hungary"       
## [13] "Ireland"        "Italy"          "Latvia"         "Lithuania"     
## [17] "Netherlands"    "Norway"         "Poland"         "Portugal"      
## [21] "Romania"        "Slovakia"       "Slovenia"       "Spain"         
## [25] "Sweden"         "Switzerland"    "United Kingdom"
fviz_cluster(pam_result, data = data_scaled, geom = "point",
             ellipse.type = "convex") +
  ggtitle("PAM Clustering")

Validation - Silhouette Analysis

sil <- silhouette(km$cluster, dist_matrix)
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1   12          0.30
## 2       2    2          0.18
## 3       3   14          0.33

Average silhouette width is around 0.3 which is not great but acceptable. Some countries have low silhouette meaning they could belong to different cluster.

Conclusions

What I found:

  1. European countries can be divided into roughly 3 groups based on economic indicators

  2. The richest group includes Norway, Switzerland, Ireland, Denmark - these have GDP over $50k

  3. Middle group has most Western European countries like France, Germany, UK

  4. Third group is mostly Eastern European countries that are still developing economically

  5. All three methods (K-means, Hierarchical, PAM) gave similar results which is good

Limitations: - Only used data from one year - 6 variables might not be enough - Some countries are between clusters (like Greece, Spain)

References

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.95    factoextra_1.0.7 ggplot2_4.0.0    cluster_2.1.8.1 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       jsonlite_2.0.0     dplyr_1.1.4        compiler_4.5.1    
##  [5] ggsignif_0.6.4     tidyselect_1.2.1   Rcpp_1.1.0         tidyr_1.3.1       
##  [9] jquerylib_0.1.4    scales_1.4.0       yaml_2.3.10        fastmap_1.2.0     
## [13] R6_2.6.1           ggpubr_0.6.2       labeling_0.4.3     generics_0.1.4    
## [17] Formula_1.2-5      knitr_1.50         backports_1.5.0    ggrepel_0.9.6     
## [21] tibble_3.3.0       car_3.1-3          bslib_0.9.0        pillar_1.11.1     
## [25] RColorBrewer_1.1-3 rlang_1.1.6        broom_1.0.10       cachem_1.1.0      
## [29] xfun_0.53          sass_0.4.10        S7_0.2.0           cli_3.6.5         
## [33] withr_3.0.2        magrittr_2.0.4     digest_0.6.37      grid_4.5.1        
## [37] rstudioapi_0.17.1  lifecycle_1.0.4    vctrs_0.6.5        rstatix_0.7.3     
## [41] evaluate_1.0.5     glue_1.8.0         farver_2.1.2       abind_1.4-8       
## [45] carData_3.0-6      rmarkdown_2.30     purrr_1.1.0        tools_4.5.1       
## [49] pkgconfig_2.0.3    htmltools_0.5.8.1