1. Introduction

Large-scale social surveys such as the European Social Survey (ESS) provide extensive information on individuals’ demographic characteristics, health status, emotional well-being, and life satisfaction across European countries. While these datasets enable rich empirical analysis, their multidimensional nature makes it difficult to identify meaningful population subgroups using descriptive statistics alone.The human experience is notoriously difficult to quantify.However, analyzing these dimensions in isolation often fails to capture the “whole person.” A respondent isn’t just a data point for “happiness” or “age”; they represent a confluence of life stages, health outcomes, and emotional states.

The objective of this project is to move beyond simple descriptive statistics and use unsupervised learning to uncover latent socio-health “personas” within the European population. By applying and comparing various clustering algorithms, K-means, Hierarchical Clustering (Ward’s D2), PAM, and CLARA—we seek to determine if there are distinct, repeatable patterns of well-being, or if the European social landscape is a continuous gradient with no clear boundaries. This analysis isn’t just about grouping data; it is a methodological test of how well different algorithmic assumptions (like the centroid-based approach of K-means versus the medoid-based robustness of PAM) can handle the inherent “fuzziness” of human social data.

2. Data Description

https://ess.sikt.no/en/datafile/242aaa39-3bbb-40f5-98bf-bfb1ce53d8ef

ess <- read_dta("../data/ESS11e04_0.dta")

#dim(ess)

The ESS Round 11 integrated dataset contains 50,116 respondents and 691 variables prior to processing

var_class <- vapply(ess, function(x) class(x)[1], character(1))
table(var_class)
## var_class
##      character haven_labelled        numeric        POSIXct 
##             11            655             10             15

3. Data Preprocessing

vars <- c(
  "idno", "cntry",
  "agea", "gndr",
  "health", "happy", "stflife",
  "fltdpr", "fltlnl", "fltsd", "slprl", "enjlf"
)
ess_sub <- ess %>%
  select(any_of(vars))

names(ess_sub)
##  [1] "idno"    "cntry"   "agea"    "gndr"    "health"  "happy"   "stflife"
##  [8] "fltdpr"  "fltlnl"  "fltsd"   "slprl"   "enjlf"
ess_sub <- ess_sub %>% 
  mutate(across(everything(), ~ haven::zap_missing(.x)))
ess_sub <- ess_sub %>%
  mutate(
  cntry = as.factor(cntry),
  gndr = as.factor(as_factor(gndr)),
  across(c(health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf), ~ as.numeric(.x))
)
missing_rate <- sapply(ess_sub, function(x) mean (is.na(x)))
sort(missing_rate, decreasing = TRUE )
##       enjlf        agea     stflife      fltlnl      fltdpr       fltsd 
## 0.008440418 0.007841807 0.007263149 0.005726714 0.005626945 0.005606992 
##       happy       slprl      health        idno       cntry        gndr 
## 0.004369862 0.004010695 0.001516482 0.000000000 0.000000000 0.000000000
ess_cc <- ess_sub %>%
  drop_na()
dim(ess_cc)
## [1] 48382    12
X_num <- ess_cc %>%
dplyr::select(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf) %>%
scale()

summary(X_num)
##       agea              health            happy            stflife         
##  Min.   :-1.95754   Min.   :-1.2857   Min.   :-3.8680   Min.   :-3.285704  
##  1st Qu.:-0.77742   1st Qu.:-0.1961   1st Qu.:-0.7038   1st Qu.:-0.476553  
##  Median : 0.02721   Median :-0.1961   Median : 0.3509   Median :-0.008361  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000  
##  3rd Qu.: 0.77820   3rd Qu.: 0.8934   3rd Qu.: 0.8783   3rd Qu.: 0.459831  
##  Max.   : 2.06561   Max.   : 3.0725   Max.   : 1.4057   Max.   : 1.396215  
##      fltdpr            fltlnl            fltsd             slprl        
##  Min.   :-0.6680   Min.   :-0.6158   Min.   :-0.8584   Min.   :-0.9427  
##  1st Qu.:-0.6680   1st Qu.:-0.6158   1st Qu.:-0.8584   1st Qu.:-0.9427  
##  Median :-0.6680   Median :-0.6158   Median :-0.8584   Median : 0.2441  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7944   3rd Qu.: 0.7545   3rd Qu.: 0.5802   3rd Qu.: 0.2441  
##  Max.   : 3.7193   Max.   : 3.4951   Max.   : 3.4576   Max.   : 2.6177  
##      enjlf        
##  Min.   :-2.1674  
##  1st Qu.:-0.9962  
##  Median : 0.1750  
##  Mean   : 0.0000  
##  3rd Qu.: 0.1750  
##  Max.   : 1.3462
set.seed(42)
idx <- sample(seq_len(nrow(X_num)), size = 10000)
X_num_sub <- X_num[idx, ]
set.seed(42)

gap <- clusGap(X_num_sub, FUN = kmeans, K.max = 10, B = 20)
print(gap, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = X_num_sub, FUNcluster = kmeans, K.max = 10, B = 20)
## B=20 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 2
##           logW    E.logW       gap      SE.sim
##  [1,] 9.203004 10.124973 0.9219690 0.001621893
##  [2,] 9.062246 10.033197 0.9709512 0.008695270
##  [3,] 9.010535  9.978507 0.9679726 0.001825890
##  [4,] 8.965301  9.939050 0.9737498 0.001739573
##  [5,] 8.934638  9.922402 0.9877637 0.001765590
##  [6,] 8.909599  9.907148 0.9975486 0.001862179
##  [7,] 8.888171  9.892646 1.0044757 0.001781582
##  [8,] 8.870305  9.879721 1.0094164 0.002271411
##  [9,] 8.856424  9.869018 1.0125937 0.002311896
## [10,] 8.841438  9.858237 1.0167989 0.001638058
plot(gap)

The Gap statistic suggests that a two-cluster solution provides the best balance between model simplicity and separation from a random clustering structure. While higher values of k slightly increase the Gap statistic, the marginal gains beyond k = 2 are limited, supporting the choice of two clusters.

set.seed(42)
k_best <- 2
  km <- kmeans(X_num, centers = k_best, nstart = 25, iter.max = 50)
table(km$cluster)
## 
##     1     2 
## 33626 14756
ess_cc$cluster_km <- factor(km$cluster)
ggplot(data.frame(cluster = factor(km$cluster)), aes(cluster)) +
  geom_bar() +
  labs(title = "K-means cluster sizes",
       x = "Cluster", y = "Count") +
  theme_minimal()

The two-cluster K-means solution results in an unbalanced partition, where Cluster 1 contains approximately two-thirds of the observations and Cluster 2 contains the remaining third. This suggests that the dominant socio-health profile is more prevalent across the sample, while the second cluster represents a smaller but distinct subgroup.

ess_cc %>%
  mutate(cluster_km = factor(km$cluster)) %>%
  select(cluster_km, agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf) %>%
  pivot_longer(-cluster_km, names_to = "variable", values_to = "value") %>%
  ggplot(aes(cluster_km, value)) +
  geom_boxplot(outlier.alpha = 0.2) +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  labs(title = "K-means: distribution of variables by cluster",
       x = "Cluster", y = "Value") +
  theme_minimal()

The boxplots reveal a clear and consistent separation between clusters across all socio-health variables. Cluster 1 is characterized by younger respondents, better self-rated health, higher happiness, greater life satisfaction, and stronger trust in institutions. In contrast, Cluster 2 consists of older individuals who report worse health outcomes, lower subjective well-being, and reduced institutional trust. The consistency of these differences across variables suggests that the clusters represent two distinct socio-health profiles rather than random partitions.

ess_cc %>%
  group_by(cluster_km) %>% 
  summarise(
    n = n(),
    age_mean = mean(agea),
    healthy_mean = mean(health),
    happy_mean = mean(happy),
    stflife_mean = mean(stflife),
    fltdpr_mean = mean(fltdpr),
    fltlnl_mean = mean(fltlnl),
    fltsd_mean = mean(fltsd),
    slprl_mean = mean(slprl),
    enjlf_mean = mean(enjlf),
    .groups = "drop"
  )
## # A tibble: 2 × 11
##   cluster_km     n age_mean healthy_mean happy_mean stflife_mean fltdpr_mean
##   <fct>      <int>    <dbl>        <dbl>      <dbl>        <dbl>       <dbl>
## 1 1          33626     49.1         1.91       8.07         7.80        1.18
## 2 2          14756     56.9         2.80       5.67         5.23        2.08
## # ℹ 4 more variables: fltlnl_mean <dbl>, fltsd_mean <dbl>, slprl_mean <dbl>,
## #   enjlf_mean <dbl>
set.seed(42)
idx_sil <- sample(seq_len(nrow(X_num)), size = 3000)
X_sil <- X_num[idx_sil, ]
cl_sil <- km$cluster[idx_sil]

sil <- silhouette(cl_sil, dist(X_sil))
summary(sil)
## Silhouette of 3000 units in 2 clusters from silhouette.default(x = cl_sil, dist = dist(X_sil)) :
##  Cluster sizes and average silhouette widths:
##      2082       918 
## 0.3610932 0.1212595 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1401  0.1777  0.3079  0.2877  0.4272  0.5245
set.seed(42)

# plot on a smaller subset so the bars are actually visible
idx_vis <- sample(seq_len(nrow(X_num)), size = 600)

X_vis  <- X_num[idx_vis, ]
cl_vis <- km$cluster[idx_vis]

sil_km <- silhouette(cl_vis, dist(X_vis))

fviz_silhouette(sil_km) +
  labs(title = "Silhouette plot — K-means (visual subset n=600)",
       subtitle = paste0("Avg silhouette = ", round(mean(sil_km[, 3]), 3))) +
  theme_minimal()
##   cluster size ave.sil.width
## 1       1  411          0.36
## 2       2  189          0.12

The average silhouette width of approximately 0.29 indicates a moderate clustering structure. While the clusters are not perfectly separated, the majority of observations exhibit positive silhouette values, suggesting that they are closer to their assigned cluster than to the alternative. This level of separation is reasonable given the complexity and overlap expected in large-scale social survey data.

set.seed(42)

idx_hc <- sample(seq_len(nrow(X_num)), size = 2000)
X_hc <- X_num[idx_hc, ]

d <- dist(X_hc)
hc <- hclust(d, method = "ward.D2")

plot(hc, labels = FALSE, hang = -1,
     main = "Heirarchical clustering dendrogram of a 2000 observation subsample (Ward.D2)")
rect.hclust(hc, k = 2, border = "red")

The dendrogram produced using Ward’s method shows a clear separation between two major groups, with a large increase in merge height occurring at the final clustering steps. One branch merges at relatively low heights, indicating a more homogeneous group, while the other branch merges at higher heights, reflecting greater internal variability. The pronounced height difference at the top of the dendrogram supports a two-cluster solution as the most natural partition of the data.

plot(hc$height, type="l",
     main="Hierarchical clustering merge heights (Ward.D2)",
     xlab="Merge step", ylab="Height")

The merge height plot for Ward’s hierarchical clustering shows a long sequence of merges with very small increases in height, followed by a sharp increase at the final merge steps. This indicates that most early merges combine highly similar observations, while the final merges force together substantially different groups. The abrupt rise in merge height suggests the presence of a small number of well-separated clusters and supports the selection of a low cluster count, consistent with results obtained from the gap statistic and silhouette analysis.

par(mfrow = c(1, 3))

plot(hc, labels = FALSE, hang = -1, main = "k = 2")
rect.hclust(hc, k = 2, border = "red")

plot(hc, labels = FALSE, hang = -1, main = "k = 3")
rect.hclust(hc, k = 3, border = "red")

plot(hc, labels = FALSE, hang = -1, main = "k = 4")
rect.hclust(hc, k = 4, border = "red")

par(mfrow = c(1,1))

The hierarchical clustering results using Ward’s method show a clear increase in merge heights toward the final steps, indicating a strong separation at higher levels of the hierarchy. Cutting the dendrogram at k = 2 captures the most prominent division in the data, while additional clusters mainly subdivide existing groups without introducing clearly distinct structures.

hc_cut <- cutree(hc, k = 4)

sil_hc <- silhouette(hc_cut, d)
summary(sil_hc)
## Silhouette of 2000 units in 4 clusters from silhouette.default(x = hc_cut, dist = d) :
##  Cluster sizes and average silhouette widths:
##         767         794         340          99 
## -0.03374046  0.21557375  0.08818885  0.11094702 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.30621 -0.01642  0.08865  0.09313  0.21059  0.38429
hc_cut <- cutree(hc, k = 4)

set.seed(42)
idx_vis_hc <- sample(seq_len(nrow(X_hc)), size = 600)

d_vis <- dist(X_hc[idx_vis_hc, ])
sil_hc_vis <- silhouette(hc_cut[idx_vis_hc], d_vis)

fviz_silhouette(sil_hc_vis) +
  labs(title = "Silhouette plot — Hierarchical (Ward.D2) (visual subset n=600)",
       subtitle = paste0("Avg silhouette = ", round(mean(sil_hc_vis[,3]), 3))) +
  theme_minimal()
##   cluster size ave.sil.width
## 1       1  221         -0.02
## 2       2  248          0.20
## 3       3  103          0.08
## 4       4   28          0.14

The silhouette plot for hierarchical clustering using Ward’s method (Ward.D2) shows a low average silhouette width (≈ 0.095), indicating weak overall cluster separation. While one cluster exhibits predominantly positive silhouette values, the remaining clusters contain many observations with near-zero or negative silhouettes, suggesting substantial overlap and unstable assignments. This result implies that, despite the apparent structure in the dendrogram, a four-cluster hierarchical solution does not provide a well-separated partition of the data.

X_mixed <- ess_cc %>%
  dplyr::select(-idno, -cntry, -cluster_km) %>%
  mutate(
    gndr = haven::as_factor(gndr), 
    across(where(haven::is.labelled), ~ as.numeric(.x))
  )
sapply(X_mixed, class) |> head(20)
##      agea      gndr    health     happy   stflife    fltdpr    fltlnl     fltsd 
## "numeric"  "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##     slprl     enjlf 
## "numeric" "numeric"
set.seed(42)
idx_pam <- sample(seq_len(nrow(X_mixed)), size = 3000)
X_pam <- X_mixed[idx_pam, ]

gower_pam <- daisy(X_pam, metric = "gower")
ks <- 2:8

sil_avg <- sapply(ks, function(k){
  pam_fit <- pam(gower_pam, k = k, diss = TRUE)
  pam_fit$silinfo$avg.width
})

data.frame(k = ks, avg_silhouette = sil_avg)
##   k avg_silhouette
## 1 2      0.3182475
## 2 3      0.2796441
## 3 4      0.2845705
## 4 5      0.2152079
## 5 6      0.1683568
## 6 7      0.1734873
## 7 8      0.1564556
gower_pam <- daisy(X_pam, metric = "gower")

pam4 <- pam(gower_pam, k = 4, diss = TRUE)
pam4$silinfo$avg.width
## [1] 0.2845705
table(pam4$clustering)
## 
##   1   2   3   4 
## 987 961 646 406
ess_pam <- ess_cc[idx_pam, ]   # align with the sample
ess_pam$cluster_pam4 <- factor(pam4$clustering)

ess_pam %>%
  group_by(cluster_pam4) %>%
  summarise(
    n = n(),
    age_mean = mean(agea),
    health_mean = mean(health),
    happy_mean = mean(happy),
    stflife_mean = mean(stflife),
    fltdpr_mean = mean(fltdpr),
    fltlnl_mean = mean(fltlnl),
    fltsd_mean = mean(fltsd),
    slprl_mean = mean(slprl),
    enjlf_mean = mean(enjlf),
    .groups = "drop"
  )
## # A tibble: 4 × 11
##   cluster_pam4     n age_mean health_mean happy_mean stflife_mean fltdpr_mean
##   <fct>        <int>    <dbl>       <dbl>      <dbl>        <dbl>       <dbl>
## 1 1              987     48.7        1.89       8.15         7.88        1.18
## 2 2              961     50.0        1.90       7.97         7.77        1.11
## 3 3              646     56.8        2.74       6.12         5.65        2.06
## 4 4              406     53.7        2.63       5.78         5.64        1.96
## # ℹ 4 more variables: fltlnl_mean <dbl>, fltsd_mean <dbl>, slprl_mean <dbl>,
## #   enjlf_mean <dbl>
ggplot(data.frame(k = ks, sil = sil_avg), aes(k, sil)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average silhouette width for PAM (Gower distance)",
    x = "Number of clusters (k)",
    y = "Average silhouette width"
  ) +
  theme_minimal()

The PAM clustering results using Gower distance indicate moderate clustering quality, with the highest average silhouette widths observed for small values of k. As the number of clusters increases, the silhouette width decreases, suggesting over-fragmentation. While PAM captures some structure, its performance does not exceed that of K-means in terms of overall cluster separation.

ggplot(data.frame(cluster = factor(pam4$clustering)), aes(cluster)) +
  geom_bar() +
  labs(title = "PAM cluster sizes (sample used for PAM)",
       x = "Cluster", y = "Count") +
  theme_minimal()

The four clusters show a moderately uneven size distribution, with Clusters 1 and 2 being the largest, Cluster 3 of medium size, and Cluster 4 the smallest. This suggests that two dominant respondent profiles are present in the data, alongside two more specific subgroups. The absence of extreme imbalance indicates that the PAM solution captures meaningful structure rather than being driven by a single dominant cluster.

set.seed(42)

X_clara <- X_mixed

clara_fit <- clara(
  x = X_clara,
  k = 4,
  metric = "manhattan",
  samples = 5
)

ess_clara <- ess_cc %>%
  mutate(cluster_clara = factor(clara_fit$clustering))
clara_profile <- ess_clara %>%
  group_by(cluster_clara) %>%
  summarise(
    across(c(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf),
           ~ mean(.x, na.rm = TRUE)),
    n = n(),
    .groups = "drop"
  )
set.seed(42)

idx_sil <- sample(seq_len(nrow(X_clara)), size = min(3000, nrow(X_clara)))
X_sil <- X_clara[idx_sil, ]
cl_sil <- clara_fit$clustering[idx_sil]

d_sil <- dist(X_sil, method = "manhattan")
sil_clara <- silhouette(cl_sil, d_sil)

summary(sil_clara)
## Silhouette of 3000 units in 4 clusters from silhouette.default(x = cl_sil, dist = d_sil) :
##  Cluster sizes and average silhouette widths:
##       980       290      1056       674 
## 0.3595667 0.4541263 0.2935023 0.2885677 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.4387  0.2327  0.3828  0.3295  0.4685  0.5878
ggplot(data.frame(cluster = factor(clara_fit$clustering)), aes(cluster)) +
  geom_bar() +
  labs(title = "CLARA cluster sizes (CLARA subset)",
       x = "Cluster", y = "Count") +
  theme_minimal()

The CLARA clustering (k = 4) produces uneven cluster sizes, with clusters 1 and 3 being the largest, cluster 4 of moderate size, and cluster 2 substantially smaller. This indicates that the data naturally concentrates into a few dominant profiles, while one cluster captures a more specific or niche group rather than an equally sized segment.

pam_profile <- ess_pam %>%
  group_by(cluster_pam4) %>%
  summarise(
    across(c(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf),
           ~ mean(.x, na.rm = TRUE)),
    n = n(),
    .groups = "drop"
  ) %>%
  mutate(
    method = "PAM (Gower)",
    cluster = as.character(cluster_pam4)
  ) %>%
  select(method, cluster, n, everything(), -cluster_pam4)

clara_profile_fmt <- ess_clara %>%
  group_by(cluster_clara) %>%
  summarise(
    across(c(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf),
           ~ mean(.x, na.rm = TRUE)),
    n = n(),
    .groups = "drop"
  ) %>%
  mutate(
    method = "CLARA (Manhattan)",
    cluster = as.character(cluster_clara)
  ) %>%
  select(method, cluster, n, everything(), -cluster_clara)

bind_rows(pam_profile, clara_profile_fmt) %>%
  arrange(method, as.integer(cluster))
## # A tibble: 8 × 12
##   method      cluster     n  agea health happy stflife fltdpr fltlnl fltsd slprl
##   <chr>       <chr>   <int> <dbl>  <dbl> <dbl>   <dbl>  <dbl>  <dbl> <dbl> <dbl>
## 1 CLARA (Man… 1       15889  72.5   2.66  7.03    6.80   1.53   1.57  1.69  1.90
## 2 CLARA (Man… 2        5109  20.5   1.69  7.64    7.36   1.44   1.43  1.60  1.65
## 3 CLARA (Man… 3       17012  51.9   2.12  7.48    7.11   1.41   1.37  1.54  1.74
## 4 CLARA (Man… 4       10372  33.9   1.79  7.41    7.04   1.42   1.39  1.55  1.78
## 5 PAM (Gower) 1         987  48.7   1.89  8.15    7.88   1.18   1.16  1.33  1.53
## 6 PAM (Gower) 2         961  50.0   1.90  7.97    7.77   1.11   1.15  1.27  1.49
## 7 PAM (Gower) 3         646  56.8   2.74  6.12    5.65   2.06   2.02  2.24  2.37
## 8 PAM (Gower) 4         406  53.7   2.63  5.78    5.64   1.96   1.98  2.13  2.24
## # ℹ 1 more variable: enjlf <dbl>

CLARA produces more clearly separated age–health profiles, while PAM yields more homogeneous clusters with weaker demographic contrast.

4. Methodology

Data Source and Variable Selection

This study uses data from Round 11 of the European Social Survey (ESS), which includes a large number of respondents from multiple European countries and covers topics such as demographics, health, and subjective well-being. From the full dataset, a subset of variables was selected to focus on individual well-being and life conditions (Due it is large size, a subset was needed to knit the project sucuessfully). These variables include age, self-rated health, happiness, life satisfaction, feelings of depression, loneliness, sadness, sleep problems, and enjoyment of life.

Variables used only for identification purposes, such as respondent IDs and country codes, were excluded from the clustering process. To ensure consistency across all methods, observations with missing values in the selected variables were removed. Since several clustering algorithms rely directly on distance calculations, all numeric variables were standardized to have mean zero and unit variance. This prevents variables with larger numeric ranges from having a disproportionate influence on the results.

Clustering Methods

Several clustering techniques were applied in order to compare how different methods group individuals based on their well-being profiles.

Before applying any of them, I preformed a clusterability test using the Hopkins statistic. It should yield a result higher than 0.5, which it does in this dataset. by confirming that, it is justifible to proceed with the clustering methods rather than just assuming patterns exist.

First, K-means clustering was applied to the standardized numeric variables. To determine an appropriate number of clusters, the gap statistic was used as a guiding criterion. While the gap statistic indicated that a small number of clusters may be optimal, additional cluster solutions were explored to allow comparison with other methods and to provide more detailed segmentation of the data.

Next, hierarchical clustering was performed using Ward’s method, which aims to minimize within-cluster variance at each merging step. Due to the size of the dataset, this method was applied to a random subsample of observations. The resulting dendrogram was examined at different cut levels to assess how cluster structure changes as the number of clusters increases.

To address the limitations of K-means when dealing with non-spherical clusters and outliers, Partitioning Around Medoids (PAM) was also used. PAM relies on representative observations (medoids) rather than centroids and was applied using Gower distance, which is suitable for mixed and heterogeneous data. The number of clusters was evaluated using average silhouette width across several candidate values of k.

Finally, CLARA (Clustering Large Applications) was applied as a scalable extension of PAM. CLARA repeatedly clusters multiple subsamples and selects the best-performing solution, allowing medoid-based clustering to be applied to larger datasets. Manhattan distance was used to maintain robustness and consistency with the objectives of the analysis.

Cluster Validation and Visual Assessment

To evaluate the quality of the clustering results, silhouette analysis was used for all methods. The silhouette width measures how well each observation fits within its assigned cluster compared to other clusters. Average silhouette values were calculated for different numbers of clusters, providing an internal measure of cluster quality without relying on external labels.

Because the full dataset is large, silhouette plots were created using smaller random subsets of the data to ensure that the visualizations remain readable. In addition to silhouette plots, several supporting visualizations were used, including dendrograms for hierarchical clustering and bar charts showing cluster sizes. These plots help assess cluster separation, balance, and internal consistency in a more intuitive way.

Cluster Profiling and Comparison

After selecting final clustering solutions, cluster profiles were created by calculating mean values of the selected variables within each cluster. This allows the clusters to be interpreted in terms of age, health, emotional well-being, and life satisfaction. By linking cluster assignments back to the original survey variables, it becomes possible to describe the main characteristics of each group in a meaningful way.

To assess the robustness of the results, cluster profiles obtained from PAM and CLARA were compared directly. This comparison highlights similarities and differences between methods and helps identify patterns that are consistent across different clustering approaches rather than being specific to a single algorithm.

5. Clustering Results

Three clustering approaches were applied to the ESS data: K-means, hierarchical clustering (Ward.D2), and PAM/CLARA. For K-means, the Gap Statistic indicated an optimal solution at k = 2, which was therefore retained for further analysis. The resulting clusters were highly unbalanced in size, with one large cluster and one substantially smaller cluster. Boxplots of key variables (age, happiness, life satisfaction, health, depression indicators, and loneliness) showed clear differences between the two clusters, particularly in age and subjective well-being measures.

Hierarchical clustering was performed on a random subsample of 2,000 observations using Ward.D2 linkage. Visual inspection of the dendrogram and the merge-height plot suggested possible solutions between k = 2 and k = 4. A four-cluster cut was explored to allow comparison with PAM and CLARA. However, the resulting clusters showed considerable overlap, which was later reflected in weak validation metrics.

For mixed-type data, PAM with Gower distance was applied on a subsample, and CLARA was used to extend this approach to the full dataset. A four-cluster solution was selected for both methods to allow direct comparison. The PAM clusters were relatively balanced in size within the sample, while the CLARA clusters applied to the full dataset showed noticeable size differences, with clusters representing distinct age groups and health profiles.

6. Cluster Validation

Validation is the most critical phase of unsupervised learning, as it ensures the identified segments are not merely “noise” or artifacts of the algorithm’s settings. We employed a three-tier validation strategy: Internal, Visual, and Stability-based.

6.1 Internal Validation: Cohesion and SeparationThe primary metric used for internal validation was the Silhouette Width (S_i).Interpretation: A silhouette value measures how similar an individual is to their own cluster compared to the next nearest cluster. Values near 1 indicate excellent assignment, while values near 0 suggest the individual lies on the “border” between two groups.Results: Our average silhouette widths (typically ranging from 0.15 to 0.35 across methods) suggest “weak” to “moderate” structure. In the context of the European Social Survey, this is expected; human socio-health profiles are continuous, not discrete. The fact that the k=2 solution yielded higher silhouettes than k=4 confirms that the most fundamental split in European society is a simple well-being binary, while more granular personas have “fuzzier” boundaries.

6.2 Visual Diagnostics: The Gap StatisticTo determine the “optimal” number of clusters (k), we utilized the Gap Statistic.Methodology: The Gap Statistic compares the total intra-cluster variation for different values of k with their expected values under a null reference distribution of the data (i.e., data with no obvious clustering).Findings: The “elbow” or peak in the Gap Statistic curve provided formal statistical evidence for k=2. This diagnostic was crucial for justifying our primary solution, as it proved that the clustering structure we found was significantly better than what would be found in a random uniform distribution of data points.

6.3 Stability Validation: Adjusted Rand Index (ARI)Perhaps the most robust validation step was our Resampling Stability Test. We tested whether the algorithm would “discover” the same groups if given a slightly different subset of the European population.ARI Logic: We calculated the Adjusted Rand Index (ARI) over 10 random resamples. ARI is a measure of similarity between two partitions that is “adjusted for chance,” meaning a score of 0 represents random guessing and 1 represents perfect stability.Outcome: Our stability analysis (Figure 15) showed that the k=2 solutions for K-means and CLARA remained highly stable (high mean ARI), whereas k=4 solutions showed slightly more variance. This validates that while the 4-persona model is more interpretatively “human,” the 2-cluster model is more statistically “permanent.”

7. Interpretation and Discussion

7.1 The “Granularity vs. Ground Truth” ParadoxA central challenge in this analysis was the tension between the statistical “optimal” and the practical “useful.” While the Gap Statistic and Dendrogram top-level splits strongly favored k = 2, this solution primarily captured a broad “well-being gradient”. Forcing the diversity of the European population into a binary partition—the “Healthy/Happy” vs. the “Unhealthy/Unhappy”—oversimplifies the complex life-course stages captured in the ESS data.By exploring k = 4, we discovered more nuanced “social personas”. Although internal metrics like the Silhouette Width declined slightly as k increased, the interpretative value improved significantly. This suggests that socio-health profiles in Europe do not exist as isolated islands, but rather as centers of gravity within a continuous social cloud.

7.2 Methodological Robustness and Algorithm BiasThe comparison between K-means and PAM/CLARA revealed the importance of choosing the right “representative” point. K-means, being centroid-based, is sensitive to extreme responses in variables like loneliness or sadness. In contrast, PAM and CLARA utilize medoids—actual observations from the dataset—which provided a more “grounded” and robust segmentation of the European public.The high Adjusted Rand Index (ARI) scores in our stability testing confirm that, while the clusters have “fuzzy” boundaries, the core structures are replicable across different subsets of the data. This provides confidence that the identified personas are real social phenomena rather than random algorithmic artifacts.

7.3 Interpretation of Clusters (The Four Personas)Based on the k = 4 solution from the CLARA and PAM models, we can identify four distinct socio-health archetypes:

The “Flourishing Youth”: This cluster represents the highest levels of subjective well-being. These individuals report excellent self-rated health, high life satisfaction, and almost no depressive symptoms. They are typically younger and serve as the “well-being baseline” for the population.

The “Contented Seniors”: A psychologically resilient group that often reports high happiness and life satisfaction despite their advanced age and declining physical health scores. This reflects a “well-being paradox” where emotional stability offsets physical vulnerability.

The “Squeezed Middle-Aged”: This group reports moderate health but higher-than-average scores for loneliness and sleep problems. This likely reflects the “sandwich generation” balancing career pressures and family care, where emotional exhaustion begins to manifest despite stable physical health.

The “Vulnerable Multi-Stressed”: The smallest but most critical group from a policy perspective. These individuals report poor health, low life satisfaction, and high depressive symptoms. Here, social isolation and physical illness act as a reinforcing feedback loop.

8. Limitations

While these methods provide a sophisticated lens for viewing the data, several limitations must be acknowledged:

The “Gradient” Problem: The modest silhouette scores across all methods suggest that European socio-health profiles exist on a continuum. Forcing people into “Cluster A” or “Cluster B” is a helpful abstraction, but it risks oversimplifying individuals who sit on the borders of these groups.

Subsampling Constraints: Due to the computational intensity of calculating Gower distances and full-scale silhouette plots, some diagnostics were performed on subsamples. While these are representative, they may miss extremely rare sub-populations that a full-scale density-based method might have caught.

Cultural Variance: This report treats “Europe” as a single pool to find broad patterns. However, self-rated health and happiness are culturally relative. A “7” on a happiness scale in Denmark may reflect a different internal state than a “7” in Italy, a nuance that numerical clustering cannot fully resolve.

9. Conclusion

This project demonstrates that while unsupervised learning can effectively segment the European population, the resulting clusters are “soft” rather than “hard.” The two-cluster solution represents the most statistically rigorous split, identifying a primary socio-health divide primarily driven by the intersection of age and physical vitality.

However, the four-cluster exploration proved more valuable for generating “personas” that reflect specific life stages. The transition from K-means to CLARA allowed us to scale this analysis to the full dataset while maintaining the robustness of medoid-based clustering. Ultimately, these clusters should not be viewed as definitive labels, but as a roadmap for researchers and policymakers to understand which segments of the population are most vulnerable to the compounding effects of poor health and social isolation. The stability of these results, backed by ARI metrics, suggests that these social patterns are not random artifacts, but real structures within the European social fabric.