Large-scale social surveys such as the European Social Survey (ESS) provide extensive information on individuals’ demographic characteristics, health status, emotional well-being, and life satisfaction across European countries. While these datasets enable rich empirical analysis, their multidimensional nature makes it difficult to identify meaningful population subgroups using descriptive statistics alone.The human experience is notoriously difficult to quantify.However, analyzing these dimensions in isolation often fails to capture the “whole person.” A respondent isn’t just a data point for “happiness” or “age”; they represent a confluence of life stages, health outcomes, and emotional states.
The objective of this project is to move beyond simple descriptive statistics and use unsupervised learning to uncover latent socio-health “personas” within the European population. By applying and comparing various clustering algorithms, K-means, Hierarchical Clustering (Ward’s D2), PAM, and CLARA—we seek to determine if there are distinct, repeatable patterns of well-being, or if the European social landscape is a continuous gradient with no clear boundaries. This analysis isn’t just about grouping data; it is a methodological test of how well different algorithmic assumptions (like the centroid-based approach of K-means versus the medoid-based robustness of PAM) can handle the inherent “fuzziness” of human social data.
https://ess.sikt.no/en/datafile/242aaa39-3bbb-40f5-98bf-bfb1ce53d8ef
ess <- read_dta("../data/ESS11e04_0.dta")
#dim(ess)
The ESS Round 11 integrated dataset contains 50,116 respondents and 691 variables prior to processing
var_class <- vapply(ess, function(x) class(x)[1], character(1))
table(var_class)
## var_class
## character haven_labelled numeric POSIXct
## 11 655 10 15
vars <- c(
"idno", "cntry",
"agea", "gndr",
"health", "happy", "stflife",
"fltdpr", "fltlnl", "fltsd", "slprl", "enjlf"
)
ess_sub <- ess %>%
select(any_of(vars))
names(ess_sub)
## [1] "idno" "cntry" "agea" "gndr" "health" "happy" "stflife"
## [8] "fltdpr" "fltlnl" "fltsd" "slprl" "enjlf"
ess_sub <- ess_sub %>%
mutate(across(everything(), ~ haven::zap_missing(.x)))
ess_sub <- ess_sub %>%
mutate(
cntry = as.factor(cntry),
gndr = as.factor(as_factor(gndr)),
across(c(health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf), ~ as.numeric(.x))
)
missing_rate <- sapply(ess_sub, function(x) mean (is.na(x)))
sort(missing_rate, decreasing = TRUE )
## enjlf agea stflife fltlnl fltdpr fltsd
## 0.008440418 0.007841807 0.007263149 0.005726714 0.005626945 0.005606992
## happy slprl health idno cntry gndr
## 0.004369862 0.004010695 0.001516482 0.000000000 0.000000000 0.000000000
ess_cc <- ess_sub %>%
drop_na()
dim(ess_cc)
## [1] 48382 12
X_num <- ess_cc %>%
dplyr::select(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf) %>%
scale()
summary(X_num)
## agea health happy stflife
## Min. :-1.95754 Min. :-1.2857 Min. :-3.8680 Min. :-3.285704
## 1st Qu.:-0.77742 1st Qu.:-0.1961 1st Qu.:-0.7038 1st Qu.:-0.476553
## Median : 0.02721 Median :-0.1961 Median : 0.3509 Median :-0.008361
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.000000
## 3rd Qu.: 0.77820 3rd Qu.: 0.8934 3rd Qu.: 0.8783 3rd Qu.: 0.459831
## Max. : 2.06561 Max. : 3.0725 Max. : 1.4057 Max. : 1.396215
## fltdpr fltlnl fltsd slprl
## Min. :-0.6680 Min. :-0.6158 Min. :-0.8584 Min. :-0.9427
## 1st Qu.:-0.6680 1st Qu.:-0.6158 1st Qu.:-0.8584 1st Qu.:-0.9427
## Median :-0.6680 Median :-0.6158 Median :-0.8584 Median : 0.2441
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7944 3rd Qu.: 0.7545 3rd Qu.: 0.5802 3rd Qu.: 0.2441
## Max. : 3.7193 Max. : 3.4951 Max. : 3.4576 Max. : 2.6177
## enjlf
## Min. :-2.1674
## 1st Qu.:-0.9962
## Median : 0.1750
## Mean : 0.0000
## 3rd Qu.: 0.1750
## Max. : 1.3462
set.seed(42)
idx <- sample(seq_len(nrow(X_num)), size = 10000)
X_num_sub <- X_num[idx, ]
set.seed(42)
gap <- clusGap(X_num_sub, FUN = kmeans, K.max = 10, B = 20)
print(gap, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = X_num_sub, FUNcluster = kmeans, K.max = 10, B = 20)
## B=20 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 2
## logW E.logW gap SE.sim
## [1,] 9.203004 10.124973 0.9219690 0.001621893
## [2,] 9.062246 10.033197 0.9709512 0.008695270
## [3,] 9.010535 9.978507 0.9679726 0.001825890
## [4,] 8.965301 9.939050 0.9737498 0.001739573
## [5,] 8.934638 9.922402 0.9877637 0.001765590
## [6,] 8.909599 9.907148 0.9975486 0.001862179
## [7,] 8.888171 9.892646 1.0044757 0.001781582
## [8,] 8.870305 9.879721 1.0094164 0.002271411
## [9,] 8.856424 9.869018 1.0125937 0.002311896
## [10,] 8.841438 9.858237 1.0167989 0.001638058
plot(gap)
The Gap statistic suggests that a two-cluster solution provides the best
balance between model simplicity and separation from a random clustering
structure. While higher values of k slightly increase the Gap statistic,
the marginal gains beyond k = 2 are limited, supporting the choice of
two clusters.
set.seed(42)
k_best <- 2
km <- kmeans(X_num, centers = k_best, nstart = 25, iter.max = 50)
table(km$cluster)
##
## 1 2
## 33626 14756
ess_cc$cluster_km <- factor(km$cluster)
ggplot(data.frame(cluster = factor(km$cluster)), aes(cluster)) +
geom_bar() +
labs(title = "K-means cluster sizes",
x = "Cluster", y = "Count") +
theme_minimal()
The two-cluster K-means solution results in an unbalanced partition,
where Cluster 1 contains approximately two-thirds of the observations
and Cluster 2 contains the remaining third. This suggests that the
dominant socio-health profile is more prevalent across the sample, while
the second cluster represents a smaller but distinct subgroup.
ess_cc %>%
mutate(cluster_km = factor(km$cluster)) %>%
select(cluster_km, agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf) %>%
pivot_longer(-cluster_km, names_to = "variable", values_to = "value") %>%
ggplot(aes(cluster_km, value)) +
geom_boxplot(outlier.alpha = 0.2) +
facet_wrap(~variable, scales = "free", ncol = 3) +
labs(title = "K-means: distribution of variables by cluster",
x = "Cluster", y = "Value") +
theme_minimal()
The boxplots reveal a clear and consistent separation between clusters
across all socio-health variables. Cluster 1 is characterized by younger
respondents, better self-rated health, higher happiness, greater life
satisfaction, and stronger trust in institutions. In contrast, Cluster 2
consists of older individuals who report worse health outcomes, lower
subjective well-being, and reduced institutional trust. The consistency
of these differences across variables suggests that the clusters
represent two distinct socio-health profiles rather than random
partitions.
ess_cc %>%
group_by(cluster_km) %>%
summarise(
n = n(),
age_mean = mean(agea),
healthy_mean = mean(health),
happy_mean = mean(happy),
stflife_mean = mean(stflife),
fltdpr_mean = mean(fltdpr),
fltlnl_mean = mean(fltlnl),
fltsd_mean = mean(fltsd),
slprl_mean = mean(slprl),
enjlf_mean = mean(enjlf),
.groups = "drop"
)
## # A tibble: 2 × 11
## cluster_km n age_mean healthy_mean happy_mean stflife_mean fltdpr_mean
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 33626 49.1 1.91 8.07 7.80 1.18
## 2 2 14756 56.9 2.80 5.67 5.23 2.08
## # ℹ 4 more variables: fltlnl_mean <dbl>, fltsd_mean <dbl>, slprl_mean <dbl>,
## # enjlf_mean <dbl>
set.seed(42)
idx_sil <- sample(seq_len(nrow(X_num)), size = 3000)
X_sil <- X_num[idx_sil, ]
cl_sil <- km$cluster[idx_sil]
sil <- silhouette(cl_sil, dist(X_sil))
summary(sil)
## Silhouette of 3000 units in 2 clusters from silhouette.default(x = cl_sil, dist = dist(X_sil)) :
## Cluster sizes and average silhouette widths:
## 2082 918
## 0.3610932 0.1212595
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1401 0.1777 0.3079 0.2877 0.4272 0.5245
set.seed(42)
# plot on a smaller subset so the bars are actually visible
idx_vis <- sample(seq_len(nrow(X_num)), size = 600)
X_vis <- X_num[idx_vis, ]
cl_vis <- km$cluster[idx_vis]
sil_km <- silhouette(cl_vis, dist(X_vis))
fviz_silhouette(sil_km) +
labs(title = "Silhouette plot — K-means (visual subset n=600)",
subtitle = paste0("Avg silhouette = ", round(mean(sil_km[, 3]), 3))) +
theme_minimal()
## cluster size ave.sil.width
## 1 1 411 0.36
## 2 2 189 0.12
The average silhouette width of approximately 0.29 indicates a moderate
clustering structure. While the clusters are not perfectly separated,
the majority of observations exhibit positive silhouette values,
suggesting that they are closer to their assigned cluster than to the
alternative. This level of separation is reasonable given the complexity
and overlap expected in large-scale social survey data.
set.seed(42)
idx_hc <- sample(seq_len(nrow(X_num)), size = 2000)
X_hc <- X_num[idx_hc, ]
d <- dist(X_hc)
hc <- hclust(d, method = "ward.D2")
plot(hc, labels = FALSE, hang = -1,
main = "Heirarchical clustering dendrogram of a 2000 observation subsample (Ward.D2)")
rect.hclust(hc, k = 2, border = "red")
The dendrogram produced using Ward’s method shows a clear separation
between two major groups, with a large increase in merge height
occurring at the final clustering steps. One branch merges at relatively
low heights, indicating a more homogeneous group, while the other branch
merges at higher heights, reflecting greater internal variability. The
pronounced height difference at the top of the dendrogram supports a
two-cluster solution as the most natural partition of the data.
plot(hc$height, type="l",
main="Hierarchical clustering merge heights (Ward.D2)",
xlab="Merge step", ylab="Height")
The merge height plot for Ward’s hierarchical clustering shows a long
sequence of merges with very small increases in height, followed by a
sharp increase at the final merge steps. This indicates that most early
merges combine highly similar observations, while the final merges force
together substantially different groups. The abrupt rise in merge height
suggests the presence of a small number of well-separated clusters and
supports the selection of a low cluster count, consistent with results
obtained from the gap statistic and silhouette analysis.
par(mfrow = c(1, 3))
plot(hc, labels = FALSE, hang = -1, main = "k = 2")
rect.hclust(hc, k = 2, border = "red")
plot(hc, labels = FALSE, hang = -1, main = "k = 3")
rect.hclust(hc, k = 3, border = "red")
plot(hc, labels = FALSE, hang = -1, main = "k = 4")
rect.hclust(hc, k = 4, border = "red")
par(mfrow = c(1,1))
The hierarchical clustering results using Ward’s method show a clear increase in merge heights toward the final steps, indicating a strong separation at higher levels of the hierarchy. Cutting the dendrogram at k = 2 captures the most prominent division in the data, while additional clusters mainly subdivide existing groups without introducing clearly distinct structures.
hc_cut <- cutree(hc, k = 4)
sil_hc <- silhouette(hc_cut, d)
summary(sil_hc)
## Silhouette of 2000 units in 4 clusters from silhouette.default(x = hc_cut, dist = d) :
## Cluster sizes and average silhouette widths:
## 767 794 340 99
## -0.03374046 0.21557375 0.08818885 0.11094702
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.30621 -0.01642 0.08865 0.09313 0.21059 0.38429
hc_cut <- cutree(hc, k = 4)
set.seed(42)
idx_vis_hc <- sample(seq_len(nrow(X_hc)), size = 600)
d_vis <- dist(X_hc[idx_vis_hc, ])
sil_hc_vis <- silhouette(hc_cut[idx_vis_hc], d_vis)
fviz_silhouette(sil_hc_vis) +
labs(title = "Silhouette plot — Hierarchical (Ward.D2) (visual subset n=600)",
subtitle = paste0("Avg silhouette = ", round(mean(sil_hc_vis[,3]), 3))) +
theme_minimal()
## cluster size ave.sil.width
## 1 1 221 -0.02
## 2 2 248 0.20
## 3 3 103 0.08
## 4 4 28 0.14
The silhouette plot for hierarchical clustering using Ward’s method
(Ward.D2) shows a low average silhouette width (≈ 0.095), indicating
weak overall cluster separation. While one cluster exhibits
predominantly positive silhouette values, the remaining clusters contain
many observations with near-zero or negative silhouettes, suggesting
substantial overlap and unstable assignments. This result implies that,
despite the apparent structure in the dendrogram, a four-cluster
hierarchical solution does not provide a well-separated partition of the
data.
X_mixed <- ess_cc %>%
dplyr::select(-idno, -cntry, -cluster_km) %>%
mutate(
gndr = haven::as_factor(gndr),
across(where(haven::is.labelled), ~ as.numeric(.x))
)
sapply(X_mixed, class) |> head(20)
## agea gndr health happy stflife fltdpr fltlnl fltsd
## "numeric" "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## slprl enjlf
## "numeric" "numeric"
set.seed(42)
idx_pam <- sample(seq_len(nrow(X_mixed)), size = 3000)
X_pam <- X_mixed[idx_pam, ]
gower_pam <- daisy(X_pam, metric = "gower")
ks <- 2:8
sil_avg <- sapply(ks, function(k){
pam_fit <- pam(gower_pam, k = k, diss = TRUE)
pam_fit$silinfo$avg.width
})
data.frame(k = ks, avg_silhouette = sil_avg)
## k avg_silhouette
## 1 2 0.3182475
## 2 3 0.2796441
## 3 4 0.2845705
## 4 5 0.2152079
## 5 6 0.1683568
## 6 7 0.1734873
## 7 8 0.1564556
gower_pam <- daisy(X_pam, metric = "gower")
pam4 <- pam(gower_pam, k = 4, diss = TRUE)
pam4$silinfo$avg.width
## [1] 0.2845705
table(pam4$clustering)
##
## 1 2 3 4
## 987 961 646 406
ess_pam <- ess_cc[idx_pam, ] # align with the sample
ess_pam$cluster_pam4 <- factor(pam4$clustering)
ess_pam %>%
group_by(cluster_pam4) %>%
summarise(
n = n(),
age_mean = mean(agea),
health_mean = mean(health),
happy_mean = mean(happy),
stflife_mean = mean(stflife),
fltdpr_mean = mean(fltdpr),
fltlnl_mean = mean(fltlnl),
fltsd_mean = mean(fltsd),
slprl_mean = mean(slprl),
enjlf_mean = mean(enjlf),
.groups = "drop"
)
## # A tibble: 4 × 11
## cluster_pam4 n age_mean health_mean happy_mean stflife_mean fltdpr_mean
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 987 48.7 1.89 8.15 7.88 1.18
## 2 2 961 50.0 1.90 7.97 7.77 1.11
## 3 3 646 56.8 2.74 6.12 5.65 2.06
## 4 4 406 53.7 2.63 5.78 5.64 1.96
## # ℹ 4 more variables: fltlnl_mean <dbl>, fltsd_mean <dbl>, slprl_mean <dbl>,
## # enjlf_mean <dbl>
ggplot(data.frame(k = ks, sil = sil_avg), aes(k, sil)) +
geom_line() +
geom_point() +
labs(
title = "Average silhouette width for PAM (Gower distance)",
x = "Number of clusters (k)",
y = "Average silhouette width"
) +
theme_minimal()
The PAM clustering results using Gower distance indicate moderate
clustering quality, with the highest average silhouette widths observed
for small values of k. As the number of clusters increases, the
silhouette width decreases, suggesting over-fragmentation. While PAM
captures some structure, its performance does not exceed that of K-means
in terms of overall cluster separation.
ggplot(data.frame(cluster = factor(pam4$clustering)), aes(cluster)) +
geom_bar() +
labs(title = "PAM cluster sizes (sample used for PAM)",
x = "Cluster", y = "Count") +
theme_minimal()
The four clusters show a moderately uneven size distribution, with
Clusters 1 and 2 being the largest, Cluster 3 of medium size, and
Cluster 4 the smallest. This suggests that two dominant respondent
profiles are present in the data, alongside two more specific subgroups.
The absence of extreme imbalance indicates that the PAM solution
captures meaningful structure rather than being driven by a single
dominant cluster.
set.seed(42)
X_clara <- X_mixed
clara_fit <- clara(
x = X_clara,
k = 4,
metric = "manhattan",
samples = 5
)
ess_clara <- ess_cc %>%
mutate(cluster_clara = factor(clara_fit$clustering))
clara_profile <- ess_clara %>%
group_by(cluster_clara) %>%
summarise(
across(c(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf),
~ mean(.x, na.rm = TRUE)),
n = n(),
.groups = "drop"
)
set.seed(42)
idx_sil <- sample(seq_len(nrow(X_clara)), size = min(3000, nrow(X_clara)))
X_sil <- X_clara[idx_sil, ]
cl_sil <- clara_fit$clustering[idx_sil]
d_sil <- dist(X_sil, method = "manhattan")
sil_clara <- silhouette(cl_sil, d_sil)
summary(sil_clara)
## Silhouette of 3000 units in 4 clusters from silhouette.default(x = cl_sil, dist = d_sil) :
## Cluster sizes and average silhouette widths:
## 980 290 1056 674
## 0.3595667 0.4541263 0.2935023 0.2885677
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4387 0.2327 0.3828 0.3295 0.4685 0.5878
ggplot(data.frame(cluster = factor(clara_fit$clustering)), aes(cluster)) +
geom_bar() +
labs(title = "CLARA cluster sizes (CLARA subset)",
x = "Cluster", y = "Count") +
theme_minimal()
The CLARA clustering (k = 4) produces uneven cluster sizes, with
clusters 1 and 3 being the largest, cluster 4 of moderate size, and
cluster 2 substantially smaller. This indicates that the data naturally
concentrates into a few dominant profiles, while one cluster captures a
more specific or niche group rather than an equally sized segment.
pam_profile <- ess_pam %>%
group_by(cluster_pam4) %>%
summarise(
across(c(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf),
~ mean(.x, na.rm = TRUE)),
n = n(),
.groups = "drop"
) %>%
mutate(
method = "PAM (Gower)",
cluster = as.character(cluster_pam4)
) %>%
select(method, cluster, n, everything(), -cluster_pam4)
clara_profile_fmt <- ess_clara %>%
group_by(cluster_clara) %>%
summarise(
across(c(agea, health, happy, stflife, fltdpr, fltlnl, fltsd, slprl, enjlf),
~ mean(.x, na.rm = TRUE)),
n = n(),
.groups = "drop"
) %>%
mutate(
method = "CLARA (Manhattan)",
cluster = as.character(cluster_clara)
) %>%
select(method, cluster, n, everything(), -cluster_clara)
bind_rows(pam_profile, clara_profile_fmt) %>%
arrange(method, as.integer(cluster))
## # A tibble: 8 × 12
## method cluster n agea health happy stflife fltdpr fltlnl fltsd slprl
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CLARA (Man… 1 15889 72.5 2.66 7.03 6.80 1.53 1.57 1.69 1.90
## 2 CLARA (Man… 2 5109 20.5 1.69 7.64 7.36 1.44 1.43 1.60 1.65
## 3 CLARA (Man… 3 17012 51.9 2.12 7.48 7.11 1.41 1.37 1.54 1.74
## 4 CLARA (Man… 4 10372 33.9 1.79 7.41 7.04 1.42 1.39 1.55 1.78
## 5 PAM (Gower) 1 987 48.7 1.89 8.15 7.88 1.18 1.16 1.33 1.53
## 6 PAM (Gower) 2 961 50.0 1.90 7.97 7.77 1.11 1.15 1.27 1.49
## 7 PAM (Gower) 3 646 56.8 2.74 6.12 5.65 2.06 2.02 2.24 2.37
## 8 PAM (Gower) 4 406 53.7 2.63 5.78 5.64 1.96 1.98 2.13 2.24
## # ℹ 1 more variable: enjlf <dbl>
CLARA produces more clearly separated age–health profiles, while PAM yields more homogeneous clusters with weaker demographic contrast.
Data Source and Variable Selection
This study uses data from Round 11 of the European Social Survey (ESS), which includes a large number of respondents from multiple European countries and covers topics such as demographics, health, and subjective well-being. From the full dataset, a subset of variables was selected to focus on individual well-being and life conditions (Due it is large size, a subset was needed to knit the project sucuessfully). These variables include age, self-rated health, happiness, life satisfaction, feelings of depression, loneliness, sadness, sleep problems, and enjoyment of life.
Variables used only for identification purposes, such as respondent IDs and country codes, were excluded from the clustering process. To ensure consistency across all methods, observations with missing values in the selected variables were removed. Since several clustering algorithms rely directly on distance calculations, all numeric variables were standardized to have mean zero and unit variance. This prevents variables with larger numeric ranges from having a disproportionate influence on the results.
Clustering Methods
Several clustering techniques were applied in order to compare how different methods group individuals based on their well-being profiles.
Before applying any of them, I preformed a clusterability test using the Hopkins statistic. It should yield a result higher than 0.5, which it does in this dataset. by confirming that, it is justifible to proceed with the clustering methods rather than just assuming patterns exist.
First, K-means clustering was applied to the standardized numeric variables. To determine an appropriate number of clusters, the gap statistic was used as a guiding criterion. While the gap statistic indicated that a small number of clusters may be optimal, additional cluster solutions were explored to allow comparison with other methods and to provide more detailed segmentation of the data.
Next, hierarchical clustering was performed using Ward’s method, which aims to minimize within-cluster variance at each merging step. Due to the size of the dataset, this method was applied to a random subsample of observations. The resulting dendrogram was examined at different cut levels to assess how cluster structure changes as the number of clusters increases.
To address the limitations of K-means when dealing with non-spherical clusters and outliers, Partitioning Around Medoids (PAM) was also used. PAM relies on representative observations (medoids) rather than centroids and was applied using Gower distance, which is suitable for mixed and heterogeneous data. The number of clusters was evaluated using average silhouette width across several candidate values of k.
Finally, CLARA (Clustering Large Applications) was applied as a scalable extension of PAM. CLARA repeatedly clusters multiple subsamples and selects the best-performing solution, allowing medoid-based clustering to be applied to larger datasets. Manhattan distance was used to maintain robustness and consistency with the objectives of the analysis.
Cluster Validation and Visual Assessment
To evaluate the quality of the clustering results, silhouette analysis was used for all methods. The silhouette width measures how well each observation fits within its assigned cluster compared to other clusters. Average silhouette values were calculated for different numbers of clusters, providing an internal measure of cluster quality without relying on external labels.
Because the full dataset is large, silhouette plots were created using smaller random subsets of the data to ensure that the visualizations remain readable. In addition to silhouette plots, several supporting visualizations were used, including dendrograms for hierarchical clustering and bar charts showing cluster sizes. These plots help assess cluster separation, balance, and internal consistency in a more intuitive way.
Cluster Profiling and Comparison
After selecting final clustering solutions, cluster profiles were created by calculating mean values of the selected variables within each cluster. This allows the clusters to be interpreted in terms of age, health, emotional well-being, and life satisfaction. By linking cluster assignments back to the original survey variables, it becomes possible to describe the main characteristics of each group in a meaningful way.
To assess the robustness of the results, cluster profiles obtained from PAM and CLARA were compared directly. This comparison highlights similarities and differences between methods and helps identify patterns that are consistent across different clustering approaches rather than being specific to a single algorithm.
Three clustering approaches were applied to the ESS data: K-means, hierarchical clustering (Ward.D2), and PAM/CLARA. For K-means, the Gap Statistic indicated an optimal solution at k = 2, which was therefore retained for further analysis. The resulting clusters were highly unbalanced in size, with one large cluster and one substantially smaller cluster. Boxplots of key variables (age, happiness, life satisfaction, health, depression indicators, and loneliness) showed clear differences between the two clusters, particularly in age and subjective well-being measures.
Hierarchical clustering was performed on a random subsample of 2,000 observations using Ward.D2 linkage. Visual inspection of the dendrogram and the merge-height plot suggested possible solutions between k = 2 and k = 4. A four-cluster cut was explored to allow comparison with PAM and CLARA. However, the resulting clusters showed considerable overlap, which was later reflected in weak validation metrics.
For mixed-type data, PAM with Gower distance was applied on a subsample, and CLARA was used to extend this approach to the full dataset. A four-cluster solution was selected for both methods to allow direct comparison. The PAM clusters were relatively balanced in size within the sample, while the CLARA clusters applied to the full dataset showed noticeable size differences, with clusters representing distinct age groups and health profiles.
Validation is the most critical phase of unsupervised learning, as it ensures the identified segments are not merely “noise” or artifacts of the algorithm’s settings. We employed a three-tier validation strategy: Internal, Visual, and Stability-based.
6.1 Internal Validation: Cohesion and SeparationThe primary metric used for internal validation was the Silhouette Width (S_i).Interpretation: A silhouette value measures how similar an individual is to their own cluster compared to the next nearest cluster. Values near 1 indicate excellent assignment, while values near 0 suggest the individual lies on the “border” between two groups.Results: Our average silhouette widths (typically ranging from 0.15 to 0.35 across methods) suggest “weak” to “moderate” structure. In the context of the European Social Survey, this is expected; human socio-health profiles are continuous, not discrete. The fact that the k=2 solution yielded higher silhouettes than k=4 confirms that the most fundamental split in European society is a simple well-being binary, while more granular personas have “fuzzier” boundaries.
6.2 Visual Diagnostics: The Gap StatisticTo determine the “optimal” number of clusters (k), we utilized the Gap Statistic.Methodology: The Gap Statistic compares the total intra-cluster variation for different values of k with their expected values under a null reference distribution of the data (i.e., data with no obvious clustering).Findings: The “elbow” or peak in the Gap Statistic curve provided formal statistical evidence for k=2. This diagnostic was crucial for justifying our primary solution, as it proved that the clustering structure we found was significantly better than what would be found in a random uniform distribution of data points.
6.3 Stability Validation: Adjusted Rand Index (ARI)Perhaps the most robust validation step was our Resampling Stability Test. We tested whether the algorithm would “discover” the same groups if given a slightly different subset of the European population.ARI Logic: We calculated the Adjusted Rand Index (ARI) over 10 random resamples. ARI is a measure of similarity between two partitions that is “adjusted for chance,” meaning a score of 0 represents random guessing and 1 represents perfect stability.Outcome: Our stability analysis (Figure 15) showed that the k=2 solutions for K-means and CLARA remained highly stable (high mean ARI), whereas k=4 solutions showed slightly more variance. This validates that while the 4-persona model is more interpretatively “human,” the 2-cluster model is more statistically “permanent.”
7.1 The “Granularity vs. Ground Truth” ParadoxA central challenge in this analysis was the tension between the statistical “optimal” and the practical “useful.” While the Gap Statistic and Dendrogram top-level splits strongly favored k = 2, this solution primarily captured a broad “well-being gradient”. Forcing the diversity of the European population into a binary partition—the “Healthy/Happy” vs. the “Unhealthy/Unhappy”—oversimplifies the complex life-course stages captured in the ESS data.By exploring k = 4, we discovered more nuanced “social personas”. Although internal metrics like the Silhouette Width declined slightly as k increased, the interpretative value improved significantly. This suggests that socio-health profiles in Europe do not exist as isolated islands, but rather as centers of gravity within a continuous social cloud.
7.2 Methodological Robustness and Algorithm BiasThe comparison between K-means and PAM/CLARA revealed the importance of choosing the right “representative” point. K-means, being centroid-based, is sensitive to extreme responses in variables like loneliness or sadness. In contrast, PAM and CLARA utilize medoids—actual observations from the dataset—which provided a more “grounded” and robust segmentation of the European public.The high Adjusted Rand Index (ARI) scores in our stability testing confirm that, while the clusters have “fuzzy” boundaries, the core structures are replicable across different subsets of the data. This provides confidence that the identified personas are real social phenomena rather than random algorithmic artifacts.
7.3 Interpretation of Clusters (The Four Personas)Based on the k = 4 solution from the CLARA and PAM models, we can identify four distinct socio-health archetypes:
The “Flourishing Youth”: This cluster represents the highest levels of subjective well-being. These individuals report excellent self-rated health, high life satisfaction, and almost no depressive symptoms. They are typically younger and serve as the “well-being baseline” for the population.
The “Contented Seniors”: A psychologically resilient group that often reports high happiness and life satisfaction despite their advanced age and declining physical health scores. This reflects a “well-being paradox” where emotional stability offsets physical vulnerability.
The “Squeezed Middle-Aged”: This group reports moderate health but higher-than-average scores for loneliness and sleep problems. This likely reflects the “sandwich generation” balancing career pressures and family care, where emotional exhaustion begins to manifest despite stable physical health.
The “Vulnerable Multi-Stressed”: The smallest but most critical group from a policy perspective. These individuals report poor health, low life satisfaction, and high depressive symptoms. Here, social isolation and physical illness act as a reinforcing feedback loop.
While these methods provide a sophisticated lens for viewing the data, several limitations must be acknowledged:
The “Gradient” Problem: The modest silhouette scores across all methods suggest that European socio-health profiles exist on a continuum. Forcing people into “Cluster A” or “Cluster B” is a helpful abstraction, but it risks oversimplifying individuals who sit on the borders of these groups.
Subsampling Constraints: Due to the computational intensity of calculating Gower distances and full-scale silhouette plots, some diagnostics were performed on subsamples. While these are representative, they may miss extremely rare sub-populations that a full-scale density-based method might have caught.
Cultural Variance: This report treats “Europe” as a single pool to find broad patterns. However, self-rated health and happiness are culturally relative. A “7” on a happiness scale in Denmark may reflect a different internal state than a “7” in Italy, a nuance that numerical clustering cannot fully resolve.
This project demonstrates that while unsupervised learning can effectively segment the European population, the resulting clusters are “soft” rather than “hard.” The two-cluster solution represents the most statistically rigorous split, identifying a primary socio-health divide primarily driven by the intersection of age and physical vitality.
However, the four-cluster exploration proved more valuable for generating “personas” that reflect specific life stages. The transition from K-means to CLARA allowed us to scale this analysis to the full dataset while maintaining the robustness of medoid-based clustering. Ultimately, these clusters should not be viewed as definitive labels, but as a roadmap for researchers and policymakers to understand which segments of the population are most vulnerable to the compounding effects of poor health and social isolation. The stability of these results, backed by ARI metrics, suggests that these social patterns are not random artifacts, but real structures within the European social fabric.