The primary objective of this project is to investigate the underlying structure of social and institutional trust in European countries. By employing Principal Component Analysis and CLARA clustering, this research aims to reduce the dimensionality of European Social Survey data and perform clustering on reduced principal components compared to raw data, determining which yields more robust results.
The data was sourced from the European Social Survey and it focuses on variables measuring trust in national and international institutions, as well as, interpersonal trust. Observations containing responses such as, “Refusal,” “Don’t know,” or “No answer” were removed, which left the data sample consisting of 43744 observations.
| Variable | Description |
|---|---|
| trstplt | Trust in politicians |
| trstplc | Trust in the police |
| trstprl | Trust in country’s parliament |
| trstprt | Trust in political parties |
| trstlgl | Trust in the legal system |
| trstep | Trust in the European Parliament |
| trstun | Trust in the United Nations |
| ppltrst | Most people can be trusted or you can’t be too careful |
| pplhlp | Most of the time people helpful or mostly looking out for themselves |
| pplfair | Most people try to take advantage of you, or try to be fair |
data_final <- data %>% dplyr::select(trstplt, trstplc, trstprl, trstprt, trstlgl, trstep, trstun, ppltrst, pplhlp, pplfair)
colnames(data_final)## [1] "trstplt" "trstplc" "trstprl" "trstprt" "trstlgl" "trstep" "trstun"
## [8] "ppltrst" "pplhlp" "pplfair"
## trstplt trstplc trstprl trstprt
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.000 1st Qu.: 5.000 1st Qu.: 2.000 1st Qu.: 2.000
## Median : 4.000 Median : 7.000 Median : 5.000 Median : 4.000
## Mean : 4.807 Mean : 7.174 Mean : 5.967 Mean : 5.076
## 3rd Qu.: 5.000 3rd Qu.: 8.000 3rd Qu.: 7.000 3rd Qu.: 5.000
## Max. :99.000 Max. :99.000 Max. :99.000 Max. :99.000
## trstlgl trstep trstun ppltrst
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.: 3.000
## Median : 6.000 Median : 5.00 Median : 5.00 Median : 5.000
## Mean : 7.136 Mean :10.14 Mean :11.24 Mean : 5.211
## 3rd Qu.: 8.000 3rd Qu.: 7.00 3rd Qu.: 7.00 3rd Qu.: 7.000
## Max. :99.000 Max. :99.00 Max. :99.00 Max. :99.000
## pplhlp pplfair
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 4.000
## Median : 5.000 Median : 6.000
## Mean : 5.246 Mean : 6.097
## 3rd Qu.: 7.000 3rd Qu.: 7.000
## Max. :99.000 Max. :99.000
## [1] 50116 10
The distribution plot of each variable was examined. The plots reveal that trust levels are not uniformly distributed, but they operate within the same numerical range (0-10). Standardization was crucial for equalizing variance and later comparability.
#distribution plot
vars <- c("trstplt", "trstplc", "trstprl", "trstprt", "trstlgl", "trstep", "trstun", "ppltrst", "pplhlp", "pplfair")
data_melt <- melt(data_final[, vars], id.vars = NULL)
ggplot(data = data_melt) +
geom_histogram(aes(x = value), fill = "indianred", color = "white", bins = 30) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
theme_minimal() +
labs(title = "Distribution of variables ", x = "Value", y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = 15))#normalizing
data_scaled <- scale(data_final)
data_scaled <- as.data.frame(data_scaled)
summary(data_scaled)## trstplt trstplc trstprl trstprt
## Min. :-1.4426 Min. :-2.443 Min. :-1.6273 Min. :-1.4592
## 1st Qu.:-0.6362 1st Qu.:-0.493 1st Qu.:-0.8843 1st Qu.:-0.6401
## Median : 0.1702 Median : 0.287 Median : 0.2302 Median : 0.1789
## Mean : 0.0000 Mean : 0.000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5733 3rd Qu.: 0.677 3rd Qu.: 0.6017 3rd Qu.: 0.5885
## Max. : 2.5893 Max. : 1.457 Max. : 2.0877 Max. : 2.6362
## trstlgl trstep trstun ppltrst
## Min. :-1.9312 Min. :-1.7727 Min. :-1.85586 Min. :-2.048020
## 1st Qu.:-0.8374 1st Qu.:-0.6163 1st Qu.:-0.73159 1st Qu.:-0.824383
## Median : 0.2564 Median : 0.1546 Median : 0.01792 Median :-0.008625
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.000000
## 3rd Qu.: 0.6210 3rd Qu.: 0.9256 3rd Qu.: 0.76743 3rd Qu.: 0.807133
## Max. : 1.7148 Max. : 2.0820 Max. : 1.89170 Max. : 2.030770
## pplhlp pplfair
## Min. :-2.133570 Min. :-2.4487
## 1st Qu.:-0.849165 1st Qu.:-0.6954
## Median : 0.007106 Median : 0.1813
## Mean : 0.000000 Mean : 0.0000
## 3rd Qu.: 0.863376 3rd Qu.: 0.6196
## Max. : 2.147781 Max. : 1.9346
The correlation plot presented below exhibits significant positive correlations among the selected trust variables, which provides an empirical justification for the use of PCA, as the observed variables share substantial common variance.
Two formal test were performed to evaluate the suitability of the data for dimension reduction. The Kaiser-Meyer-Olkin measure and Bartlett’s Test of Sphericity. The overall MSA value is 0.86, confirming that the data is well suited for PCA. Furthermore, the null hypothesis was rejected with a high level of significance confirms that the dataset possesses a structure for dimension reduction.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data_scaled)
## Overall MSA = 0.86
## MSA for each item =
## trstplt trstplc trstprl trstprt trstlgl trstep trstun ppltrst pplhlp pplfair
## 0.84 0.88 0.94 0.85 0.88 0.84 0.83 0.85 0.88 0.83
#Bartlett
n_obs <- nrow(data_scaled)
bartlett_result <- cortest.bartlett(M, n = n_obs)
print(bartlett_result)## $chisq
## [1] 268502.6
##
## $p.value
## [1] 0
##
## $df
## [1] 45
The PCA was conducted to transform the 10 correlated trust variables into a set of uncorrelated principal components. The first principal component is the most significant, accounting for 51.3% of the total variance. The second one explains additional 15.01%, reaching a cumulative proportion of 66.31%. The threshold of 85% is met by adding PC3 and PC4.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.265 1.2253 0.90600 0.8503 0.69831 0.63023 0.56294
## Proportion of Variance 0.513 0.1501 0.08208 0.0723 0.04876 0.03972 0.03169
## Cumulative Proportion 0.513 0.6631 0.74519 0.8175 0.86625 0.90597 0.93766
## PC8 PC9 PC10
## Standard deviation 0.50627 0.49421 0.35051
## Proportion of Variance 0.02563 0.02442 0.01229
## Cumulative Proportion 0.96329 0.98771 1.00000
The next step is determining the optimal number of components for further analysis. This decision is guided by examination of Eigenvalues, which represent the total amount of variance captured by each principal component. According to the Kaiser rule, only components with an eigenvalue greater than 1.0 should be retained. The screeplot shows that component 1 and 2 should satisfy the criteria, although components 3 and 4 are not far behind.
fviz_eig(pca_model, choice = "eigenvalue", ncp = 22, barfill = "skyblue", barcolor = "skyblue3", linecolor = "skyblue4", addlabels = TRUE, main = "Eigenvalues")The first dimension is predominantly defined by variables related to official institutions (vertical). The second dimension (Dim2), explaining 15% of the variance, represents interpersonal trust.
The cos2 plot provides a visual representation of quality of projection, the entire respondent population within the newly defined factor space. The orange and red points are individuals well-represented by the two-dimensional space, while the ones located near the center (blue) have more complex trust patterns and may be better represented with higher dimensions.
fviz_pca_ind(pca_model, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
geom = "point", alpha.ind = 0.5)To achieve a simple structure, a Varimax rotation was applied, which redistributed the variance to make the factors more interpretable without changing the total variance explained.
The rotated two factor solution maintains a cumulative variance of 66.3%. The first component RC1 is mostly made up of institutional trust, with the highest loadings for trust in politicians and political parties. The second RC consists of social trust variables. The uniqueness plot for two-factor solution shows a relatively uniform distribution across most variables, hovering between 0.25-0.4, which means 0.6-0.75 of the varinace is explained by the first two components.
#varimax rotation
res_rot <- principal(data_scaled, nfactors = 2, rotate = "varimax")
#loadings
print(loadings(res_rot), digits = 3, cutoff = 0.4, sort = TRUE)##
## Loadings:
## RC1 RC2
## trstplt 0.843
## trstplc 0.650
## trstprl 0.813
## trstprt 0.836
## trstlgl 0.749
## trstep 0.777
## trstun 0.753
## ppltrst 0.816
## pplhlp 0.794
## pplfair 0.837
##
## RC1 RC2
## SS loadings 4.327 2.304
## Proportion Var 0.433 0.230
## Cumulative Var 0.433 0.663
#contribution of variables
p1 <- fviz_contrib(pca_model, choice = "var", axes = 1, fill = "skyblue")
p2 <- fviz_contrib(pca_model, choice = "var", axes = 2, fill = "salmon")
#layout
p1 + p2 + plot_layout(ncol = 2)#uniqueness plot
barplot(res_rot$uniqueness,
main = "Uniqueness of variables",
las = 2,
col = "skyblue",
cex.names = 0.7,
ylab = "Uniqueness")
abline(h = 0.6, col = "red", lty = 2)Increasing the model to four factors raises the cumulative explained variance to 81.7%. RC1 consists mostly focuses on politicians and parties, RC2 remains as a proxy for social trust, RC3 represents trust in international institutions and RC4 - police and legal system.
The uniqueness plot shows that more of the variance is explained by increasing the components from 2 to 4. Most variables ranging from 0.1 to 0.2, with the exception of those related to social trust.
#varimax rotation
res_rot4 <- principal(data_scaled, nfactors = 4, rotate = "varimax")
#loadings
print(loadings(res_rot4), digits = 3, cutoff = 0.4, sort = TRUE)##
## Loadings:
## RC1 RC2 RC3 RC4
## trstplt 0.870
## trstprl 0.745
## trstprt 0.867
## ppltrst 0.812
## pplhlp 0.795
## pplfair 0.832
## trstep 0.846
## trstun 0.867
## trstplc 0.877
## trstlgl 0.424 0.743
##
## RC1 RC2 RC3 RC4
## SS loadings 2.534 2.194 1.777 1.670
## Proportion Var 0.253 0.219 0.178 0.167
## Cumulative Var 0.253 0.473 0.650 0.817
#contribution of variables
p1 <- fviz_contrib(pca_model, choice = "var", axes = 1, fill = "skyblue")
p2 <- fviz_contrib(pca_model, choice = "var", axes = 2, fill = "salmon")
p3 <- fviz_contrib(pca_model, choice = "var", axes = 3, fill = "palegreen")
p4 <- fviz_contrib(pca_model, choice = "var", axes = 4, fill = "plum")
#layout
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)#uniqueness plot
barplot(res_rot4$uniqueness,
main = "Uniqueness of variables",
las = 2,
col = "skyblue",
cex.names = 0.7,
ylab = "Uniqueness")
abline(h = 0.6, col = "red", lty = 2)Following the dimension reduction, the next stage of the analysis involves grouping respondents into clusters. Hopkins statistic was calculated on a sample due to computational reasons to determine the suitability of data for clustering. The resulting score of 0.98 strongly indicates a significant clustering tendency. For the clustering process, the CLARA algorithm was selected, which is a PAM extension designed for large datasets. The clustering will be performed on raw data, the first two principal components and the first four principal components.
set.seed(123)
#hopkins stat on a sample due to computational issues
sample_indices <- sample(1:nrow(data_scaled), 2000)
data_sample <- data_scaled[sample_indices, ]
h_stat <- hopkins(data_sample, m = 200)
print(h_stat)## [1] 0.9345512
To find the most appropriate number of clusters, two methods were applied, the elbow method and the average silhouette method. The plot of the total WSS (elbow) shows a significant “bend” at k=2 or k=3. The silhouette plot show a peak at k=2, which represents the highest level of cluster separation. Although the two cluster solution is mathematically the most distinct, this study suggests a three cluster solution. This decision is based on an aim to discover more interesting and interpretable results.
# Elbow
fviz_nbclust(data_sample, clara, method = "wss") +
geom_vline(xintercept = 3, linetype = 2) +
labs(subtitle = "Elbow method")# Silhouette
fviz_nbclust(data_sample, clara, method = "silhouette") +
labs(subtitle = "Silohuette score")The clusters follow the general diagonal trend of the data,
separating the population into three segments: high trust, low trust,
and a central moderate group.
As observed in the plot, the clusters are separated by nearly
straight, parallel lines. This occurs because the clustering was
performed in the exact same 2D space in which it is visualized.
Unlike the PCA 2 plot, the clusters here appear to significantly
overlap in the 2D space. It may indicate that the clusters separate in
the third and fourth dimensions, which are not visible on the plot.
First, the silhouette width was compared, as it is a crucial metric for evaluating the clustering performance. The silhouette profile for the raw data reveals a relatively weak cluster structure. The average silhouette score is 0.17 with several observations below 0. The PCA 2-dim silhouette shows the highest width (0.34), with all the bars consistently higher, clusters are more compact and better separated. The PCA 4 dim average silhouette width is 0.23 with some negative values.
## [1] 0.168179
## [1] 0.3399685
## [1] 0.2312136
#Raw Data
s1 <- fviz_silhouette(clara_raw, print.summary = FALSE) +
ggtitle("Silohuette: Raw Data") +
theme_minimal() +
theme(axis.text.x = element_blank())
# PCA (2 components)
s2 <- fviz_silhouette(clara_pca, print.summary = FALSE) +
ggtitle("Silohuette: PCA 2-dim") +
theme_minimal() +
theme(axis.text.x = element_blank())
# PCA (4 components)
s3 <- fviz_silhouette(clara_pca_4, print.summary = FALSE) +
ggtitle("Silohuette: PCA 4-dim") +
theme_minimal() +
theme(axis.text.x = element_blank())
s1 / s2 / s3The evaluation of the three clustering results was done using: average silhouette width, Dunn Index and WSS. The PCA 2-component model was clearly identified as superior. In contrast, clustering on raw data yielded a substantially weaker structure, which may indicate some noise was filtered out by dimension reduction. Although the PCA 2 model outperformed others in terms of silhouette and WSS, it recorded the lowest Dunn Index (0.002), which is a typical result for large data, where observations form a continuous cloud rather than isolated groups. In the reduced 2D PCA space, the clusters are contiguous, meaning their boundaries are in direct contact.
####Quality measures comparison
idx <- sample(1:nrow(data_scaled), 2000)
# Distance matrix
dist_raw <- dist(data_scaled[idx, ])
dist_pca2 <- dist(pca_model$x[idx, 1:2])
dist_pca4 <- dist(pca_model$x[idx, 1:4])
# Quality measures
stats_raw <- cluster.stats(dist_raw, clara_raw$clustering[idx])
stats_pca2 <- cluster.stats(dist_pca2, clara_pca$clustering[idx])
stats_pca4 <- cluster.stats(dist_pca4, clara_pca_4$clustering[idx])
quality_measures <- data.frame(
Method = c("Raw Data", "PCA 2", "PCA 4"),
Silhouette = c(stats_raw$avg.silwidth, stats_pca2$avg.silwidth, stats_pca4$avg.silwidth),
Dunn_Index = c(stats_raw$dunn, stats_pca2$dunn, stats_pca4$dunn),
WSS = c(stats_raw$within.cluster.ss, stats_pca2$within.cluster.ss, stats_pca4$within.cluster.ss)
)
print(round(quality_measures[, -1], 3) %>% cbind(Method = quality_measures$Method))## Silhouette Dunn_Index WSS Method
## 1 0.180 0.064 11424.144 Raw Data
## 2 0.340 0.002 4729.155 PCA 2
## 3 0.234 0.021 7953.785 PCA 4
The PCA 2-component model identified three distinct groups: the trusting (variable means close to 1), the neutral (ranging around zero) and the skeptics (negative scores).
# Profiling the clusters
get_profile <- function(cluster_vector, original_data, label) {
original_data %>%
mutate(Cluster = cluster_vector) %>%
group_by(Cluster) %>%
summarise(across(everything(), mean, na.rm = TRUE)) %>%
mutate(Method = label)
}
#Raw
profile_raw <- get_profile(data_final_results$cluster_raw, data_scaled, "Raw Data")
#PCA2
profile_pca2 <- get_profile(data_final_results$cluster_pca, data_scaled, "PCA 2-dim")
# PCA4
profile_pca4 <- get_profile(data_final_results$cluster_pca_4, data_scaled, "PCA 4-dim")
final_comparison <- bind_rows(profile_raw, profile_pca2, profile_pca4)
final_comparisonThis study aimed to demonstrate the effectiveness of integrating PCA with CLARA clustering to analyze social attitudes related to overall trust in institutions and people. By reducing the initial 10 variables from the European Social Survey into a lower-dimensional space, clarity and interpretability was achieved.
Dimension reduction with two principal components yielded the most robust results, outperforming the raw data approach by filtering out noise. While the 4-factor model captured more of the variance, the 2-component model provided a more distinct partition. The final clusters revealed three profiles of Europeans: individuals with high level of trust for institutions who also exhibit a lot of confidence for interpersonal relationships, a “middle-ground” group and skeptics characterized by a deep distrust in political actors and the community.