Introduction and database

The primary objective of this project is to investigate the underlying structure of social and institutional trust in European countries. By employing Principal Component Analysis and CLARA clustering, this research aims to reduce the dimensionality of European Social Survey data and perform clustering on reduced principal components compared to raw data, determining which yields more robust results.

The data was sourced from the European Social Survey and it focuses on variables measuring trust in national and international institutions, as well as, interpersonal trust. Observations containing responses such as, “Refusal,” “Don’t know,” or “No answer” were removed, which left the data sample consisting of 43744 observations.

Variable Descriptions (ESS Round 11)
Variable Description
trstplt Trust in politicians
trstplc Trust in the police
trstprl Trust in country’s parliament
trstprt Trust in political parties
trstlgl Trust in the legal system
trstep Trust in the European Parliament
trstun Trust in the United Nations
ppltrst Most people can be trusted or you can’t be too careful
pplhlp Most of the time people helpful or mostly looking out for themselves
pplfair Most people try to take advantage of you, or try to be fair
data_final <- data %>% dplyr::select(trstplt, trstplc, trstprl, trstprt, trstlgl, trstep, trstun, ppltrst, pplhlp, pplfair)
colnames(data_final)
##  [1] "trstplt" "trstplc" "trstprl" "trstprt" "trstlgl" "trstep"  "trstun" 
##  [8] "ppltrst" "pplhlp"  "pplfair"
summary(data_final)
##     trstplt          trstplc          trstprl          trstprt      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 1.000   1st Qu.: 5.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median : 4.000   Median : 7.000   Median : 5.000   Median : 4.000  
##  Mean   : 4.807   Mean   : 7.174   Mean   : 5.967   Mean   : 5.076  
##  3rd Qu.: 5.000   3rd Qu.: 8.000   3rd Qu.: 7.000   3rd Qu.: 5.000  
##  Max.   :99.000   Max.   :99.000   Max.   :99.000   Max.   :99.000  
##     trstlgl           trstep          trstun         ppltrst      
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 3.00   1st Qu.: 3.00   1st Qu.: 3.000  
##  Median : 6.000   Median : 5.00   Median : 5.00   Median : 5.000  
##  Mean   : 7.136   Mean   :10.14   Mean   :11.24   Mean   : 5.211  
##  3rd Qu.: 8.000   3rd Qu.: 7.00   3rd Qu.: 7.00   3rd Qu.: 7.000  
##  Max.   :99.000   Max.   :99.00   Max.   :99.00   Max.   :99.000  
##      pplhlp          pplfair      
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.: 4.000  
##  Median : 5.000   Median : 6.000  
##  Mean   : 5.246   Mean   : 6.097  
##  3rd Qu.: 7.000   3rd Qu.: 7.000  
##  Max.   :99.000   Max.   :99.000
dim(data_final)
## [1] 50116    10
#removing missing answers
data_final[data_final > 10] <- NA
data_final <- na.omit(data_final)

The distribution plot of each variable was examined. The plots reveal that trust levels are not uniformly distributed, but they operate within the same numerical range (0-10). Standardization was crucial for equalizing variance and later comparability.

#distribution plot 
vars <- c("trstplt", "trstplc", "trstprl", "trstprt", "trstlgl", "trstep", "trstun", "ppltrst", "pplhlp", "pplfair")
data_melt <- melt(data_final[, vars], id.vars = NULL)

ggplot(data = data_melt) + 
  geom_histogram(aes(x = value), fill = "indianred", color = "white", bins = 30) + 
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Distribution of variables ", x = "Value", y = "Frequency") +
  theme(plot.title = element_text(hjust = 0.5, size = 15))

#normalizing
data_scaled <- scale(data_final)
data_scaled <- as.data.frame(data_scaled)
summary(data_scaled)
##     trstplt           trstplc          trstprl           trstprt       
##  Min.   :-1.4426   Min.   :-2.443   Min.   :-1.6273   Min.   :-1.4592  
##  1st Qu.:-0.6362   1st Qu.:-0.493   1st Qu.:-0.8843   1st Qu.:-0.6401  
##  Median : 0.1702   Median : 0.287   Median : 0.2302   Median : 0.1789  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5733   3rd Qu.: 0.677   3rd Qu.: 0.6017   3rd Qu.: 0.5885  
##  Max.   : 2.5893   Max.   : 1.457   Max.   : 2.0877   Max.   : 2.6362  
##     trstlgl            trstep            trstun            ppltrst         
##  Min.   :-1.9312   Min.   :-1.7727   Min.   :-1.85586   Min.   :-2.048020  
##  1st Qu.:-0.8374   1st Qu.:-0.6163   1st Qu.:-0.73159   1st Qu.:-0.824383  
##  Median : 0.2564   Median : 0.1546   Median : 0.01792   Median :-0.008625  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.6210   3rd Qu.: 0.9256   3rd Qu.: 0.76743   3rd Qu.: 0.807133  
##  Max.   : 1.7148   Max.   : 2.0820   Max.   : 1.89170   Max.   : 2.030770  
##      pplhlp             pplfair       
##  Min.   :-2.133570   Min.   :-2.4487  
##  1st Qu.:-0.849165   1st Qu.:-0.6954  
##  Median : 0.007106   Median : 0.1813  
##  Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 0.863376   3rd Qu.: 0.6196  
##  Max.   : 2.147781   Max.   : 1.9346

PCA

Preparation

The correlation plot presented below exhibits significant positive correlations among the selected trust variables, which provides an empirical justification for the use of PCA, as the observed variables share substantial common variance.

M <- cor(data_scaled)
corrplot(M, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

Two formal test were performed to evaluate the suitability of the data for dimension reduction. The Kaiser-Meyer-Olkin measure and Bartlett’s Test of Sphericity. The overall MSA value is 0.86, confirming that the data is well suited for PCA. Furthermore, the null hypothesis was rejected with a high level of significance confirms that the dataset possesses a structure for dimension reduction.

#Kaiser-Meyer-Olkin
kmo_result <- KMO(data_scaled)
print(kmo_result)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = data_scaled)
## Overall MSA =  0.86
## MSA for each item = 
## trstplt trstplc trstprl trstprt trstlgl  trstep  trstun ppltrst  pplhlp pplfair 
##    0.84    0.88    0.94    0.85    0.88    0.84    0.83    0.85    0.88    0.83
#Bartlett
n_obs <- nrow(data_scaled)
bartlett_result <- cortest.bartlett(M, n = n_obs)
print(bartlett_result)
## $chisq
## [1] 268502.6
## 
## $p.value
## [1] 0
## 
## $df
## [1] 45

PCA

The PCA was conducted to transform the 10 correlated trust variables into a set of uncorrelated principal components. The first principal component is the most significant, accounting for 51.3% of the total variance. The second one explains additional 15.01%, reaching a cumulative proportion of 66.31%. The threshold of 85% is met by adding PC3 and PC4.

pca_model <- prcomp(data_scaled, center = TRUE, scale. = FALSE)
summary(pca_model)
## Importance of components:
##                          PC1    PC2     PC3    PC4     PC5     PC6     PC7
## Standard deviation     2.265 1.2253 0.90600 0.8503 0.69831 0.63023 0.56294
## Proportion of Variance 0.513 0.1501 0.08208 0.0723 0.04876 0.03972 0.03169
## Cumulative Proportion  0.513 0.6631 0.74519 0.8175 0.86625 0.90597 0.93766
##                            PC8     PC9    PC10
## Standard deviation     0.50627 0.49421 0.35051
## Proportion of Variance 0.02563 0.02442 0.01229
## Cumulative Proportion  0.96329 0.98771 1.00000

The next step is determining the optimal number of components for further analysis. This decision is guided by examination of Eigenvalues, which represent the total amount of variance captured by each principal component. According to the Kaiser rule, only components with an eigenvalue greater than 1.0 should be retained. The screeplot shows that component 1 and 2 should satisfy the criteria, although components 3 and 4 are not far behind.

fviz_eig(pca_model, choice = "eigenvalue", ncp = 22, barfill = "skyblue", barcolor = "skyblue3", linecolor = "skyblue4",  addlabels = TRUE,   main = "Eigenvalues")

The first dimension is predominantly defined by variables related to official institutions (vertical). The second dimension (Dim2), explaining 15% of the variance, represents interpersonal trust.

fviz_pca_var(pca_model, col.var = "black")

The cos2 plot provides a visual representation of quality of projection, the entire respondent population within the newly defined factor space. The orange and red points are individuals well-represented by the two-dimensional space, while the ones located near the center (blue) have more complex trust patterns and may be better represented with higher dimensions.

fviz_pca_ind(pca_model, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             geom = "point", alpha.ind = 0.5)

PCA Rotation

To achieve a simple structure, a Varimax rotation was applied, which redistributed the variance to make the factors more interpretable without changing the total variance explained.

PCA 2

The rotated two factor solution maintains a cumulative variance of 66.3%. The first component RC1 is mostly made up of institutional trust, with the highest loadings for trust in politicians and political parties. The second RC consists of social trust variables. The uniqueness plot for two-factor solution shows a relatively uniform distribution across most variables, hovering between 0.25-0.4, which means 0.6-0.75 of the varinace is explained by the first two components.

#varimax rotation 
res_rot <- principal(data_scaled, nfactors = 2, rotate = "varimax")

#loadings
print(loadings(res_rot), digits = 3, cutoff = 0.4, sort = TRUE)
## 
## Loadings:
##         RC1   RC2  
## trstplt 0.843      
## trstplc 0.650      
## trstprl 0.813      
## trstprt 0.836      
## trstlgl 0.749      
## trstep  0.777      
## trstun  0.753      
## ppltrst       0.816
## pplhlp        0.794
## pplfair       0.837
## 
##                  RC1   RC2
## SS loadings    4.327 2.304
## Proportion Var 0.433 0.230
## Cumulative Var 0.433 0.663
#contribution of variables
p1 <- fviz_contrib(pca_model, choice = "var", axes = 1, fill = "skyblue")
p2 <- fviz_contrib(pca_model, choice = "var", axes = 2, fill = "salmon")

#layout
p1 + p2 + plot_layout(ncol = 2)

#uniqueness plot
barplot(res_rot$uniqueness, 
        main = "Uniqueness of variables", 
        las = 2,           
        col = "skyblue", 
        cex.names = 0.7,   
        ylab = "Uniqueness")
abline(h = 0.6, col = "red", lty = 2)

PCA 4

Increasing the model to four factors raises the cumulative explained variance to 81.7%. RC1 consists mostly focuses on politicians and parties, RC2 remains as a proxy for social trust, RC3 represents trust in international institutions and RC4 - police and legal system.

The uniqueness plot shows that more of the variance is explained by increasing the components from 2 to 4. Most variables ranging from 0.1 to 0.2, with the exception of those related to social trust.

#varimax rotation 
res_rot4 <- principal(data_scaled, nfactors = 4, rotate = "varimax")

#loadings
print(loadings(res_rot4), digits = 3, cutoff = 0.4, sort = TRUE)
## 
## Loadings:
##         RC1   RC2   RC3   RC4  
## trstplt 0.870                  
## trstprl 0.745                  
## trstprt 0.867                  
## ppltrst       0.812            
## pplhlp        0.795            
## pplfair       0.832            
## trstep              0.846      
## trstun              0.867      
## trstplc                   0.877
## trstlgl 0.424             0.743
## 
##                  RC1   RC2   RC3   RC4
## SS loadings    2.534 2.194 1.777 1.670
## Proportion Var 0.253 0.219 0.178 0.167
## Cumulative Var 0.253 0.473 0.650 0.817
#contribution of variables
p1 <- fviz_contrib(pca_model, choice = "var", axes = 1, fill = "skyblue")
p2 <- fviz_contrib(pca_model, choice = "var", axes = 2, fill = "salmon")
p3 <- fviz_contrib(pca_model, choice = "var", axes = 3, fill = "palegreen")
p4 <- fviz_contrib(pca_model, choice = "var", axes = 4, fill = "plum")

#layout
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

#uniqueness plot
barplot(res_rot4$uniqueness, 
        main = "Uniqueness of variables", 
        las = 2,           
        col = "skyblue", 
        cex.names = 0.7,   
        ylab = "Uniqueness")
abline(h = 0.6, col = "red", lty = 2)

Clustering

Preparation

Following the dimension reduction, the next stage of the analysis involves grouping respondents into clusters. Hopkins statistic was calculated on a sample due to computational reasons to determine the suitability of data for clustering. The resulting score of 0.98 strongly indicates a significant clustering tendency. For the clustering process, the CLARA algorithm was selected, which is a PAM extension designed for large datasets. The clustering will be performed on raw data, the first two principal components and the first four principal components.

set.seed(123)
#hopkins stat on a sample due to computational issues
sample_indices <- sample(1:nrow(data_scaled), 2000)
data_sample <- data_scaled[sample_indices, ]
h_stat <- hopkins(data_sample, m = 200) 
print(h_stat)
## [1] 0.9345512

Optimal number of clusters

To find the most appropriate number of clusters, two methods were applied, the elbow method and the average silhouette method. The plot of the total WSS (elbow) shows a significant “bend” at k=2 or k=3. The silhouette plot show a peak at k=2, which represents the highest level of cluster separation. Although the two cluster solution is mathematically the most distinct, this study suggests a three cluster solution. This decision is based on an aim to discover more interesting and interpretable results.

# Elbow
fviz_nbclust(data_sample, clara, method = "wss") +  
  geom_vline(xintercept = 3, linetype = 2) +
  labs(subtitle = "Elbow method")

# Silhouette
fviz_nbclust(data_sample, clara, method = "silhouette") + 
  labs(subtitle = "Silohuette score")

CLARA Clustering

Raw data

The clusters follow the general diagonal trend of the data, separating the population into three segments: high trust, low trust, and a central moderate group.

PCA 2

As observed in the plot, the clusters are separated by nearly straight, parallel lines. This occurs because the clustering was performed in the exact same 2D space in which it is visualized.

PCA 4

Unlike the PCA 2 plot, the clusters here appear to significantly overlap in the 2D space. It may indicate that the clusters separate in the third and fourth dimensions, which are not visible on the plot.

Comparison

First, the silhouette width was compared, as it is a crucial metric for evaluating the clustering performance. The silhouette profile for the raw data reveals a relatively weak cluster structure. The average silhouette score is 0.17 with several observations below 0. The PCA 2-dim silhouette shows the highest width (0.34), with all the bars consistently higher, clusters are more compact and better separated. The PCA 4 dim average silhouette width is 0.23 with some negative values.

clara_raw$silinfo$avg.width
## [1] 0.168179
clara_pca$silinfo$avg.width
## [1] 0.3399685
clara_pca_4$silinfo$avg.width
## [1] 0.2312136
#Raw Data
s1 <- fviz_silhouette(clara_raw, print.summary = FALSE) +
  ggtitle("Silohuette: Raw Data") +
  theme_minimal() +
  theme(axis.text.x = element_blank())

# PCA (2 components)
s2 <- fviz_silhouette(clara_pca, print.summary = FALSE) +
  ggtitle("Silohuette: PCA 2-dim") +
  theme_minimal() +
  theme(axis.text.x = element_blank())

# PCA (4 components)
s3 <- fviz_silhouette(clara_pca_4, print.summary = FALSE) +
  ggtitle("Silohuette: PCA 4-dim") +
  theme_minimal() +
  theme(axis.text.x = element_blank())

s1 / s2 / s3

The evaluation of the three clustering results was done using: average silhouette width, Dunn Index and WSS. The PCA 2-component model was clearly identified as superior. In contrast, clustering on raw data yielded a substantially weaker structure, which may indicate some noise was filtered out by dimension reduction. Although the PCA 2 model outperformed others in terms of silhouette and WSS, it recorded the lowest Dunn Index (0.002), which is a typical result for large data, where observations form a continuous cloud rather than isolated groups. In the reduced 2D PCA space, the clusters are contiguous, meaning their boundaries are in direct contact.

####Quality measures comparison
idx <- sample(1:nrow(data_scaled), 2000)
# Distance matrix
dist_raw <- dist(data_scaled[idx, ])
dist_pca2 <- dist(pca_model$x[idx, 1:2])
dist_pca4 <- dist(pca_model$x[idx, 1:4])

# Quality measures
stats_raw  <- cluster.stats(dist_raw,  clara_raw$clustering[idx])
stats_pca2 <- cluster.stats(dist_pca2, clara_pca$clustering[idx])
stats_pca4 <- cluster.stats(dist_pca4, clara_pca_4$clustering[idx])

quality_measures <- data.frame(
  Method = c("Raw Data", "PCA 2", "PCA 4"),
  Silhouette = c(stats_raw$avg.silwidth, stats_pca2$avg.silwidth, stats_pca4$avg.silwidth),
  Dunn_Index = c(stats_raw$dunn, stats_pca2$dunn, stats_pca4$dunn),
  WSS = c(stats_raw$within.cluster.ss, stats_pca2$within.cluster.ss, stats_pca4$within.cluster.ss)
)

print(round(quality_measures[, -1], 3) %>% cbind(Method = quality_measures$Method))
##   Silhouette Dunn_Index       WSS   Method
## 1      0.180      0.064 11424.144 Raw Data
## 2      0.340      0.002  4729.155    PCA 2
## 3      0.234      0.021  7953.785    PCA 4

The PCA 2-component model identified three distinct groups: the trusting (variable means close to 1), the neutral (ranging around zero) and the skeptics (negative scores).

# Profiling the clusters
get_profile <- function(cluster_vector, original_data, label) {
  original_data %>%
    mutate(Cluster = cluster_vector) %>%
    group_by(Cluster) %>%
    summarise(across(everything(), mean, na.rm = TRUE)) %>%
    mutate(Method = label)
}

#Raw
profile_raw <- get_profile(data_final_results$cluster_raw, data_scaled, "Raw Data")
#PCA2
profile_pca2 <- get_profile(data_final_results$cluster_pca, data_scaled, "PCA 2-dim")
# PCA4
profile_pca4 <- get_profile(data_final_results$cluster_pca_4, data_scaled, "PCA 4-dim")

final_comparison <- bind_rows(profile_raw, profile_pca2, profile_pca4)
final_comparison

Conclusions

This study aimed to demonstrate the effectiveness of integrating PCA with CLARA clustering to analyze social attitudes related to overall trust in institutions and people. By reducing the initial 10 variables from the European Social Survey into a lower-dimensional space, clarity and interpretability was achieved.

Dimension reduction with two principal components yielded the most robust results, outperforming the raw data approach by filtering out noise. While the 4-factor model captured more of the variance, the 2-component model provided a more distinct partition. The final clusters revealed three profiles of Europeans: individuals with high level of trust for institutions who also exhibit a lot of confidence for interpersonal relationships, a “middle-ground” group and skeptics characterized by a deep distrust in political actors and the community.