Introduction

This is my second project for Unsupervised Learning course. I use PCA (Principal Component Analysis) to reduce dimensions in economic data about European countries. The data is same as in my clustering project - from World Bank.

Why PCA? We have 6 variables and its hard to visualize. PCA helps to reduce this to 2-3 components while keeping most information.

Data and Libraries

if (!require("FactoMineR")) install.packages("FactoMineR")
if (!require("factoextra")) install.packages("factoextra")
if (!require("corrplot")) install.packages("corrplot")

library(FactoMineR)
library(factoextra)
library(corrplot)
# same data as clustering project (World Bank 2022)
europe <- data.frame(
  Country = c("Albania", "Austria", "Belgium", "Bulgaria", "Croatia", 
              "Czech Republic", "Denmark", "Estonia", "Finland", "France",
              "Germany", "Greece", "Hungary", "Ireland", "Italy",
              "Latvia", "Lithuania", "Netherlands", "Norway", "Poland",
              "Portugal", "Romania", "Slovakia", "Slovenia", "Spain",
              "Sweden", "Switzerland", "United Kingdom"),
  GDP_per_capita = c(6810, 52085, 49582, 13772, 18570, 27220, 67803, 28247, 
                     50732, 40886, 48636, 20867, 18390, 103685, 34776, 21947, 
                     24032, 57025, 106149, 18688, 24521, 15892, 21088, 28439, 
                     29674, 56424, 93259, 45295),
  Life_expectancy = c(76.5, 81.3, 81.9, 74.8, 78.1, 78.3, 81.4, 78.6, 82.0, 
                      82.5, 80.6, 80.1, 76.2, 82.0, 82.9, 75.3, 75.7, 81.4, 
                      83.2, 77.0, 81.1, 74.2, 77.0, 80.6, 83.0, 83.0, 84.0, 80.4),
  Unemployment = c(11.0, 4.8, 5.6, 4.3, 7.0, 2.2, 4.5, 5.6, 6.8, 7.3, 3.1, 
                   12.4, 3.6, 4.5, 8.1, 6.9, 5.9, 3.5, 3.2, 2.9, 6.0, 5.6, 
                   6.1, 4.0, 12.9, 7.5, 4.3, 3.7),
  Education = c(58, 89, 79, 79, 67, 65, 82, 70, 93, 67, 72, 143, 52, 80, 64, 
                88, 72, 90, 84, 67, 66, 52, 60, 82, 93, 78, 63, 62),
  Internet_users = c(79, 93, 92, 80, 82, 88, 98, 92, 93, 86, 93, 83, 89, 92, 
                     85, 90, 87, 93, 98, 87, 84, 84, 90, 89, 94, 95, 96, 97),
  CO2_emissions = c(1.8, 7.1, 8.0, 5.7, 4.3, 9.3, 5.1, 8.4, 7.5, 4.6, 8.1, 
                    5.7, 4.9, 7.7, 5.3, 3.8, 4.5, 8.8, 7.5, 8.1, 4.4, 3.7, 
                    5.8, 6.1, 5.1, 3.8, 4.0, 5.2)
)
rownames(europe) <- europe$Country

Checking Correlations

PCA works best when variables are correlated. Lets check:

cor_mat <- cor(europe[,-1])
corrplot(cor_mat, method = "color", addCoef.col = "black", number.cex = 0.7)

There are some correlations here especially between GDP and life expectancy. Good - PCA makes sense.

Running PCA

pca <- PCA(europe[,-1], graph = FALSE)

How Much Variance is Explained?

eig <- get_eigenvalue(pca)
print(round(eig, 2))
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1       2.71            45.19                       45.19
## Dim.2       1.56            26.05                       71.25
## Dim.3       0.97            16.15                       87.39
## Dim.4       0.36             6.05                       93.44
## Dim.5       0.28             4.60                       98.04
## Dim.6       0.12             1.96                      100.00

First component explains about 42%, second adds 18%. Together thats 60%. Not amazing but ok for visualization.

fviz_eig(pca, addlabels = TRUE)

Kaiser rule says keep components with eigenvalue > 1. Here thats first 2 components.

Variable Contributions

Which variables matter most?

fviz_pca_var(pca, col.var = "contrib",
             gradient.cols = c("blue", "yellow", "red"))

Reading this plot: - GDP and life expectancy point in similar direction - they are correlated - Variables pointing opposite are negatively correlated - Longer arrow = better represented

# loadings
pca$var$coord[,1:2]
##                      Dim.1       Dim.2
## GDP_per_capita   0.8913103  0.05132561
## Life_expectancy  0.7631650  0.43731057
## Unemployment    -0.4237216  0.85624150
## Education        0.2400949  0.74396571
## Internet_users   0.8623456 -0.02909967
## CO2_emissions    0.5948122 -0.28615340

PC1 is mainly about GDP, life expectancy, internet - basically “how developed is the country”

PC2 has more to do with education and unemployment

Country Scores

Where does each country fall?

fviz_pca_ind(pca, col.ind = "cos2",
             gradient.cols = c("blue", "yellow", "red"),
             repel = TRUE)

Countries on right side are more developed (high PC1). Norway and Switzerland are furthest right - they have highest GDP. Albania and Romania are on left - lower development.

Biplot

This shows both variables and countries together:

fviz_pca_biplot(pca, repel = TRUE,
                col.var = "red", col.ind = "blue")

You can see Norway and Switzerland are close to the GDP arrow direction.

Quality of Representation

Cos2 tells us how well each country is represented in 2D:

fviz_cos2(pca, choice = "ind", axes = 1:2)

Most countries are reasonably well represented. Some like Greece have lower cos2 - they might need more components to explain their position.

Simple Development Index

We can use PC1 as a simple development score:

scores <- as.data.frame(pca$ind$coord)
scores$Country <- rownames(scores)

# rank by PC1
scores <- scores[order(-scores$Dim.1),]
head(scores[, c("Country", "Dim.1")], 10)
##                       Country     Dim.1
## Norway                 Norway 3.4129643
## Ireland               Ireland 2.4479135
## Switzerland       Switzerland 2.1083997
## Netherlands       Netherlands 1.8746807
## Denmark               Denmark 1.7128163
## Germany               Germany 1.3225480
## Finland               Finland 1.2919728
## Austria               Austria 1.2894854
## Belgium               Belgium 1.2488049
## United Kingdom United Kingdom 0.9126716

Top countries by this “index”: Norway, Switzerland, Ireland…

tail(scores[, c("Country", "Dim.1")], 5)
##             Country     Dim.1
## Lithuania Lithuania -1.554336
## Croatia     Croatia -1.967948
## Bulgaria   Bulgaria -2.165052
## Romania     Romania -2.560033
## Albania     Albania -3.721780

Bottom: Albania, Romania, Bulgaria - mostly Eastern Europe.

Conclusions

What I learned:

  1. PCA reduced 6 variables to 2 main components explaining 60% of variance

  2. PC1 is basically a “development index” - combines GDP, health, technology

  3. PC2 is harder to interpret but relates to education vs environment tradeoffs

  4. Results match what we know - Western/Nordic countries score high, Eastern Europe scores lower

  5. Ireland scores very high because of its huge GDP (partly due to multinational companies)

Problems with this analysis: - We lose 40% of information by using only 2 components - Greece has very high education score (143%) which seems like outlier - PCA assumes linear relationships

This connects to my clustering project - countries that are close in PCA plot also ended up in same clusters.

References

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.95    factoextra_1.0.7 ggplot2_4.0.0    FactoMineR_2.13 
## 
## loaded via a namespace (and not attached):
##  [1] tidyr_1.3.1          sass_0.4.10          generics_0.1.4      
##  [4] rstatix_0.7.3        lattice_0.22-7       digest_0.6.37       
##  [7] magrittr_2.0.4       evaluate_1.0.5       grid_4.5.1          
## [10] estimability_1.5.1   RColorBrewer_1.1-3   mvtnorm_1.3-3       
## [13] fastmap_1.2.0        jsonlite_2.0.0       ggrepel_0.9.6       
## [16] backports_1.5.0      Formula_1.2-5        purrr_1.1.0         
## [19] scales_1.4.0         jquerylib_0.1.4      abind_1.4-8         
## [22] cli_3.6.5            rlang_1.1.6          scatterplot3d_0.3-44
## [25] leaps_3.2            withr_3.0.2          cachem_1.1.0        
## [28] yaml_2.3.10          tools_4.5.1          multcompView_0.1-10 
## [31] ggsignif_0.6.4       dplyr_1.1.4          DT_0.34.0           
## [34] ggpubr_0.6.2         flashClust_1.01-2    broom_1.0.10        
## [37] vctrs_0.6.5          R6_2.6.1             lifecycle_1.0.4     
## [40] emmeans_2.0.1        car_3.1-3            htmlwidgets_1.6.4   
## [43] MASS_7.3-65          cluster_2.1.8.1      pkgconfig_2.0.3     
## [46] pillar_1.11.1        bslib_0.9.0          gtable_0.3.6        
## [49] glue_1.8.0           Rcpp_1.1.0           xfun_0.53           
## [52] tibble_3.3.0         tidyselect_1.2.1     rstudioapi_0.17.1   
## [55] knitr_1.50           farver_2.1.2         xtable_1.8-4        
## [58] htmltools_0.5.8.1    labeling_0.4.3       carData_3.0-6       
## [61] rmarkdown_2.30       compiler_4.5.1       S7_0.2.0