This is my second project for Unsupervised Learning course. I use PCA (Principal Component Analysis) to reduce dimensions in economic data about European countries. The data is same as in my clustering project - from World Bank.
Why PCA? We have 6 variables and its hard to visualize. PCA helps to reduce this to 2-3 components while keeping most information.
if (!require("FactoMineR")) install.packages("FactoMineR")
if (!require("factoextra")) install.packages("factoextra")
if (!require("corrplot")) install.packages("corrplot")
library(FactoMineR)
library(factoextra)
library(corrplot)
# same data as clustering project (World Bank 2022)
europe <- data.frame(
Country = c("Albania", "Austria", "Belgium", "Bulgaria", "Croatia",
"Czech Republic", "Denmark", "Estonia", "Finland", "France",
"Germany", "Greece", "Hungary", "Ireland", "Italy",
"Latvia", "Lithuania", "Netherlands", "Norway", "Poland",
"Portugal", "Romania", "Slovakia", "Slovenia", "Spain",
"Sweden", "Switzerland", "United Kingdom"),
GDP_per_capita = c(6810, 52085, 49582, 13772, 18570, 27220, 67803, 28247,
50732, 40886, 48636, 20867, 18390, 103685, 34776, 21947,
24032, 57025, 106149, 18688, 24521, 15892, 21088, 28439,
29674, 56424, 93259, 45295),
Life_expectancy = c(76.5, 81.3, 81.9, 74.8, 78.1, 78.3, 81.4, 78.6, 82.0,
82.5, 80.6, 80.1, 76.2, 82.0, 82.9, 75.3, 75.7, 81.4,
83.2, 77.0, 81.1, 74.2, 77.0, 80.6, 83.0, 83.0, 84.0, 80.4),
Unemployment = c(11.0, 4.8, 5.6, 4.3, 7.0, 2.2, 4.5, 5.6, 6.8, 7.3, 3.1,
12.4, 3.6, 4.5, 8.1, 6.9, 5.9, 3.5, 3.2, 2.9, 6.0, 5.6,
6.1, 4.0, 12.9, 7.5, 4.3, 3.7),
Education = c(58, 89, 79, 79, 67, 65, 82, 70, 93, 67, 72, 143, 52, 80, 64,
88, 72, 90, 84, 67, 66, 52, 60, 82, 93, 78, 63, 62),
Internet_users = c(79, 93, 92, 80, 82, 88, 98, 92, 93, 86, 93, 83, 89, 92,
85, 90, 87, 93, 98, 87, 84, 84, 90, 89, 94, 95, 96, 97),
CO2_emissions = c(1.8, 7.1, 8.0, 5.7, 4.3, 9.3, 5.1, 8.4, 7.5, 4.6, 8.1,
5.7, 4.9, 7.7, 5.3, 3.8, 4.5, 8.8, 7.5, 8.1, 4.4, 3.7,
5.8, 6.1, 5.1, 3.8, 4.0, 5.2)
)
rownames(europe) <- europe$Country
PCA works best when variables are correlated. Lets check:
cor_mat <- cor(europe[,-1])
corrplot(cor_mat, method = "color", addCoef.col = "black", number.cex = 0.7)
There are some correlations here especially between GDP and life expectancy. Good - PCA makes sense.
pca <- PCA(europe[,-1], graph = FALSE)
eig <- get_eigenvalue(pca)
print(round(eig, 2))
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.71 45.19 45.19
## Dim.2 1.56 26.05 71.25
## Dim.3 0.97 16.15 87.39
## Dim.4 0.36 6.05 93.44
## Dim.5 0.28 4.60 98.04
## Dim.6 0.12 1.96 100.00
First component explains about 42%, second adds 18%. Together thats 60%. Not amazing but ok for visualization.
fviz_eig(pca, addlabels = TRUE)
Kaiser rule says keep components with eigenvalue > 1. Here thats first 2 components.
Which variables matter most?
fviz_pca_var(pca, col.var = "contrib",
gradient.cols = c("blue", "yellow", "red"))
Reading this plot: - GDP and life expectancy point in similar direction - they are correlated - Variables pointing opposite are negatively correlated - Longer arrow = better represented
# loadings
pca$var$coord[,1:2]
## Dim.1 Dim.2
## GDP_per_capita 0.8913103 0.05132561
## Life_expectancy 0.7631650 0.43731057
## Unemployment -0.4237216 0.85624150
## Education 0.2400949 0.74396571
## Internet_users 0.8623456 -0.02909967
## CO2_emissions 0.5948122 -0.28615340
PC1 is mainly about GDP, life expectancy, internet - basically “how developed is the country”
PC2 has more to do with education and unemployment
Where does each country fall?
fviz_pca_ind(pca, col.ind = "cos2",
gradient.cols = c("blue", "yellow", "red"),
repel = TRUE)
Countries on right side are more developed (high PC1). Norway and Switzerland are furthest right - they have highest GDP. Albania and Romania are on left - lower development.
This shows both variables and countries together:
fviz_pca_biplot(pca, repel = TRUE,
col.var = "red", col.ind = "blue")
You can see Norway and Switzerland are close to the GDP arrow direction.
Cos2 tells us how well each country is represented in 2D:
fviz_cos2(pca, choice = "ind", axes = 1:2)
Most countries are reasonably well represented. Some like Greece have lower cos2 - they might need more components to explain their position.
We can use PC1 as a simple development score:
scores <- as.data.frame(pca$ind$coord)
scores$Country <- rownames(scores)
# rank by PC1
scores <- scores[order(-scores$Dim.1),]
head(scores[, c("Country", "Dim.1")], 10)
## Country Dim.1
## Norway Norway 3.4129643
## Ireland Ireland 2.4479135
## Switzerland Switzerland 2.1083997
## Netherlands Netherlands 1.8746807
## Denmark Denmark 1.7128163
## Germany Germany 1.3225480
## Finland Finland 1.2919728
## Austria Austria 1.2894854
## Belgium Belgium 1.2488049
## United Kingdom United Kingdom 0.9126716
Top countries by this “index”: Norway, Switzerland, Ireland…
tail(scores[, c("Country", "Dim.1")], 5)
## Country Dim.1
## Lithuania Lithuania -1.554336
## Croatia Croatia -1.967948
## Bulgaria Bulgaria -2.165052
## Romania Romania -2.560033
## Albania Albania -3.721780
Bottom: Albania, Romania, Bulgaria - mostly Eastern Europe.
What I learned:
PCA reduced 6 variables to 2 main components explaining 60% of variance
PC1 is basically a “development index” - combines GDP, health, technology
PC2 is harder to interpret but relates to education vs environment tradeoffs
Results match what we know - Western/Nordic countries score high, Eastern Europe scores lower
Ireland scores very high because of its huge GDP (partly due to multinational companies)
Problems with this analysis: - We lose 40% of information by using only 2 components - Greece has very high education score (143%) which seems like outlier - PCA assumes linear relationships
This connects to my clustering project - countries that are close in PCA plot also ended up in same clusters.
sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] corrplot_0.95 factoextra_1.0.7 ggplot2_4.0.0 FactoMineR_2.13
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.1 sass_0.4.10 generics_0.1.4
## [4] rstatix_0.7.3 lattice_0.22-7 digest_0.6.37
## [7] magrittr_2.0.4 evaluate_1.0.5 grid_4.5.1
## [10] estimability_1.5.1 RColorBrewer_1.1-3 mvtnorm_1.3-3
## [13] fastmap_1.2.0 jsonlite_2.0.0 ggrepel_0.9.6
## [16] backports_1.5.0 Formula_1.2-5 purrr_1.1.0
## [19] scales_1.4.0 jquerylib_0.1.4 abind_1.4-8
## [22] cli_3.6.5 rlang_1.1.6 scatterplot3d_0.3-44
## [25] leaps_3.2 withr_3.0.2 cachem_1.1.0
## [28] yaml_2.3.10 tools_4.5.1 multcompView_0.1-10
## [31] ggsignif_0.6.4 dplyr_1.1.4 DT_0.34.0
## [34] ggpubr_0.6.2 flashClust_1.01-2 broom_1.0.10
## [37] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.4
## [40] emmeans_2.0.1 car_3.1-3 htmlwidgets_1.6.4
## [43] MASS_7.3-65 cluster_2.1.8.1 pkgconfig_2.0.3
## [46] pillar_1.11.1 bslib_0.9.0 gtable_0.3.6
## [49] glue_1.8.0 Rcpp_1.1.0 xfun_0.53
## [52] tibble_3.3.0 tidyselect_1.2.1 rstudioapi_0.17.1
## [55] knitr_1.50 farver_2.1.2 xtable_1.8-4
## [58] htmltools_0.5.8.1 labeling_0.4.3 carData_3.0-6
## [61] rmarkdown_2.30 compiler_4.5.1 S7_0.2.0