For my additional homework I decided to perform PCA on the dataset from WorldBank - World Development Indicators (2022). The reason why I decided to focus on 2022 was due to the a lot of missing values in 2023 data set. The homework is structured as follows:
summary(world)
## Country Name SP.ADO.TFRT NV.AGR.TOTL.ZS NE.EXP.GNFS.ZS
## Length:196 Min. : 0.473 Min. : 0.01149 Min. : 1.571
## Class :character 1st Qu.: 8.537 1st Qu.: 2.09654 1st Qu.: 24.422
## Mode :character Median : 29.694 Median : 6.95629 Median : 41.029
## Mean : 41.187 Mean :10.17519 Mean : 47.853
## 3rd Qu.: 61.053 3rd Qu.:15.82598 3rd Qu.: 59.864
## Max. :163.019 Max. :43.05999 Max. :211.278
## NA's :11 NA's :22
## SP.DYN.TFRT.IN BX.KLT.DINV.CD.WD AG.LND.FRST.K2 NY.GDP.MKTP.CD
## Min. :0.701 Min. :-2.942e+11 Min. : 0 Min. :1.568e+08
## 1st Qu.:1.530 1st Qu.: 7.508e+07 1st Qu.: 3096 1st Qu.:1.107e+10
## Median :2.011 Median : 8.764e+08 Median : 24317 Median :3.919e+10
## Mean :2.489 Mean : 1.011e+10 Mean : 208111 Mean :5.172e+11
## 3rd Qu.:3.271 3rd Qu.: 8.046e+09 3rd Qu.: 122824 3rd Qu.:2.544e+11
## Max. :6.749 Max. : 4.090e+11 Max. :8153116 Max. :2.601e+13
## NA's :3 NA's :16 NA's :5 NA's :2
## NY.GDP.MKTP.KD.ZG NY.GNP.PCAP.CD NY.GNP.PCAP.PP.CD NY.GNP.ATLS.CD
## Min. :-28.759 Min. : 230 Min. : 890 Min. :2.496e+08
## 1st Qu.: 2.553 1st Qu.: 2400 1st Qu.: 6430 1st Qu.:1.010e+10
## Median : 4.185 Median : 6640 Median : 16950 Median :3.899e+10
## Mean : 4.332 Mean : 17376 Mean : 27258 Mean :5.243e+11
## 3rd Qu.: 6.202 3rd Qu.: 22290 3rd Qu.: 41010 3rd Qu.:2.483e+11
## Max. : 63.335 Max. :132240 Max. :124910 Max. :2.582e+13
## NA's :1 NA's :3 NA's :3 NA's :3
## NY.GNP.MKTP.PP.CD NE.GDI.TOTL.ZS TX.VAL.TECH.MF.ZS SH.IMM.MEAS
## Min. :2.558e+08 Min. : 1.225 Min. : 0.000 Min. :33.00
## 1st Qu.:2.635e+10 1st Qu.:19.744 1st Qu.: 2.246 1st Qu.:76.25
## Median :8.805e+10 Median :24.201 Median : 6.824 Median :90.00
## Mean :8.742e+11 Mean :24.703 Mean :11.439 Mean :84.47
## 3rd Qu.:4.940e+11 3rd Qu.:27.829 3rd Qu.:16.804 3rd Qu.:96.00
## Max. :3.152e+13 Max. :56.066 Max. :72.637 Max. :99.00
## NA's :3 NA's :27 NA's :46 NA's :10
## NE.IMP.GNFS.ZS NV.IND.TOTL.ZS NY.GDP.DEFL.KD.ZG SP.DYN.LE00.IN
## Min. : 1.128 Min. : 2.391 Min. :-10.801 Min. :53.00
## 1st Qu.: 30.507 1st Qu.:18.028 1st Qu.: 4.719 1st Qu.:66.56
## Median : 48.133 Median :24.905 Median : 7.199 Median :73.43
## Mean : 53.793 Mean :27.028 Mean : 13.501 Mean :72.21
## 3rd Qu.: 68.521 3rd Qu.:33.060 3rd Qu.: 14.046 3rd Qu.:77.96
## Max. :190.394 Max. :76.838 Max. :266.987 Max. :85.38
## NA's :22 NA's :10 NA's :1 NA's :4
## TG.VAL.TOTL.GD.ZS MS.MIL.XPND.GD.ZS IT.CEL.SETS.P2 SH.DYN.MORT
## Min. : 15.66 Min. : 0.06918 Min. : 42.07 Min. : 1.50
## 1st Qu.: 43.32 1st Qu.: 0.96722 1st Qu.: 98.89 1st Qu.: 6.05
## Median : 63.77 Median : 1.45800 Median :118.07 Median : 14.50
## Mean : 73.70 Mean : 1.97281 Mean :118.16 Mean : 25.01
## 3rd Qu.: 85.56 3rd Qu.: 2.16442 3rd Qu.:132.97 3rd Qu.: 38.35
## Max. :356.16 Max. :33.54657 Max. :291.91 Max. :117.30
## NA's :6 NA's :50 NA's :47 NA's :9
## SM.POP.NETM BX.TRF.PWKR.CD.DT EN.POP.DNST SP.POP.GROW
## Min. :-5699445 Min. :0.000e+00 Min. : 2.205 Min. :-7.619
## 1st Qu.: -8675 1st Qu.:1.482e+08 1st Qu.: 38.994 1st Qu.: 0.191
## Median : -112 Median :6.897e+08 Median : 89.101 Median : 1.062
## Mean : 718 Mean :4.087e+09 Mean : 219.795 Mean : 1.062
## 3rd Qu.: 26832 3rd Qu.:3.699e+09 3rd Qu.: 214.583 3rd Qu.: 2.013
## Max. : 1319009 Max. :1.112e+11 Max. :7851.006 Max. : 5.907
## NA's :3 NA's :5 NA's :1
## SP.POP.TOTL SE.PRM.CMPT.ZS GC.REV.XGRT.GD.ZS SE.PRM.ENRR
## Min. :1.180e+04 Min. : 52.99 Min. : 3.268 Min. : 67.23
## 1st Qu.:1.707e+06 1st Qu.: 87.74 1st Qu.:17.894 1st Qu.: 94.66
## Median :7.453e+06 Median : 95.76 Median :27.098 Median : 99.71
## Mean :4.007e+07 Mean : 91.92 Mean :27.210 Mean :100.34
## 3rd Qu.:2.989e+07 3rd Qu.: 99.90 3rd Qu.:35.560 3rd Qu.:104.08
## Max. :1.425e+09 Max. :120.12 Max. :65.932 Max. :156.80
## NA's :72 NA's :94 NA's :46
## SE.SEC.ENRR AG.SRF.TOTL.K2 GC.TAX.TOTL.GD.ZS ER.PTD.TOTL.ZS
## Min. : 33.72 Min. : 20 Min. : 0.5785 Min. : 0.00
## 1st Qu.: 86.41 1st Qu.: 20760 1st Qu.:12.1591 1st Qu.: 2.93
## Median : 97.12 Median : 114760 Median :17.2955 Median :10.35
## Mean : 94.30 Mean : 718785 Mean :17.4935 Mean :13.23
## 3rd Qu.:106.01 3rd Qu.: 520990 3rd Qu.:22.2794 3rd Qu.:18.69
## Max. :144.85 Max. :17098250 Max. :31.2733 Max. :99.96
## NA's :70 NA's :5 NA's :94 NA's :3
## DT.TDS.DECT.EX.ZS SP.URB.GROW EG.ELC.ACCS.ZS FP.CPI.TOTL
## Min. : 0.1694 Min. :-7.3874 Min. : 10.30 Min. : 102.2
## 1st Qu.: 7.1991 1st Qu.: 0.6367 1st Qu.: 85.30 1st Qu.: 126.3
## Median :11.3586 Median : 1.6478 Median :100.00 Median : 141.9
## Mean :14.0297 Mean : 1.7315 Mean : 87.03 Mean : 487.3
## 3rd Qu.:18.6198 3rd Qu.: 2.9046 3rd Qu.:100.00 3rd Qu.: 195.6
## Max. :60.2780 Max. : 5.9473 Max. :100.00 Max. :38796.6
## NA's :82 NA's :2 NA's :1 NA's :23
## FR.INR.DPST GC.XPN.TOTL.GD.ZS NE.CON.TOTL.ZS NY.GDP.PCAP.KD.ZG
## Min. :-0.1523 Min. : 3.875 Min. : 32.37 Min. :-21.6953
## 1st Qu.: 1.2523 1st Qu.:19.630 1st Qu.: 69.43 1st Qu.: 0.9803
## Median : 2.5579 Median :28.240 Median : 79.03 Median : 2.5948
## Mean : 4.7563 Mean :29.018 Mean : 80.31 Mean : 3.1467
## 3rd Qu.: 6.1671 3rd Qu.:35.875 3rd Qu.: 90.96 3rd Qu.: 5.2230
## Max. :52.4167 Max. :81.642 Max. :162.39 Max. : 62.1110
## NA's :98 NA's :94 NA's :26 NA's :1
## BX.GSR.MRCH.CD BM.GSR.MRCH.CD SE.XPD.TOTL.GD.ZS
## Min. :2.098e+06 Min. :1.021e+08 Min. : 0.3485
## 1st Qu.:2.112e+09 1st Qu.:4.517e+09 1st Qu.: 3.0358
## Median :1.311e+10 Median :1.522e+10 Median : 3.9541
## Mean :1.333e+11 Mean :1.332e+11 Mean : 4.2242
## 3rd Qu.:8.563e+10 3rd Qu.:8.145e+10 3rd Qu.: 5.0780
## Max. :3.347e+12 Max. :3.270e+12 Max. :10.7034
## NA's :23 NA's :23 NA's :64
In the data set we can observe many missing values. As discussed during classes when we dealing with missing data it is better to leave it in order to not impact the real meaning of the data. In my case we will be calculating correlations and other numerical values, that’s why I decided to enter 0, to places where missing value occurs.
world[is.na(world)] <- 0
world[!complete.cases(world),]
## # A tibble: 0 × 47
## # ℹ 47 variables: Country Name <chr>, SP.ADO.TFRT <dbl>, NV.AGR.TOTL.ZS <dbl>,
## # NE.EXP.GNFS.ZS <dbl>, SP.DYN.TFRT.IN <dbl>, BX.KLT.DINV.CD.WD <dbl>,
## # AG.LND.FRST.K2 <dbl>, NY.GDP.MKTP.CD <dbl>, NY.GDP.MKTP.KD.ZG <dbl>,
## # NY.GNP.PCAP.CD <dbl>, NY.GNP.PCAP.PP.CD <dbl>, NY.GNP.ATLS.CD <dbl>,
## # NY.GNP.MKTP.PP.CD <dbl>, NE.GDI.TOTL.ZS <dbl>, TX.VAL.TECH.MF.ZS <dbl>,
## # SH.IMM.MEAS <dbl>, NE.IMP.GNFS.ZS <dbl>, NV.IND.TOTL.ZS <dbl>,
## # NY.GDP.DEFL.KD.ZG <dbl>, SP.DYN.LE00.IN <dbl>, TG.VAL.TOTL.GD.ZS <dbl>, …
Now we have full data and we can observe correlations. My data set contains different variables - macroeconomics ones and more demographic ones. I would like to focus on macroeconomic data. First look at the data:
world_matrix <- data.matrix(world[2:47])
correlation_matrix<-cor(na.omit(world_matrix))
corrplot(correlation_matrix, method="circle", tl.col="black", tl.cex=0.2,col=inferno(10))+theme_classic()
## NULL
We need to make choice which 10 variables we will take into consideration while performing dimension reduction. My choices:
Agriculture, forestry, and fishing, value added (% of GDP) (NV.AGR.TOTL.ZS)
High-technology exports (% of manufactured exports) (TX.VAL.TECH.MF.ZS)
Imports of goods and services (% of GDP) (NE.IMP.GNFS.ZS)
Military expenditure (% of GDP) (MS.MIL.XPND.GD.ZS)
Population growth (annual %) (SP.POP.GROW)
Tax revenue (% of GDP) (GC.TAX.TOTL.GD.ZS)
Consumer price index (2010 = 100) (FP.CPI.TOTL)
Deposit interest rate (%) (FR.INR.DPST)
GDP per capita growth (annual %) (NY.GDP.PCAP.KD.ZG)
Government expenditure on education, total (% of GDP) (SE.XPD.TOTL.GD.ZS)
I mostly focused on the macroeconomics factors.
Renaming columns and chosen variables overview:
world_macro<-world[,c("Country Name", "NY.GDP.PCAP.KD.ZG", "FR.INR.DPST", "FP.CPI.TOTL", "NE.IMP.GNFS.ZS", "TX.VAL.TECH.MF.ZS", "GC.TAX.TOTL.GD.ZS", "SP.POP.GROW","MS.MIL.XPND.GD.ZS", "SE.XPD.TOTL.GD.ZS", "NV.AGR.TOTL.ZS")]
colnames(world_macro)<-c("country", "GDP_percapita_growth", "deposit_rate", "CPI", "import_GS", "high_tech_export","tax_revenue", "pop_growth_rate", "military_exp", "goverm_exp_educ", "agriculture_to_GDP")
Correlation plot - correlation values are below
world_macro_matrix <- data.matrix(world_macro[2:11])
correlation_matrix_macro<-cor(world_macro_matrix)
corrplot(correlation_matrix_macro, method="color", tl.col="black", addCoef.col="black", number.cex=0.5,tl.cex=0.5,col=inferno(10))+theme_classic()
## NULL
OBSERVATION: Correlation is lower than 0.4 for all variables combinations.
In this step I will proceed with the dimension reduction using PCA. The homework task was to reduce dimensions from 10 to 3. Firstly in the PCA we need to standardize our data and rescale variance to 1.
preproc_world_macro <- preProcess(world_macro[2:11], method=c("center", "scale"))
world_macro.s <- predict(preproc_world_macro, world_macro[2:11])
summary(world_macro.s)
## GDP_percapita_growth deposit_rate CPI import_GS
## Min. :-3.82067 Min. :-0.4544 Min. :-0.14994 Min. :-1.4121
## 1st Qu.:-0.34613 1st Qu.:-0.4270 1st Qu.:-0.10752 1st Qu.:-0.6315
## Median :-0.08321 Median :-0.4270 Median :-0.10293 Median :-0.1573
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.32017 3rd Qu.: 0.0309 3rd Qu.:-0.08639 3rd Qu.: 0.4756
## Max. : 9.07703 Max. : 8.9851 Max. :13.37386 Max. : 4.2178
## high_tech_export tax_revenue pop_growth_rate military_exp
## Min. :-0.6921 Min. :-0.9226 Min. :-5.590937 Min. :-0.5522
## 1st Qu.:-0.6921 1st Qu.:-0.9226 1st Qu.:-0.564272 1st Qu.:-0.5522
## Median :-0.4067 Median :-0.2134 Median :-0.008859 Median :-0.1267
## Mean : 0.0000 Mean : 0.0000 Mean : 0.000000 Mean : 0.0000
## 3rd Qu.: 0.2710 3rd Qu.: 0.8720 3rd Qu.: 0.615361 3rd Qu.: 0.1135
## Max. : 5.0503 Max. : 2.2468 Max. : 3.125569 Max. :12.0523
## goverm_exp_educ agriculture_to_GDP
## Min. :-1.1584 Min. :-0.9554
## 1st Qu.:-1.1584 1st Qu.:-0.7859
## Median : 0.0850 Median :-0.3588
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6972 3rd Qu.: 0.5412
## Max. : 3.1999 Max. : 3.3282
Eigenvalues represents the variance explained by the components in the dataset. The highest value, the higher part of variance explained by component. There is a Kaiser criterion used i PCA which states that we should drop the components for which eignevalues are less than 1 when data is standardized. Here according to eigenvalues and criterion we should narrow aour analysis to 4 pca.
world_macro.cov<-cov(world_macro.s)
world_macro.eigen<-eigen(world_macro.cov)
world_macro.eigen$values
## [1] 1.9528671 1.3322583 1.1814486 1.1245120 0.9854871 0.8038083 0.7758225
## [8] 0.6892791 0.5946023 0.5599148
Loadings are interpret as a importance of the variable in the factor. The higher is the loading the higher its importance in the vector. When the correlation between factor and variable is possitive the loading is positive as well. In case of the negative correlation the loading may be negative.
Let’s investigate loadings in our case. Since we will be restricting our data to 3 PCA, we will investigate only firts three PCA’s.
xxx<-world_macro.s
xxx.pca1<-prcomp(xxx, center=FALSE, scale.=FALSE)
xxx.pca1
## Standard deviations (1, .., p=10):
## [1] 1.3974502 1.1542349 1.0869446 1.0604301 0.9927170 0.8965535 0.8808078
## [8] 0.8302283 0.7711046 0.7482745
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4
## GDP_percapita_growth -0.05297794 -0.417660491 -0.45790197 0.513456726
## deposit_rate -0.08014343 0.335681118 -0.48911229 0.079387361
## CPI -0.14344063 -0.009978956 -0.32253795 -0.667318829
## import_GS 0.44887220 -0.163928225 0.30250321 -0.096368023
## high_tech_export 0.41558856 -0.258401027 0.14304111 0.002494964
## tax_revenue 0.45988827 0.146981271 0.02785852 0.002884607
## pop_growth_rate -0.37208985 -0.261376369 0.37788608 -0.158567643
## military_exp 0.17346435 0.630276192 0.02369625 -0.077864738
## goverm_exp_educ -0.06673912 0.332107090 0.25097181 0.488593615
## agriculture_to_GDP -0.45989453 0.147208932 0.35773304 0.074254571
## PC5 PC6 PC7 PC8
## GDP_percapita_growth -0.05816475 -0.11086800 -0.24578407 0.05793824
## deposit_rate 0.55121836 0.44100541 0.12194070 -0.29487298
## CPI 0.28095995 -0.48502737 -0.03814317 0.02878682
## import_GS 0.14028527 0.09548680 0.52490129 -0.15111392
## high_tech_export 0.26903242 -0.14854501 -0.49388781 -0.59596964
## tax_revenue 0.30190214 0.27136745 -0.31533726 0.60075397
## pop_growth_rate 0.35188194 0.15572018 -0.32103608 0.23905630
## military_exp -0.34135123 -0.09130488 -0.39878740 -0.11204596
## goverm_exp_educ 0.43530397 -0.60074159 0.14821670 0.10134545
## agriculture_to_GDP 0.02832451 0.24427753 -0.14254082 -0.29995121
## PC9 PC10
## GDP_percapita_growth 0.35470409 -0.379716385
## deposit_rate -0.13812997 -0.127023962
## CPI 0.33420777 -0.042562880
## import_GS 0.31813404 -0.492425179
## high_tech_export -0.10943793 0.185801965
## tax_revenue 0.31310302 0.207251640
## pop_growth_rate -0.30662308 -0.472218695
## military_exp 0.01315391 -0.519058435
## goverm_exp_educ -0.03028308 0.002010686
## agriculture_to_GDP 0.66112257 0.159669705
OBSERVATION:
For the first PCA the highest postive loading are for: import good’s and services, high technology export and tax revenue. The most important negative loading are: contribution of the agriculture to GDP and population growth rate. Based on that we can assume that the first PCA is positively related to the technology and economic and negative with agriculture and population growth.
For the PCA2 the highest loadings we can observe for military expenditure as % of GDP, deposit rate and expenditures for education (as % of GDP). That indicates that the second PCA is positively correlated with those variables and negatively with growth GDP per capita and population growth rate.
For the thrid PCA the major positive correlation we can observe for population growth rate and agriculture contribution to GDP and the highest negative to deposit rate and growth of GDP per capita.
fviz_pca_var(xxx.pca1, col.var=paletteInferno[3])
OBSERVATIONS: This graph presents contribution of the varaibles to dimensions and relationship between variables. If the angle bewteen vectors of variables is close to 0, the variables are highly correlated - for example import of good and services and export of high technologies.
The example of the opposite relation - CPI and military expedinture variables are close to 90 degrees angle which suggest that their ortogonal to each other - no correlation.
We can also analyze the contribution of the individuals to component on graph:
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90, fill = paletteInferno[2], color = paletteInferno[2], cex.lab=0.4)+theme_minimal()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
a
b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90, fill = paletteInferno[4], color = paletteInferno[4])+theme_minimal()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
b
c<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90, fill = paletteInferno[6], color = paletteInferno[6])+theme_minimal()+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
c
Now we will investigate the importance of the components:
summary(xxx.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.3975 1.1542 1.0869 1.0604 0.99272 0.89655 0.88081
## Proportion of Variance 0.1953 0.1332 0.1181 0.1124 0.09855 0.08038 0.07758
## Cumulative Proportion 0.1953 0.3285 0.4467 0.5591 0.65766 0.73804 0.81562
## PC8 PC9 PC10
## Standard deviation 0.83023 0.77110 0.74827
## Proportion of Variance 0.06893 0.05946 0.05599
## Cumulative Proportion 0.88455 0.94401 1.00000
OBSERVATIONS: The first three components explained 44.67% of the variance - if I wouldn’t have the restriction to narrow my analysis to three components I will keep it at seven or eight component.
We can also visualise that to better visual presentation:
eigenvalue <- fviz_eig(xxx.pca1, choice='eigenvalue', barcolor = paletteInferno[4], barfill = paletteInferno[4], main="Eigenvalues scree plot")+
theme_classic()
var_perc<-fviz_eig(xxx.pca1, barcolor = paletteInferno[3], barfill = paletteInferno[3], main = "Percentage of explained variance scree plot")+theme_classic()
grid.arrange(eigenvalue, var_perc, nrow=2)
We can observe that when we narrows our analysis to 3 PCA quality of our dimension reduction process will be worse than if we will restrict it to 7 dimension. However restricting the 10 dimension to 7 isn’t satisfactory as well.
In the last step I will investigate the meaning of each component. In order to do that we need to rotate our PCA once again to achieve linear structure in order to easier interprets them. We will restrict our analysis to 3 factors as was stated in the homework description.
xxx.pca.rotated.analysis<-principal(xxx, nfactors=3, rotate="varimax")
xxx.pca.rotated.analysis
## Principal Components Analysis
## Call: principal(r = xxx, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC3 RC2 h2 u2 com
## GDP_percapita_growth 0.03 -0.18 -0.67 0.49 0.51 1.2
## deposit_rate 0.31 -0.59 0.05 0.45 0.55 1.5
## CPI 0.01 -0.35 -0.19 0.16 0.84 1.5
## import_GS 0.27 0.68 0.01 0.54 0.46 1.3
## high_tech_export 0.29 0.58 -0.18 0.45 0.55 1.7
## tax_revenue 0.55 0.34 0.15 0.44 0.56 1.8
## pop_growth_rate -0.72 0.09 -0.03 0.53 0.47 1.0
## military_exp 0.42 -0.14 0.63 0.59 0.41 1.9
## goverm_exp_educ -0.08 -0.03 0.47 0.23 0.77 1.1
## agriculture_to_GDP -0.65 -0.19 0.36 0.59 0.41 1.8
##
## RC1 RC3 RC2
## SS loadings 1.69 1.49 1.29
## Proportion Var 0.17 0.15 0.13
## Cumulative Var 0.17 0.32 0.45
## Proportion Explained 0.38 0.33 0.29
## Cumulative Proportion 0.38 0.71 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 3 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.12
## with the empirical chi square 270.17 with prob < 6.3e-47
##
## Fit based upon off diagonal values = 0.14
KEY OBSERVATIONS:
RC1 - it seems to be a component which represent social and agriculture influences to world development. It’s dominated by variables: taxes revenue, population growth rate and contribution of agriculture to GDP
RC2 - component with the higher contribution of import, export and deposit rate. It’s related to the trade activities of economies.
RC3 - it’s related of government spending - for military and education.
Additionally I took a look into the other statistic presents in the summary:
complexity (com) - it explains how many variables we need to explain a variable. Higher means that we need more variables to explain the meaning of the factor. Here for example the most complex variable is military expenditure. Overall - high variables are undesirable. We can set a benchmark that the high uniqueness is 0.7.
plot(xxx.pca.rotated.analysis$complexity, pch=16, xlim=c(-60, 60), main="Complexity of factors", xlab=" ", ylab="complexity", col=paletteInferno[5])
text(xxx.pca.rotated.analysis$complexity, labels=names(xxx.pca.rotated.analysis$complexity), cex=1)
uniqueness - it’s the proportion of variance that is unique for givenvariable (it’s not shared with others variables). We want it to be low in PCA because then we will be able to reducce the space to a smallernumber of dimensions.
We can visualise both characteristcs on the graph with the benchmark points. We want uniqueness below 0.7 and complexity below 2:
plot(xxx.pca.rotated.analysis$complexity, xxx.pca.rotated.analysis$uniqueness, xlim=c(0, 3), xlab="complexity", ylab="uniqueness", main="Complexity and uniqueness graph",cex=0.8)
text(xxx.pca.rotated.analysis$complexity, xxx.pca.rotated.analysis$uniqueness, labels=names(xxx.pca.rotated.analysis$uniqueness), cex=0.8)
abline(h=c(0.38, 0.7), lty=3, col=2)
abline(v=c(2), lty=3, col=2)
The PCA isn’t the best tool to reduce the dimension in this case. Reduction to 3 dimensions explain only 45% of total variance, which isn’t a desired results. Additionaly some variables in the data set show high uniqueness values which isn’t good in case of PCA.
MDS is an algorithm that perseve pairwise distance (PCA was finding the directions). MDS also doesn’t assume relationship between variables so it can easily deal withe nonlinear linkage.
Firstly we calculate distance matrix and then we proceed to multidimensional scaling to 3 dimensions. MDS goals is to minimize the stress function, which measures the discrepancy between the original distances and the distances in the reduced space. In order to receive comparable redults we need to scale dataset.
world_macro.s
## # A tibble: 196 × 10
## GDP_percapita_growth deposit_rate CPI import_GS high_tech_export
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.295 8.99 -0.150 -0.960 -0.316
## 2 -0.0318 -0.427 -0.104 -0.812 1.35
## 3 -0.0745 1.73 -0.0787 -0.845 0.0282
## 4 -0.0256 -0.158 -0.104 -0.893 1.50
## 5 -0.138 -0.427 -0.109 -0.250 1.12
## 6 -0.383 -0.427 -0.107 -0.132 0.690
## 7 0.464 -0.427 -0.0784 -0.621 0.311
## 8 0.215 0.149 -0.0931 -0.792 -0.0356
## 9 0.268 -0.427 -0.107 -0.320 0.0320
## 10 -0.266 -0.427 -0.112 -0.664 0.751
## # ℹ 186 more rows
## # ℹ 5 more variables: tax_revenue <dbl>, pop_growth_rate <dbl>,
## # military_exp <dbl>, goverm_exp_educ <dbl>, agriculture_to_GDP <dbl>
dist.world<-dist(world_macro.s, method = "euclidean")
mds1<-cmdscale(dist.world, k=3, eig = TRUE)
mds1
## $points
## [,1] [,2] [,3]
## [1,] 0.966138620 -3.277061351 -4.932023152
## [2,] -1.153377850 -0.154769877 0.296460439
## [3,] -0.164067520 -0.404669391 -1.621746016
## [4,] -0.524443722 -0.031884810 -0.156865144
## [5,] -1.769921636 0.371019871 -0.307791446
## [6,] -0.808743033 -0.031044004 0.267787553
## [7,] 0.470129414 -0.196447978 0.196009172
## [8,] 0.313392698 0.218450425 -0.564395062
## [9,] -1.368284753 0.216826321 -0.797976060
## [10,] -0.370807534 0.164845164 -0.386616242
## [11,] -2.169673381 0.351689068 -0.533631740
## [12,] -0.927463818 0.760616592 -0.420608683
## [13,] -2.001555927 0.225032169 0.644540273
## [14,] -0.092016565 -1.922091947 -0.854759915
## [15,] 0.634245400 -0.507329923 0.266419166
## [16,] -0.710966648 0.081302703 -0.126753485
## [17,] -1.331437688 0.553645012 0.564424641
## [18,] 0.130580585 -1.510613142 -2.513739926
## [19,] -1.708984088 -0.098637910 0.276443581
## [20,] -0.963625462 0.057734824 -0.618840085
## [21,] 1.691117525 -0.113231215 1.593497361
## [22,] 0.481510290 -0.258108999 -0.457948907
## [23,] 1.053640808 -1.180293630 0.184829782
## [24,] 0.717047346 1.050108189 -0.669521761
## [25,] 1.024924971 0.197872100 0.309865632
## [26,] 0.172128114 0.714953344 -0.623240989
## [27,] -1.287775806 -0.455353111 -1.130445814
## [28,] -0.764665286 1.323652406 -1.007740713
## [29,] -1.535911931 0.166017176 0.284245868
## [30,] -0.432580075 -1.447442249 -1.261674044
## [31,] -0.489679760 0.629466479 -0.958925153
## [32,] -0.145242957 0.212851260 -0.105259702
## [33,] 1.184324561 0.024130715 -0.892005773
## [34,] 0.395308705 1.231432473 -1.542086836
## [35,] -0.343815712 -1.449102791 -0.403207397
## [36,] -2.318626260 1.023216998 0.188432434
## [37,] 0.824118174 0.359588444 -0.025513837
## [38,] 1.992617497 0.267753453 0.995545102
## [39,] -0.003438902 0.904723718 -0.810328940
## [40,] 0.925545191 -0.346445503 0.415253991
## [41,] 0.974704267 -0.756068731 0.445867414
## [42,] -1.362676231 0.328901282 -0.893841681
## [43,] -0.564323441 0.033843931 -0.493117270
## [44,] -0.171952194 0.071106918 -0.021285356
## [45,] -2.367444634 0.082836410 -1.441838433
## [46,] 0.618402863 -1.090892473 1.216673273
## [47,] 2.110244868 -0.811355668 1.434706180
## [48,] 0.306673018 0.897252482 -0.778933465
## [49,] 0.473451178 0.096206145 0.572984542
## [50,] 1.710472552 0.157142581 0.636035086
## [51,] -0.428551069 0.199784852 0.242792118
## [52,] 0.480389521 0.068991817 1.118437849
## [53,] 1.814343262 -0.277414272 1.364302961
## [54,] -0.845468862 -0.465270169 -1.141658933
## [55,] -0.125246168 -0.355453512 -1.184761161
## [56,] 2.356392706 -0.087925940 1.127276221
## [57,] 1.067220862 0.531513367 0.837529364
## [58,] 0.893632719 0.006344651 0.617533134
## [59,] -0.952144919 0.880193266 -0.654874259
## [60,] 0.931388711 0.133757719 0.630535243
## [61,] -1.747110971 0.449823362 -0.771899868
## [62,] -0.136388724 0.642348573 -0.814458802
## [63,] 0.296125039 0.645627184 -1.955573759
## [64,] -2.166286166 0.820391874 0.221798957
## [65,] -1.246167041 0.874919781 0.188097007
## [66,] -1.880385394 -0.348650890 0.545191289
## [67,] -1.336456383 1.609838626 0.819594788
## [68,] 0.516697034 0.081592502 -0.210903305
## [69,] 0.013075754 -0.349346601 -0.615247895
## [70,] -0.060731366 -0.421117021 -0.508308280
## [71,] 1.156027923 0.191313866 -1.115986414
## [72,] -1.064246779 -0.119610585 0.100092947
## [73,] 0.557102173 0.572848599 -0.051943565
## [74,] 1.010676071 0.656645442 -0.549951656
## [75,] -1.842869692 0.460563697 0.574660062
## [76,] 1.260436807 -0.958794907 0.004449671
## [77,] 2.479355428 -0.210918209 1.347375918
## [78,] 0.932074269 0.736464936 0.378411127
## [79,] 0.452244015 0.900400356 -0.913125964
## [80,] -1.280449374 0.156146957 -0.313959571
## [81,] -0.265278150 1.097111568 -0.699801028
## [82,] 0.954459633 0.298283304 0.227012557
## [83,] 1.741862530 0.029888616 0.381966446
## [84,] -0.803432631 -0.544739159 -1.467997690
## [85,] 1.102093930 -0.654632751 -0.378938102
## [86,] -2.259812262 -0.156267184 -0.862724734
## [87,] 0.936622532 0.249720180 -0.948786133
## [88,] 0.304126981 -0.042701868 -0.123629535
## [89,] 1.787833700 -0.058133085 1.153754500
## [90,] 2.206714087 0.152211423 0.928547939
## [91,] 1.712280111 4.154732728 -4.890211425
## [92,] 1.471256898 -0.153739928 0.294652894
## [93,] 0.741495466 -0.263213005 0.405799157
## [94,] -4.452438869 1.927168487 1.886772272
## [95,] -2.314021568 0.163921873 -1.019592153
## [96,] -1.109801942 0.729552642 0.906259026
## [97,] 1.142631516 -0.361190471 -0.018033684
## [98,] 0.832980485 0.781331522 -0.509500723
## [99,] -2.285129223 2.274140030 0.703517259
## [100,] -1.459750013 -0.647679578 0.539578919
## [101,] 1.095223789 -0.665469424 -0.843705694
## [102,] -0.620568580 -1.100005540 0.175438098
## [103,] -0.682047716 0.526816412 0.516460574
## [104,] 1.168587838 -0.782953600 0.113944116
## [105,] 0.094259935 0.521946298 1.124034431
## [106,] -0.284915399 0.505073988 -0.703305236
## [107,] 1.325934137 -0.643319207 1.077980135
## [108,] -0.708384861 -0.100800978 0.911343816
## [109,] 1.103103239 0.085253038 -0.030993366
## [110,] -2.143946126 0.282934781 -0.006187563
## [111,] -0.730623744 1.055076900 -0.046098646
## [112,] -1.449901674 -0.764114492 0.856104173
## [113,] 2.442940365 -0.037927401 0.968087596
## [114,] 0.046523131 0.254605058 0.589468324
## [115,] -1.703066051 0.364535155 0.335996390
## [116,] -2.591770440 0.449229463 1.767152577
## [117,] -1.458037895 -1.683415353 1.564334860
## [118,] 1.514040417 -0.786557750 -0.172793736
## [119,] 1.738588495 -0.018190399 0.748228989
## [120,] -1.966067014 1.456128387 0.595730100
## [121,] 0.283863970 0.935571510 -0.384434025
## [122,] 2.302507102 -0.634771107 1.787198875
## [123,] -2.644004370 1.772776808 1.063053557
## [124,] -0.021833109 -0.988907919 0.322358444
## [125,] 1.262690014 0.297737829 1.084485514
## [126,] -0.890317877 0.147245304 -0.559954536
## [127,] 0.872868711 0.337827098 0.571117609
## [128,] -1.263404197 -1.725128517 -0.662280900
## [129,] -0.304867915 0.910425947 0.987341360
## [130,] -0.506392303 0.782904721 -0.584449643
## [131,] -0.694900120 -1.436353519 0.627719462
## [132,] 0.378243714 -0.949440553 1.357826022
## [133,] 1.520100101 -0.304168541 -0.191494567
## [134,] -0.241893770 -1.503656943 1.108917596
## [135,] -0.465971199 0.118586204 0.848277103
## [136,] 1.234857198 -0.041052864 0.368479374
## [137,] -1.433456105 -0.660074620 -0.366665247
## [138,] -0.065867270 -0.168304236 0.676695768
## [139,] 2.580173838 0.283765893 1.481289711
## [140,] 2.181601928 -0.068505251 -0.205481106
## [141,] -0.575935362 0.492270276 -0.219751618
## [142,] -1.686251688 -0.199676085 0.083337316
## [143,] 0.898548161 0.007874487 0.928785725
## [144,] 1.604731873 -0.231638707 0.447544376
## [145,] -0.187046847 0.092407109 0.201684391
## [146,] 0.574990005 0.909766211 -0.496222248
## [147,] 1.148114811 0.357550964 -0.177694710
## [148,] 0.051583079 -0.070422995 0.590557589
## [149,] 0.740274586 -0.320114438 -0.344836392
## [150,] -2.150056829 1.421364868 0.618110224
## [151,] -1.450192792 0.311878164 -0.664986457
## [152,] -1.172158910 0.585011170 -0.573009525
## [153,] -0.263518459 0.670240719 -1.210339767
## [154,] 1.080164357 -0.357385361 0.730018496
## [155,] -1.146861969 0.104812545 -0.982837145
## [156,] 1.785516326 -0.421765070 0.235370296
## [157,] 0.611777499 -0.553183040 0.769934957
## [158,] -2.706874666 0.796572692 -0.106950015
## [159,] 1.693998628 -0.367114887 0.451265139
## [160,] 0.353516891 -0.534532007 1.280452258
## [161,] -1.836981590 -0.791647943 -0.733265300
## [162,] -1.380674016 -0.067853160 0.944694675
## [163,] 2.607108132 -1.048326020 1.171777161
## [164,] -3.100222218 1.650679646 2.055936626
## [165,] 1.097061573 0.559340553 -0.805590955
## [166,] -2.068832034 0.309049739 0.059931176
## [167,] -1.689957464 0.599667121 -0.076468119
## [168,] 1.302544683 -0.962815484 2.141074423
## [169,] 0.342891999 1.203592554 0.436917363
## [170,] -0.551786122 -0.845629968 0.017971736
## [171,] 0.189661460 -0.688163547 0.241572097
## [172,] 0.837204011 0.394589501 -1.534685235
## [173,] 0.982938587 1.090540091 -2.052901475
## [174,] 0.768460961 -0.297165718 -0.916993624
## [175,] 3.364383541 0.656270469 -4.243097237
## [176,] 1.013752057 -0.172502804 -0.888923130
## [177,] -1.815215369 0.394317736 -0.092738656
## [178,] 3.059991731 0.313953385 1.945386148
## [179,] 1.100615596 -0.143580273 1.036609809
## [180,] 1.581637319 -0.140060993 1.041506358
## [181,] -1.429401564 0.364766599 0.321019959
## [182,] -0.397682287 -1.364130677 1.638014753
## [183,] 0.591228751 -1.117877037 0.851160293
## [184,] 0.523653187 -0.613589250 0.592395395
## [185,] 0.680108344 -0.219720537 -0.931133970
## [186,] 0.214821015 -0.623522574 0.736008943
## [187,] 1.484021001 -0.299494839 0.999211395
## [188,] -4.536310269 -11.194916328 -0.560692995
## [189,] 0.404044678 1.942108450 0.988678360
## [190,] 0.338334662 -0.564139694 -0.866233206
## [191,] 1.350662386 -1.090206860 -0.678230615
## [192,] 0.710872882 -0.479937974 1.153298011
## [193,] -1.077700912 1.471104282 0.343297150
## [194,] 0.563319911 1.123883457 0.279318138
## [195,] 1.040387559 0.323564399 0.248696045
## [196,] 1.850280508 -1.123534108 -4.556604141
##
## $eig
## [1] 3.808091e+02 2.597904e+02 2.303825e+02 2.192798e+02 1.921700e+02
## [6] 1.567426e+02 1.512854e+02 1.344094e+02 1.159475e+02 1.091834e+02
## [11] 6.880946e-13 2.668232e-13 2.361772e-13 1.462996e-13 1.345763e-13
## [16] 8.407673e-14 3.460965e-14 2.584081e-14 2.256230e-14 2.220531e-14
## [21] 1.795908e-14 1.508267e-14 1.384050e-14 1.258982e-14 1.079169e-14
## [26] 9.990249e-15 9.453339e-15 9.076563e-15 8.850293e-15 8.617568e-15
## [31] 8.081677e-15 7.969679e-15 7.482417e-15 6.434706e-15 6.147151e-15
## [36] 5.586305e-15 5.154366e-15 4.860911e-15 4.859449e-15 4.699985e-15
## [41] 4.570646e-15 4.474876e-15 4.461016e-15 3.994779e-15 3.944326e-15
## [46] 3.571916e-15 3.557447e-15 3.478092e-15 3.334405e-15 3.309017e-15
## [51] 3.262828e-15 3.038482e-15 2.958770e-15 2.764798e-15 2.755243e-15
## [56] 2.653158e-15 2.517919e-15 2.500352e-15 2.368606e-15 2.300339e-15
## [61] 2.140398e-15 1.988257e-15 1.949903e-15 1.896890e-15 1.849789e-15
## [66] 1.841729e-15 1.805562e-15 1.708777e-15 1.688360e-15 1.680371e-15
## [71] 1.652432e-15 1.629878e-15 1.621577e-15 1.552497e-15 1.516870e-15
## [76] 1.465981e-15 1.426809e-15 1.394362e-15 1.284667e-15 1.267783e-15
## [81] 1.237374e-15 1.236469e-15 1.200628e-15 1.123359e-15 1.088557e-15
## [86] 1.065884e-15 1.041021e-15 1.027288e-15 1.018400e-15 9.978079e-16
## [91] 9.948959e-16 9.591708e-16 9.093739e-16 8.865428e-16 8.765591e-16
## [96] 8.532539e-16 7.617271e-16 7.527166e-16 6.805238e-16 6.460121e-16
## [101] 6.219553e-16 5.434901e-16 5.012187e-16 4.967110e-16 4.595967e-16
## [106] 4.496621e-16 4.476675e-16 4.443187e-16 3.417833e-16 3.270619e-16
## [111] 1.810898e-16 1.651887e-16 1.499935e-16 9.283906e-17 8.152539e-17
## [116] -1.944740e-17 -4.260418e-17 -5.525236e-17 -2.019966e-16 -2.637971e-16
## [121] -2.828092e-16 -2.909576e-16 -3.233568e-16 -3.400485e-16 -4.497722e-16
## [126] -4.713495e-16 -4.820456e-16 -4.824304e-16 -5.053585e-16 -5.121736e-16
## [131] -5.420321e-16 -5.447006e-16 -5.479336e-16 -6.268899e-16 -6.607863e-16
## [136] -7.236243e-16 -7.571433e-16 -8.113674e-16 -8.831856e-16 -8.911434e-16
## [141] -9.195455e-16 -9.241656e-16 -1.035144e-15 -1.071459e-15 -1.071727e-15
## [146] -1.092167e-15 -1.095923e-15 -1.105475e-15 -1.315736e-15 -1.350768e-15
## [151] -1.429542e-15 -1.444023e-15 -1.446644e-15 -1.549397e-15 -1.577194e-15
## [156] -1.577349e-15 -1.710482e-15 -1.726458e-15 -1.783444e-15 -1.798126e-15
## [161] -1.947489e-15 -1.987014e-15 -2.001301e-15 -2.050868e-15 -2.162041e-15
## [166] -2.188858e-15 -2.228734e-15 -2.332059e-15 -2.444970e-15 -2.504959e-15
## [171] -2.740499e-15 -2.744976e-15 -2.781443e-15 -2.957212e-15 -3.031298e-15
## [176] -3.100308e-15 -3.144482e-15 -3.166503e-15 -3.421981e-15 -3.429866e-15
## [181] -3.470370e-15 -3.500045e-15 -3.696943e-15 -3.784081e-15 -4.084763e-15
## [186] -9.505633e-15 -1.145797e-14 -1.693873e-14 -1.770825e-14 -1.928121e-14
## [191] -5.198379e-14 -8.898376e-14 -1.158095e-13 -1.719275e-13 -2.426788e-13
## [196] -6.048682e-13
##
## $x
## NULL
##
## $ac
## [1] 0
##
## $GOF
## [1] 0.4466574 0.4466574
Adding the “eig=TRUE” allowed us to take a look into eigenvalues assigned to variables. The eigenvalues are very small - each dimension explain very little percent of the variance
GDF value (goodness of fit) - its a measure how well the model fitted the data. I achieved result of 0.4466574 which indicates that the model explained 44.67% variance of the data after reduction to 3 dimensions.
Let’s analyze the visual presentation of two methods:
mds1df<-as.data.frame(mds1$points)
pcadf<-as.data.frame(xxx.pca1$x[,1:3])
scatterplot3d(mds1df, xlab="dim1", ylab="dim2", zlab="dim3", main="MDS", pch=16, color="#BB3754FF", grid=FALSE)
scatterplot3d(pcadf, xlab="pc1", ylab="pc2", zlab="pc3", main="PCA", color="#ED6925FF", grid=FALSE, pch=2)
We can observe that two methods output slightly differs in the points location. It is due to the different algorithms that two methods use. We can also investigate it in one plot:
s3d <- scatterplot3d(mds1df, xlab="dim1", ylab="dim2", zlab="dim3", main="MDS and PCA Combined", pch=16, color="#BB3754FF", grid=FALSE)
s3d$points3d(xxx.pca1$x[,1], xxx.pca1$x[,2], xxx.pca1$x[,3], pch=2, col="#ED6925FF")
OBSERVATION: Two outputs differs in the outliers configuration. The main “cloud” of points is located in the same place on scale.
The PCA and MDS in case of my dataset occurs in similar results. After reducing to 3 dimensions:
PCA explained 44.67% of all variance.
MDS explained 44.67% of all variance.
That indicates that in terms of variance analysis both method on my dataset occurs in the same result in terms of variance. PCA allows us to take deeper insight into data we are handling - we can compare how different variables contributes to components and identify patterns. MDS doesn’t give us this deep insight. However, MDS is better for non-linear data.
In my case both methods resulted in results under expectations. It may occur due to the missing values in data and choice of the weak correlated factors.
PCA: script from classes
3D plots: https://plotly.com/r/3d-scatter-plots/ and debugging with chat GPT
MDS: script from classes and https://rpubs.com/folwalsh/mdspca, chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://fricas.org/~p-wyk4/mag2020/refs/MDS_KBazner.pdf, https://anastasiospanagiotelis.netlify.app/teaching/hdda2019/lectures/05mds/mds#87