PCA and MDS on World Development Indicators (2022)

For my additional homework I decided to perform PCA on the dataset from WorldBank - World Development Indicators (2022). The reason why I decided to focus on 2022 was due to the a lot of missing values in 2023 data set. The homework is structured as follows:

  1. Data loading and preprocesing
  2. PCA
  3. MDS
  4. Summary

Data preprocesing

summary(world)
##  Country Name        SP.ADO.TFRT      NV.AGR.TOTL.ZS     NE.EXP.GNFS.ZS   
##  Length:196         Min.   :  0.473   Min.   : 0.01149   Min.   :  1.571  
##  Class :character   1st Qu.:  8.537   1st Qu.: 2.09654   1st Qu.: 24.422  
##  Mode  :character   Median : 29.694   Median : 6.95629   Median : 41.029  
##                     Mean   : 41.187   Mean   :10.17519   Mean   : 47.853  
##                     3rd Qu.: 61.053   3rd Qu.:15.82598   3rd Qu.: 59.864  
##                     Max.   :163.019   Max.   :43.05999   Max.   :211.278  
##                                       NA's   :11         NA's   :22       
##  SP.DYN.TFRT.IN  BX.KLT.DINV.CD.WD    AG.LND.FRST.K2    NY.GDP.MKTP.CD     
##  Min.   :0.701   Min.   :-2.942e+11   Min.   :      0   Min.   :1.568e+08  
##  1st Qu.:1.530   1st Qu.: 7.508e+07   1st Qu.:   3096   1st Qu.:1.107e+10  
##  Median :2.011   Median : 8.764e+08   Median :  24317   Median :3.919e+10  
##  Mean   :2.489   Mean   : 1.011e+10   Mean   : 208111   Mean   :5.172e+11  
##  3rd Qu.:3.271   3rd Qu.: 8.046e+09   3rd Qu.: 122824   3rd Qu.:2.544e+11  
##  Max.   :6.749   Max.   : 4.090e+11   Max.   :8153116   Max.   :2.601e+13  
##  NA's   :3       NA's   :16           NA's   :5         NA's   :2          
##  NY.GDP.MKTP.KD.ZG NY.GNP.PCAP.CD   NY.GNP.PCAP.PP.CD NY.GNP.ATLS.CD     
##  Min.   :-28.759   Min.   :   230   Min.   :   890    Min.   :2.496e+08  
##  1st Qu.:  2.553   1st Qu.:  2400   1st Qu.:  6430    1st Qu.:1.010e+10  
##  Median :  4.185   Median :  6640   Median : 16950    Median :3.899e+10  
##  Mean   :  4.332   Mean   : 17376   Mean   : 27258    Mean   :5.243e+11  
##  3rd Qu.:  6.202   3rd Qu.: 22290   3rd Qu.: 41010    3rd Qu.:2.483e+11  
##  Max.   : 63.335   Max.   :132240   Max.   :124910    Max.   :2.582e+13  
##  NA's   :1         NA's   :3        NA's   :3         NA's   :3          
##  NY.GNP.MKTP.PP.CD   NE.GDI.TOTL.ZS   TX.VAL.TECH.MF.ZS  SH.IMM.MEAS   
##  Min.   :2.558e+08   Min.   : 1.225   Min.   : 0.000    Min.   :33.00  
##  1st Qu.:2.635e+10   1st Qu.:19.744   1st Qu.: 2.246    1st Qu.:76.25  
##  Median :8.805e+10   Median :24.201   Median : 6.824    Median :90.00  
##  Mean   :8.742e+11   Mean   :24.703   Mean   :11.439    Mean   :84.47  
##  3rd Qu.:4.940e+11   3rd Qu.:27.829   3rd Qu.:16.804    3rd Qu.:96.00  
##  Max.   :3.152e+13   Max.   :56.066   Max.   :72.637    Max.   :99.00  
##  NA's   :3           NA's   :27       NA's   :46        NA's   :10     
##  NE.IMP.GNFS.ZS    NV.IND.TOTL.ZS   NY.GDP.DEFL.KD.ZG SP.DYN.LE00.IN 
##  Min.   :  1.128   Min.   : 2.391   Min.   :-10.801   Min.   :53.00  
##  1st Qu.: 30.507   1st Qu.:18.028   1st Qu.:  4.719   1st Qu.:66.56  
##  Median : 48.133   Median :24.905   Median :  7.199   Median :73.43  
##  Mean   : 53.793   Mean   :27.028   Mean   : 13.501   Mean   :72.21  
##  3rd Qu.: 68.521   3rd Qu.:33.060   3rd Qu.: 14.046   3rd Qu.:77.96  
##  Max.   :190.394   Max.   :76.838   Max.   :266.987   Max.   :85.38  
##  NA's   :22        NA's   :10       NA's   :1         NA's   :4      
##  TG.VAL.TOTL.GD.ZS MS.MIL.XPND.GD.ZS  IT.CEL.SETS.P2    SH.DYN.MORT    
##  Min.   : 15.66    Min.   : 0.06918   Min.   : 42.07   Min.   :  1.50  
##  1st Qu.: 43.32    1st Qu.: 0.96722   1st Qu.: 98.89   1st Qu.:  6.05  
##  Median : 63.77    Median : 1.45800   Median :118.07   Median : 14.50  
##  Mean   : 73.70    Mean   : 1.97281   Mean   :118.16   Mean   : 25.01  
##  3rd Qu.: 85.56    3rd Qu.: 2.16442   3rd Qu.:132.97   3rd Qu.: 38.35  
##  Max.   :356.16    Max.   :33.54657   Max.   :291.91   Max.   :117.30  
##  NA's   :6         NA's   :50         NA's   :47       NA's   :9       
##   SM.POP.NETM       BX.TRF.PWKR.CD.DT    EN.POP.DNST        SP.POP.GROW    
##  Min.   :-5699445   Min.   :0.000e+00   Min.   :   2.205   Min.   :-7.619  
##  1st Qu.:   -8675   1st Qu.:1.482e+08   1st Qu.:  38.994   1st Qu.: 0.191  
##  Median :    -112   Median :6.897e+08   Median :  89.101   Median : 1.062  
##  Mean   :     718   Mean   :4.087e+09   Mean   : 219.795   Mean   : 1.062  
##  3rd Qu.:   26832   3rd Qu.:3.699e+09   3rd Qu.: 214.583   3rd Qu.: 2.013  
##  Max.   : 1319009   Max.   :1.112e+11   Max.   :7851.006   Max.   : 5.907  
##                     NA's   :3           NA's   :5          NA's   :1       
##   SP.POP.TOTL        SE.PRM.CMPT.ZS   GC.REV.XGRT.GD.ZS  SE.PRM.ENRR    
##  Min.   :1.180e+04   Min.   : 52.99   Min.   : 3.268    Min.   : 67.23  
##  1st Qu.:1.707e+06   1st Qu.: 87.74   1st Qu.:17.894    1st Qu.: 94.66  
##  Median :7.453e+06   Median : 95.76   Median :27.098    Median : 99.71  
##  Mean   :4.007e+07   Mean   : 91.92   Mean   :27.210    Mean   :100.34  
##  3rd Qu.:2.989e+07   3rd Qu.: 99.90   3rd Qu.:35.560    3rd Qu.:104.08  
##  Max.   :1.425e+09   Max.   :120.12   Max.   :65.932    Max.   :156.80  
##                      NA's   :72       NA's   :94        NA's   :46      
##   SE.SEC.ENRR     AG.SRF.TOTL.K2     GC.TAX.TOTL.GD.ZS ER.PTD.TOTL.ZS 
##  Min.   : 33.72   Min.   :      20   Min.   : 0.5785   Min.   : 0.00  
##  1st Qu.: 86.41   1st Qu.:   20760   1st Qu.:12.1591   1st Qu.: 2.93  
##  Median : 97.12   Median :  114760   Median :17.2955   Median :10.35  
##  Mean   : 94.30   Mean   :  718785   Mean   :17.4935   Mean   :13.23  
##  3rd Qu.:106.01   3rd Qu.:  520990   3rd Qu.:22.2794   3rd Qu.:18.69  
##  Max.   :144.85   Max.   :17098250   Max.   :31.2733   Max.   :99.96  
##  NA's   :70       NA's   :5          NA's   :94        NA's   :3      
##  DT.TDS.DECT.EX.ZS  SP.URB.GROW      EG.ELC.ACCS.ZS    FP.CPI.TOTL     
##  Min.   : 0.1694   Min.   :-7.3874   Min.   : 10.30   Min.   :  102.2  
##  1st Qu.: 7.1991   1st Qu.: 0.6367   1st Qu.: 85.30   1st Qu.:  126.3  
##  Median :11.3586   Median : 1.6478   Median :100.00   Median :  141.9  
##  Mean   :14.0297   Mean   : 1.7315   Mean   : 87.03   Mean   :  487.3  
##  3rd Qu.:18.6198   3rd Qu.: 2.9046   3rd Qu.:100.00   3rd Qu.:  195.6  
##  Max.   :60.2780   Max.   : 5.9473   Max.   :100.00   Max.   :38796.6  
##  NA's   :82        NA's   :2         NA's   :1        NA's   :23       
##   FR.INR.DPST      GC.XPN.TOTL.GD.ZS NE.CON.TOTL.ZS   NY.GDP.PCAP.KD.ZG 
##  Min.   :-0.1523   Min.   : 3.875    Min.   : 32.37   Min.   :-21.6953  
##  1st Qu.: 1.2523   1st Qu.:19.630    1st Qu.: 69.43   1st Qu.:  0.9803  
##  Median : 2.5579   Median :28.240    Median : 79.03   Median :  2.5948  
##  Mean   : 4.7563   Mean   :29.018    Mean   : 80.31   Mean   :  3.1467  
##  3rd Qu.: 6.1671   3rd Qu.:35.875    3rd Qu.: 90.96   3rd Qu.:  5.2230  
##  Max.   :52.4167   Max.   :81.642    Max.   :162.39   Max.   : 62.1110  
##  NA's   :98        NA's   :94        NA's   :26       NA's   :1         
##  BX.GSR.MRCH.CD      BM.GSR.MRCH.CD      SE.XPD.TOTL.GD.ZS
##  Min.   :2.098e+06   Min.   :1.021e+08   Min.   : 0.3485  
##  1st Qu.:2.112e+09   1st Qu.:4.517e+09   1st Qu.: 3.0358  
##  Median :1.311e+10   Median :1.522e+10   Median : 3.9541  
##  Mean   :1.333e+11   Mean   :1.332e+11   Mean   : 4.2242  
##  3rd Qu.:8.563e+10   3rd Qu.:8.145e+10   3rd Qu.: 5.0780  
##  Max.   :3.347e+12   Max.   :3.270e+12   Max.   :10.7034  
##  NA's   :23          NA's   :23          NA's   :64

In the data set we can observe many missing values. As discussed during classes when we dealing with missing data it is better to leave it in order to not impact the real meaning of the data. In my case we will be calculating correlations and other numerical values, that’s why I decided to enter 0, to places where missing value occurs.

world[is.na(world)] <- 0
world[!complete.cases(world),]
## # A tibble: 0 × 47
## # ℹ 47 variables: Country Name <chr>, SP.ADO.TFRT <dbl>, NV.AGR.TOTL.ZS <dbl>,
## #   NE.EXP.GNFS.ZS <dbl>, SP.DYN.TFRT.IN <dbl>, BX.KLT.DINV.CD.WD <dbl>,
## #   AG.LND.FRST.K2 <dbl>, NY.GDP.MKTP.CD <dbl>, NY.GDP.MKTP.KD.ZG <dbl>,
## #   NY.GNP.PCAP.CD <dbl>, NY.GNP.PCAP.PP.CD <dbl>, NY.GNP.ATLS.CD <dbl>,
## #   NY.GNP.MKTP.PP.CD <dbl>, NE.GDI.TOTL.ZS <dbl>, TX.VAL.TECH.MF.ZS <dbl>,
## #   SH.IMM.MEAS <dbl>, NE.IMP.GNFS.ZS <dbl>, NV.IND.TOTL.ZS <dbl>,
## #   NY.GDP.DEFL.KD.ZG <dbl>, SP.DYN.LE00.IN <dbl>, TG.VAL.TOTL.GD.ZS <dbl>, …

Now we have full data and we can observe correlations. My data set contains different variables - macroeconomics ones and more demographic ones. I would like to focus on macroeconomic data. First look at the data:

world_matrix <- data.matrix(world[2:47])
correlation_matrix<-cor(na.omit(world_matrix))
corrplot(correlation_matrix, method="circle", tl.col="black", tl.cex=0.2,col=inferno(10))+theme_classic()

## NULL

We need to make choice which 10 variables we will take into consideration while performing dimension reduction. My choices:

  • Agriculture, forestry, and fishing, value added (% of GDP) (NV.AGR.TOTL.ZS)

  • High-technology exports (% of manufactured exports) (TX.VAL.TECH.MF.ZS)

  • Imports of goods and services (% of GDP) (NE.IMP.GNFS.ZS)

  • Military expenditure (% of GDP) (MS.MIL.XPND.GD.ZS)

  • Population growth (annual %) (SP.POP.GROW)

  • Tax revenue (% of GDP) (GC.TAX.TOTL.GD.ZS)

  • Consumer price index (2010 = 100) (FP.CPI.TOTL)

  • Deposit interest rate (%) (FR.INR.DPST)

  • GDP per capita growth (annual %) (NY.GDP.PCAP.KD.ZG)

  • Government expenditure on education, total (% of GDP) (SE.XPD.TOTL.GD.ZS)

I mostly focused on the macroeconomics factors.

Renaming columns and chosen variables overview:

world_macro<-world[,c("Country Name", "NY.GDP.PCAP.KD.ZG", "FR.INR.DPST", "FP.CPI.TOTL", "NE.IMP.GNFS.ZS", "TX.VAL.TECH.MF.ZS", "GC.TAX.TOTL.GD.ZS", "SP.POP.GROW","MS.MIL.XPND.GD.ZS", "SE.XPD.TOTL.GD.ZS", "NV.AGR.TOTL.ZS")]

colnames(world_macro)<-c("country", "GDP_percapita_growth", "deposit_rate", "CPI", "import_GS", "high_tech_export","tax_revenue", "pop_growth_rate", "military_exp", "goverm_exp_educ", "agriculture_to_GDP")

Correlation plot - correlation values are below

world_macro_matrix <- data.matrix(world_macro[2:11])
correlation_matrix_macro<-cor(world_macro_matrix)
corrplot(correlation_matrix_macro, method="color", tl.col="black", addCoef.col="black", number.cex=0.5,tl.cex=0.5,col=inferno(10))+theme_classic()

## NULL

OBSERVATION: Correlation is lower than 0.4 for all variables combinations.

PCA

In this step I will proceed with the dimension reduction using PCA. The homework task was to reduce dimensions from 10 to 3. Firstly in the PCA we need to standardize our data and rescale variance to 1.

preproc_world_macro <- preProcess(world_macro[2:11], method=c("center", "scale"))
world_macro.s <- predict(preproc_world_macro, world_macro[2:11])
summary(world_macro.s)
##  GDP_percapita_growth  deposit_rate          CPI             import_GS      
##  Min.   :-3.82067     Min.   :-0.4544   Min.   :-0.14994   Min.   :-1.4121  
##  1st Qu.:-0.34613     1st Qu.:-0.4270   1st Qu.:-0.10752   1st Qu.:-0.6315  
##  Median :-0.08321     Median :-0.4270   Median :-0.10293   Median :-0.1573  
##  Mean   : 0.00000     Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.32017     3rd Qu.: 0.0309   3rd Qu.:-0.08639   3rd Qu.: 0.4756  
##  Max.   : 9.07703     Max.   : 8.9851   Max.   :13.37386   Max.   : 4.2178  
##  high_tech_export   tax_revenue      pop_growth_rate      military_exp    
##  Min.   :-0.6921   Min.   :-0.9226   Min.   :-5.590937   Min.   :-0.5522  
##  1st Qu.:-0.6921   1st Qu.:-0.9226   1st Qu.:-0.564272   1st Qu.:-0.5522  
##  Median :-0.4067   Median :-0.2134   Median :-0.008859   Median :-0.1267  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000  
##  3rd Qu.: 0.2710   3rd Qu.: 0.8720   3rd Qu.: 0.615361   3rd Qu.: 0.1135  
##  Max.   : 5.0503   Max.   : 2.2468   Max.   : 3.125569   Max.   :12.0523  
##  goverm_exp_educ   agriculture_to_GDP
##  Min.   :-1.1584   Min.   :-0.9554   
##  1st Qu.:-1.1584   1st Qu.:-0.7859   
##  Median : 0.0850   Median :-0.3588   
##  Mean   : 0.0000   Mean   : 0.0000   
##  3rd Qu.: 0.6972   3rd Qu.: 0.5412   
##  Max.   : 3.1999   Max.   : 3.3282

Eigenvalues

Eigenvalues represents the variance explained by the components in the dataset. The highest value, the higher part of variance explained by component. There is a Kaiser criterion used i PCA which states that we should drop the components for which eignevalues are less than 1 when data is standardized. Here according to eigenvalues and criterion we should narrow aour analysis to 4 pca.

world_macro.cov<-cov(world_macro.s)
world_macro.eigen<-eigen(world_macro.cov)
world_macro.eigen$values
##  [1] 1.9528671 1.3322583 1.1814486 1.1245120 0.9854871 0.8038083 0.7758225
##  [8] 0.6892791 0.5946023 0.5599148

Loadings analysis

Loadings are interpret as a importance of the variable in the factor. The higher is the loading the higher its importance in the vector. When the correlation between factor and variable is possitive the loading is positive as well. In case of the negative correlation the loading may be negative.

Let’s investigate loadings in our case. Since we will be restricting our data to 3 PCA, we will investigate only firts three PCA’s.

xxx<-world_macro.s
xxx.pca1<-prcomp(xxx, center=FALSE, scale.=FALSE)
xxx.pca1
## Standard deviations (1, .., p=10):
##  [1] 1.3974502 1.1542349 1.0869446 1.0604301 0.9927170 0.8965535 0.8808078
##  [8] 0.8302283 0.7711046 0.7482745
## 
## Rotation (n x k) = (10 x 10):
##                              PC1          PC2         PC3          PC4
## GDP_percapita_growth -0.05297794 -0.417660491 -0.45790197  0.513456726
## deposit_rate         -0.08014343  0.335681118 -0.48911229  0.079387361
## CPI                  -0.14344063 -0.009978956 -0.32253795 -0.667318829
## import_GS             0.44887220 -0.163928225  0.30250321 -0.096368023
## high_tech_export      0.41558856 -0.258401027  0.14304111  0.002494964
## tax_revenue           0.45988827  0.146981271  0.02785852  0.002884607
## pop_growth_rate      -0.37208985 -0.261376369  0.37788608 -0.158567643
## military_exp          0.17346435  0.630276192  0.02369625 -0.077864738
## goverm_exp_educ      -0.06673912  0.332107090  0.25097181  0.488593615
## agriculture_to_GDP   -0.45989453  0.147208932  0.35773304  0.074254571
##                              PC5         PC6         PC7         PC8
## GDP_percapita_growth -0.05816475 -0.11086800 -0.24578407  0.05793824
## deposit_rate          0.55121836  0.44100541  0.12194070 -0.29487298
## CPI                   0.28095995 -0.48502737 -0.03814317  0.02878682
## import_GS             0.14028527  0.09548680  0.52490129 -0.15111392
## high_tech_export      0.26903242 -0.14854501 -0.49388781 -0.59596964
## tax_revenue           0.30190214  0.27136745 -0.31533726  0.60075397
## pop_growth_rate       0.35188194  0.15572018 -0.32103608  0.23905630
## military_exp         -0.34135123 -0.09130488 -0.39878740 -0.11204596
## goverm_exp_educ       0.43530397 -0.60074159  0.14821670  0.10134545
## agriculture_to_GDP    0.02832451  0.24427753 -0.14254082 -0.29995121
##                              PC9         PC10
## GDP_percapita_growth  0.35470409 -0.379716385
## deposit_rate         -0.13812997 -0.127023962
## CPI                   0.33420777 -0.042562880
## import_GS             0.31813404 -0.492425179
## high_tech_export     -0.10943793  0.185801965
## tax_revenue           0.31310302  0.207251640
## pop_growth_rate      -0.30662308 -0.472218695
## military_exp          0.01315391 -0.519058435
## goverm_exp_educ      -0.03028308  0.002010686
## agriculture_to_GDP    0.66112257  0.159669705

OBSERVATION:

For the first PCA the highest postive loading are for: import good’s and services, high technology export and tax revenue. The most important negative loading are: contribution of the agriculture to GDP and population growth rate. Based on that we can assume that the first PCA is positively related to the technology and economic and negative with agriculture and population growth.

For the PCA2 the highest loadings we can observe for military expenditure as % of GDP, deposit rate and expenditures for education (as % of GDP). That indicates that the second PCA is positively correlated with those variables and negatively with growth GDP per capita and population growth rate.

For the thrid PCA the major positive correlation we can observe for population growth rate and agriculture contribution to GDP and the highest negative to deposit rate and growth of GDP per capita.

Relationship between variables

fviz_pca_var(xxx.pca1, col.var=paletteInferno[3])

OBSERVATIONS: This graph presents contribution of the varaibles to dimensions and relationship between variables. If the angle bewteen vectors of variables is close to 0, the variables are highly correlated - for example import of good and services and export of high technologies.

The example of the opposite relation - CPI and military expedinture variables are close to 90 degrees angle which suggest that their ortogonal to each other - no correlation.

We can also analyze the contribution of the individuals to component on graph:

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90, fill = paletteInferno[2], color = paletteInferno[2], cex.lab=0.4)+theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
a

b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90, fill = paletteInferno[4], color = paletteInferno[4])+theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
b

c<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90, fill = paletteInferno[6], color = paletteInferno[6])+theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
c

Importance of components analysis

Now we will investigate the importance of the components:

summary(xxx.pca1)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.3975 1.1542 1.0869 1.0604 0.99272 0.89655 0.88081
## Proportion of Variance 0.1953 0.1332 0.1181 0.1124 0.09855 0.08038 0.07758
## Cumulative Proportion  0.1953 0.3285 0.4467 0.5591 0.65766 0.73804 0.81562
##                            PC8     PC9    PC10
## Standard deviation     0.83023 0.77110 0.74827
## Proportion of Variance 0.06893 0.05946 0.05599
## Cumulative Proportion  0.88455 0.94401 1.00000

OBSERVATIONS: The first three components explained 44.67% of the variance - if I wouldn’t have the restriction to narrow my analysis to three components I will keep it at seven or eight component.

We can also visualise that to better visual presentation:

eigenvalue <- fviz_eig(xxx.pca1, choice='eigenvalue', barcolor = paletteInferno[4], barfill = paletteInferno[4], main="Eigenvalues scree plot")+
  theme_classic()

var_perc<-fviz_eig(xxx.pca1, barcolor = paletteInferno[3], barfill = paletteInferno[3], main = "Percentage of explained variance scree plot")+theme_classic() 

grid.arrange(eigenvalue, var_perc, nrow=2)

We can observe that when we narrows our analysis to 3 PCA quality of our dimension reduction process will be worse than if we will restrict it to 7 dimension. However restricting the 10 dimension to 7 isn’t satisfactory as well.

Rotation - meaning of the axes

In the last step I will investigate the meaning of each component. In order to do that we need to rotate our PCA once again to achieve linear structure in order to easier interprets them. We will restrict our analysis to 3 factors as was stated in the homework description.

xxx.pca.rotated.analysis<-principal(xxx, nfactors=3, rotate="varimax")
xxx.pca.rotated.analysis
## Principal Components Analysis
## Call: principal(r = xxx, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                        RC1   RC3   RC2   h2   u2 com
## GDP_percapita_growth  0.03 -0.18 -0.67 0.49 0.51 1.2
## deposit_rate          0.31 -0.59  0.05 0.45 0.55 1.5
## CPI                   0.01 -0.35 -0.19 0.16 0.84 1.5
## import_GS             0.27  0.68  0.01 0.54 0.46 1.3
## high_tech_export      0.29  0.58 -0.18 0.45 0.55 1.7
## tax_revenue           0.55  0.34  0.15 0.44 0.56 1.8
## pop_growth_rate      -0.72  0.09 -0.03 0.53 0.47 1.0
## military_exp          0.42 -0.14  0.63 0.59 0.41 1.9
## goverm_exp_educ      -0.08 -0.03  0.47 0.23 0.77 1.1
## agriculture_to_GDP   -0.65 -0.19  0.36 0.59 0.41 1.8
## 
##                        RC1  RC3  RC2
## SS loadings           1.69 1.49 1.29
## Proportion Var        0.17 0.15 0.13
## Cumulative Var        0.17 0.32 0.45
## Proportion Explained  0.38 0.33 0.29
## Cumulative Proportion 0.38 0.71 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.12 
##  with the empirical chi square  270.17  with prob <  6.3e-47 
## 
## Fit based upon off diagonal values = 0.14

KEY OBSERVATIONS:

  • RC1 - it seems to be a component which represent social and agriculture influences to world development. It’s dominated by variables: taxes revenue, population growth rate and contribution of agriculture to GDP

  • RC2 - component with the higher contribution of import, export and deposit rate. It’s related to the trade activities of economies.

  • RC3 - it’s related of government spending - for military and education.

Additionally I took a look into the other statistic presents in the summary:

complexity (com) - it explains how many variables we need to explain a variable. Higher means that we need more variables to explain the meaning of the factor. Here for example the most complex variable is military expenditure. Overall - high variables are undesirable. We can set a benchmark that the high uniqueness is 0.7.

plot(xxx.pca.rotated.analysis$complexity, pch=16, xlim=c(-60, 60), main="Complexity of factors", xlab=" ", ylab="complexity", col=paletteInferno[5])
text(xxx.pca.rotated.analysis$complexity, labels=names(xxx.pca.rotated.analysis$complexity), cex=1)

uniqueness - it’s the proportion of variance that is unique for givenvariable (it’s not shared with others variables). We want it to be low in PCA because then we will be able to reducce the space to a smallernumber of dimensions.

We can visualise both characteristcs on the graph with the benchmark points. We want uniqueness below 0.7 and complexity below 2:

plot(xxx.pca.rotated.analysis$complexity, xxx.pca.rotated.analysis$uniqueness, xlim=c(0, 3), xlab="complexity", ylab="uniqueness", main="Complexity and uniqueness graph",cex=0.8)
text(xxx.pca.rotated.analysis$complexity, xxx.pca.rotated.analysis$uniqueness, labels=names(xxx.pca.rotated.analysis$uniqueness), cex=0.8)
abline(h=c(0.38, 0.7), lty=3, col=2)
abline(v=c(2), lty=3, col=2)

Summary PCA

The PCA isn’t the best tool to reduce the dimension in this case. Reduction to 3 dimensions explain only 45% of total variance, which isn’t a desired results. Additionaly some variables in the data set show high uniqueness values which isn’t good in case of PCA.

MDS

MDS is an algorithm that perseve pairwise distance (PCA was finding the directions). MDS also doesn’t assume relationship between variables so it can easily deal withe nonlinear linkage.

Performing MDS

Firstly we calculate distance matrix and then we proceed to multidimensional scaling to 3 dimensions. MDS goals is to minimize the stress function, which measures the discrepancy between the original distances and the distances in the reduced space. In order to receive comparable redults we need to scale dataset.

world_macro.s
## # A tibble: 196 × 10
##    GDP_percapita_growth deposit_rate     CPI import_GS high_tech_export
##                   <dbl>        <dbl>   <dbl>     <dbl>            <dbl>
##  1               0.295         8.99  -0.150     -0.960          -0.316 
##  2              -0.0318       -0.427 -0.104     -0.812           1.35  
##  3              -0.0745        1.73  -0.0787    -0.845           0.0282
##  4              -0.0256       -0.158 -0.104     -0.893           1.50  
##  5              -0.138        -0.427 -0.109     -0.250           1.12  
##  6              -0.383        -0.427 -0.107     -0.132           0.690 
##  7               0.464        -0.427 -0.0784    -0.621           0.311 
##  8               0.215         0.149 -0.0931    -0.792          -0.0356
##  9               0.268        -0.427 -0.107     -0.320           0.0320
## 10              -0.266        -0.427 -0.112     -0.664           0.751 
## # ℹ 186 more rows
## # ℹ 5 more variables: tax_revenue <dbl>, pop_growth_rate <dbl>,
## #   military_exp <dbl>, goverm_exp_educ <dbl>, agriculture_to_GDP <dbl>
dist.world<-dist(world_macro.s, method = "euclidean")
mds1<-cmdscale(dist.world, k=3, eig = TRUE)
mds1
## $points
##                [,1]          [,2]         [,3]
##   [1,]  0.966138620  -3.277061351 -4.932023152
##   [2,] -1.153377850  -0.154769877  0.296460439
##   [3,] -0.164067520  -0.404669391 -1.621746016
##   [4,] -0.524443722  -0.031884810 -0.156865144
##   [5,] -1.769921636   0.371019871 -0.307791446
##   [6,] -0.808743033  -0.031044004  0.267787553
##   [7,]  0.470129414  -0.196447978  0.196009172
##   [8,]  0.313392698   0.218450425 -0.564395062
##   [9,] -1.368284753   0.216826321 -0.797976060
##  [10,] -0.370807534   0.164845164 -0.386616242
##  [11,] -2.169673381   0.351689068 -0.533631740
##  [12,] -0.927463818   0.760616592 -0.420608683
##  [13,] -2.001555927   0.225032169  0.644540273
##  [14,] -0.092016565  -1.922091947 -0.854759915
##  [15,]  0.634245400  -0.507329923  0.266419166
##  [16,] -0.710966648   0.081302703 -0.126753485
##  [17,] -1.331437688   0.553645012  0.564424641
##  [18,]  0.130580585  -1.510613142 -2.513739926
##  [19,] -1.708984088  -0.098637910  0.276443581
##  [20,] -0.963625462   0.057734824 -0.618840085
##  [21,]  1.691117525  -0.113231215  1.593497361
##  [22,]  0.481510290  -0.258108999 -0.457948907
##  [23,]  1.053640808  -1.180293630  0.184829782
##  [24,]  0.717047346   1.050108189 -0.669521761
##  [25,]  1.024924971   0.197872100  0.309865632
##  [26,]  0.172128114   0.714953344 -0.623240989
##  [27,] -1.287775806  -0.455353111 -1.130445814
##  [28,] -0.764665286   1.323652406 -1.007740713
##  [29,] -1.535911931   0.166017176  0.284245868
##  [30,] -0.432580075  -1.447442249 -1.261674044
##  [31,] -0.489679760   0.629466479 -0.958925153
##  [32,] -0.145242957   0.212851260 -0.105259702
##  [33,]  1.184324561   0.024130715 -0.892005773
##  [34,]  0.395308705   1.231432473 -1.542086836
##  [35,] -0.343815712  -1.449102791 -0.403207397
##  [36,] -2.318626260   1.023216998  0.188432434
##  [37,]  0.824118174   0.359588444 -0.025513837
##  [38,]  1.992617497   0.267753453  0.995545102
##  [39,] -0.003438902   0.904723718 -0.810328940
##  [40,]  0.925545191  -0.346445503  0.415253991
##  [41,]  0.974704267  -0.756068731  0.445867414
##  [42,] -1.362676231   0.328901282 -0.893841681
##  [43,] -0.564323441   0.033843931 -0.493117270
##  [44,] -0.171952194   0.071106918 -0.021285356
##  [45,] -2.367444634   0.082836410 -1.441838433
##  [46,]  0.618402863  -1.090892473  1.216673273
##  [47,]  2.110244868  -0.811355668  1.434706180
##  [48,]  0.306673018   0.897252482 -0.778933465
##  [49,]  0.473451178   0.096206145  0.572984542
##  [50,]  1.710472552   0.157142581  0.636035086
##  [51,] -0.428551069   0.199784852  0.242792118
##  [52,]  0.480389521   0.068991817  1.118437849
##  [53,]  1.814343262  -0.277414272  1.364302961
##  [54,] -0.845468862  -0.465270169 -1.141658933
##  [55,] -0.125246168  -0.355453512 -1.184761161
##  [56,]  2.356392706  -0.087925940  1.127276221
##  [57,]  1.067220862   0.531513367  0.837529364
##  [58,]  0.893632719   0.006344651  0.617533134
##  [59,] -0.952144919   0.880193266 -0.654874259
##  [60,]  0.931388711   0.133757719  0.630535243
##  [61,] -1.747110971   0.449823362 -0.771899868
##  [62,] -0.136388724   0.642348573 -0.814458802
##  [63,]  0.296125039   0.645627184 -1.955573759
##  [64,] -2.166286166   0.820391874  0.221798957
##  [65,] -1.246167041   0.874919781  0.188097007
##  [66,] -1.880385394  -0.348650890  0.545191289
##  [67,] -1.336456383   1.609838626  0.819594788
##  [68,]  0.516697034   0.081592502 -0.210903305
##  [69,]  0.013075754  -0.349346601 -0.615247895
##  [70,] -0.060731366  -0.421117021 -0.508308280
##  [71,]  1.156027923   0.191313866 -1.115986414
##  [72,] -1.064246779  -0.119610585  0.100092947
##  [73,]  0.557102173   0.572848599 -0.051943565
##  [74,]  1.010676071   0.656645442 -0.549951656
##  [75,] -1.842869692   0.460563697  0.574660062
##  [76,]  1.260436807  -0.958794907  0.004449671
##  [77,]  2.479355428  -0.210918209  1.347375918
##  [78,]  0.932074269   0.736464936  0.378411127
##  [79,]  0.452244015   0.900400356 -0.913125964
##  [80,] -1.280449374   0.156146957 -0.313959571
##  [81,] -0.265278150   1.097111568 -0.699801028
##  [82,]  0.954459633   0.298283304  0.227012557
##  [83,]  1.741862530   0.029888616  0.381966446
##  [84,] -0.803432631  -0.544739159 -1.467997690
##  [85,]  1.102093930  -0.654632751 -0.378938102
##  [86,] -2.259812262  -0.156267184 -0.862724734
##  [87,]  0.936622532   0.249720180 -0.948786133
##  [88,]  0.304126981  -0.042701868 -0.123629535
##  [89,]  1.787833700  -0.058133085  1.153754500
##  [90,]  2.206714087   0.152211423  0.928547939
##  [91,]  1.712280111   4.154732728 -4.890211425
##  [92,]  1.471256898  -0.153739928  0.294652894
##  [93,]  0.741495466  -0.263213005  0.405799157
##  [94,] -4.452438869   1.927168487  1.886772272
##  [95,] -2.314021568   0.163921873 -1.019592153
##  [96,] -1.109801942   0.729552642  0.906259026
##  [97,]  1.142631516  -0.361190471 -0.018033684
##  [98,]  0.832980485   0.781331522 -0.509500723
##  [99,] -2.285129223   2.274140030  0.703517259
## [100,] -1.459750013  -0.647679578  0.539578919
## [101,]  1.095223789  -0.665469424 -0.843705694
## [102,] -0.620568580  -1.100005540  0.175438098
## [103,] -0.682047716   0.526816412  0.516460574
## [104,]  1.168587838  -0.782953600  0.113944116
## [105,]  0.094259935   0.521946298  1.124034431
## [106,] -0.284915399   0.505073988 -0.703305236
## [107,]  1.325934137  -0.643319207  1.077980135
## [108,] -0.708384861  -0.100800978  0.911343816
## [109,]  1.103103239   0.085253038 -0.030993366
## [110,] -2.143946126   0.282934781 -0.006187563
## [111,] -0.730623744   1.055076900 -0.046098646
## [112,] -1.449901674  -0.764114492  0.856104173
## [113,]  2.442940365  -0.037927401  0.968087596
## [114,]  0.046523131   0.254605058  0.589468324
## [115,] -1.703066051   0.364535155  0.335996390
## [116,] -2.591770440   0.449229463  1.767152577
## [117,] -1.458037895  -1.683415353  1.564334860
## [118,]  1.514040417  -0.786557750 -0.172793736
## [119,]  1.738588495  -0.018190399  0.748228989
## [120,] -1.966067014   1.456128387  0.595730100
## [121,]  0.283863970   0.935571510 -0.384434025
## [122,]  2.302507102  -0.634771107  1.787198875
## [123,] -2.644004370   1.772776808  1.063053557
## [124,] -0.021833109  -0.988907919  0.322358444
## [125,]  1.262690014   0.297737829  1.084485514
## [126,] -0.890317877   0.147245304 -0.559954536
## [127,]  0.872868711   0.337827098  0.571117609
## [128,] -1.263404197  -1.725128517 -0.662280900
## [129,] -0.304867915   0.910425947  0.987341360
## [130,] -0.506392303   0.782904721 -0.584449643
## [131,] -0.694900120  -1.436353519  0.627719462
## [132,]  0.378243714  -0.949440553  1.357826022
## [133,]  1.520100101  -0.304168541 -0.191494567
## [134,] -0.241893770  -1.503656943  1.108917596
## [135,] -0.465971199   0.118586204  0.848277103
## [136,]  1.234857198  -0.041052864  0.368479374
## [137,] -1.433456105  -0.660074620 -0.366665247
## [138,] -0.065867270  -0.168304236  0.676695768
## [139,]  2.580173838   0.283765893  1.481289711
## [140,]  2.181601928  -0.068505251 -0.205481106
## [141,] -0.575935362   0.492270276 -0.219751618
## [142,] -1.686251688  -0.199676085  0.083337316
## [143,]  0.898548161   0.007874487  0.928785725
## [144,]  1.604731873  -0.231638707  0.447544376
## [145,] -0.187046847   0.092407109  0.201684391
## [146,]  0.574990005   0.909766211 -0.496222248
## [147,]  1.148114811   0.357550964 -0.177694710
## [148,]  0.051583079  -0.070422995  0.590557589
## [149,]  0.740274586  -0.320114438 -0.344836392
## [150,] -2.150056829   1.421364868  0.618110224
## [151,] -1.450192792   0.311878164 -0.664986457
## [152,] -1.172158910   0.585011170 -0.573009525
## [153,] -0.263518459   0.670240719 -1.210339767
## [154,]  1.080164357  -0.357385361  0.730018496
## [155,] -1.146861969   0.104812545 -0.982837145
## [156,]  1.785516326  -0.421765070  0.235370296
## [157,]  0.611777499  -0.553183040  0.769934957
## [158,] -2.706874666   0.796572692 -0.106950015
## [159,]  1.693998628  -0.367114887  0.451265139
## [160,]  0.353516891  -0.534532007  1.280452258
## [161,] -1.836981590  -0.791647943 -0.733265300
## [162,] -1.380674016  -0.067853160  0.944694675
## [163,]  2.607108132  -1.048326020  1.171777161
## [164,] -3.100222218   1.650679646  2.055936626
## [165,]  1.097061573   0.559340553 -0.805590955
## [166,] -2.068832034   0.309049739  0.059931176
## [167,] -1.689957464   0.599667121 -0.076468119
## [168,]  1.302544683  -0.962815484  2.141074423
## [169,]  0.342891999   1.203592554  0.436917363
## [170,] -0.551786122  -0.845629968  0.017971736
## [171,]  0.189661460  -0.688163547  0.241572097
## [172,]  0.837204011   0.394589501 -1.534685235
## [173,]  0.982938587   1.090540091 -2.052901475
## [174,]  0.768460961  -0.297165718 -0.916993624
## [175,]  3.364383541   0.656270469 -4.243097237
## [176,]  1.013752057  -0.172502804 -0.888923130
## [177,] -1.815215369   0.394317736 -0.092738656
## [178,]  3.059991731   0.313953385  1.945386148
## [179,]  1.100615596  -0.143580273  1.036609809
## [180,]  1.581637319  -0.140060993  1.041506358
## [181,] -1.429401564   0.364766599  0.321019959
## [182,] -0.397682287  -1.364130677  1.638014753
## [183,]  0.591228751  -1.117877037  0.851160293
## [184,]  0.523653187  -0.613589250  0.592395395
## [185,]  0.680108344  -0.219720537 -0.931133970
## [186,]  0.214821015  -0.623522574  0.736008943
## [187,]  1.484021001  -0.299494839  0.999211395
## [188,] -4.536310269 -11.194916328 -0.560692995
## [189,]  0.404044678   1.942108450  0.988678360
## [190,]  0.338334662  -0.564139694 -0.866233206
## [191,]  1.350662386  -1.090206860 -0.678230615
## [192,]  0.710872882  -0.479937974  1.153298011
## [193,] -1.077700912   1.471104282  0.343297150
## [194,]  0.563319911   1.123883457  0.279318138
## [195,]  1.040387559   0.323564399  0.248696045
## [196,]  1.850280508  -1.123534108 -4.556604141
## 
## $eig
##   [1]  3.808091e+02  2.597904e+02  2.303825e+02  2.192798e+02  1.921700e+02
##   [6]  1.567426e+02  1.512854e+02  1.344094e+02  1.159475e+02  1.091834e+02
##  [11]  6.880946e-13  2.668232e-13  2.361772e-13  1.462996e-13  1.345763e-13
##  [16]  8.407673e-14  3.460965e-14  2.584081e-14  2.256230e-14  2.220531e-14
##  [21]  1.795908e-14  1.508267e-14  1.384050e-14  1.258982e-14  1.079169e-14
##  [26]  9.990249e-15  9.453339e-15  9.076563e-15  8.850293e-15  8.617568e-15
##  [31]  8.081677e-15  7.969679e-15  7.482417e-15  6.434706e-15  6.147151e-15
##  [36]  5.586305e-15  5.154366e-15  4.860911e-15  4.859449e-15  4.699985e-15
##  [41]  4.570646e-15  4.474876e-15  4.461016e-15  3.994779e-15  3.944326e-15
##  [46]  3.571916e-15  3.557447e-15  3.478092e-15  3.334405e-15  3.309017e-15
##  [51]  3.262828e-15  3.038482e-15  2.958770e-15  2.764798e-15  2.755243e-15
##  [56]  2.653158e-15  2.517919e-15  2.500352e-15  2.368606e-15  2.300339e-15
##  [61]  2.140398e-15  1.988257e-15  1.949903e-15  1.896890e-15  1.849789e-15
##  [66]  1.841729e-15  1.805562e-15  1.708777e-15  1.688360e-15  1.680371e-15
##  [71]  1.652432e-15  1.629878e-15  1.621577e-15  1.552497e-15  1.516870e-15
##  [76]  1.465981e-15  1.426809e-15  1.394362e-15  1.284667e-15  1.267783e-15
##  [81]  1.237374e-15  1.236469e-15  1.200628e-15  1.123359e-15  1.088557e-15
##  [86]  1.065884e-15  1.041021e-15  1.027288e-15  1.018400e-15  9.978079e-16
##  [91]  9.948959e-16  9.591708e-16  9.093739e-16  8.865428e-16  8.765591e-16
##  [96]  8.532539e-16  7.617271e-16  7.527166e-16  6.805238e-16  6.460121e-16
## [101]  6.219553e-16  5.434901e-16  5.012187e-16  4.967110e-16  4.595967e-16
## [106]  4.496621e-16  4.476675e-16  4.443187e-16  3.417833e-16  3.270619e-16
## [111]  1.810898e-16  1.651887e-16  1.499935e-16  9.283906e-17  8.152539e-17
## [116] -1.944740e-17 -4.260418e-17 -5.525236e-17 -2.019966e-16 -2.637971e-16
## [121] -2.828092e-16 -2.909576e-16 -3.233568e-16 -3.400485e-16 -4.497722e-16
## [126] -4.713495e-16 -4.820456e-16 -4.824304e-16 -5.053585e-16 -5.121736e-16
## [131] -5.420321e-16 -5.447006e-16 -5.479336e-16 -6.268899e-16 -6.607863e-16
## [136] -7.236243e-16 -7.571433e-16 -8.113674e-16 -8.831856e-16 -8.911434e-16
## [141] -9.195455e-16 -9.241656e-16 -1.035144e-15 -1.071459e-15 -1.071727e-15
## [146] -1.092167e-15 -1.095923e-15 -1.105475e-15 -1.315736e-15 -1.350768e-15
## [151] -1.429542e-15 -1.444023e-15 -1.446644e-15 -1.549397e-15 -1.577194e-15
## [156] -1.577349e-15 -1.710482e-15 -1.726458e-15 -1.783444e-15 -1.798126e-15
## [161] -1.947489e-15 -1.987014e-15 -2.001301e-15 -2.050868e-15 -2.162041e-15
## [166] -2.188858e-15 -2.228734e-15 -2.332059e-15 -2.444970e-15 -2.504959e-15
## [171] -2.740499e-15 -2.744976e-15 -2.781443e-15 -2.957212e-15 -3.031298e-15
## [176] -3.100308e-15 -3.144482e-15 -3.166503e-15 -3.421981e-15 -3.429866e-15
## [181] -3.470370e-15 -3.500045e-15 -3.696943e-15 -3.784081e-15 -4.084763e-15
## [186] -9.505633e-15 -1.145797e-14 -1.693873e-14 -1.770825e-14 -1.928121e-14
## [191] -5.198379e-14 -8.898376e-14 -1.158095e-13 -1.719275e-13 -2.426788e-13
## [196] -6.048682e-13
## 
## $x
## NULL
## 
## $ac
## [1] 0
## 
## $GOF
## [1] 0.4466574 0.4466574

Adding the “eig=TRUE” allowed us to take a look into eigenvalues assigned to variables. The eigenvalues are very small - each dimension explain very little percent of the variance

GDF value (goodness of fit) - its a measure how well the model fitted the data. I achieved result of 0.4466574 which indicates that the model explained 44.67% variance of the data after reduction to 3 dimensions.

Visual comparison of MDS and PCA - three dimensions

Let’s analyze the visual presentation of two methods:

mds1df<-as.data.frame(mds1$points)
pcadf<-as.data.frame(xxx.pca1$x[,1:3])

scatterplot3d(mds1df, xlab="dim1", ylab="dim2", zlab="dim3", main="MDS", pch=16, color="#BB3754FF", grid=FALSE) 

scatterplot3d(pcadf, xlab="pc1", ylab="pc2", zlab="pc3", main="PCA", color="#ED6925FF", grid=FALSE, pch=2)

We can observe that two methods output slightly differs in the points location. It is due to the different algorithms that two methods use. We can also investigate it in one plot:

s3d <- scatterplot3d(mds1df, xlab="dim1", ylab="dim2", zlab="dim3", main="MDS and PCA Combined", pch=16, color="#BB3754FF", grid=FALSE)
s3d$points3d(xxx.pca1$x[,1], xxx.pca1$x[,2], xxx.pca1$x[,3], pch=2, col="#ED6925FF")

OBSERVATION: Two outputs differs in the outliers configuration. The main “cloud” of points is located in the same place on scale.

Conclusions

The PCA and MDS in case of my dataset occurs in similar results. After reducing to 3 dimensions:

  • PCA explained 44.67% of all variance.

  • MDS explained 44.67% of all variance.

That indicates that in terms of variance analysis both method on my dataset occurs in the same result in terms of variance. PCA allows us to take deeper insight into data we are handling - we can compare how different variables contributes to components and identify patterns. MDS doesn’t give us this deep insight. However, MDS is better for non-linear data.

In my case both methods resulted in results under expectations. It may occur due to the missing values in data and choice of the weak correlated factors.