Data is imported from the World Bank web source, section World Development Indicators. The year 2021 was chosen as it represents the period with the highest availability of the selected indicators.

The following indicators were selected:
- Access to electricity (% of population). Access to electricity is the percentage of population with access to electricity. Electrification data are collected from industry, national surveys and international sources.
- Current account balance (% of GDP). Current account balance is the sum of net exports of goods and services, net primary income, and net secondary income.
- Trade (% of GDP).Trade is the sum of exports and imports of goods and services measured as a share of gross domestic product.
- Exports of goods and services (% of GDP). Exports of goods and services represent the value of all goods and other market services provided to the rest of the world. They include the value of merchandise, freight, insurance, transport, travel, royalties, license fees, and other services, such as communication, construction, financial, information, business, personal, and government services. They exclude compensation of employees and investment income (formerly called factor services) and transfer payments.
- Net migration. Net migration is the number of immigrants minus the number of emigrants, including citizens and noncitizens.
- Agriculture, forestry, and fishing, value added (% of GDP). Agriculture, forestry, and fishing corresponds to ISIC divisions 1-3 and includes forestry, hunting, and fishing, as well as cultivation of crops and livestock production.
- Urban population (% of total population). Urban population refers to people living in urban areas as defined by national statistical offices. The data are collected and smoothed by United Nations Population Division.
- GDP per capita (current US$). GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in current U.S. dollars.
- Population density (people per sq. km of land area). Population density is midyear population divided by land area in square kilometers.
- Population growth (annual %). Annual population growth rate. Population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship.

library(ggplot2)
library(corrplot)
library(factoextra)
library(labdsv)
library(psych)
library(ClusterR)
library(readxl)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ggthemes)
library(plotly)
library(stringr)
library(missMDA)
library(ade4)
library(smacof)
library(ggplot2)
library(Rtsne)
library(scales)
library(kableExtra)
library(factoextra)
library(psy)
library(pdp)
library(scales)
library(corrplot)
library(tidyr)
library(dplyr)
library(caret)
library(gridExtra)

Import the data and check data types.

setwd("C:/Users/ydmar/Documents/UW/UW - 1 semester/UL/HW 2")
variables <- read.csv("Data_HW2.csv", 
                      header=TRUE, col.names = c("Country", "Code",
                                                 "Accesstoelectricity","CurrentAccountBalanceOfGDP",
                                                 "TradeOfGDP","ExportOfGDP","NetMigration",
                                                 "AgricultureOfGDP","UrbanPopPortion","GDPPerCapita",
                                                 "PopDensity","PopGrowth"))
str(variables)

## 'data.frame':    217 obs. of  12 variables:
##  $ Country                   : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code                      : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ Accesstoelectricity       : num  97.7 100 99.8 NA 100 48.2 100 100 100 100 ...
##  $ CurrentAccountBalanceOfGDP: num  -15.7 -7.6 -2.42 NA 14.06 ...
##  $ TradeOfGDP                : num  51.4 75.6 46.8 136.8 NA ...
##  $ ExportOfGDP               : num  14.3 31.1 23.4 44.3 NA ...
##  $ NetMigration              : int  -548784 -32848 -10803 -1333 1514 29094 0 314 -19457 503 ...
##  $ AgricultureOfGDP          : num  33.598 17.793 11.235 NA 0.532 ...
##  $ UrbanPopPortion           : num  26.3 63 74.3 87.2 87.9 ...
##  $ GDPPerCapita              : num  356 6413 4161 15236 42426 ...
##  $ PopDensity                : num  61.3 102.6 18.8 246.1 166.7 ...
##  $ PopGrowth                 : num  2.356 -0.927 1.619 -1.083 1.264 ...

summary(variables)

##    Country              Code           Accesstoelectricity
##  Length:217         Length:217         Min.   :  7.70     
##  Class :character   Class :character   1st Qu.: 87.10     
##  Mode  :character   Mode  :character   Median :100.00     
##                                        Mean   : 87.18     
##                                        3rd Qu.:100.00     
##                                        Max.   :100.00     
##                                        NA's   :2          
##  CurrentAccountBalanceOfGDP   TradeOfGDP       ExportOfGDP    
##  Min.   :-148.000           Min.   :  4.128   Min.   :  2.25  
##  1st Qu.:  -6.786           1st Qu.: 55.274   1st Qu.: 22.31  
##  Median :  -1.593           Median : 80.000   Median : 34.59  
##  Mean   :  -2.300           Mean   : 91.635   Mean   : 42.78  
##  3rd Qu.:   3.883           3rd Qu.:110.000   3rd Qu.: 52.08  
##  Max.   :  42.053           Max.   :402.460   Max.   :213.22  
##  NA's   :17                 NA's   :16        NA's   :23      
##   NetMigration       AgricultureOfGDP    UrbanPopPortion   GDPPerCapita     
##  Min.   :-994722.0   Min.   :  0.01663   Min.   : 13.46   Min.   :   214.1  
##  1st Qu.:  -9888.0   1st Qu.:  2.16056   1st Qu.: 43.00   1st Qu.:  2518.6  
##  Median :   -368.0   Median :  7.14749   Median : 62.97   Median :  7176.7  
##  Mean   :    744.5   Mean   : 11.77065   Mean   : 61.73   Mean   : 20303.6  
##  3rd Qu.:   6493.0   3rd Qu.: 16.93128   3rd Qu.: 81.42   3rd Qu.: 27187.0  
##  Max.   : 674787.0   Max.   :110.00000   Max.   :100.00   Max.   :223823.4  
##                      NA's   :18          NA's   :2        NA's   :7         
##    PopDensity          PopGrowth        
##  Min.   :    0.138   Min.   :-10.92744  
##  1st Qu.:   39.427   1st Qu.:  0.08925  
##  Median :   94.195   Median :  0.83103  
##  Mean   :  447.269   Mean   :  0.78397  
##  3rd Qu.:  238.573   3rd Qu.:  1.83582  
##  Max.   :20681.818   Max.   :  3.65720  
##  NA's   :1

Remove NA values from dataset and create matrix only with numerical data.

pure_variables <- variables %>%
  filter(complete.cases(select(., -Country, -Code)))
variable_mx <- as.matrix(pure_variables[,3:12])

Compute the covariance matrix to see if there is any relationship between variables.

cor_matrix <- cor(variable_mx, method = "pearson")
print(cor_matrix)

##                            Accesstoelectricity CurrentAccountBalanceOfGDP
## Accesstoelectricity                 1.00000000                 0.11620203
## CurrentAccountBalanceOfGDP          0.11620203                 1.00000000
## TradeOfGDP                          0.22408667                 0.14100098
## ExportOfGDP                         0.22367246                 0.23637860
## NetMigration                        0.04103629                 0.08526557
## AgricultureOfGDP                   -0.37749488                -0.01892055
## UrbanPopPortion                     0.50259339                 0.17545155
## GDPPerCapita                        0.36324984                 0.25519912
## PopDensity                          0.08534384                 0.11381147
## PopGrowth                          -0.54490164                -0.16742597
##                             TradeOfGDP ExportOfGDP NetMigration
## Accesstoelectricity         0.22408667  0.22367246   0.04103629
## CurrentAccountBalanceOfGDP  0.14100098  0.23637860   0.08526557
## TradeOfGDP                  1.00000000  0.95026002   0.06925632
## ExportOfGDP                 0.95026002  1.00000000   0.09112753
## NetMigration                0.06925632  0.09112753   1.00000000
## AgricultureOfGDP           -0.19101084 -0.27543366  -0.13791142
## UrbanPopPortion             0.32672829  0.38342431   0.23444995
## GDPPerCapita                0.44287896  0.51432791   0.24356124
## PopDensity                  0.31559456  0.32139894  -0.03869927
## PopGrowth                  -0.27620165 -0.28555875  -0.03324977
##                            AgricultureOfGDP UrbanPopPortion GDPPerCapita
## Accesstoelectricity             -0.37749488       0.5025934    0.3632498
## CurrentAccountBalanceOfGDP      -0.01892055       0.1754516    0.2551991
## TradeOfGDP                      -0.19101084       0.3267283    0.4428790
## ExportOfGDP                     -0.27543366       0.3834243    0.5143279
## NetMigration                    -0.13791142       0.2344500    0.2435612
## AgricultureOfGDP                 1.00000000      -0.3340769   -0.3016043
## UrbanPopPortion                 -0.33407692       1.0000000    0.5685653
## GDPPerCapita                    -0.30160427       0.5685653    1.0000000
## PopDensity                       0.30982143       0.2033779    0.1989183
## PopGrowth                        0.27010434      -0.4388513   -0.3791341
##                             PopDensity   PopGrowth
## Accesstoelectricity         0.08534384 -0.54490164
## CurrentAccountBalanceOfGDP  0.11381147 -0.16742597
## TradeOfGDP                  0.31559456 -0.27620165
## ExportOfGDP                 0.32139894 -0.28555875
## NetMigration               -0.03869927 -0.03324977
## AgricultureOfGDP            0.30982143  0.27010434
## UrbanPopPortion             0.20337788 -0.43885128
## GDPPerCapita                0.19891829 -0.37913406
## PopDensity                  1.00000000 -0.15464858
## PopGrowth                  -0.15464858  1.00000000

corrplot::corrplot(cor_matrix, method = "circle")

The plot reveals a strong correlation between Trade of GDP (TradeOfGDP) and Export of GDP (ExportOfGDP). Since exports are a major component of trade (the sum of imports and exports), an increase in exports directly drives an increase in total trade activity. Consequently, an increase in TradeOfGDP is often associated with a rise in GDP per capita (GDPPerCapita), as both TradeOfGDP and ExportOfGDP are positively correlated with GDPPerCapita. Additionally, the plot highlights a positive correlation between Urban Population Proportion (UrbanPopPortion) and GDPPerCapita, ExportOfGDP, and TradeOfGDP. This suggests that more urbanized countries tend to achieve higher levels of GDP per capita. A further observation is the positive correlation between UrbanPopPortion and Access to Electricity, indicating that urbanized regions are more likely to have better access to electricity infrastructure.

PCA

Normalize the data before further analysis.

preproc1 <- preProcess(variable_mx, method=c("center", "scale"))
variable_mx.s <- predict(preproc1, variable_mx)
summary(variable_mx.s)

##  Accesstoelectricity CurrentAccountBalanceOfGDP   TradeOfGDP     
##  Min.   :-3.18817    Min.   :-9.66709           Min.   :-1.4314  
##  1st Qu.:-0.06691    1st Qu.:-0.27230           1st Qu.:-0.6261  
##  Median : 0.56151    Median : 0.03692           Median :-0.2197  
##  Mean   : 0.00000    Mean   : 0.00000           Mean   : 0.0000  
##  3rd Qu.: 0.56151    3rd Qu.: 0.39651           3rd Qu.: 0.3032  
##  Max.   : 0.56151    Max.   : 2.93712           Max.   : 5.1396  
##   ExportOfGDP       NetMigration      AgricultureOfGDP  UrbanPopPortion   
##  Min.   :-1.2081   Min.   :-7.04532   Min.   :-0.8341   Min.   :-2.05337  
##  1st Qu.:-0.6097   1st Qu.:-0.09632   1st Qu.:-0.6667   1st Qu.:-0.76970  
##  Median :-0.2642   Median :-0.02643   Median :-0.3229   Median : 0.07717  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.2566   3rd Qu.: 0.04468   3rd Qu.: 0.3355   3rd Qu.: 0.83650  
##  Max.   : 5.0094   Max.   : 4.74319   Max.   : 6.6850   Max.   : 1.70653  
##   GDPPerCapita       PopDensity         PopGrowth       
##  Min.   :-0.7043   Min.   :-0.21309   Min.   :-7.47161  
##  1st Qu.:-0.6100   1st Qu.:-0.19242   1st Qu.:-0.44998  
##  Median :-0.4509   Median :-0.16416   Median : 0.01595  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.1493   3rd Qu.:-0.09063   3rd Qu.: 0.66638  
##  Max.   : 4.9241   Max.   :11.89901   Max.   : 1.65063

Eigenvalues on the basis of covariance.

variable_mx.cov<-cov(variable_mx.s)
variable_mx.eigen<-eigen(variable_mx.cov)
variable_mx.eigen$values

##  [1] 3.56704719 1.54843489 1.11757071 1.05361972 0.87965618 0.56278307
##  [7] 0.46525012 0.39443745 0.37478974 0.03641093

head(variable_mx.eigen$vectors)

##            [,1]       [,2]        [,3]        [,4]        [,5]        [,6]
## [1,] -0.3270633 -0.3324280  0.40069388  0.07938043 -0.05360354 -0.13955332
## [2,] -0.1772058  0.1086928  0.04756724 -0.50037595  0.81901644 -0.03490101
## [3,] -0.3913644  0.3659184 -0.26655656  0.31888786  0.01144333 -0.20166028
## [4,] -0.4184695  0.3356922 -0.28938771  0.25952451  0.08897670 -0.10073816
## [5,] -0.1255890 -0.1664455 -0.49790421 -0.60682593 -0.38102658 -0.42532389
## [6,]  0.2433332  0.4932585  0.28785342 -0.31279151 -0.14395305 -0.10647433
##             [,7]          [,8]        [,9]       [,10]
## [1,]  0.48962736 -0.4851487174  0.33983176  0.05221146
## [2,]  0.15474660  0.0455401661 -0.03583309 -0.06953282
## [3,]  0.03767639 -0.2019842631 -0.09980791 -0.66874446
## [4,]  0.04563945 -0.0536476139 -0.07555263  0.72963111
## [5,]  0.10320525 -0.0004294539  0.05947414  0.01032634
## [6,] -0.24378293 -0.6230867074 -0.16862623  0.09287441

xxx<-variable_mx.s
xxx.pca1<-prcomp(xxx, center=FALSE, scale.=FALSE)
xxx.pca1

## Standard deviations (1, .., p=10):
##  [1] 1.8886628 1.2443612 1.0571522 1.0264598 0.9378999 0.7501887 0.6820925
##  [8] 0.6280426 0.6122007 0.1908165
## 
## Rotation (n x k) = (10 x 10):
##                                   PC1         PC2         PC3         PC4
## Accesstoelectricity        -0.3270633  0.33242801 -0.40069388  0.07938043
## CurrentAccountBalanceOfGDP -0.1772058 -0.10869278 -0.04756724 -0.50037595
## TradeOfGDP                 -0.3913644 -0.36591845  0.26655656  0.31888786
## ExportOfGDP                -0.4184695 -0.33569221  0.28938771  0.25952451
## NetMigration               -0.1255890  0.16644551  0.49790421 -0.60682593
## AgricultureOfGDP            0.2433332 -0.49325852 -0.28785342 -0.31279151
## UrbanPopPortion            -0.3919354  0.17461029 -0.11831882 -0.19174349
## GDPPerCapita               -0.4051167  0.02660823  0.09110463 -0.18691114
## PopDensity                 -0.1705403 -0.53846740 -0.37483550 -0.18750315
## PopGrowth                   0.3339826 -0.19733210  0.43452835 -0.03667714
##                                     PC5         PC6         PC7           PC8
## Accesstoelectricity        -0.053603545  0.13955332  0.48962736 -0.4851487174
## CurrentAccountBalanceOfGDP  0.819016436  0.03490101  0.15474660  0.0455401661
## TradeOfGDP                  0.011443328  0.20166028  0.03767639 -0.2019842631
## ExportOfGDP                 0.088976701  0.10073816  0.04563945 -0.0536476139
## NetMigration               -0.381026576  0.42532389  0.10320525 -0.0004294539
## AgricultureOfGDP           -0.143953048  0.10647433 -0.24378293 -0.6230867074
## UrbanPopPortion            -0.221325469 -0.44812374  0.12373000 -0.0265662546
## GDPPerCapita               -0.046298123 -0.48590947 -0.52934827 -0.1352259326
## PopDensity                 -0.318062674 -0.05670013  0.29146555  0.5047474592
## PopGrowth                  -0.009739602 -0.54416610  0.53128500 -0.2384996029
##                                    PC9        PC10
## Accesstoelectricity        -0.33983176 -0.05221146
## CurrentAccountBalanceOfGDP  0.03583309  0.06953282
## TradeOfGDP                  0.09980791  0.66874446
## ExportOfGDP                 0.07555263 -0.72963111
## NetMigration               -0.05947414 -0.01032634
## AgricultureOfGDP            0.16862623 -0.09287441
## UrbanPopPortion             0.70624079  0.02328586
## GDPPerCapita               -0.50343670  0.04172474
## PopDensity                 -0.24424641  0.03940226
## PopGrowth                  -0.15397306  0.01704202

xxx.pca1$rotation

##                                   PC1         PC2         PC3         PC4
## Accesstoelectricity        -0.3270633  0.33242801 -0.40069388  0.07938043
## CurrentAccountBalanceOfGDP -0.1772058 -0.10869278 -0.04756724 -0.50037595
## TradeOfGDP                 -0.3913644 -0.36591845  0.26655656  0.31888786
## ExportOfGDP                -0.4184695 -0.33569221  0.28938771  0.25952451
## NetMigration               -0.1255890  0.16644551  0.49790421 -0.60682593
## AgricultureOfGDP            0.2433332 -0.49325852 -0.28785342 -0.31279151
## UrbanPopPortion            -0.3919354  0.17461029 -0.11831882 -0.19174349
## GDPPerCapita               -0.4051167  0.02660823  0.09110463 -0.18691114
## PopDensity                 -0.1705403 -0.53846740 -0.37483550 -0.18750315
## PopGrowth                   0.3339826 -0.19733210  0.43452835 -0.03667714
##                                     PC5         PC6         PC7           PC8
## Accesstoelectricity        -0.053603545  0.13955332  0.48962736 -0.4851487174
## CurrentAccountBalanceOfGDP  0.819016436  0.03490101  0.15474660  0.0455401661
## TradeOfGDP                  0.011443328  0.20166028  0.03767639 -0.2019842631
## ExportOfGDP                 0.088976701  0.10073816  0.04563945 -0.0536476139
## NetMigration               -0.381026576  0.42532389  0.10320525 -0.0004294539
## AgricultureOfGDP           -0.143953048  0.10647433 -0.24378293 -0.6230867074
## UrbanPopPortion            -0.221325469 -0.44812374  0.12373000 -0.0265662546
## GDPPerCapita               -0.046298123 -0.48590947 -0.52934827 -0.1352259326
## PopDensity                 -0.318062674 -0.05670013  0.29146555  0.5047474592
## PopGrowth                  -0.009739602 -0.54416610  0.53128500 -0.2384996029
##                                    PC9        PC10
## Accesstoelectricity        -0.33983176 -0.05221146
## CurrentAccountBalanceOfGDP  0.03583309  0.06953282
## TradeOfGDP                  0.09980791  0.66874446
## ExportOfGDP                 0.07555263 -0.72963111
## NetMigration               -0.05947414 -0.01032634
## AgricultureOfGDP            0.16862623 -0.09287441
## UrbanPopPortion             0.70624079  0.02328586
## GDPPerCapita               -0.50343670  0.04172474
## PopDensity                 -0.24424641  0.03940226
## PopGrowth                  -0.15397306  0.01704202

As we can see, the first eigenvalue explains the most variance (~35.67%). The eigenvectors represents the loadings of original variables on each principal component.

summary(xxx.pca1)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.8887 1.2444 1.0572 1.0265 0.93790 0.75019 0.68209
## Proportion of Variance 0.3567 0.1548 0.1118 0.1054 0.08797 0.05628 0.04653
## Cumulative Proportion  0.3567 0.5115 0.6233 0.7287 0.81663 0.87291 0.91944
##                            PC8     PC9    PC10
## Standard deviation     0.62804 0.61220 0.19082
## Proportion of Variance 0.03944 0.03748 0.00364
## Cumulative Proportion  0.95888 0.99636 1.00000

fviz_eig(xxx.pca1, choice='eigenvalue') + 
geom_hline(yintercept = 1, linetype = "dashed", color = "red", size = 1)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Eigenvalues between 3 and 4 components are higher than 1. We can choose between 3 or 4 components, since they are explain most of the variance.

fviz_eig(xxx.pca1)

Scree plot represents graphically the percentage of variance explained by every component.The results show that PC1 explains 35% of variation, PC2 explains 15 % of variation, PC3 explains 11% of variation and PC4 explains 10% of variation.

eig.val<-get_eigenvalue(xxx.pca1)
eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.56704719       35.6704719                    35.67047
## Dim.2  1.54843489       15.4843489                    51.15482
## Dim.3  1.11757071       11.1757071                    62.33053
## Dim.4  1.05361972       10.5361972                    72.86673
## Dim.5  0.87965618        8.7965618                    81.66329
## Dim.6  0.56278307        5.6278307                    87.29112
## Dim.7  0.46525012        4.6525012                    91.94362
## Dim.8  0.39443745        3.9443745                    95.88799
## Dim.9  0.37478974        3.7478974                    99.63589
## Dim.10 0.03641093        0.3641093                   100.00000

And again, eigenvalues are higher than 1 for the Dim.1, Dim.2, Dim.3 and for the Dim.4.

a<-summary(xxx.pca1)
plot(a$importance[3,],type="p")

The plot shows what percent of variance has been explained for each number of principal components (aggregate variance explained). The first PCs (PC1, PC2, PC3 and PC4) explain most of the variance. PC1 explains ~35%. PC1 + PC2 explain ~50%, PC1 + PC2 + PC3 explain ~ 62% (quite acceptable). And PC1+ PC2 + PC3 + PC4 explain 72%. The plot suggests that retaining the first 3–4 components would explain most of the variance (~ 60% - 70%) without including all components. The “cloud of points” graph shows individual observations quality of representation.

2D visualization

fviz_pca_ind(xxx.pca1, col.ind="cos2", geom = "point", gradient.cols = c("blue","steelblue","gray"))

fviz_pca_var(xxx.pca1, col.var="steelblue")

3D visualization

cos2 <- apply(xxx.pca1$x[, 1:3]^2, 1, sum) / apply(xxx.pca1$x^2, 1, sum) # quality of representation of a data point on the selected principal component 
colors <- colorRampPalette(c("blue", "steelblue", "gray"))(length(cos2))
colors <- colors[rank(cos2)]  
plot_ly(
  as.data.frame(xxx.pca1$x[, 1:3]),
  x = ~PC1,
  y = ~PC2,
  z = ~PC3,
  color = ~cos2,
  colors = c("blue", "steelblue", "gray"),
  marker = list(size = 5),
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    scene = list(
      xaxis = list(title = "PC1"),
      yaxis = list(title = "PC2"),
      zaxis = list(title = "PC3")
    ),
    title = "PCA Individuals Plot (3D)"
  )

This plot extends the 2D PCA individuals plot by incorporating the third principal components along with PC1 and PC2. Variance explained by the first three PCs ( ~62%). Most observations appear clustered near the center (origin), which suggests that they have moderate or balanced contributions from the variables related to PC1, PC2, and PC3.

plot_ly(
  x = ~xxx.pca1$rotation[, 1], 
  y = ~xxx.pca1$rotation[, 2],
  z = ~xxx.pca1$rotation[, 3], 
  type = 'scatter3d',
  mode = 'markers+text',
  text = rownames(xxx.pca1$rotation), 
  marker = list(size = 5, color = 'steelblue')
) %>%
  layout(
    scene = list(
      xaxis = list(title = "PC1"),
      yaxis = list(title = "PC2"),
      zaxis = list(title = "PC3")
    )
  )

The variable plots displayed above show relations between variables as well as the “quality” of all factors. Variables correlated positively are close to each other whereas those correlated negatively are on the opposite sites of the plot.“Quality” of the variable is presented by the distance from the center. We can see what variables and how strong influence on the selected dimension based on the length of the arrow (example, TradeOfGDP and ExportOfGDP have a strong positive correlation and strongly represented in the PC1-PC2 plane; both variables contribute significantly to PC1 and PC2). However, we can notice other variables with long arrows (GDPPerCapita, UrbamPopPortion and AccesstoElectrisity), but they are not strongly correlated between each other. Regarding the 3D plots, the coordinates represents its loadings to PC1, PC2 and PC3. Variables, such as ExportOfGDP, TradeOfGDP and NetMigration are farther from the origin, indicating they strongly influence the components. And again, ExportOfGDP and TradeOfGDP are positioned close together, indicating a strong positive correlation. Variables positioned far apart or oppositely across the origin are negatively correlated (example: PopGrowth and AgricultureOfGDP).They might have opposing contributions to the components.

loading_scores_PC_1<-xxx.pca1$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_1_ranked, 1]

##                ExportOfGDP               GDPPerCapita 
##                 -0.4184695                 -0.4051167 
##            UrbanPopPortion                 TradeOfGDP 
##                 -0.3919354                 -0.3913644 
##                  PopGrowth        Accesstoelectricity 
##                  0.3339826                 -0.3270633 
##           AgricultureOfGDP CurrentAccountBalanceOfGDP 
##                  0.2433332                 -0.1772058 
##                 PopDensity               NetMigration 
##                 -0.1705403                 -0.1255890

As we can see, variables were sorted based on their absolute contributions to PC1. The following contributions are the strongest one: ExportOfGDP, GDPPerCapita, UrbanPopPortion, TradeOfGDP, PopGrowth.

loading_scores_PC_2<-xxx.pca1$rotation[,2]
fac_scores_PC_2<-abs(loading_scores_PC_2)
fac_scores_PC_2_ranked<-names(sort(fac_scores_PC_2, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_2_ranked, 2]

##                 PopDensity           AgricultureOfGDP 
##                -0.53846740                -0.49325852 
##                 TradeOfGDP                ExportOfGDP 
##                -0.36591845                -0.33569221 
##        Accesstoelectricity                  PopGrowth 
##                 0.33242801                -0.19733210 
##            UrbanPopPortion               NetMigration 
##                 0.17461029                 0.16644551 
## CurrentAccountBalanceOfGDP               GDPPerCapita 
##                -0.10869278                 0.02660823

As we can observe from the PC2, the following contributions are the strongest one: PopDensity, AgricultureOfGDP, TradeOfGDP, ExportOfGDP and AccesstoElectricity.

loading_scores_PC_3<-xxx.pca1$rotation[, 3]
fac_scores_PC_3<-abs(loading_scores_PC_3)
fac_scores_PC_3_ranked<-names(sort(fac_scores_PC_3, decreasing = TRUE))
xxx.pca1$rotation[fac_scores_PC_3_ranked, 3]

##               NetMigration                  PopGrowth 
##                 0.49790421                 0.43452835 
##        Accesstoelectricity                 PopDensity 
##                -0.40069388                -0.37483550 
##                ExportOfGDP           AgricultureOfGDP 
##                 0.28938771                -0.28785342 
##                 TradeOfGDP            UrbanPopPortion 
##                 0.26655656                -0.11831882 
##               GDPPerCapita CurrentAccountBalanceOfGDP 
##                 0.09110463                -0.04756724

Regarding the PC3, we can observe NetMigration, PopGrowth, Accesstoelectricity and PopDensity as strong contributions.

ind<-get_pca_ind(xxx.pca1) 
print(ind)

## Principal Component Analysis Results for individuals
##  ===================================================
##   Name       Description                       
## 1 "$coord"   "Coordinates for the individuals" 
## 2 "$cos2"    "Cos2 for the individuals"        
## 3 "$contrib" "contributions of the individuals"

head(ind$coord) #coordinates of variables

##        Dim.1      Dim.2      Dim.3      Dim.4       Dim.5      Dim.6
## 1  2.6821596 -0.9663685 -2.3277322  2.3777296  0.84523647 -1.1881496
## 2  0.0441960  0.5174540 -1.0970598  0.2181173 -0.25536799  0.7036110
## 3  0.5322906  0.7646938 -0.4414868 -0.2449284 -0.09853443 -0.4359844
## 4  1.0350411 -0.6813527  1.2936583 -0.7236341  0.85206642 -0.7777434
## 5  0.2871516  0.3508099  0.2308214  1.2019684 -0.38485450  0.8109654
## 6 -0.1402277  1.3364393 -0.8863389 -0.6163592 -0.07811597 -0.5001120
##         Dim.7      Dim.8      Dim.9      Dim.10
## 1 -0.08949271 -1.1831738 -0.6496809 -0.05067535
## 2 -0.32051027 -0.2083634  0.3263366 -0.03672802
## 3  0.78099144 -0.2118803  0.3495734 -0.09120295
## 4  0.49682084  0.5243659  0.8881107 -0.12526784
## 5 -0.04009047  0.1103267 -1.3911295 -0.04770611
## 6  0.37493616  0.1638155  0.8152576 -0.07105062

head(ind$contrib) #contributions of variables

##          Dim.1      Dim.2      Dim.3      Dim.4       Dim.5     Dim.6
## 1 1.1020702381 0.32956532 2.64935355 2.93217534 0.443805216 1.3707245
## 2 0.0002992306 0.09449298 0.58848381 0.02467432 0.040510617 0.4806989
## 3 0.0434047853 0.20636264 0.09530363 0.03111311 0.006031314 0.1845652
## 4 0.1641176306 0.16383246 0.81830097 0.27158329 0.451006557 0.5873279
## 5 0.0126317231 0.04343099 0.02605111 0.74929202 0.092008743 0.6385759
## 6 0.0030123655 0.63031038 0.38412586 0.19703018 0.003790667 0.2428524
##         Dim.7      Dim.8     Dim.9     Dim.10
## 1 0.009406711 1.93940210 0.6154056 0.03853990
## 2 0.120655277 0.06014695 0.1552719 0.02024473
## 3 0.716399000 0.06219449 0.1781713 0.12483459
## 4 0.289909217 0.38092518 1.1499936 0.23550294
## 5 0.001887751 0.01686289 2.8216057 0.03415585
## 6 0.165111406 0.03717753 0.9690606 0.07576226

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) +
  theme(axis.title.x = element_blank())
b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90) +
  theme(axis.title.x = element_blank())
c<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90) +
  theme(axis.title.x = element_blank())
grid.arrange(a)

grid.arrange(b)

grid.arrange(c)

Plots represent the same variables contributions as we describe above. We only missed Accesstoelectricity for the PC1. The first three components have eigenvalues greater than 1, indicating they explain more variance than an average variable (PC1 - 3.57, PC2 - 1.55, PC3 - 1.12). Combined the first three PCs explain ~62.3% of the variance, that indicates, that we can reduce dimensions to three. Note: based on the findings, including PC4 would increase explanation to ~72.8%, but it’s not so significant difference compared to the first three components. Such variables as TradeOfGDP and ExportOfGDP are highly correlated. We can see it in the correlation matrix and PCA loadings. UrbanPopPortion and GDPPerCapita is also positively correlated.

t-SNE

tsne_data <- variables[, c("Accesstoelectricity", "CurrentAccountBalanceOfGDP",
                           "TradeOfGDP", "ExportOfGDP", "NetMigration",
                           "AgricultureOfGDP", "UrbanPopPortion", "GDPPerCapita",
                           "PopDensity", "PopGrowth")]
tsne_data <- na.omit(tsne_data)

t-SNE (1). Set 3 dimensions, perplexity = 30.

set.seed(42)
tsne_out_1 <- Rtsne(variable_mx, dims = 3, perplexity = 30, verbose = TRUE, max_iter = 1000)

## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.03 seconds (sparsity = 0.691809)!
## Learning embedding...
## Iteration 50: error is 49.671298 (50 iterations in 0.02 seconds)
## Iteration 100: error is 48.664024 (50 iterations in 0.02 seconds)
## Iteration 150: error is 48.597943 (50 iterations in 0.03 seconds)
## Iteration 200: error is 48.404369 (50 iterations in 0.02 seconds)
## Iteration 250: error is 48.717785 (50 iterations in 0.02 seconds)
## Iteration 300: error is 0.191370 (50 iterations in 0.02 seconds)
## Iteration 350: error is 0.110262 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.105177 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.106944 (50 iterations in 0.02 seconds)
## Iteration 500: error is 0.107461 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.107567 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.109085 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.108467 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.110864 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.110261 (50 iterations in 0.02 seconds)
## Iteration 800: error is 0.109214 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.110537 (50 iterations in 0.02 seconds)
## Iteration 900: error is 0.109831 (50 iterations in 0.02 seconds)
## Iteration 950: error is 0.109585 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.109022 (50 iterations in 0.02 seconds)
## Fitting performed in 0.35 seconds.

plot_ly(
  data.frame(
  x = tsne_out_1$Y[, 1],
  y = tsne_out_1$Y[, 2],
  z = tsne_out_1$Y[, 3], 
  Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
  x = ~x,
  y = ~y,
  z = ~z,
  color = ~Country, 
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = "Data Visualized via t-SNE p 30 & iter 1000"
  )

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

We can slightly indicate clusters from that plot. There are lot’s of regions where clusters overlap or close to each other. The overlap may suggest similarities in some factors across countries.

t-SNE (2). Set 3 dimensions, perplexity = 50.

set.seed(42)
tsne_out_2 <- Rtsne(variable_mx, dims = 3, perplexity = 50, verbose = TRUE, max_iter = 1000)

## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 50.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.02 seconds (sparsity = 0.953149)!
## Learning embedding...
## Iteration 50: error is 44.244555 (50 iterations in 0.02 seconds)
## Iteration 100: error is 43.848997 (50 iterations in 0.03 seconds)
## Iteration 150: error is 43.841653 (50 iterations in 0.03 seconds)
## Iteration 200: error is 43.282574 (50 iterations in 0.03 seconds)
## Iteration 250: error is 43.985355 (50 iterations in 0.03 seconds)
## Iteration 300: error is 0.214034 (50 iterations in 0.02 seconds)
## Iteration 350: error is 0.054697 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.054083 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.052128 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.052159 (50 iterations in 0.02 seconds)
## Iteration 550: error is 0.049671 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.051869 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.048967 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.051522 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.050943 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.053591 (50 iterations in 0.02 seconds)
## Iteration 850: error is 0.055399 (50 iterations in 0.02 seconds)
## Iteration 900: error is 0.056304 (50 iterations in 0.02 seconds)
## Iteration 950: error is 0.056295 (50 iterations in 0.02 seconds)
## Iteration 1000: error is 0.055907 (50 iterations in 0.02 seconds)
## Fitting performed in 0.39 seconds.

plot_ly(
  data.frame(
  x = tsne_out_2$Y[, 1],
  y = tsne_out_2$Y[, 2],
  z = tsne_out_2$Y[, 3], 
  Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
  x = ~x,
  y = ~y,
  z = ~z,
  color = ~Country, 
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = "Data Visualized via t-SNE p 50 & iter 1000"
  )

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

We can observe that more regions become overlapped now and cluster identification is slightly possible now.

t-SNE (3). Set 3 dimensions, perplexity = 10.

set.seed(42)
tsne_out_3<- Rtsne(variable_mx, dims = 3, perplexity = 10,verbose = TRUE, max_iter = 1000)

## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.01 seconds (sparsity = 0.240855)!
## Learning embedding...
## Iteration 50: error is 55.062593 (50 iterations in 0.02 seconds)
## Iteration 100: error is 54.179610 (50 iterations in 0.02 seconds)
## Iteration 150: error is 53.594269 (50 iterations in 0.01 seconds)
## Iteration 200: error is 54.742051 (50 iterations in 0.02 seconds)
## Iteration 250: error is 52.997651 (50 iterations in 0.02 seconds)
## Iteration 300: error is 0.503769 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.268688 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.260427 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.254632 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.249588 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.246131 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.243871 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.243083 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.241250 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.238840 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.236589 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.234576 (50 iterations in 0.01 seconds)
## Iteration 900: error is 0.234085 (50 iterations in 0.01 seconds)
## Iteration 950: error is 0.234982 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.233785 (50 iterations in 0.01 seconds)
## Fitting performed in 0.25 seconds.

plot_ly(
  data.frame(
  x = tsne_out_3$Y[, 1],
  y = tsne_out_3$Y[, 2],
  z = tsne_out_3$Y[, 3], 
  Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
  x = ~x,
  y = ~y,
  z = ~z,
  color = ~Country, 
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = "Data Visualized via t-SNE p 10 & iter 1000"
  )

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Compared to previous observations with perplexity values of 30 and 50, the clusters representing different “economic behaviors” are more distinct and well-formed. This suggests that t-SNE with these parameters effectively captures local groupings within the data based on economic performance. Lower perplexity values result in tighter and smaller clusters.The common characteristics of these clusters are driven by economic activity, allowing for the identification of groups such as developed, developing, and underdeveloped countries, small economies, or nations characterized by high GDP, trade intensity, and urbanization. Points that lie far from the main clusters likely represent outlier countries with unique economic profiles that do not align with other groups. Approximately 10 clusters, along with outliers, can be observed. For greater precision and to refine the clustering further, applying a perplexity value of 5 may be beneficial.

t-SNE (4). Set 3 dimensions, perplexity = 5.

set.seed(42)
tsne_out_4<- Rtsne(variable_mx, dims = 3, perplexity = 5,verbose = TRUE, max_iter = 1000)

## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 5.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.00 seconds (sparsity = 0.111320)!
## Learning embedding...
## Iteration 50: error is 60.797243 (50 iterations in 0.01 seconds)
## Iteration 100: error is 55.673925 (50 iterations in 0.01 seconds)
## Iteration 150: error is 56.398287 (50 iterations in 0.01 seconds)
## Iteration 200: error is 55.823025 (50 iterations in 0.01 seconds)
## Iteration 250: error is 57.359579 (50 iterations in 0.01 seconds)
## Iteration 300: error is 1.068412 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.337501 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.312611 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.296204 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.291130 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.287907 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.282089 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.278523 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.277947 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.276734 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.273748 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.271295 (50 iterations in 0.01 seconds)
## Iteration 900: error is 0.270727 (50 iterations in 0.01 seconds)
## Iteration 950: error is 0.268156 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.267512 (50 iterations in 0.01 seconds)
## Fitting performed in 0.21 seconds.

plot_ly(
  data.frame(
  x = tsne_out_4$Y[, 1],
  y = tsne_out_4$Y[, 2],
  z = tsne_out_4$Y[, 3], 
  Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
  x = ~x,
  y = ~y,
  z = ~z,
  color = ~Country, 
  type = "scatter3d",
  mode = "markers"
) %>%
  layout(
    title = "Data Visualized via t-SNE p 5 & iter 1000"
  )

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Considering that plot, we can better identify number of clusters. We can use this information to understand the patterns withing the dataset. The baseline setting (perplexity 5, iterations 1000) is effective for the imported dataset, providing a good starting point for visualization. Increasing perplexity can sometimes merge clusters that should be distinct,indicating that the chosen perplexity may be too high for preserving the local structure of the dataset. Tests with iteration aren’t presented here since there is no difference with iteration changes. # PCA and t-SNE

PCA reveals the global patterns within the dataset, such as a strong positive correlation between GDP per capita, trade as a percentage of GDP, and exports as a percentage of GDP. Additionally, urbanization can be explained through its positive correlation with these variables, highlighting broader economic trends. When t-SNE is applied, it complements PCA by uncovering local trends and revealing how countries group together based on their economic characteristics. However, relying solely on t-SNE would limit the depth of the analysis. While t-SNE effectively visualizes clusters, it does not provide explicit information on why countries are grouped together. By incorporating PCA, we can extract detailed insights and draw stronger, data-driven conclusions about the underlying relationships in the dataset. The combined use of PCA and t-SNE enhances the analysis by balancing global structure identification (from PCA) with the ability to detect local clusters (from t-SNE). This approach ensures a more comprehensive understanding of the data.Using both techniques we can simplify high-dimensional data, visualize it and understand complex data structures effectively. Those techniques help us to transform large dataset into more manageable three-dimensional representation.

HW2_Yuliya Martyniuk

Yuliya Martyniuk

2024-12-21

PCA

t-SNE