Data is imported from the World Bank web source, section World Development Indicators. The year 2021 was chosen as it represents the period with the highest availability of the selected indicators.
The following indicators were selected:
- Access to electricity
(% of population). Access to electricity is the percentage of population
with access to electricity. Electrification data are collected from
industry, national surveys and international sources.
- Current
account balance (% of GDP). Current account balance is the sum of net
exports of goods and services, net primary income, and net secondary
income.
- Trade (% of GDP).Trade is the sum of exports and imports
of goods and services measured as a share of gross domestic product.
- Exports of goods and services (% of GDP). Exports of goods and
services represent the value of all goods and other market services
provided to the rest of the world. They include the value of
merchandise, freight, insurance, transport, travel, royalties, license
fees, and other services, such as communication, construction,
financial, information, business, personal, and government services.
They exclude compensation of employees and investment income (formerly
called factor services) and transfer payments.
- Net migration. Net
migration is the number of immigrants minus the number of emigrants,
including citizens and noncitizens.
- Agriculture, forestry, and
fishing, value added (% of GDP). Agriculture, forestry, and fishing
corresponds to ISIC divisions 1-3 and includes forestry, hunting, and
fishing, as well as cultivation of crops and livestock production.
-
Urban population (% of total population). Urban population refers to
people living in urban areas as defined by national statistical offices.
The data are collected and smoothed by United Nations Population
Division.
- GDP per capita (current US$). GDP per capita is gross
domestic product divided by midyear population. GDP is the sum of gross
value added by all resident producers in the economy plus any product
taxes and minus any subsidies not included in the value of the products.
It is calculated without making deductions for depreciation of
fabricated assets or for depletion and degradation of natural resources.
Data are in current U.S. dollars.
- Population density (people per
sq. km of land area). Population density is midyear population divided
by land area in square kilometers.
- Population growth (annual %).
Annual population growth rate. Population is based on the de facto
definition of population, which counts all residents regardless of legal
status or citizenship.
library(ggplot2)
library(corrplot)
library(factoextra)
library(labdsv)
library(psych)
library(ClusterR)
library(readxl)
library(cluster)
library(flexclust)
library(fpc)
library(clustertend)
library(ggthemes)
library(plotly)
library(stringr)
library(missMDA)
library(ade4)
library(smacof)
library(ggplot2)
library(Rtsne)
library(scales)
library(kableExtra)
library(factoextra)
library(psy)
library(pdp)
library(scales)
library(corrplot)
library(tidyr)
library(dplyr)
library(caret)
library(gridExtra)
Import the data and check data types.
setwd("C:/Users/ydmar/Documents/UW/UW - 1 semester/UL/HW 2")
variables <- read.csv("Data_HW2.csv",
header=TRUE, col.names = c("Country", "Code",
"Accesstoelectricity","CurrentAccountBalanceOfGDP",
"TradeOfGDP","ExportOfGDP","NetMigration",
"AgricultureOfGDP","UrbanPopPortion","GDPPerCapita",
"PopDensity","PopGrowth"))
str(variables)
## 'data.frame': 217 obs. of 12 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ Accesstoelectricity : num 97.7 100 99.8 NA 100 48.2 100 100 100 100 ...
## $ CurrentAccountBalanceOfGDP: num -15.7 -7.6 -2.42 NA 14.06 ...
## $ TradeOfGDP : num 51.4 75.6 46.8 136.8 NA ...
## $ ExportOfGDP : num 14.3 31.1 23.4 44.3 NA ...
## $ NetMigration : int -548784 -32848 -10803 -1333 1514 29094 0 314 -19457 503 ...
## $ AgricultureOfGDP : num 33.598 17.793 11.235 NA 0.532 ...
## $ UrbanPopPortion : num 26.3 63 74.3 87.2 87.9 ...
## $ GDPPerCapita : num 356 6413 4161 15236 42426 ...
## $ PopDensity : num 61.3 102.6 18.8 246.1 166.7 ...
## $ PopGrowth : num 2.356 -0.927 1.619 -1.083 1.264 ...
summary(variables)
## Country Code Accesstoelectricity
## Length:217 Length:217 Min. : 7.70
## Class :character Class :character 1st Qu.: 87.10
## Mode :character Mode :character Median :100.00
## Mean : 87.18
## 3rd Qu.:100.00
## Max. :100.00
## NA's :2
## CurrentAccountBalanceOfGDP TradeOfGDP ExportOfGDP
## Min. :-148.000 Min. : 4.128 Min. : 2.25
## 1st Qu.: -6.786 1st Qu.: 55.274 1st Qu.: 22.31
## Median : -1.593 Median : 80.000 Median : 34.59
## Mean : -2.300 Mean : 91.635 Mean : 42.78
## 3rd Qu.: 3.883 3rd Qu.:110.000 3rd Qu.: 52.08
## Max. : 42.053 Max. :402.460 Max. :213.22
## NA's :17 NA's :16 NA's :23
## NetMigration AgricultureOfGDP UrbanPopPortion GDPPerCapita
## Min. :-994722.0 Min. : 0.01663 Min. : 13.46 Min. : 214.1
## 1st Qu.: -9888.0 1st Qu.: 2.16056 1st Qu.: 43.00 1st Qu.: 2518.6
## Median : -368.0 Median : 7.14749 Median : 62.97 Median : 7176.7
## Mean : 744.5 Mean : 11.77065 Mean : 61.73 Mean : 20303.6
## 3rd Qu.: 6493.0 3rd Qu.: 16.93128 3rd Qu.: 81.42 3rd Qu.: 27187.0
## Max. : 674787.0 Max. :110.00000 Max. :100.00 Max. :223823.4
## NA's :18 NA's :2 NA's :7
## PopDensity PopGrowth
## Min. : 0.138 Min. :-10.92744
## 1st Qu.: 39.427 1st Qu.: 0.08925
## Median : 94.195 Median : 0.83103
## Mean : 447.269 Mean : 0.78397
## 3rd Qu.: 238.573 3rd Qu.: 1.83582
## Max. :20681.818 Max. : 3.65720
## NA's :1
Remove NA values from dataset and create matrix only with numerical data.
pure_variables <- variables %>%
filter(complete.cases(select(., -Country, -Code)))
variable_mx <- as.matrix(pure_variables[,3:12])
Compute the covariance matrix to see if there is any relationship between variables.
cor_matrix <- cor(variable_mx, method = "pearson")
print(cor_matrix)
## Accesstoelectricity CurrentAccountBalanceOfGDP
## Accesstoelectricity 1.00000000 0.11620203
## CurrentAccountBalanceOfGDP 0.11620203 1.00000000
## TradeOfGDP 0.22408667 0.14100098
## ExportOfGDP 0.22367246 0.23637860
## NetMigration 0.04103629 0.08526557
## AgricultureOfGDP -0.37749488 -0.01892055
## UrbanPopPortion 0.50259339 0.17545155
## GDPPerCapita 0.36324984 0.25519912
## PopDensity 0.08534384 0.11381147
## PopGrowth -0.54490164 -0.16742597
## TradeOfGDP ExportOfGDP NetMigration
## Accesstoelectricity 0.22408667 0.22367246 0.04103629
## CurrentAccountBalanceOfGDP 0.14100098 0.23637860 0.08526557
## TradeOfGDP 1.00000000 0.95026002 0.06925632
## ExportOfGDP 0.95026002 1.00000000 0.09112753
## NetMigration 0.06925632 0.09112753 1.00000000
## AgricultureOfGDP -0.19101084 -0.27543366 -0.13791142
## UrbanPopPortion 0.32672829 0.38342431 0.23444995
## GDPPerCapita 0.44287896 0.51432791 0.24356124
## PopDensity 0.31559456 0.32139894 -0.03869927
## PopGrowth -0.27620165 -0.28555875 -0.03324977
## AgricultureOfGDP UrbanPopPortion GDPPerCapita
## Accesstoelectricity -0.37749488 0.5025934 0.3632498
## CurrentAccountBalanceOfGDP -0.01892055 0.1754516 0.2551991
## TradeOfGDP -0.19101084 0.3267283 0.4428790
## ExportOfGDP -0.27543366 0.3834243 0.5143279
## NetMigration -0.13791142 0.2344500 0.2435612
## AgricultureOfGDP 1.00000000 -0.3340769 -0.3016043
## UrbanPopPortion -0.33407692 1.0000000 0.5685653
## GDPPerCapita -0.30160427 0.5685653 1.0000000
## PopDensity 0.30982143 0.2033779 0.1989183
## PopGrowth 0.27010434 -0.4388513 -0.3791341
## PopDensity PopGrowth
## Accesstoelectricity 0.08534384 -0.54490164
## CurrentAccountBalanceOfGDP 0.11381147 -0.16742597
## TradeOfGDP 0.31559456 -0.27620165
## ExportOfGDP 0.32139894 -0.28555875
## NetMigration -0.03869927 -0.03324977
## AgricultureOfGDP 0.30982143 0.27010434
## UrbanPopPortion 0.20337788 -0.43885128
## GDPPerCapita 0.19891829 -0.37913406
## PopDensity 1.00000000 -0.15464858
## PopGrowth -0.15464858 1.00000000
corrplot::corrplot(cor_matrix, method = "circle")
The plot reveals a strong correlation between Trade of GDP (TradeOfGDP) and Export of GDP (ExportOfGDP). Since exports are a major component of trade (the sum of imports and exports), an increase in exports directly drives an increase in total trade activity. Consequently, an increase in TradeOfGDP is often associated with a rise in GDP per capita (GDPPerCapita), as both TradeOfGDP and ExportOfGDP are positively correlated with GDPPerCapita. Additionally, the plot highlights a positive correlation between Urban Population Proportion (UrbanPopPortion) and GDPPerCapita, ExportOfGDP, and TradeOfGDP. This suggests that more urbanized countries tend to achieve higher levels of GDP per capita. A further observation is the positive correlation between UrbanPopPortion and Access to Electricity, indicating that urbanized regions are more likely to have better access to electricity infrastructure.
Normalize the data before further analysis.
preproc1 <- preProcess(variable_mx, method=c("center", "scale"))
variable_mx.s <- predict(preproc1, variable_mx)
summary(variable_mx.s)
## Accesstoelectricity CurrentAccountBalanceOfGDP TradeOfGDP
## Min. :-3.18817 Min. :-9.66709 Min. :-1.4314
## 1st Qu.:-0.06691 1st Qu.:-0.27230 1st Qu.:-0.6261
## Median : 0.56151 Median : 0.03692 Median :-0.2197
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.56151 3rd Qu.: 0.39651 3rd Qu.: 0.3032
## Max. : 0.56151 Max. : 2.93712 Max. : 5.1396
## ExportOfGDP NetMigration AgricultureOfGDP UrbanPopPortion
## Min. :-1.2081 Min. :-7.04532 Min. :-0.8341 Min. :-2.05337
## 1st Qu.:-0.6097 1st Qu.:-0.09632 1st Qu.:-0.6667 1st Qu.:-0.76970
## Median :-0.2642 Median :-0.02643 Median :-0.3229 Median : 0.07717
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.2566 3rd Qu.: 0.04468 3rd Qu.: 0.3355 3rd Qu.: 0.83650
## Max. : 5.0094 Max. : 4.74319 Max. : 6.6850 Max. : 1.70653
## GDPPerCapita PopDensity PopGrowth
## Min. :-0.7043 Min. :-0.21309 Min. :-7.47161
## 1st Qu.:-0.6100 1st Qu.:-0.19242 1st Qu.:-0.44998
## Median :-0.4509 Median :-0.16416 Median : 0.01595
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.1493 3rd Qu.:-0.09063 3rd Qu.: 0.66638
## Max. : 4.9241 Max. :11.89901 Max. : 1.65063
Eigenvalues on the basis of covariance.
variable_mx.cov<-cov(variable_mx.s)
variable_mx.eigen<-eigen(variable_mx.cov)
variable_mx.eigen$values
## [1] 3.56704719 1.54843489 1.11757071 1.05361972 0.87965618 0.56278307
## [7] 0.46525012 0.39443745 0.37478974 0.03641093
head(variable_mx.eigen$vectors)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.3270633 -0.3324280 0.40069388 0.07938043 -0.05360354 -0.13955332
## [2,] -0.1772058 0.1086928 0.04756724 -0.50037595 0.81901644 -0.03490101
## [3,] -0.3913644 0.3659184 -0.26655656 0.31888786 0.01144333 -0.20166028
## [4,] -0.4184695 0.3356922 -0.28938771 0.25952451 0.08897670 -0.10073816
## [5,] -0.1255890 -0.1664455 -0.49790421 -0.60682593 -0.38102658 -0.42532389
## [6,] 0.2433332 0.4932585 0.28785342 -0.31279151 -0.14395305 -0.10647433
## [,7] [,8] [,9] [,10]
## [1,] 0.48962736 -0.4851487174 0.33983176 0.05221146
## [2,] 0.15474660 0.0455401661 -0.03583309 -0.06953282
## [3,] 0.03767639 -0.2019842631 -0.09980791 -0.66874446
## [4,] 0.04563945 -0.0536476139 -0.07555263 0.72963111
## [5,] 0.10320525 -0.0004294539 0.05947414 0.01032634
## [6,] -0.24378293 -0.6230867074 -0.16862623 0.09287441
xxx<-variable_mx.s
xxx.pca1<-prcomp(xxx, center=FALSE, scale.=FALSE)
xxx.pca1
## Standard deviations (1, .., p=10):
## [1] 1.8886628 1.2443612 1.0571522 1.0264598 0.9378999 0.7501887 0.6820925
## [8] 0.6280426 0.6122007 0.1908165
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4
## Accesstoelectricity -0.3270633 0.33242801 -0.40069388 0.07938043
## CurrentAccountBalanceOfGDP -0.1772058 -0.10869278 -0.04756724 -0.50037595
## TradeOfGDP -0.3913644 -0.36591845 0.26655656 0.31888786
## ExportOfGDP -0.4184695 -0.33569221 0.28938771 0.25952451
## NetMigration -0.1255890 0.16644551 0.49790421 -0.60682593
## AgricultureOfGDP 0.2433332 -0.49325852 -0.28785342 -0.31279151
## UrbanPopPortion -0.3919354 0.17461029 -0.11831882 -0.19174349
## GDPPerCapita -0.4051167 0.02660823 0.09110463 -0.18691114
## PopDensity -0.1705403 -0.53846740 -0.37483550 -0.18750315
## PopGrowth 0.3339826 -0.19733210 0.43452835 -0.03667714
## PC5 PC6 PC7 PC8
## Accesstoelectricity -0.053603545 0.13955332 0.48962736 -0.4851487174
## CurrentAccountBalanceOfGDP 0.819016436 0.03490101 0.15474660 0.0455401661
## TradeOfGDP 0.011443328 0.20166028 0.03767639 -0.2019842631
## ExportOfGDP 0.088976701 0.10073816 0.04563945 -0.0536476139
## NetMigration -0.381026576 0.42532389 0.10320525 -0.0004294539
## AgricultureOfGDP -0.143953048 0.10647433 -0.24378293 -0.6230867074
## UrbanPopPortion -0.221325469 -0.44812374 0.12373000 -0.0265662546
## GDPPerCapita -0.046298123 -0.48590947 -0.52934827 -0.1352259326
## PopDensity -0.318062674 -0.05670013 0.29146555 0.5047474592
## PopGrowth -0.009739602 -0.54416610 0.53128500 -0.2384996029
## PC9 PC10
## Accesstoelectricity -0.33983176 -0.05221146
## CurrentAccountBalanceOfGDP 0.03583309 0.06953282
## TradeOfGDP 0.09980791 0.66874446
## ExportOfGDP 0.07555263 -0.72963111
## NetMigration -0.05947414 -0.01032634
## AgricultureOfGDP 0.16862623 -0.09287441
## UrbanPopPortion 0.70624079 0.02328586
## GDPPerCapita -0.50343670 0.04172474
## PopDensity -0.24424641 0.03940226
## PopGrowth -0.15397306 0.01704202
xxx.pca1$rotation
## PC1 PC2 PC3 PC4
## Accesstoelectricity -0.3270633 0.33242801 -0.40069388 0.07938043
## CurrentAccountBalanceOfGDP -0.1772058 -0.10869278 -0.04756724 -0.50037595
## TradeOfGDP -0.3913644 -0.36591845 0.26655656 0.31888786
## ExportOfGDP -0.4184695 -0.33569221 0.28938771 0.25952451
## NetMigration -0.1255890 0.16644551 0.49790421 -0.60682593
## AgricultureOfGDP 0.2433332 -0.49325852 -0.28785342 -0.31279151
## UrbanPopPortion -0.3919354 0.17461029 -0.11831882 -0.19174349
## GDPPerCapita -0.4051167 0.02660823 0.09110463 -0.18691114
## PopDensity -0.1705403 -0.53846740 -0.37483550 -0.18750315
## PopGrowth 0.3339826 -0.19733210 0.43452835 -0.03667714
## PC5 PC6 PC7 PC8
## Accesstoelectricity -0.053603545 0.13955332 0.48962736 -0.4851487174
## CurrentAccountBalanceOfGDP 0.819016436 0.03490101 0.15474660 0.0455401661
## TradeOfGDP 0.011443328 0.20166028 0.03767639 -0.2019842631
## ExportOfGDP 0.088976701 0.10073816 0.04563945 -0.0536476139
## NetMigration -0.381026576 0.42532389 0.10320525 -0.0004294539
## AgricultureOfGDP -0.143953048 0.10647433 -0.24378293 -0.6230867074
## UrbanPopPortion -0.221325469 -0.44812374 0.12373000 -0.0265662546
## GDPPerCapita -0.046298123 -0.48590947 -0.52934827 -0.1352259326
## PopDensity -0.318062674 -0.05670013 0.29146555 0.5047474592
## PopGrowth -0.009739602 -0.54416610 0.53128500 -0.2384996029
## PC9 PC10
## Accesstoelectricity -0.33983176 -0.05221146
## CurrentAccountBalanceOfGDP 0.03583309 0.06953282
## TradeOfGDP 0.09980791 0.66874446
## ExportOfGDP 0.07555263 -0.72963111
## NetMigration -0.05947414 -0.01032634
## AgricultureOfGDP 0.16862623 -0.09287441
## UrbanPopPortion 0.70624079 0.02328586
## GDPPerCapita -0.50343670 0.04172474
## PopDensity -0.24424641 0.03940226
## PopGrowth -0.15397306 0.01704202
As we can see, the first eigenvalue explains the most variance (~35.67%). The eigenvectors represents the loadings of original variables on each principal component.
summary(xxx.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8887 1.2444 1.0572 1.0265 0.93790 0.75019 0.68209
## Proportion of Variance 0.3567 0.1548 0.1118 0.1054 0.08797 0.05628 0.04653
## Cumulative Proportion 0.3567 0.5115 0.6233 0.7287 0.81663 0.87291 0.91944
## PC8 PC9 PC10
## Standard deviation 0.62804 0.61220 0.19082
## Proportion of Variance 0.03944 0.03748 0.00364
## Cumulative Proportion 0.95888 0.99636 1.00000
fviz_eig(xxx.pca1, choice='eigenvalue') +
geom_hline(yintercept = 1, linetype = "dashed", color = "red", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Eigenvalues between 3 and 4 components are higher than 1. We can choose
between 3 or 4 components, since they are explain most of the
variance.
fviz_eig(xxx.pca1)
Scree plot represents graphically the percentage of variance explained
by every component.The results show that PC1 explains 35% of variation,
PC2 explains 15 % of variation, PC3 explains 11% of variation and PC4
explains 10% of variation.
eig.val<-get_eigenvalue(xxx.pca1)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.56704719 35.6704719 35.67047
## Dim.2 1.54843489 15.4843489 51.15482
## Dim.3 1.11757071 11.1757071 62.33053
## Dim.4 1.05361972 10.5361972 72.86673
## Dim.5 0.87965618 8.7965618 81.66329
## Dim.6 0.56278307 5.6278307 87.29112
## Dim.7 0.46525012 4.6525012 91.94362
## Dim.8 0.39443745 3.9443745 95.88799
## Dim.9 0.37478974 3.7478974 99.63589
## Dim.10 0.03641093 0.3641093 100.00000
And again, eigenvalues are higher than 1 for the Dim.1, Dim.2, Dim.3 and for the Dim.4.
a<-summary(xxx.pca1)
plot(a$importance[3,],type="p")
The plot shows what percent of variance has been explained for each
number of principal components (aggregate variance explained). The first
PCs (PC1, PC2, PC3 and PC4) explain most of the variance. PC1 explains
~35%. PC1 + PC2 explain ~50%, PC1 + PC2 + PC3 explain ~ 62% (quite
acceptable). And PC1+ PC2 + PC3 + PC4 explain 72%. The plot suggests
that retaining the first 3–4 components would explain most of the
variance (~ 60% - 70%) without including all components. The “cloud of
points” graph shows individual observations quality of
representation.
2D visualization
fviz_pca_ind(xxx.pca1, col.ind="cos2", geom = "point", gradient.cols = c("blue","steelblue","gray"))
fviz_pca_var(xxx.pca1, col.var="steelblue")
3D visualization
cos2 <- apply(xxx.pca1$x[, 1:3]^2, 1, sum) / apply(xxx.pca1$x^2, 1, sum) # quality of representation of a data point on the selected principal component
colors <- colorRampPalette(c("blue", "steelblue", "gray"))(length(cos2))
colors <- colors[rank(cos2)]
plot_ly(
as.data.frame(xxx.pca1$x[, 1:3]),
x = ~PC1,
y = ~PC2,
z = ~PC3,
color = ~cos2,
colors = c("blue", "steelblue", "gray"),
marker = list(size = 5),
type = "scatter3d",
mode = "markers"
) %>%
layout(
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
),
title = "PCA Individuals Plot (3D)"
)
This plot extends the 2D PCA individuals plot by incorporating the third principal components along with PC1 and PC2. Variance explained by the first three PCs ( ~62%). Most observations appear clustered near the center (origin), which suggests that they have moderate or balanced contributions from the variables related to PC1, PC2, and PC3.
plot_ly(
x = ~xxx.pca1$rotation[, 1],
y = ~xxx.pca1$rotation[, 2],
z = ~xxx.pca1$rotation[, 3],
type = 'scatter3d',
mode = 'markers+text',
text = rownames(xxx.pca1$rotation),
marker = list(size = 5, color = 'steelblue')
) %>%
layout(
scene = list(
xaxis = list(title = "PC1"),
yaxis = list(title = "PC2"),
zaxis = list(title = "PC3")
)
)
The variable plots displayed above show relations between variables as well as the “quality” of all factors. Variables correlated positively are close to each other whereas those correlated negatively are on the opposite sites of the plot.“Quality” of the variable is presented by the distance from the center. We can see what variables and how strong influence on the selected dimension based on the length of the arrow (example, TradeOfGDP and ExportOfGDP have a strong positive correlation and strongly represented in the PC1-PC2 plane; both variables contribute significantly to PC1 and PC2). However, we can notice other variables with long arrows (GDPPerCapita, UrbamPopPortion and AccesstoElectrisity), but they are not strongly correlated between each other. Regarding the 3D plots, the coordinates represents its loadings to PC1, PC2 and PC3. Variables, such as ExportOfGDP, TradeOfGDP and NetMigration are farther from the origin, indicating they strongly influence the components. And again, ExportOfGDP and TradeOfGDP are positioned close together, indicating a strong positive correlation. Variables positioned far apart or oppositely across the origin are negatively correlated (example: PopGrowth and AgricultureOfGDP).They might have opposing contributions to the components.
loading_scores_PC_1<-xxx.pca1$rotation[,1]
fac_scores_PC_1<-abs(loading_scores_PC_1)
fac_scores_PC_1_ranked<-names(sort(fac_scores_PC_1, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_1_ranked, 1]
## ExportOfGDP GDPPerCapita
## -0.4184695 -0.4051167
## UrbanPopPortion TradeOfGDP
## -0.3919354 -0.3913644
## PopGrowth Accesstoelectricity
## 0.3339826 -0.3270633
## AgricultureOfGDP CurrentAccountBalanceOfGDP
## 0.2433332 -0.1772058
## PopDensity NetMigration
## -0.1705403 -0.1255890
As we can see, variables were sorted based on their absolute contributions to PC1. The following contributions are the strongest one: ExportOfGDP, GDPPerCapita, UrbanPopPortion, TradeOfGDP, PopGrowth.
loading_scores_PC_2<-xxx.pca1$rotation[,2]
fac_scores_PC_2<-abs(loading_scores_PC_2)
fac_scores_PC_2_ranked<-names(sort(fac_scores_PC_2, decreasing=T))
xxx.pca1$rotation[fac_scores_PC_2_ranked, 2]
## PopDensity AgricultureOfGDP
## -0.53846740 -0.49325852
## TradeOfGDP ExportOfGDP
## -0.36591845 -0.33569221
## Accesstoelectricity PopGrowth
## 0.33242801 -0.19733210
## UrbanPopPortion NetMigration
## 0.17461029 0.16644551
## CurrentAccountBalanceOfGDP GDPPerCapita
## -0.10869278 0.02660823
As we can observe from the PC2, the following contributions are the strongest one: PopDensity, AgricultureOfGDP, TradeOfGDP, ExportOfGDP and AccesstoElectricity.
loading_scores_PC_3<-xxx.pca1$rotation[, 3]
fac_scores_PC_3<-abs(loading_scores_PC_3)
fac_scores_PC_3_ranked<-names(sort(fac_scores_PC_3, decreasing = TRUE))
xxx.pca1$rotation[fac_scores_PC_3_ranked, 3]
## NetMigration PopGrowth
## 0.49790421 0.43452835
## Accesstoelectricity PopDensity
## -0.40069388 -0.37483550
## ExportOfGDP AgricultureOfGDP
## 0.28938771 -0.28785342
## TradeOfGDP UrbanPopPortion
## 0.26655656 -0.11831882
## GDPPerCapita CurrentAccountBalanceOfGDP
## 0.09110463 -0.04756724
Regarding the PC3, we can observe NetMigration, PopGrowth, Accesstoelectricity and PopDensity as strong contributions.
ind<-get_pca_ind(xxx.pca1)
print(ind)
## Principal Component Analysis Results for individuals
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the individuals"
## 2 "$cos2" "Cos2 for the individuals"
## 3 "$contrib" "contributions of the individuals"
head(ind$coord) #coordinates of variables
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## 1 2.6821596 -0.9663685 -2.3277322 2.3777296 0.84523647 -1.1881496
## 2 0.0441960 0.5174540 -1.0970598 0.2181173 -0.25536799 0.7036110
## 3 0.5322906 0.7646938 -0.4414868 -0.2449284 -0.09853443 -0.4359844
## 4 1.0350411 -0.6813527 1.2936583 -0.7236341 0.85206642 -0.7777434
## 5 0.2871516 0.3508099 0.2308214 1.2019684 -0.38485450 0.8109654
## 6 -0.1402277 1.3364393 -0.8863389 -0.6163592 -0.07811597 -0.5001120
## Dim.7 Dim.8 Dim.9 Dim.10
## 1 -0.08949271 -1.1831738 -0.6496809 -0.05067535
## 2 -0.32051027 -0.2083634 0.3263366 -0.03672802
## 3 0.78099144 -0.2118803 0.3495734 -0.09120295
## 4 0.49682084 0.5243659 0.8881107 -0.12526784
## 5 -0.04009047 0.1103267 -1.3911295 -0.04770611
## 6 0.37493616 0.1638155 0.8152576 -0.07105062
head(ind$contrib) #contributions of variables
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## 1 1.1020702381 0.32956532 2.64935355 2.93217534 0.443805216 1.3707245
## 2 0.0002992306 0.09449298 0.58848381 0.02467432 0.040510617 0.4806989
## 3 0.0434047853 0.20636264 0.09530363 0.03111311 0.006031314 0.1845652
## 4 0.1641176306 0.16383246 0.81830097 0.27158329 0.451006557 0.5873279
## 5 0.0126317231 0.04343099 0.02605111 0.74929202 0.092008743 0.6385759
## 6 0.0030123655 0.63031038 0.38412586 0.19703018 0.003790667 0.2428524
## Dim.7 Dim.8 Dim.9 Dim.10
## 1 0.009406711 1.93940210 0.6154056 0.03853990
## 2 0.120655277 0.06014695 0.1552719 0.02024473
## 3 0.716399000 0.06219449 0.1781713 0.12483459
## 4 0.289909217 0.38092518 1.1499936 0.23550294
## 5 0.001887751 0.01686289 2.8216057 0.03415585
## 6 0.165111406 0.03717753 0.9690606 0.07576226
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) +
theme(axis.title.x = element_blank())
b<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90) +
theme(axis.title.x = element_blank())
c<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90) +
theme(axis.title.x = element_blank())
grid.arrange(a)
grid.arrange(b)
grid.arrange(c)
Plots represent the same variables contributions as we describe above. We only missed Accesstoelectricity for the PC1. The first three components have eigenvalues greater than 1, indicating they explain more variance than an average variable (PC1 - 3.57, PC2 - 1.55, PC3 - 1.12). Combined the first three PCs explain ~62.3% of the variance, that indicates, that we can reduce dimensions to three. Note: based on the findings, including PC4 would increase explanation to ~72.8%, but it’s not so significant difference compared to the first three components. Such variables as TradeOfGDP and ExportOfGDP are highly correlated. We can see it in the correlation matrix and PCA loadings. UrbanPopPortion and GDPPerCapita is also positively correlated.
tsne_data <- variables[, c("Accesstoelectricity", "CurrentAccountBalanceOfGDP",
"TradeOfGDP", "ExportOfGDP", "NetMigration",
"AgricultureOfGDP", "UrbanPopPortion", "GDPPerCapita",
"PopDensity", "PopGrowth")]
tsne_data <- na.omit(tsne_data)
t-SNE (1). Set 3 dimensions, perplexity = 30.
set.seed(42)
tsne_out_1 <- Rtsne(variable_mx, dims = 3, perplexity = 30, verbose = TRUE, max_iter = 1000)
## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.03 seconds (sparsity = 0.691809)!
## Learning embedding...
## Iteration 50: error is 49.671298 (50 iterations in 0.02 seconds)
## Iteration 100: error is 48.664024 (50 iterations in 0.02 seconds)
## Iteration 150: error is 48.597943 (50 iterations in 0.03 seconds)
## Iteration 200: error is 48.404369 (50 iterations in 0.02 seconds)
## Iteration 250: error is 48.717785 (50 iterations in 0.02 seconds)
## Iteration 300: error is 0.191370 (50 iterations in 0.02 seconds)
## Iteration 350: error is 0.110262 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.105177 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.106944 (50 iterations in 0.02 seconds)
## Iteration 500: error is 0.107461 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.107567 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.109085 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.108467 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.110864 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.110261 (50 iterations in 0.02 seconds)
## Iteration 800: error is 0.109214 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.110537 (50 iterations in 0.02 seconds)
## Iteration 900: error is 0.109831 (50 iterations in 0.02 seconds)
## Iteration 950: error is 0.109585 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.109022 (50 iterations in 0.02 seconds)
## Fitting performed in 0.35 seconds.
plot_ly(
data.frame(
x = tsne_out_1$Y[, 1],
y = tsne_out_1$Y[, 2],
z = tsne_out_1$Y[, 3],
Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
x = ~x,
y = ~y,
z = ~z,
color = ~Country,
type = "scatter3d",
mode = "markers"
) %>%
layout(
title = "Data Visualized via t-SNE p 30 & iter 1000"
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
We can slightly indicate clusters from that plot. There are lot’s of regions where clusters overlap or close to each other. The overlap may suggest similarities in some factors across countries.
t-SNE (2). Set 3 dimensions, perplexity = 50.
set.seed(42)
tsne_out_2 <- Rtsne(variable_mx, dims = 3, perplexity = 50, verbose = TRUE, max_iter = 1000)
## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 50.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.02 seconds (sparsity = 0.953149)!
## Learning embedding...
## Iteration 50: error is 44.244555 (50 iterations in 0.02 seconds)
## Iteration 100: error is 43.848997 (50 iterations in 0.03 seconds)
## Iteration 150: error is 43.841653 (50 iterations in 0.03 seconds)
## Iteration 200: error is 43.282574 (50 iterations in 0.03 seconds)
## Iteration 250: error is 43.985355 (50 iterations in 0.03 seconds)
## Iteration 300: error is 0.214034 (50 iterations in 0.02 seconds)
## Iteration 350: error is 0.054697 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.054083 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.052128 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.052159 (50 iterations in 0.02 seconds)
## Iteration 550: error is 0.049671 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.051869 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.048967 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.051522 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.050943 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.053591 (50 iterations in 0.02 seconds)
## Iteration 850: error is 0.055399 (50 iterations in 0.02 seconds)
## Iteration 900: error is 0.056304 (50 iterations in 0.02 seconds)
## Iteration 950: error is 0.056295 (50 iterations in 0.02 seconds)
## Iteration 1000: error is 0.055907 (50 iterations in 0.02 seconds)
## Fitting performed in 0.39 seconds.
plot_ly(
data.frame(
x = tsne_out_2$Y[, 1],
y = tsne_out_2$Y[, 2],
z = tsne_out_2$Y[, 3],
Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
x = ~x,
y = ~y,
z = ~z,
color = ~Country,
type = "scatter3d",
mode = "markers"
) %>%
layout(
title = "Data Visualized via t-SNE p 50 & iter 1000"
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
We can observe that more regions become overlapped now and cluster identification is slightly possible now.
t-SNE (3). Set 3 dimensions, perplexity = 10.
set.seed(42)
tsne_out_3<- Rtsne(variable_mx, dims = 3, perplexity = 10,verbose = TRUE, max_iter = 1000)
## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 10.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.01 seconds (sparsity = 0.240855)!
## Learning embedding...
## Iteration 50: error is 55.062593 (50 iterations in 0.02 seconds)
## Iteration 100: error is 54.179610 (50 iterations in 0.02 seconds)
## Iteration 150: error is 53.594269 (50 iterations in 0.01 seconds)
## Iteration 200: error is 54.742051 (50 iterations in 0.02 seconds)
## Iteration 250: error is 52.997651 (50 iterations in 0.02 seconds)
## Iteration 300: error is 0.503769 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.268688 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.260427 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.254632 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.249588 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.246131 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.243871 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.243083 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.241250 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.238840 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.236589 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.234576 (50 iterations in 0.01 seconds)
## Iteration 900: error is 0.234085 (50 iterations in 0.01 seconds)
## Iteration 950: error is 0.234982 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.233785 (50 iterations in 0.01 seconds)
## Fitting performed in 0.25 seconds.
plot_ly(
data.frame(
x = tsne_out_3$Y[, 1],
y = tsne_out_3$Y[, 2],
z = tsne_out_3$Y[, 3],
Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
x = ~x,
y = ~y,
z = ~z,
color = ~Country,
type = "scatter3d",
mode = "markers"
) %>%
layout(
title = "Data Visualized via t-SNE p 10 & iter 1000"
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Compared to previous observations with perplexity values of 30 and 50, the clusters representing different “economic behaviors” are more distinct and well-formed. This suggests that t-SNE with these parameters effectively captures local groupings within the data based on economic performance. Lower perplexity values result in tighter and smaller clusters.The common characteristics of these clusters are driven by economic activity, allowing for the identification of groups such as developed, developing, and underdeveloped countries, small economies, or nations characterized by high GDP, trade intensity, and urbanization. Points that lie far from the main clusters likely represent outlier countries with unique economic profiles that do not align with other groups. Approximately 10 clusters, along with outliers, can be observed. For greater precision and to refine the clustering further, applying a perplexity value of 5 may be beneficial.
t-SNE (4). Set 3 dimensions, perplexity = 5.
set.seed(42)
tsne_out_4<- Rtsne(variable_mx, dims = 3, perplexity = 5,verbose = TRUE, max_iter = 1000)
## Performing PCA
## Read the 183 x 10 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 3, perplexity = 5.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.00 seconds (sparsity = 0.111320)!
## Learning embedding...
## Iteration 50: error is 60.797243 (50 iterations in 0.01 seconds)
## Iteration 100: error is 55.673925 (50 iterations in 0.01 seconds)
## Iteration 150: error is 56.398287 (50 iterations in 0.01 seconds)
## Iteration 200: error is 55.823025 (50 iterations in 0.01 seconds)
## Iteration 250: error is 57.359579 (50 iterations in 0.01 seconds)
## Iteration 300: error is 1.068412 (50 iterations in 0.01 seconds)
## Iteration 350: error is 0.337501 (50 iterations in 0.01 seconds)
## Iteration 400: error is 0.312611 (50 iterations in 0.01 seconds)
## Iteration 450: error is 0.296204 (50 iterations in 0.01 seconds)
## Iteration 500: error is 0.291130 (50 iterations in 0.01 seconds)
## Iteration 550: error is 0.287907 (50 iterations in 0.01 seconds)
## Iteration 600: error is 0.282089 (50 iterations in 0.01 seconds)
## Iteration 650: error is 0.278523 (50 iterations in 0.01 seconds)
## Iteration 700: error is 0.277947 (50 iterations in 0.01 seconds)
## Iteration 750: error is 0.276734 (50 iterations in 0.01 seconds)
## Iteration 800: error is 0.273748 (50 iterations in 0.01 seconds)
## Iteration 850: error is 0.271295 (50 iterations in 0.01 seconds)
## Iteration 900: error is 0.270727 (50 iterations in 0.01 seconds)
## Iteration 950: error is 0.268156 (50 iterations in 0.01 seconds)
## Iteration 1000: error is 0.267512 (50 iterations in 0.01 seconds)
## Fitting performed in 0.21 seconds.
plot_ly(
data.frame(
x = tsne_out_4$Y[, 1],
y = tsne_out_4$Y[, 2],
z = tsne_out_4$Y[, 3],
Country = pure_variables$Country[!is.na(rowSums(tsne_data))]
),
x = ~x,
y = ~y,
z = ~z,
color = ~Country,
type = "scatter3d",
mode = "markers"
) %>%
layout(
title = "Data Visualized via t-SNE p 5 & iter 1000"
)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Considering that plot, we can better identify number of clusters. We can use this information to understand the patterns withing the dataset. The baseline setting (perplexity 5, iterations 1000) is effective for the imported dataset, providing a good starting point for visualization. Increasing perplexity can sometimes merge clusters that should be distinct,indicating that the chosen perplexity may be too high for preserving the local structure of the dataset. Tests with iteration aren’t presented here since there is no difference with iteration changes. # PCA and t-SNE
PCA reveals the global patterns within the dataset, such as a strong positive correlation between GDP per capita, trade as a percentage of GDP, and exports as a percentage of GDP. Additionally, urbanization can be explained through its positive correlation with these variables, highlighting broader economic trends. When t-SNE is applied, it complements PCA by uncovering local trends and revealing how countries group together based on their economic characteristics. However, relying solely on t-SNE would limit the depth of the analysis. While t-SNE effectively visualizes clusters, it does not provide explicit information on why countries are grouped together. By incorporating PCA, we can extract detailed insights and draw stronger, data-driven conclusions about the underlying relationships in the dataset. The combined use of PCA and t-SNE enhances the analysis by balancing global structure identification (from PCA) with the ability to detect local clusters (from t-SNE). This approach ensures a more comprehensive understanding of the data.Using both techniques we can simplify high-dimensional data, visualize it and understand complex data structures effectively. Those techniques help us to transform large dataset into more manageable three-dimensional representation.