Dimension Reduction of Country Data

Introduction

The aim of this project is applying dimension reduction methods to country data set. Multidimensional Scaling (MDS) and Principal Component Analysis (PCA) will be used for dimension reduction.

MDS is a statistical technique that aims to represent the structure of data by positioning similar data points closely together and dissimilar data points farther apart in a lower dimensional space. This method can be used for data visualization, clustering, and data mining.

Meanwhile, PCA is a method that transforms a high-dimensional dataset into a smaller set of variables, known as principal components, while preserving as much of the original variance in the data as possible

Although both MDS and PCA can effectively reduce the dimensionality of high-dimensional datasets, they differ in their approach. MDS is primarily utilized for data visualization to represent data structure, while PCA is used to reduce dimensionality while maintaining the original data variance.

Description of the data set

Data set consist of 167 countries and 10 features such as health, income and inflation. Data set consist of numeric data except one column which is the country name.

The majority of the data is presented as numeric values, but there is one column that includes country names. As dimension reduction techniques only work on numeric data, it is necessary to remove this column containing the country names to perform any dimension reduction.

Moreover, for better performance and accurate dimension reduction results, it is highly recommended to scale the numeric features of the data set. The scaling process is essential because the dimension reduction techniques such as MDS and PCA are highly sensitive to differences in the scales of the features. Failure to scale the data can result in the features with the larger magnitudes dominating the analysis and skewing the results.

MDS

Multidimensional Scaling experiments

countries <- read.csv("Country-data.csv")  
head(countries)

##               country child_mort exports health imports income inflation
## 1         Afghanistan       90.2    10.0   7.58    44.9   1610      9.44
## 2             Albania       16.6    28.0   6.55    48.6   9930      4.49
## 3             Algeria       27.3    38.4   4.17    31.4  12900     16.10
## 4              Angola      119.0    62.3   2.85    42.9   5900     22.40
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44
## 6           Argentina       14.5    18.9   8.10    16.0  18700     20.90
##   life_expec total_fer  gdpp
## 1       56.2      5.82   553
## 2       76.3      1.65  4090
## 3       76.5      2.89  4460
## 4       60.1      6.16  3530
## 5       76.8      2.13 12200
## 6       75.8      2.37 10300

colnames(countries)

##  [1] "country"    "child_mort" "exports"    "health"     "imports"   
##  [6] "income"     "inflation"  "life_expec" "total_fer"  "gdpp"

Conversion of variables to numeric

summary(countries)

##    country            child_mort        exports            health      
##  Length:167         Min.   :  2.60   Min.   :  0.109   Min.   : 1.810  
##  Class :character   1st Qu.:  8.25   1st Qu.: 23.800   1st Qu.: 4.920  
##  Mode  :character   Median : 19.30   Median : 35.000   Median : 6.320  
##                     Mean   : 38.27   Mean   : 41.109   Mean   : 6.816  
##                     3rd Qu.: 62.10   3rd Qu.: 51.350   3rd Qu.: 8.600  
##                     Max.   :208.00   Max.   :200.000   Max.   :17.900  
##     imports             income         inflation         life_expec   
##  Min.   :  0.0659   Min.   :   609   Min.   : -4.210   Min.   :32.10  
##  1st Qu.: 30.2000   1st Qu.:  3355   1st Qu.:  1.810   1st Qu.:65.30  
##  Median : 43.3000   Median :  9960   Median :  5.390   Median :73.10  
##  Mean   : 46.8902   Mean   : 17145   Mean   :  7.782   Mean   :70.56  
##  3rd Qu.: 58.7500   3rd Qu.: 22800   3rd Qu.: 10.750   3rd Qu.:76.80  
##  Max.   :174.0000   Max.   :125000   Max.   :104.000   Max.   :82.80  
##    total_fer          gdpp       
##  Min.   :1.150   Min.   :   231  
##  1st Qu.:1.795   1st Qu.:  1330  
##  Median :2.410   Median :  4660  
##  Mean   :2.948   Mean   : 12964  
##  3rd Qu.:3.880   3rd Qu.: 14050  
##  Max.   :7.490   Max.   :105000

str(countries)

## 'data.frame':    167 obs. of  10 variables:
##  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : int  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : int  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

countries[!complete.cases(countries),]

##  [1] country    child_mort exports    health     imports    income    
##  [7] inflation  life_expec total_fer  gdpp      
## <0 rows> (or 0-length row.names)

countries$income <- as.numeric(countries$income)
countries$gdpp <- as.numeric(countries$gdpp)

Distribution of variables

countries %>%
  gather(Features, value, 2:10) %>%
  ggplot(aes(x = value)) +
  geom_histogram(fill = "white", colour = "black") +
  facet_wrap(~Features, scales = "free_x") +
  labs(x = "Values", y = "Frequency")

str(countries)

## 'data.frame':    167 obs. of  10 variables:
##  $ country   : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ child_mort: num  90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num  10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num  7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num  44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : num  1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num  9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num  56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num  5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : num  553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

The correlation plot displays the associations between pairs of variables in a given dataset. Positive correlations between two variables are represented by a darker blue shade in the plot, while negative correlations are indicated by a shade of red. Additionally, the size of the dots in the plot corresponds to the strength of the correlation between the two variables.

cor<-cor(countries[,-1], method="pearson") 
corrplot(cor, order ="alphabet", tl.cex=0.6)

This the correlation plot after narmalazing the data set.

n<-data.Normalization(countries[,-1], type="n1",normalization="column")
n.cor<-cor(n, method="pearson") 
corrplot(n.cor, order ="alphabet", tl.cex=0.6)

ggpairs(as.data.frame(countries[,-1]))

library(qgraph)
qgraph(cor(countries[,-1]), shape="rectangle", posCol="blue", negCol="pink")

Normalization of variables

data<-countries[,-1]    
n<-data.Normalization(data, type="n1", normalization="column")
summary(n)

##    child_mort         exports            health           imports       
##  Min.   :-0.8845   Min.   :-1.4957   Min.   :-1.8223   Min.   :-1.9341  
##  1st Qu.:-0.7444   1st Qu.:-0.6314   1st Qu.:-0.6901   1st Qu.:-0.6894  
##  Median :-0.4704   Median :-0.2229   Median :-0.1805   Median :-0.1483  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5909   3rd Qu.: 0.3736   3rd Qu.: 0.6496   3rd Qu.: 0.4899  
##  Max.   : 4.2086   Max.   : 5.7964   Max.   : 4.0353   Max.   : 5.2504  
##      income          inflation         life_expec        total_fer      
##  Min.   :-0.8577   Min.   :-1.1344   Min.   :-4.3242   Min.   :-1.1877  
##  1st Qu.:-0.7153   1st Qu.:-0.5649   1st Qu.:-0.5910   1st Qu.:-0.7616  
##  Median :-0.3727   Median :-0.2263   Median : 0.2861   Median :-0.3554  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2934   3rd Qu.: 0.2808   3rd Qu.: 0.7021   3rd Qu.: 0.6157  
##  Max.   : 5.5947   Max.   : 9.1023   Max.   : 1.3768   Max.   : 3.0003  
##       gdpp         
##  Min.   :-0.69471  
##  1st Qu.:-0.63475  
##  Median :-0.45307  
##  Mean   : 0.00000  
##  3rd Qu.: 0.05924  
##  Max.   : 5.02140

Distance matrix

dist<-dist(t(n))
dist

##            child_mort   exports    health   imports    income inflation
## exports     20.919057                                                  
## health      19.963303 19.234957                                        
## imports     19.345129  9.337535 17.326917                              
## income      22.496057 12.666011 16.999409 17.069304                    
## inflation   15.371803 19.173464 20.415307 20.347042 19.520630          
## life_expec  25.027514 15.065995 16.187965 17.718418 11.350263 20.287485
## total_fer    7.092621 20.934266 19.932279 19.616424 22.329597 15.059290
## gdpp        22.189337 13.891846 14.735652 17.136353  5.888148 20.139054
##            life_expec total_fer
## exports                        
## health                         
## imports                        
## income                         
## inflation                      
## life_expec                     
## total_fer   24.178718          
## gdpp        11.522604 21.977948

Performing MDS with 2-Dimensions

fit.data<-mds(dist, ndim=2,  type="ratio") 
fit.data

## 
## Call:
## mds(delta = dist, ndim = 2, type = "ratio")
## 
## Model: Symmetric SMACOF 
## Number of objects: 9 
## Stress-1 value: 0.175 
## Number of iterations: 46

summary(fit.data)

## 
## Configurations:
##                 D1      D2
## child_mort  0.7550 -0.2412
## exports    -0.2828 -0.4996
## health     -0.1122  0.6317
## imports     0.0058 -0.6105
## income     -0.5623 -0.0980
## inflation   0.5223  0.4544
## life_expec -0.6645  0.3171
## total_fer   0.7719 -0.0196
## gdpp       -0.4334  0.0658
## 
## 
## Stress per point (in %):
## child_mort    exports     health    imports     income  inflation life_expec 
##       6.68       7.92      19.34      12.55       7.81      17.71      12.34 
##  total_fer       gdpp 
##       7.25       8.38

# step 4: make visualisation
plot(fit.data, plot.type = "stressplot")

fit.data

## 
## Call:
## mds(delta = dist, ndim = 2, type = "ratio")
## 
## Model: Symmetric SMACOF 
## Number of objects: 9 
## Stress-1 value: 0.175 
## Number of iterations: 46

summary(fit.data)

## 
## Configurations:
##                 D1      D2
## child_mort  0.7550 -0.2412
## exports    -0.2828 -0.4996
## health     -0.1122  0.6317
## imports     0.0058 -0.6105
## income     -0.5623 -0.0980
## inflation   0.5223  0.4544
## life_expec -0.6645  0.3171
## total_fer   0.7719 -0.0196
## gdpp       -0.4334  0.0658
## 
## 
## Stress per point (in %):
## child_mort    exports     health    imports     income  inflation life_expec 
##       6.68       7.92      19.34      12.55       7.81      17.71      12.34 
##  total_fer       gdpp 
##       7.25       8.38

summary(n)

##    child_mort         exports            health           imports       
##  Min.   :-0.8845   Min.   :-1.4957   Min.   :-1.8223   Min.   :-1.9341  
##  1st Qu.:-0.7444   1st Qu.:-0.6314   1st Qu.:-0.6901   1st Qu.:-0.6894  
##  Median :-0.4704   Median :-0.2229   Median :-0.1805   Median :-0.1483  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5909   3rd Qu.: 0.3736   3rd Qu.: 0.6496   3rd Qu.: 0.4899  
##  Max.   : 4.2086   Max.   : 5.7964   Max.   : 4.0353   Max.   : 5.2504  
##      income          inflation         life_expec        total_fer      
##  Min.   :-0.8577   Min.   :-1.1344   Min.   :-4.3242   Min.   :-1.1877  
##  1st Qu.:-0.7153   1st Qu.:-0.5649   1st Qu.:-0.5910   1st Qu.:-0.7616  
##  Median :-0.3727   Median :-0.2263   Median : 0.2861   Median :-0.3554  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.2934   3rd Qu.: 0.2808   3rd Qu.: 0.7021   3rd Qu.: 0.6157  
##  Max.   : 5.5947   Max.   : 9.1023   Max.   : 1.3768   Max.   : 3.0003  
##       gdpp         
##  Min.   :-0.69471  
##  1st Qu.:-0.63475  
##  Median :-0.45307  
##  Mean   : 0.00000  
##  3rd Qu.: 0.05924  
##  Max.   : 5.02140

lab<-data.frame(whatever=n[,1], child_mort=0, exports=0, health=0, imports=0, income=0, inflation=0, life_expec=0, total_fer=0, gdpp=0)   
lab<-lab[,-1]

for(i in 1:9){
lab[,i]<-"average"
lab[n[,i]>1.25,i]<-"high"
lab[n[,i]<0.75,i]<-"low"}
head(lab)

##   child_mort exports health imports income inflation life_expec total_fer gdpp
## 1       high     low    low     low    low       low        low      high  low
## 2        low     low    low     low    low       low        low       low  low
## 3        low     low    low     low    low   average        low       low  low
## 4       high average    low     low    low      high        low      high  low
## 5        low     low    low     low    low       low        low       low  low
## 6        low     low    low     low    low   average        low       low  low

dist.gower<-gower.dist(t(lab))
dist.gower

##            child_mort   exports    health   imports    income inflation
## child_mort  0.0000000 0.3592814 0.3712575 0.3173653 0.3772455 0.2574850
## exports     0.3592814 0.0000000 0.3712575 0.1916168 0.2574850 0.2874251
## health      0.3712575 0.3712575 0.0000000 0.3053892 0.2335329 0.3652695
## imports     0.3173653 0.1916168 0.3053892 0.0000000 0.2814371 0.2874251
## income      0.3772455 0.2574850 0.2335329 0.2814371 0.0000000 0.3113772
## inflation   0.2574850 0.2874251 0.3652695 0.2874251 0.3113772 0.0000000
## life_expec  0.4311377 0.2934132 0.2814371 0.3173653 0.2035928 0.3772455
## total_fer   0.1197605 0.3892216 0.4191617 0.3652695 0.4191617 0.2694611
## gdpp        0.3772455 0.2694611 0.2215569 0.2814371 0.1017964 0.3113772
##            life_expec total_fer      gdpp
## child_mort  0.4311377 0.1197605 0.3772455
## exports     0.2934132 0.3892216 0.2694611
## health      0.2814371 0.4191617 0.2215569
## imports     0.3173653 0.3652695 0.2814371
## income      0.2035928 0.4191617 0.1017964
## inflation   0.3772455 0.2694611 0.3113772
## life_expec  0.0000000 0.4730539 0.1976048
## total_fer   0.4730539 0.0000000 0.4191617
## gdpp        0.1976048 0.4191617 0.0000000

dist<-dist(t(n)) 

mds<-cmdscale(dist, k=2) 
plot(mds, type='n') 
text(mds, labels=colnames(countries), cex=0.6, adj=0.5)

fit.data<-mds(dist, ndim=2,  type="ratio") 
plot(fit.data, pch=21, cex=as.numeric(fit.data$spp), bg="pink", main="MDS for selected variables")

dist<-dist(n) 
mds2<-cmdscale(dist, k=2) 
plot(mds2, type='n') 
text(mds2, labels=countries$country, cex=0.6, adj=0.5)

fit.data<-mds(dist, ndim=2,  type="ratio") 
plot(fit.data, pch=21, cex=as.numeric(fit.data$spp), bg="pink", main="MDS for Countries")

PCA

Principal Component Analysis experiments

pca<-prcomp(n, center = FALSE, scale.=FALSE) # stats::
pca

## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
## 
## Rotation (n x k) = (9 x 9):
##                   PC1          PC2         PC3          PC4         PC5
## child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
## exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
## imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
## income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
## inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
## total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
## gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
##                     PC6         PC7         PC8         PC9
## child_mort -0.200628153  0.07948854  0.68274306  0.32754180
## exports     0.059332832  0.70730269  0.01419742 -0.12308207
## health     -0.007276456  0.24983051 -0.07249683  0.11308797
## imports     0.030031537 -0.59218953  0.02894642  0.09903717
## income     -0.160346990 -0.09556237 -0.35262369  0.61298247
## inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
## life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
## total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

pca$rotation

##                   PC1          PC2         PC3          PC4         PC5
## child_mort -0.4195194 -0.192883937  0.02954353 -0.370653262  0.16896968
## exports     0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health      0.1508378  0.243086779  0.59663237 -0.461897497 -0.51800037
## imports     0.1614824 -0.671820644  0.29992674  0.071907461 -0.25537642
## income      0.3984411 -0.022535530 -0.30154750 -0.392159039  0.24714960
## inflation  -0.1931729  0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec  0.4258394  0.222706743 -0.11391854  0.203797235 -0.10821980
## total_fer  -0.4037290 -0.155233106 -0.01954925 -0.378303645  0.13526221
## gdpp        0.3926448  0.046022396 -0.12297749 -0.531994575  0.18016662
##                     PC6         PC7         PC8         PC9
## child_mort -0.200628153  0.07948854  0.68274306  0.32754180
## exports     0.059332832  0.70730269  0.01419742 -0.12308207
## health     -0.007276456  0.24983051 -0.07249683  0.11308797
## imports     0.030031537 -0.59218953  0.02894642  0.09903717
## income     -0.160346990 -0.09556237 -0.35262369  0.61298247
## inflation  -0.066285372 -0.10463252  0.01153775 -0.02523614
## life_expec  0.601126516 -0.01848639  0.50466425  0.29403981
## total_fer   0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp       -0.016778761 -0.24299776  0.24969636 -0.62564572

fviz_pca_var(pca, col.var = "steelblue")

Eigenvalues are a measure of the amount of variance explained by each principal component in Principal Component Analysis (PCA). The graph displays the eigenvalues of the principal components against their corresponding component number.

The scree plot indicates that the first and second principal components have the highest eigenvalues, meaning that they explain the most variance in the original data. This suggests that these principal components contain the most important information about the data and should be retained for further analysis.

In this case, the first and second principal components will be used for clustering, likely referring to a subsequent analysis such as k-means clustering or hierarchical clustering. This implies that the two principal components identified as most important in the PCA analysis will be used to create new variables that summarize the original data and are better suited for clustering analysis.

fviz_eig(pca, choice='eigenvalue')

eig.val<-get_eigenvalue(pca)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.13565658       45.9517398                    45.95174
## Dim.2 1.54634631       17.1816257                    63.13337
## Dim.3 1.17038330       13.0042589                    76.13762
## Dim.4 0.99478456       11.0531618                    87.19079
## Dim.5 0.66061903        7.3402114                    94.53100
## Dim.6 0.22358112        2.4842347                    97.01523
## Dim.7 0.11343874        1.2604304                    98.27566
## Dim.8 0.08831536        0.9812817                    99.25694
## Dim.9 0.06687501        0.7430556                   100.00000

var<-get_pca_var(pca)
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')

rgl.open()
plot3d(pca$x[,1], pca$x[,2], pca$x[,3], col = "pink")
rglwidget()

kmeans_model_mds <- kmeans(mds2, 3)

sum(kmeans_model_mds$withinss)

## [1] 368.6546

ms_pca <- princomp(n)$scores[,1:2]
ms_km <- kmeans(ms_pca,3)
sum(ms_km$withinss)

## [1] 368.6546

ms_pca

##           Comp.1       Comp.2
## 1    2.904289861  0.095333856
## 2   -0.428622238 -0.586392077
## 3    0.284369828 -0.453809569
## 4    2.923629763  1.690470936
## 5   -1.030476681  0.136248937
## 6   -0.022340073 -1.773851672
## 7    0.101279137 -0.566547817
## 8   -2.335141612 -1.982496744
## 9   -2.964846812 -0.732485689
## 10   0.180942807 -0.401657877
## 11  -1.264939526 -0.654619578
## 12  -1.665985901  0.559479843
## 13   1.120481059 -0.958514647
## 14  -1.078131688 -0.480524341
## 15  -0.578285942  0.533721652
## 16  -3.134359295  0.661558267
## 17  -0.210621996  0.697145978
## 18   2.664300926  0.416918232
## 19   0.156101483  0.775064591
## 20   0.791471191 -0.119900481
## 21  -0.992881028 -0.968974224
## 22   0.879442693  0.455996758
## 23  -0.140359227 -2.144627284
## 24  -2.452709502  0.016404706
## 25  -0.903876085  0.030186818
## 26   3.112691936  0.038658702
## 27   2.890278094 -0.421395969
## 28   0.580665501  0.892137207
## 29   2.799489040  0.078413068
## 30  -2.536003455 -1.721915992
## 31   0.155334280  0.350182275
## 32   3.953075045  0.385460038
## 33   3.546887847  1.285262632
## 34  -0.948802508 -1.073200592
## 35  -0.057309620 -1.186428306
## 36  -0.120782862 -1.763605049
## 37   2.087278886  0.342570698
## 38   3.163854741  1.047232056
## 39   1.720501954  2.169823152
## 40  -0.935014535 -1.346422975
## 41   2.573964967  1.204251605
## 42  -1.145418560 -0.842278871
## 43  -2.167934801 -0.004496923
## 44  -2.047106561  0.421929317
## 45  -3.001464839 -0.862953375
## 46   0.230409959 -0.878000693
## 47  -0.009589492 -1.042086864
## 48   0.845643404 -0.817360669
## 49  -0.081622363 -0.566101379
## 50   1.289544500  2.356606996
## 51   2.467275503 -0.616172083
## 52  -1.654108622  1.018501304
## 53   0.188262206  1.068550889
## 54  -2.451586981 -1.072916114
## 55  -2.247511347 -1.861041001
## 56   1.417451531  0.318764665
## 57   2.207031878  0.222825742
## 58  -0.320976861 -0.516701233
## 59  -2.663411673 -1.269790972
## 60   2.048007495  0.378894857
## 61  -1.774157118 -1.760103376
## 62  -0.145068502 -0.430043001
## 63   0.661513606 -0.612070021
## 64   2.960625317  0.726349272
## 65   2.825119853 -0.090854946
## 66   0.321813602  1.357259367
## 67   4.396494705  1.737006394
## 68  -1.833645391  1.269147932
## 69  -2.473484888 -0.632798767
## 70   1.338799322 -0.533534328
## 71   0.951887298 -0.730165794
## 72   0.001061420 -1.330348538
## 73   1.026142012 -0.282419938
## 74  -3.657627636  1.724307271
## 75  -1.480862926 -1.046078257
## 76  -2.159315753 -1.767170669
## 77  -0.018553500 -0.238244777
## 78  -2.259087723 -2.428290687
## 79  -0.159662454  0.539442784
## 80   0.292466898 -0.236813212
## 81   1.869081163 -0.170517133
## 82   1.235501061  0.368031547
## 83  -2.458265407  0.087785761
## 84   0.338950478  1.294303760
## 85   1.523188916  0.544150345
## 86  -1.185275100  0.161554156
## 87  -1.168476537 -0.255526609
## 88   1.797744636  2.031740468
## 89   1.768262138  1.050240042
## 90  -0.816487444  0.388672717
## 91  -1.405560863  0.727644787
## 92  -6.897012020  4.835301393
## 93  -0.731011782 -0.094582971
## 94   2.129603836  0.341705354
## 95   2.970950045  0.215972875
## 96  -1.227137773  1.596945780
## 97  -1.105276859  1.006287825
## 98   3.402023455  0.559784946
## 99  -3.668509461  4.751196705
## 100  1.948068605  1.379236434
## 101 -0.897077073  0.415230963
## 102  0.379786578  0.101468460
## 103 -0.508011595  0.161173606
## 104  0.942142021  0.528210954
## 105 -1.023605372 -0.256869026
## 106  0.232171894 -0.280185105
## 107  2.911783246  0.890591816
## 108  1.831688884 -1.608830396
## 109  1.040246148  0.999834092
## 110  1.303170528 -0.786682663
## 111 -3.369024843  0.115355507
## 112 -1.810302129 -1.579971882
## 113  3.439822383  0.967014133
## 114  4.897337278 -0.094215330
## 115 -3.710037100 -1.442915374
## 116 -1.124006175  0.490137036
## 117  2.353269663 -0.477962163
## 118 -1.160294677  1.111932039
## 119 -0.117492862  0.359948586
## 120  0.020573576 -1.083359181
## 121  0.780398801 -0.096208740
## 122 -1.214175871 -0.657192438
## 123 -1.808627988 -1.446536042
## 124 -4.229575787 -0.195017155
## 125 -0.571075181 -0.635473639
## 126 -0.163270503 -1.063480038
## 127  1.674666958 -0.998625230
## 128  0.561209779 -0.022038116
## 129 -0.853369283 -0.182890710
## 130  1.906436650  0.091285392
## 131 -0.829924169 -0.866719316
## 132 -1.597792344  2.930307595
## 133  3.371484958 -0.235592964
## 134 -5.766034799  6.662053986
## 135 -2.023637558  1.047257791
## 136 -2.272656633  0.194689692
## 137  0.803791712  1.299582062
## 138  1.188263623 -0.555087724
## 139 -1.912311127 -0.426186478
## 140 -2.013142633 -1.779031972
## 141  0.573846297 -0.994560313
## 142 -0.026543635 -0.016015913
## 143  2.312469057 -0.767100254
## 144 -0.171159963 -0.094523359
## 145 -2.809872095 -0.911738890
## 146 -4.076284595 -0.428174164
## 147  1.240732828 -0.028830722
## 148  2.546390853 -0.214383193
## 149 -0.923315812  0.825747199
## 150  2.364858104 -1.173982164
## 151  1.991652303  0.955487932
## 152  0.752744639 -0.087630306
## 153 -0.600425816  0.172915660
## 154 -0.400233991 -1.407755872
## 155  0.462545049  1.287999777
## 156  2.846275992 -0.351026661
## 157 -0.301393352 -0.097278499
## 158 -2.419863452  1.148359354
## 159 -2.061789050 -1.530709845
## 160 -2.633286156 -2.988376840
## 161 -0.615461581 -1.426187936
## 162  0.850969631 -0.652522634
## 163  0.818170463  0.637652316
## 164  0.549383280 -1.230186365
## 165 -0.497029556  1.386574158
## 166  1.881791521 -0.109124819
## 167  2.855476001  0.484540717

t-SNE

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a sophisticated method for reducing high-dimensional data into a lower-dimensional space. Its popularity has grown in recent years due to its remarkable ability to preserve the local structure of data, especially when used for data visualization purposes.

While linear dimension reduction methods like PCA and MDS focus on reducing the dimensionality of data by capturing the most important patterns in the data, t-SNE takes a different approach. t-SNE is a non-linear method that aims to preserve the local relationships between data points in high-dimensional space when visualized in a lower-dimensional space. To do this, t-SNE models the high-dimensional data as a set of probabilities and then seeks to minimize the difference between the probabilities in the high-dimensional space and the probabilities in the lower-dimensional space. In other words, t-SNE looks to identify a lower-dimensional representation of the data that maintains the pairwise similarities between data points, while also minimizing the distortion introduced by the dimensionality reduction process.

t-SNE offers several advantages over other dimension reduction techniques. Firstly, it can effectively preserve the local structure of data, meaning it can accurately represent clusters of similar data points in a lower-dimensional space. This feature makes t-SNE particularly useful for visualizing complex and high-dimensional data sets. Additionally, t-SNE can identify patterns in large data sets and detect outliers and anomalies.

Another advantage of t-SNE is its ability to handle non-linear relationships between variables, which is not possible with linear methods such as PCA and MDS. This ability makes t-SNE a valuable tool for data sets with complex and non-linear variable relationships.

As a result, t-SNE is a powerful tool for dimension reduction and data visualization, especially for large and complex data sets. Its ability to preserve local structure and handle non-linear relationships between variables makes it an essential method.

# Run t-SNE on the data
tsne_result <- Rtsne(data)

# Plot the t-SNE results
plot(tsne_result$Y, col = "blue", main = "t-SNE Plot")

Conclusion

In this paper the applications of Multidimensional Scaling (MDS) and Principal Component Analysis (PCA) and t-SNE (t-Distributed Stochastic Neighbor Embedding) presented. The first step in the analysis was to perform data analysis and transformation in order to prepare the data for MDS and PCA. Once the data had been transformed into MDS and PCA, clustering was performed in order to evaluate the effectiveness of these methods.

The results of the clustering analysis showed that MDS and PCA were both effective at reducing the dimensions of the data. The total SSE (sum of squared errors) for both MDS and PCA was 368.6546, suggesting that both methods were effective at summarizing the original data and reducing its complexity.

Dimension reduction is very significant step for clustering, as it allows for better visualizations and simpler data structures that are easier to analyze. MDS is more suitable for data analysis purposes, as it is better at preserving the distances between data points and producing visualizations that are more intuitive and interpretable. On the other hand, PCA is more suitible for dimeansion reduction especially for significantly high dimensions.

Dimension Reduction of Country Data

Zehra Usta

2022/2023

Introduction

Description of the data set

MDS

PCA

t-SNE

Conclusion