The aim of this project is applying dimension reduction methods to country data set. Multidimensional Scaling (MDS) and Principal Component Analysis (PCA) will be used for dimension reduction.
MDS is a statistical technique that aims to represent the structure of data by positioning similar data points closely together and dissimilar data points farther apart in a lower dimensional space. This method can be used for data visualization, clustering, and data mining.
Meanwhile, PCA is a method that transforms a high-dimensional dataset into a smaller set of variables, known as principal components, while preserving as much of the original variance in the data as possible
Although both MDS and PCA can effectively reduce the dimensionality of high-dimensional datasets, they differ in their approach. MDS is primarily utilized for data visualization to represent data structure, while PCA is used to reduce dimensionality while maintaining the original data variance.
Data set consist of 167 countries and 10 features such as health, income and inflation. Data set consist of numeric data except one column which is the country name.
The majority of the data is presented as numeric values, but there is one column that includes country names. As dimension reduction techniques only work on numeric data, it is necessary to remove this column containing the country names to perform any dimension reduction.
Moreover, for better performance and accurate dimension reduction results, it is highly recommended to scale the numeric features of the data set. The scaling process is essential because the dimension reduction techniques such as MDS and PCA are highly sensitive to differences in the scales of the features. Failure to scale the data can result in the features with the larger magnitudes dominating the analysis and skewing the results.
Multidimensional Scaling experiments
countries <- read.csv("Country-data.csv")
head(countries)
## country child_mort exports health imports income inflation
## 1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44
## 2 Albania 16.6 28.0 6.55 48.6 9930 4.49
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.10
## 4 Angola 119.0 62.3 2.85 42.9 5900 22.40
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44
## 6 Argentina 14.5 18.9 8.10 16.0 18700 20.90
## life_expec total_fer gdpp
## 1 56.2 5.82 553
## 2 76.3 1.65 4090
## 3 76.5 2.89 4460
## 4 60.1 6.16 3530
## 5 76.8 2.13 12200
## 6 75.8 2.37 10300
colnames(countries)
## [1] "country" "child_mort" "exports" "health" "imports"
## [6] "income" "inflation" "life_expec" "total_fer" "gdpp"
Conversion of variables to numeric
summary(countries)
## country child_mort exports health
## Length:167 Min. : 2.60 Min. : 0.109 Min. : 1.810
## Class :character 1st Qu.: 8.25 1st Qu.: 23.800 1st Qu.: 4.920
## Mode :character Median : 19.30 Median : 35.000 Median : 6.320
## Mean : 38.27 Mean : 41.109 Mean : 6.816
## 3rd Qu.: 62.10 3rd Qu.: 51.350 3rd Qu.: 8.600
## Max. :208.00 Max. :200.000 Max. :17.900
## imports income inflation life_expec
## Min. : 0.0659 Min. : 609 Min. : -4.210 Min. :32.10
## 1st Qu.: 30.2000 1st Qu.: 3355 1st Qu.: 1.810 1st Qu.:65.30
## Median : 43.3000 Median : 9960 Median : 5.390 Median :73.10
## Mean : 46.8902 Mean : 17145 Mean : 7.782 Mean :70.56
## 3rd Qu.: 58.7500 3rd Qu.: 22800 3rd Qu.: 10.750 3rd Qu.:76.80
## Max. :174.0000 Max. :125000 Max. :104.000 Max. :82.80
## total_fer gdpp
## Min. :1.150 Min. : 231
## 1st Qu.:1.795 1st Qu.: 1330
## Median :2.410 Median : 4660
## Mean :2.948 Mean : 12964
## 3rd Qu.:3.880 3rd Qu.: 14050
## Max. :7.490 Max. :105000
str(countries)
## 'data.frame': 167 obs. of 10 variables:
## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : int 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : int 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
countries[!complete.cases(countries),]
## [1] country child_mort exports health imports income
## [7] inflation life_expec total_fer gdpp
## <0 rows> (or 0-length row.names)
countries$income <- as.numeric(countries$income)
countries$gdpp <- as.numeric(countries$gdpp)
Distribution of variables
countries %>%
gather(Features, value, 2:10) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = "white", colour = "black") +
facet_wrap(~Features, scales = "free_x") +
labs(x = "Values", y = "Frequency")
str(countries)
## 'data.frame': 167 obs. of 10 variables:
## $ country : chr "Afghanistan" "Albania" "Algeria" "Angola" ...
## $ child_mort: num 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : num 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : num 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
The correlation plot displays the associations between pairs of variables in a given dataset. Positive correlations between two variables are represented by a darker blue shade in the plot, while negative correlations are indicated by a shade of red. Additionally, the size of the dots in the plot corresponds to the strength of the correlation between the two variables.
cor<-cor(countries[,-1], method="pearson")
corrplot(cor, order ="alphabet", tl.cex=0.6)
This the correlation plot after narmalazing the data set.
n<-data.Normalization(countries[,-1], type="n1",normalization="column")
n.cor<-cor(n, method="pearson")
corrplot(n.cor, order ="alphabet", tl.cex=0.6)
ggpairs(as.data.frame(countries[,-1]))
library(qgraph)
qgraph(cor(countries[,-1]), shape="rectangle", posCol="blue", negCol="pink")
Normalization of variables
data<-countries[,-1]
n<-data.Normalization(data, type="n1", normalization="column")
summary(n)
## child_mort exports health imports
## Min. :-0.8845 Min. :-1.4957 Min. :-1.8223 Min. :-1.9341
## 1st Qu.:-0.7444 1st Qu.:-0.6314 1st Qu.:-0.6901 1st Qu.:-0.6894
## Median :-0.4704 Median :-0.2229 Median :-0.1805 Median :-0.1483
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5909 3rd Qu.: 0.3736 3rd Qu.: 0.6496 3rd Qu.: 0.4899
## Max. : 4.2086 Max. : 5.7964 Max. : 4.0353 Max. : 5.2504
## income inflation life_expec total_fer
## Min. :-0.8577 Min. :-1.1344 Min. :-4.3242 Min. :-1.1877
## 1st Qu.:-0.7153 1st Qu.:-0.5649 1st Qu.:-0.5910 1st Qu.:-0.7616
## Median :-0.3727 Median :-0.2263 Median : 0.2861 Median :-0.3554
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2934 3rd Qu.: 0.2808 3rd Qu.: 0.7021 3rd Qu.: 0.6157
## Max. : 5.5947 Max. : 9.1023 Max. : 1.3768 Max. : 3.0003
## gdpp
## Min. :-0.69471
## 1st Qu.:-0.63475
## Median :-0.45307
## Mean : 0.00000
## 3rd Qu.: 0.05924
## Max. : 5.02140
Distance matrix
dist<-dist(t(n))
dist
## child_mort exports health imports income inflation
## exports 20.919057
## health 19.963303 19.234957
## imports 19.345129 9.337535 17.326917
## income 22.496057 12.666011 16.999409 17.069304
## inflation 15.371803 19.173464 20.415307 20.347042 19.520630
## life_expec 25.027514 15.065995 16.187965 17.718418 11.350263 20.287485
## total_fer 7.092621 20.934266 19.932279 19.616424 22.329597 15.059290
## gdpp 22.189337 13.891846 14.735652 17.136353 5.888148 20.139054
## life_expec total_fer
## exports
## health
## imports
## income
## inflation
## life_expec
## total_fer 24.178718
## gdpp 11.522604 21.977948
Performing MDS with 2-Dimensions
fit.data<-mds(dist, ndim=2, type="ratio")
fit.data
##
## Call:
## mds(delta = dist, ndim = 2, type = "ratio")
##
## Model: Symmetric SMACOF
## Number of objects: 9
## Stress-1 value: 0.175
## Number of iterations: 46
summary(fit.data)
##
## Configurations:
## D1 D2
## child_mort 0.7550 -0.2412
## exports -0.2828 -0.4996
## health -0.1122 0.6317
## imports 0.0058 -0.6105
## income -0.5623 -0.0980
## inflation 0.5223 0.4544
## life_expec -0.6645 0.3171
## total_fer 0.7719 -0.0196
## gdpp -0.4334 0.0658
##
##
## Stress per point (in %):
## child_mort exports health imports income inflation life_expec
## 6.68 7.92 19.34 12.55 7.81 17.71 12.34
## total_fer gdpp
## 7.25 8.38
# step 4: make visualisation
plot(fit.data, plot.type = "stressplot")
fit.data
##
## Call:
## mds(delta = dist, ndim = 2, type = "ratio")
##
## Model: Symmetric SMACOF
## Number of objects: 9
## Stress-1 value: 0.175
## Number of iterations: 46
summary(fit.data)
##
## Configurations:
## D1 D2
## child_mort 0.7550 -0.2412
## exports -0.2828 -0.4996
## health -0.1122 0.6317
## imports 0.0058 -0.6105
## income -0.5623 -0.0980
## inflation 0.5223 0.4544
## life_expec -0.6645 0.3171
## total_fer 0.7719 -0.0196
## gdpp -0.4334 0.0658
##
##
## Stress per point (in %):
## child_mort exports health imports income inflation life_expec
## 6.68 7.92 19.34 12.55 7.81 17.71 12.34
## total_fer gdpp
## 7.25 8.38
summary(n)
## child_mort exports health imports
## Min. :-0.8845 Min. :-1.4957 Min. :-1.8223 Min. :-1.9341
## 1st Qu.:-0.7444 1st Qu.:-0.6314 1st Qu.:-0.6901 1st Qu.:-0.6894
## Median :-0.4704 Median :-0.2229 Median :-0.1805 Median :-0.1483
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5909 3rd Qu.: 0.3736 3rd Qu.: 0.6496 3rd Qu.: 0.4899
## Max. : 4.2086 Max. : 5.7964 Max. : 4.0353 Max. : 5.2504
## income inflation life_expec total_fer
## Min. :-0.8577 Min. :-1.1344 Min. :-4.3242 Min. :-1.1877
## 1st Qu.:-0.7153 1st Qu.:-0.5649 1st Qu.:-0.5910 1st Qu.:-0.7616
## Median :-0.3727 Median :-0.2263 Median : 0.2861 Median :-0.3554
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2934 3rd Qu.: 0.2808 3rd Qu.: 0.7021 3rd Qu.: 0.6157
## Max. : 5.5947 Max. : 9.1023 Max. : 1.3768 Max. : 3.0003
## gdpp
## Min. :-0.69471
## 1st Qu.:-0.63475
## Median :-0.45307
## Mean : 0.00000
## 3rd Qu.: 0.05924
## Max. : 5.02140
lab<-data.frame(whatever=n[,1], child_mort=0, exports=0, health=0, imports=0, income=0, inflation=0, life_expec=0, total_fer=0, gdpp=0)
lab<-lab[,-1]
for(i in 1:9){
lab[,i]<-"average"
lab[n[,i]>1.25,i]<-"high"
lab[n[,i]<0.75,i]<-"low"}
head(lab)
## child_mort exports health imports income inflation life_expec total_fer gdpp
## 1 high low low low low low low high low
## 2 low low low low low low low low low
## 3 low low low low low average low low low
## 4 high average low low low high low high low
## 5 low low low low low low low low low
## 6 low low low low low average low low low
dist.gower<-gower.dist(t(lab))
dist.gower
## child_mort exports health imports income inflation
## child_mort 0.0000000 0.3592814 0.3712575 0.3173653 0.3772455 0.2574850
## exports 0.3592814 0.0000000 0.3712575 0.1916168 0.2574850 0.2874251
## health 0.3712575 0.3712575 0.0000000 0.3053892 0.2335329 0.3652695
## imports 0.3173653 0.1916168 0.3053892 0.0000000 0.2814371 0.2874251
## income 0.3772455 0.2574850 0.2335329 0.2814371 0.0000000 0.3113772
## inflation 0.2574850 0.2874251 0.3652695 0.2874251 0.3113772 0.0000000
## life_expec 0.4311377 0.2934132 0.2814371 0.3173653 0.2035928 0.3772455
## total_fer 0.1197605 0.3892216 0.4191617 0.3652695 0.4191617 0.2694611
## gdpp 0.3772455 0.2694611 0.2215569 0.2814371 0.1017964 0.3113772
## life_expec total_fer gdpp
## child_mort 0.4311377 0.1197605 0.3772455
## exports 0.2934132 0.3892216 0.2694611
## health 0.2814371 0.4191617 0.2215569
## imports 0.3173653 0.3652695 0.2814371
## income 0.2035928 0.4191617 0.1017964
## inflation 0.3772455 0.2694611 0.3113772
## life_expec 0.0000000 0.4730539 0.1976048
## total_fer 0.4730539 0.0000000 0.4191617
## gdpp 0.1976048 0.4191617 0.0000000
dist<-dist(t(n))
mds<-cmdscale(dist, k=2)
plot(mds, type='n')
text(mds, labels=colnames(countries), cex=0.6, adj=0.5)
fit.data<-mds(dist, ndim=2, type="ratio")
plot(fit.data, pch=21, cex=as.numeric(fit.data$spp), bg="pink", main="MDS for selected variables")
dist<-dist(n)
mds2<-cmdscale(dist, k=2)
plot(mds2, type='n')
text(mds2, labels=countries$country, cex=0.6, adj=0.5)
fit.data<-mds(dist, ndim=2, type="ratio")
plot(fit.data, pch=21, cex=as.numeric(fit.data$spp), bg="pink", main="MDS for Countries")
Principal Component Analysis experiments
pca<-prcomp(n, center = FALSE, scale.=FALSE) # stats::
pca
## Standard deviations (1, .., p=9):
## [1] 2.0336314 1.2435217 1.0818425 0.9973889 0.8127847 0.4728437 0.3368067
## [8] 0.2971790 0.2586020
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
## exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
## imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
## income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
## inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
## total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
## gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
## PC6 PC7 PC8 PC9
## child_mort -0.200628153 0.07948854 0.68274306 0.32754180
## exports 0.059332832 0.70730269 0.01419742 -0.12308207
## health -0.007276456 0.24983051 -0.07249683 0.11308797
## imports 0.030031537 -0.59218953 0.02894642 0.09903717
## income -0.160346990 -0.09556237 -0.35262369 0.61298247
## inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
## life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
## total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
pca$rotation
## PC1 PC2 PC3 PC4 PC5
## child_mort -0.4195194 -0.192883937 0.02954353 -0.370653262 0.16896968
## exports 0.2838970 -0.613163494 -0.14476069 -0.003091019 -0.05761584
## health 0.1508378 0.243086779 0.59663237 -0.461897497 -0.51800037
## imports 0.1614824 -0.671820644 0.29992674 0.071907461 -0.25537642
## income 0.3984411 -0.022535530 -0.30154750 -0.392159039 0.24714960
## inflation -0.1931729 0.008404473 -0.64251951 -0.150441762 -0.71486910
## life_expec 0.4258394 0.222706743 -0.11391854 0.203797235 -0.10821980
## total_fer -0.4037290 -0.155233106 -0.01954925 -0.378303645 0.13526221
## gdpp 0.3926448 0.046022396 -0.12297749 -0.531994575 0.18016662
## PC6 PC7 PC8 PC9
## child_mort -0.200628153 0.07948854 0.68274306 0.32754180
## exports 0.059332832 0.70730269 0.01419742 -0.12308207
## health -0.007276456 0.24983051 -0.07249683 0.11308797
## imports 0.030031537 -0.59218953 0.02894642 0.09903717
## income -0.160346990 -0.09556237 -0.35262369 0.61298247
## inflation -0.066285372 -0.10463252 0.01153775 -0.02523614
## life_expec 0.601126516 -0.01848639 0.50466425 0.29403981
## total_fer 0.750688748 -0.02882643 -0.29335267 -0.02633585
## gdpp -0.016778761 -0.24299776 0.24969636 -0.62564572
fviz_pca_var(pca, col.var = "steelblue")
Eigenvalues are a measure of the amount of variance explained by each principal component in Principal Component Analysis (PCA). The graph displays the eigenvalues of the principal components against their corresponding component number.
The scree plot indicates that the first and second principal components have the highest eigenvalues, meaning that they explain the most variance in the original data. This suggests that these principal components contain the most important information about the data and should be retained for further analysis.
In this case, the first and second principal components will be used for clustering, likely referring to a subsequent analysis such as k-means clustering or hierarchical clustering. This implies that the two principal components identified as most important in the PCA analysis will be used to create new variables that summarize the original data and are better suited for clustering analysis.
fviz_eig(pca, choice='eigenvalue')
eig.val<-get_eigenvalue(pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.13565658 45.9517398 45.95174
## Dim.2 1.54634631 17.1816257 63.13337
## Dim.3 1.17038330 13.0042589 76.13762
## Dim.4 0.99478456 11.0531618 87.19079
## Dim.5 0.66061903 7.3402114 94.53100
## Dim.6 0.22358112 2.4842347 97.01523
## Dim.7 0.11343874 1.2604304 98.27566
## Dim.8 0.08831536 0.9812817 99.25694
## Dim.9 0.06687501 0.7430556 100.00000
var<-get_pca_var(pca)
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90) # default angle=45°
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
grid.arrange(a,b,top='Contribution to the first two Principal Components')
rgl.open()
plot3d(pca$x[,1], pca$x[,2], pca$x[,3], col = "pink")
rglwidget()
kmeans_model_mds <- kmeans(mds2, 3)
sum(kmeans_model_mds$withinss)
## [1] 368.6546
ms_pca <- princomp(n)$scores[,1:2]
ms_km <- kmeans(ms_pca,3)
sum(ms_km$withinss)
## [1] 368.6546
ms_pca
## Comp.1 Comp.2
## 1 2.904289861 0.095333856
## 2 -0.428622238 -0.586392077
## 3 0.284369828 -0.453809569
## 4 2.923629763 1.690470936
## 5 -1.030476681 0.136248937
## 6 -0.022340073 -1.773851672
## 7 0.101279137 -0.566547817
## 8 -2.335141612 -1.982496744
## 9 -2.964846812 -0.732485689
## 10 0.180942807 -0.401657877
## 11 -1.264939526 -0.654619578
## 12 -1.665985901 0.559479843
## 13 1.120481059 -0.958514647
## 14 -1.078131688 -0.480524341
## 15 -0.578285942 0.533721652
## 16 -3.134359295 0.661558267
## 17 -0.210621996 0.697145978
## 18 2.664300926 0.416918232
## 19 0.156101483 0.775064591
## 20 0.791471191 -0.119900481
## 21 -0.992881028 -0.968974224
## 22 0.879442693 0.455996758
## 23 -0.140359227 -2.144627284
## 24 -2.452709502 0.016404706
## 25 -0.903876085 0.030186818
## 26 3.112691936 0.038658702
## 27 2.890278094 -0.421395969
## 28 0.580665501 0.892137207
## 29 2.799489040 0.078413068
## 30 -2.536003455 -1.721915992
## 31 0.155334280 0.350182275
## 32 3.953075045 0.385460038
## 33 3.546887847 1.285262632
## 34 -0.948802508 -1.073200592
## 35 -0.057309620 -1.186428306
## 36 -0.120782862 -1.763605049
## 37 2.087278886 0.342570698
## 38 3.163854741 1.047232056
## 39 1.720501954 2.169823152
## 40 -0.935014535 -1.346422975
## 41 2.573964967 1.204251605
## 42 -1.145418560 -0.842278871
## 43 -2.167934801 -0.004496923
## 44 -2.047106561 0.421929317
## 45 -3.001464839 -0.862953375
## 46 0.230409959 -0.878000693
## 47 -0.009589492 -1.042086864
## 48 0.845643404 -0.817360669
## 49 -0.081622363 -0.566101379
## 50 1.289544500 2.356606996
## 51 2.467275503 -0.616172083
## 52 -1.654108622 1.018501304
## 53 0.188262206 1.068550889
## 54 -2.451586981 -1.072916114
## 55 -2.247511347 -1.861041001
## 56 1.417451531 0.318764665
## 57 2.207031878 0.222825742
## 58 -0.320976861 -0.516701233
## 59 -2.663411673 -1.269790972
## 60 2.048007495 0.378894857
## 61 -1.774157118 -1.760103376
## 62 -0.145068502 -0.430043001
## 63 0.661513606 -0.612070021
## 64 2.960625317 0.726349272
## 65 2.825119853 -0.090854946
## 66 0.321813602 1.357259367
## 67 4.396494705 1.737006394
## 68 -1.833645391 1.269147932
## 69 -2.473484888 -0.632798767
## 70 1.338799322 -0.533534328
## 71 0.951887298 -0.730165794
## 72 0.001061420 -1.330348538
## 73 1.026142012 -0.282419938
## 74 -3.657627636 1.724307271
## 75 -1.480862926 -1.046078257
## 76 -2.159315753 -1.767170669
## 77 -0.018553500 -0.238244777
## 78 -2.259087723 -2.428290687
## 79 -0.159662454 0.539442784
## 80 0.292466898 -0.236813212
## 81 1.869081163 -0.170517133
## 82 1.235501061 0.368031547
## 83 -2.458265407 0.087785761
## 84 0.338950478 1.294303760
## 85 1.523188916 0.544150345
## 86 -1.185275100 0.161554156
## 87 -1.168476537 -0.255526609
## 88 1.797744636 2.031740468
## 89 1.768262138 1.050240042
## 90 -0.816487444 0.388672717
## 91 -1.405560863 0.727644787
## 92 -6.897012020 4.835301393
## 93 -0.731011782 -0.094582971
## 94 2.129603836 0.341705354
## 95 2.970950045 0.215972875
## 96 -1.227137773 1.596945780
## 97 -1.105276859 1.006287825
## 98 3.402023455 0.559784946
## 99 -3.668509461 4.751196705
## 100 1.948068605 1.379236434
## 101 -0.897077073 0.415230963
## 102 0.379786578 0.101468460
## 103 -0.508011595 0.161173606
## 104 0.942142021 0.528210954
## 105 -1.023605372 -0.256869026
## 106 0.232171894 -0.280185105
## 107 2.911783246 0.890591816
## 108 1.831688884 -1.608830396
## 109 1.040246148 0.999834092
## 110 1.303170528 -0.786682663
## 111 -3.369024843 0.115355507
## 112 -1.810302129 -1.579971882
## 113 3.439822383 0.967014133
## 114 4.897337278 -0.094215330
## 115 -3.710037100 -1.442915374
## 116 -1.124006175 0.490137036
## 117 2.353269663 -0.477962163
## 118 -1.160294677 1.111932039
## 119 -0.117492862 0.359948586
## 120 0.020573576 -1.083359181
## 121 0.780398801 -0.096208740
## 122 -1.214175871 -0.657192438
## 123 -1.808627988 -1.446536042
## 124 -4.229575787 -0.195017155
## 125 -0.571075181 -0.635473639
## 126 -0.163270503 -1.063480038
## 127 1.674666958 -0.998625230
## 128 0.561209779 -0.022038116
## 129 -0.853369283 -0.182890710
## 130 1.906436650 0.091285392
## 131 -0.829924169 -0.866719316
## 132 -1.597792344 2.930307595
## 133 3.371484958 -0.235592964
## 134 -5.766034799 6.662053986
## 135 -2.023637558 1.047257791
## 136 -2.272656633 0.194689692
## 137 0.803791712 1.299582062
## 138 1.188263623 -0.555087724
## 139 -1.912311127 -0.426186478
## 140 -2.013142633 -1.779031972
## 141 0.573846297 -0.994560313
## 142 -0.026543635 -0.016015913
## 143 2.312469057 -0.767100254
## 144 -0.171159963 -0.094523359
## 145 -2.809872095 -0.911738890
## 146 -4.076284595 -0.428174164
## 147 1.240732828 -0.028830722
## 148 2.546390853 -0.214383193
## 149 -0.923315812 0.825747199
## 150 2.364858104 -1.173982164
## 151 1.991652303 0.955487932
## 152 0.752744639 -0.087630306
## 153 -0.600425816 0.172915660
## 154 -0.400233991 -1.407755872
## 155 0.462545049 1.287999777
## 156 2.846275992 -0.351026661
## 157 -0.301393352 -0.097278499
## 158 -2.419863452 1.148359354
## 159 -2.061789050 -1.530709845
## 160 -2.633286156 -2.988376840
## 161 -0.615461581 -1.426187936
## 162 0.850969631 -0.652522634
## 163 0.818170463 0.637652316
## 164 0.549383280 -1.230186365
## 165 -0.497029556 1.386574158
## 166 1.881791521 -0.109124819
## 167 2.855476001 0.484540717
t-SNE (t-Distributed Stochastic Neighbor Embedding) is a sophisticated method for reducing high-dimensional data into a lower-dimensional space. Its popularity has grown in recent years due to its remarkable ability to preserve the local structure of data, especially when used for data visualization purposes.
While linear dimension reduction methods like PCA and MDS focus on reducing the dimensionality of data by capturing the most important patterns in the data, t-SNE takes a different approach. t-SNE is a non-linear method that aims to preserve the local relationships between data points in high-dimensional space when visualized in a lower-dimensional space. To do this, t-SNE models the high-dimensional data as a set of probabilities and then seeks to minimize the difference between the probabilities in the high-dimensional space and the probabilities in the lower-dimensional space. In other words, t-SNE looks to identify a lower-dimensional representation of the data that maintains the pairwise similarities between data points, while also minimizing the distortion introduced by the dimensionality reduction process.
t-SNE offers several advantages over other dimension reduction techniques. Firstly, it can effectively preserve the local structure of data, meaning it can accurately represent clusters of similar data points in a lower-dimensional space. This feature makes t-SNE particularly useful for visualizing complex and high-dimensional data sets. Additionally, t-SNE can identify patterns in large data sets and detect outliers and anomalies.
Another advantage of t-SNE is its ability to handle non-linear relationships between variables, which is not possible with linear methods such as PCA and MDS. This ability makes t-SNE a valuable tool for data sets with complex and non-linear variable relationships.
As a result, t-SNE is a powerful tool for dimension reduction and data visualization, especially for large and complex data sets. Its ability to preserve local structure and handle non-linear relationships between variables makes it an essential method.
# Run t-SNE on the data
tsne_result <- Rtsne(data)
# Plot the t-SNE results
plot(tsne_result$Y, col = "blue", main = "t-SNE Plot")
In this paper the applications of Multidimensional Scaling (MDS) and Principal Component Analysis (PCA) and t-SNE (t-Distributed Stochastic Neighbor Embedding) presented. The first step in the analysis was to perform data analysis and transformation in order to prepare the data for MDS and PCA. Once the data had been transformed into MDS and PCA, clustering was performed in order to evaluate the effectiveness of these methods.
The results of the clustering analysis showed that MDS and PCA were both effective at reducing the dimensions of the data. The total SSE (sum of squared errors) for both MDS and PCA was 368.6546, suggesting that both methods were effective at summarizing the original data and reducing its complexity.
Dimension reduction is very significant step for clustering, as it allows for better visualizations and simpler data structures that are easier to analyze. MDS is more suitable for data analysis purposes, as it is better at preserving the distances between data points and producing visualizations that are more intuitive and interpretable. On the other hand, PCA is more suitible for dimeansion reduction especially for significantly high dimensions.