Main goal of this article is to use dimensions reduction methods like MDS, PCA, t-SNE and UMAP. I would like to discuss some of them more precisely. MDS (multidimensional scaling) - is the machine learning approach that used to reduce dimension of the data to show meaningful features of the data. MDS is based on measuring dissimilarities between pairs of elements using euclidean, manhattan and other distances. Method mostly used to reduce dimension of the data in order to be able to plot the data. It is used for example in testing structural hypothesis, exploratory data analysis to help to discover the structure of the data. t-SNE (t-distributed stochastic neighbor embedding) - dimensions reduction algorithm used to visualize the data which is based on Stochastic Neighbor Embedding. This method is able to identify non-linear relationship in data. UMAP is a new machine learning method that is considered to be faster than t-SNE and UMAP is more successful in differentiating each cluster.
Data set used for this project was taken from the dataset repository. Dataset has information on different educational indicators across both countries and regions.
Indicators:
setwd("C:/Users/daria/OneDrive/Desktop/DSBA 1 semester/1 semestr/Unsupervised Learning")
# loading the data. Working directory was setted before.
library(readr)
data = read_csv("Education Indicators 2014.csv", col_names = TRUE)
#View(data)
# dimension of the data
dim(data)
## [1] 63 10
# summary statistics. We can notice that our indicators have different scale.
summary(data)
## Country Name PPT GDP PRPE
## Length:63 Min. : 351706 Min. :1.399e+09 Min. : 0.000
## Class :character 1st Qu.: 4237223 1st Qu.:1.301e+10 1st Qu.: 0.395
## Mode :character Median : 9535079 Median :5.017e+10 Median : 1.380
## Mean : 25216447 Mean :2.683e+11 Mean : 3.959
## 3rd Qu.: 22465060 3rd Qu.:2.655e+11 3rd Qu.: 6.155
## Max. :254454778 Max. :3.880e+12 Max. :24.250
## OOCP ESE EPE UNEMP
## Min. : 397 Min. : 30230 Min. : 24072 Min. : 1.000
## 1st Qu.: 4128 1st Qu.: 351824 1st Qu.: 271797 1st Qu.: 4.400
## Median : 24274 Median : 650516 Median : 773568 Median : 6.200
## Mean : 267882 Mean : 1981848 Mean : 2492792 Mean : 8.571
## 3rd Qu.: 186085 3rd Qu.: 1769626 3rd Qu.: 2728357 3rd Qu.: 9.750
## Max. :5611792 Max. :22586956 Max. :29838440 Max. :31.000
## LEB TDP
## Min. :49.70 Min. :4.000
## 1st Qu.:68.58 1st Qu.:5.000
## Median :73.50 Median :6.000
## Mean :71.68 Mean :5.524
## 3rd Qu.:77.35 3rd Qu.:6.000
## Max. :83.98 Max. :7.000
# Important to know the data type of the columns.
str(data)
## tibble [63 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country Name: chr [1:63] "Albania" "United Arab Emirates" "Azerbaijan" "Burundi" ...
## $ PPT : num [1:63] 2893654 9086139 9535079 10816860 11231213 ...
## $ GDP : num [1:63] 1.32e+10 4.02e+11 7.52e+10 3.09e+09 5.32e+11 ...
## $ PRPE : num [1:63] 0.73 0.22 0.16 24.25 2.49 ...
## $ OOCP : num [1:63] 7097 14611 22821 69246 6538 ...
## $ ESE : num [1:63] 333291 411040 949294 583308 1210112 ...
## $ EPE : num [1:63] 195720 409776 517708 2046794 773568 ...
## $ UNEMP : num [1:63] 16.1 3.6 5.2 6.9 8.5 1 3.1 11.6 27.9 5.9 ...
## $ LEB : num [1:63] 77.8 77.4 70.8 56.7 80.6 ...
## $ TDP : num [1:63] 5 5 4 6 6 6 6 4 5 4 ...
## - attr(*, "spec")=
## .. cols(
## .. `Country Name` = col_character(),
## .. PPT = col_double(),
## .. GDP = col_double(),
## .. PRPE = col_double(),
## .. OOCP = col_double(),
## .. ESE = col_double(),
## .. EPE = col_double(),
## .. UNEMP = col_double(),
## .. LEB = col_double(),
## .. TDP = col_double()
## .. )
#preparing data for analysis
countries_regions = data[,1]
indicators = colnames(data)
indicators = indicators[ 2:length(indicators)]
data = as.matrix(data[,2:ncol(data)])
# normalization of the data, as our indicators have different scales
library(clusterSim)
## Warning: package 'cluster' was built under R version 4.0.4
data_nr<-data.Normalization(data, type="n1",normalization="column")
head(data_nr)
## PPT GDP PRPE OOCP ESE EPE
## [1,] -0.4945285 -0.4285266 -0.6019239 -0.3381103 -0.4473052 -0.48026373
## [2,] -0.3573432 0.2247060 -0.6970066 -0.3283683 -0.4262094 -0.43550964
## [3,] -0.3473976 -0.3243902 -0.7081928 -0.3177240 -0.2801643 -0.41294360
## [4,] -0.3190016 -0.4455407 3.7830653 -0.2575335 -0.3794677 -0.09324762
## [5,] -0.3098222 0.4431334 -0.2737955 -0.3388350 -0.2093962 -0.35944928
## [6,] -0.3238394 -0.4344282 1.2829503 -0.2562603 -0.2944176 -0.07515498
## UNEMP LEB TDP
## [1,] 1.12859518 0.7219009 -0.5849633
## [2,] -0.74525830 0.6679148 -0.5849633
## [3,] -0.50540505 -0.1078418 -1.7017115
## [4,] -0.25056098 -1.7591119 0.5317848
## [5,] -0.01070773 1.0458175 0.5317848
## [6,] -1.13501982 -1.4281537 0.5317848
library(corrplot)
## corrplot 0.84 loaded
library(clusterSim)
corrr<-cor(data_nr, method="pearson")
corrplot(corrr)
Correlation plot is showing that some of the indicators are correlated between each other. Enrollment in Secondary Education (ESE) and Enrollment in Primary Education (EPE) are highly correlated with population (PPT) as well as between each other. Life expectancy at birth (LEB) doesn’t have any relations with Percentage of repeaters in Primary Education (PRPE).
library(smacof)
# creating a dissimilarity matrix
dist_data = dist(t(data_nr))
mds = smacofSym(dist_data, ndim = 2 )
library(stats)
mds1<-cmdscale(dist_data, k=2) #k - the maximum dimension of the space
summary(mds1)
## V1 V2
## Min. :-4.736 Min. :-5.1146
## 1st Qu.:-4.007 1st Qu.:-2.4754
## Median :-1.724 Median :-0.8212
## Mean : 0.000 Mean : 0.0000
## 3rd Qu.: 4.931 3rd Qu.: 2.9029
## Max. : 5.578 Max. : 7.5593
par(mfrow = c(1,2))
plot(mds)
plot(mds1)
We can say that EPE, PPT, OOCP, GDP and ESE are in the same group hence providing the same information about education in countries. LEB, UNEMP, TDP and PRPE are possibly outliers in data.
Next step is to check the goodness of MDS fitting the data.
library(smacof)
stress_vector<-randomstress(n=9, ndim=2, nrep=1)
mean(stress_vector)
## [1] 0.3070712
mds$stress
## [1] 0.1708296
ratio<- mds$stress/ mean(stress_vector)
ratio
## [1] 0.5563192
Due to the scale presented by Kruskal: 0.20 = poor, 0.10 = fair, 0.05 = good, 0.025 = excellent, 0.00 = perfect, one can say that MDS is not the best method for dimension reduction. Value of 0.53 is much higher than 0.2 hence this method should be changed for more appropriate fro the data.
I also would like to demonstrate the results of t-SNE dimensions reduction method. In order to use this method we need to have matrix of distances as well, but allocation of the elements into clusters is calculated with help of t-Student distribution.
One of the feature of t-SNE function are tunable parameters that can change our output significantly.
library(labdsv)
tsnee1 = tsne(dist_data, perplexity = 1, eta = 500)
plot(tsnee1)
text(tsnee1$points[,1], tsnee1$points[,2], indicators, cex=0.7, pos=4, col="red")
library(labdsv)
tsnee2 = tsne(dist_data, perplexity = 2, eta = 1000)
plot(tsnee2)
text(tsnee2$points[,1], tsnee2$points[,2], indicators, cex=0.7, pos=4, col="red")
Having a value of 1 and 2 for hyperparameter, we can notice that TDP and PRPE could create a separate cluster from the rest of the indicators. With the values of the perplexity lower than 5, we can say that local variations are dominating making it difficult to distinguish clusters comparing to first graph. On the first graph we can clearly separate observations into two clusters.
Before conducting PCA I want to check how many dimensions should be used. To understand this I will make scree plot which describes percentage of explained variances by each of the components.
pca <- prcomp(data_nr, center=FALSE, scale=FALSE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9012 1.4602 0.9993 0.93621 0.86315 0.58885 0.50724
## Proportion of Variance 0.4016 0.2369 0.1110 0.09739 0.08278 0.03853 0.02859
## Cumulative Proportion 0.4016 0.6385 0.7495 0.84688 0.92966 0.96818 0.99677
## PC8 PC9
## Standard deviation 0.1500 0.08102
## Proportion of Variance 0.0025 0.00073
## Cumulative Proportion 0.9993 1.00000
# scree plot
library(factoextra)
fviz_eig(pca, addlabels = TRUE)
3 components will describe 81% of the variation in data.
#variable correlation plot
fviz_pca_var(pca)
EPE, PPT, ESE and GDP are grouped together. Quality of the variable and its impact on the describing our data is represented by the distance from the origin. Previously mentioned variables are allocated far from the origin thus they are describing major part of the variation in data. Variable UNEMP doesn’t have a big impact on describing the variation of the data, so it shouldn’t be considered in the future analysis. PRPE and LEB have negative impact on the data as they are located opposite to the rest of the variables.
library(gridExtra)
ind<-get_pca_ind(pca)
var<-get_pca_var(pca)
a<-fviz_contrib(pca, "var", axes=1, xtickslab.rt=90)
b<-fviz_contrib(pca, "var", axes=2, xtickslab.rt=90)
c<-fviz_contrib(pca, "var", axes=3, xtickslab.rt=90)
grid.arrange(a,b,c, top='Contribution to the first three Principal Components')
On the above graphs we can see variables that contributes the most to the each PCA element. I plotted contributions of the individual variables for the each of the 3 dimensions. EPE, PPT, ESE, OOCD and GDP are the main components in PCA 1 element. PPRE and LEB contributes the most to the PCA 2 and UNEMP variable is main one in PCA 3.
library(psych)
pca2<-principal(data_nr, nfactors=3, rotate="varimax")
plot(pca2$complexity)
text(pca2$complexity, labels = indicators)
OOCP and TDP are characterized to have a high complexity.
library(psych)
plot(pca2$uniquenesses)
text(pca2$uniquenesses, labels = indicators)
PPT, ESE, EPE and UNEMP are considered to have a low level of uniqueness.
In order to perform umap dimension reduction I prepared new data with classification for the following regions: Africa, Arab, South America, North America, Europe. I needed it in order to plot the results after UMAP method. Umap is mainly used for larger datasets, so it could be difficult to show all the options.
library(umap)
library(ggplot2)
library(clusterSim)
x = read_csv("Book1.csv", col_names = TRUE)
countries = x[,1]
regions = x[, 11]
x_final = as.matrix(x[,2:10])
# we need to normalize data in order to perform clustering to compare the results before and after using UMAP method.
x_nr = data.Normalization(x_final, type = "n1", normalization = "column")
# for umap method we do not need to normalize the data, so we are proceeding with x_final data.
df = umap(x_final)
results = data.frame(x = df$layout[,1],
y = df$layout[,2],
regions_2 = x[,11])
ggplot(results, aes(x = x, y = y, col = Regions)) + geom_point()
From the graph we can notice 5 groups of the countries creating 5 clusters depending on the indicators that were taken into account. Interesting that Umap classified countries from difernt regions into different clusters. However, we can notice some tendency in the lower part of the graph where we can find countries from Asia, Europe and South America. Probably could be developing countries. In the upper part we can find cluster containing mainly African countries which probably could be interpreted as poor countries. In the middle righ part of the graph we can notice the group of countries from North America, Europe and some African and Asian countries. It can be the group of developed countries.
km1 = eclust(x_nr, "kmeans", hc_metric = "eucledian", k = 3, graph = FALSE)
a1 = fviz_silhouette(km1)
## cluster size ave.sil.width
## 1 1 42 0.36
## 2 2 16 0.28
## 3 3 5 0.06
b1 = fviz_cluster(km1, data = data, elipse.type = "convex", main = "K-means/before") + theme_minimal()
grid.arrange(a1, b1, ncol = 2)
km2 = eclust(results[,1:2], "kmeans", hc_metric = "eucledian", k = 3, graph = FALSE)
a2 = fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 20 0.77
## 2 2 23 0.60
## 3 3 20 0.60
b2 = fviz_cluster(km2, data = data, elipse.type = "convex", main = "K-means/after") + theme_minimal()
grid.arrange(a2, b2, ncol = 2)
From the graphs we can see that clustering after dimensions reduction method (UMAP) results are much better taking into consideration silhouette statistics (before - 0.32, after - 0.66).