Introduction
Description of data
Dimensions reduction
Clustering Principal components
Clustering Raw Data
Summary
Bibliography
In modern times, we tend to store bigger and bigger amounts of data. Because of that, data scientist have also more choice when it comes to choosing features/variables for their models. But is having more features/variables necessarily more helpful when it comes to clustering methods? In fact, when number of features is high, there is a possibility of The Curse of Dimensionality occurrence. According to Aggarwal, Hinneburg, Keim (2001), having more variables in the models of nearest neighbor search does not bring better results. Authors also imply, that the problem is even bigger when Euclidean metric is used. It happens because the distance metric in high dimensions is not as intuitive as in 2 or 3 dimensions. After bringing more dimensions to the model, distance ratio between observations at some point becomes uniformly distributed, hence the Euclidean distance metric becomes unuseful while comparing distances to find the nearest neighbors. (Aggarwal et al., 2001)
To deal with The Curse of Dimensionality, methods of dimensions reductions can be applied before any clustering methods. It would help to reduce the unnecessary noise of data, which shows that dimensions reduction can be also used as a form pre-processing of data to keep the most useful information from the analyzed data. (Ben-Hur and Guyon, 2003)
In the following research, I will verify whether forms of dimensions reduction performed before clustering the data help to achieve better clustering results. To do that, data set of 18 different quality of life features for 31 countries will be used. Having the number of variables more than half as big as the number of observations, can be in fact problematic according to the Curse of Dimensionality. Additionally, the focus will also be on searching for any existing patterns of clustered data, which could tell us more about groups of countries which are similar in terms of life quality.
The analysis will be based on 31 countries from Europe in year 2018. The data set consists of 18 indicators, where each of them in different ways describes the quality of life of the country’s inhabitants. All data was taken from Eurostat database. The list of variables can be found below:
Firstly, let’s read the data with all packages needed.
library(knitr)
library(dplyr)
library(tidyr)
library(clustertend)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(clusterSim)
library(smacof)
library(factoextra)
library(gridExtra)
library(MASS)
library(paran)
library(BBmisc)
library(stats)
library(rworldmap)
library(ClustGeo)
df<-read.csv("country.csv", sep=";", dec=".", header=TRUE)
After reading the data, one of the methods of dimensions reduction will be applied - Multidimensional scaling. One of the first steps is to create the matrix of unlabeled data, and then standardize the data. Standardization will lower the effect of extreme values on the MDS results. Moreover, analyzed data is not in the same units, hence standardization will be used for all dimension reduction and clustering techniques.
# no labels at data
country <-df[,1] # first column
variable<-colnames(df[1,2:19]) # first row
df <- data.frame(df[,2:19], row.names = df[,1]) # first column as an index
data_raw<-as.matrix(df) # data only
df.n<-data.Normalization(data_raw, type="n1", normalization="column")#clusterSim::
Now, it’s time for Multidimensional scaling for all countries in our data. With the following code, the distances between observations will be counted. Then, the distances in the 2- dimension data will be optimized, so that they correspond to the distances in the original matrix. After having two dimensions, we will be able to see if there are any clusters.
# analysis for countries
distance <- dist(df.n)
mds1<-cmdscale(distance, k=2)
plot(mds1, type='n')
text(mds1, labels=country, cex=0.6, adj=0.5)
As one can see from above graph, there are some outliers which might impact the results. Hence, the next step will be to remove all existing outliers.
# looking for outliers
data.out<-which(mds1[,1]>4.1 | mds1[,1]<(-4.1)) # arbitrary limits
data.all<-c(data.out) # merging into single dataset
# visualization with outliers
plot(mds1) # simple graphics for products (price_what)
abline(v=c(-4.1, 4.1), lty=3, col=2) # vertical line
points(mds1[data.all,], pch=21, bg="red", cex=2)
text(mds1[data.all,]-0.5, labels=rownames(mds1[data.all,]))
# removing outliers from the database
df.n <- df.n[-c(8,31),]
df <- df[-c(8,31),]
Greece and Turkey were identified as outliers, hence we will omit them during our further analysis. They also may highly affect PCA analysis, which will be our next step. However, there are still no significant clusters visible.
As a next step, PCA method will be applied to check whether it brings better results.
The Principal Component Analysis is a well-known method, which enables researchers to lower number of data dimensions by creating new eigen vectors, which carry the most variance (most information). Firstly, let’s take a glance at first results of PCA.
xxx.pca1<-prcomp(df.n, center=FALSE, scale.=FALSE)
summary(xxx.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3063 1.5445 1.2911 1.13655 0.92700 0.82541 0.77157
## Proportion of Variance 0.3618 0.1622 0.1134 0.08785 0.05844 0.04633 0.04049
## Cumulative Proportion 0.3618 0.5240 0.6373 0.72520 0.78364 0.82997 0.87046
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.6598 0.5764 0.5144 0.48224 0.41619 0.38759 0.32713
## Proportion of Variance 0.0296 0.0226 0.0180 0.01582 0.01178 0.01022 0.00728
## Cumulative Proportion 0.9001 0.9227 0.9406 0.95647 0.96825 0.97847 0.98574
## PC15 PC16 PC17 PC18
## Standard deviation 0.29968 0.2628 0.1957 0.11171
## Proportion of Variance 0.00611 0.0047 0.0026 0.00085
## Cumulative Proportion 0.99185 0.9966 0.9991 1.00000
# visusalisation of quality
fviz_eig(xxx.pca1) # eigenvalues on y-axis
fviz_pca_var(xxx.pca1, col.var="steelblue")#
The cumulative variance of two principals is equal to 0.5345. Although it would be easier to draw data in 2 dimensions, more Principal Components may be needed to explain enough variance. In order to determine the adequate number of PCs, two methods will be used: Kaiser Criterion and adjusting the Scree Plot results with parallel analysis.
Kaiser Criterion helps to establish number of Principle Components that explain enough variance. According to the criterion, one should use as many PCs, as there is number of eigenvalues which are higher than 1.
# table of eigenvalues
eig.val<-get_eigenvalue(xxx.pca1)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.31922820 36.1747509 36.17475
## Dim.2 2.38546983 16.2229883 52.39774
## Dim.3 1.66705179 11.3372055 63.73494
## Dim.4 1.29175629 8.7849139 72.51986
## Dim.5 0.85932830 5.8440785 78.36394
## Dim.6 0.68129903 4.6333456 82.99728
## Dim.7 0.59532331 4.0486461 87.04593
## Dim.8 0.43530530 2.9604033 90.00633
## Dim.9 0.33225811 2.2596049 92.26594
## Dim.10 0.26460396 1.7995059 94.06544
## Dim.11 0.23255359 1.5815392 95.64698
## Dim.12 0.17321546 1.1779954 96.82498
## Dim.13 0.15022582 1.0216485 97.84663
## Dim.14 0.10701204 0.7277623 98.57439
## Dim.15 0.08980844 0.6107649 99.18515
## Dim.16 0.06904650 0.4695681 99.65472
## Dim.17 0.03829159 0.2604116 99.91513
## Dim.18 0.01247908 0.0848671 100.00000
In the analyzed example 4 eigenvalues are higher than 1, and therefore accordingly to the criterion 4 Principal Components should be chosen.
Parallel analysis with Scree Plot was introduced by Horn (1965). The parallel analysis was designed to both graphically and numerically assess the number of components to keep in PCA analysis. It is based on the assumption that if data was random, we would observe non-correlated factors, therefore eigenvalue of PCA would be equal to 1. Horn’s technique enables to compare obtained PCA results with the random data PCA results. In this case paran package may be found useful.
According to Paran package description, below graph results can be evaluated in two ways:
paran(df.n, iterations=10000, quietly=FALSE,
status=FALSE, all=TRUE, cfa=FALSE, graph=TRUE,
color=TRUE, col=c("black","red","blue"),
lty=c(1,2,3), lwd=1, legend=TRUE, file="",
width=640, height=640, grdevice="png", seed=0, mat=NA, n=NA)
##
## Using eigendecomposition of correlation matrix.
##
## Results of Horn's Parallel Analysis for component retention
## 10000 iterations, using the mean estimate
##
## --------------------------------------------------
## Component Adjusted Unadjusted Estimated
## Eigenvalue Eigenvalue Bias
## --------------------------------------------------
## 1 4.547977 6.271412 1.723435
## 2 1.520106 2.826159 1.306052
## 3 1.002757 2.005554 1.002797
## 4 0.801781 1.556885 0.755103
## 5 0.603719 1.140798 0.537079
## 6 0.461683 0.807473 0.345789
## 7 0.598184 0.772484 0.174299
## 8 0.629686 0.646852 0.017166
## 9 0.599397 0.473681 -0.12571
## 10 0.706564 0.449794 -0.25677
## 11 0.652416 0.276837 -0.37557
## 12 0.691169 0.208176 -0.48299
## 13 0.761849 0.180710 -0.58113
## 14 0.798916 0.130131 -0.66878
## 15 0.866667 0.120080 -0.74658
## 16 0.894284 0.078217 -0.81606
## 17 0.919190 0.042183 -0.87700
## 18 0.943645 0.012565 -0.93108
## --------------------------------------------------
##
## Adjusted eigenvalues > 1 indicate dimensions to retain.
## (3 components retained)
Results on both table graph show, that three factors should be retained - only three first meet the requirements of obtained eigenvalues being higher than the ones obtained from random data. This gives us different conclusions than Kaiser Criterion, which suggested using 4 PCs. However, as three eigenvalues explain only 63% of variance and four eigenvalues explain even 72% of variance, 4 dimensions will be chosen for the sake of more information.
Now, let’s see which variables constitute the most for four first principal components (which carry the most variance).
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of first Principal Components")
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of second Principal Components")
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of third Principal Components")
var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=4, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of fourth Principal Components")
Looking at first Principal Component, one can see that the most contributing variables concern work life and financial variables, such as consumption, average working hours etc. On the other hand, to second principle component contributed variables such as life expectancy, greenhouse emission, high satisfaction or crime, which could concern general happiness and equality in the societies.
Firstly, let’s see what is the clustering tendency for our dataset. The smaller the statistic, the better.
xxx.pca1<-prcomp(df.n, center=FALSE, scale.=FALSE, rank. = 4) # stats::
results <- xxx.pca1$x
hopkins(df, n=nrow(results)-1)
## $H
## [1] 0.4246506
Hopkins statistic does not bring optimistic results. That means, that quality clustering tests may also give poor outcomes. We will see however, if performing PCA before clustering will give better results.
After dimension reduction, clustering of the features obtained by PCA analysis will be done. One of the most popular methods of clustering is the K-means algorithm. The method requires to choose optimal metric and number of clusters.
Firstly, proper number of clusters will be established by Silhouette statistic and Gap statistic graph.
xxx.pca1<-prcomp(df.n, center=FALSE, scale.=FALSE, rank. = 4) # stats::
results <- xxx.pca1$x
fviz_nbclust(results, FUNcluster=kmeans, k.max = 8)
Silhouette Statistic is the highest for 2 clusters.
fviz_nbclust(results, FUNcluster=kmeans, method="gap_stat", k.max = 8)+ theme_classic()
The biggest differences for gap statistics on graph are visible for 2 clusters. With that being said, two clusters will be used in the further analysis.
The next step is to choose the most suitable distance metrics. For that purpose, clustering for two different distance measures will be conducted, specifically for:
km1<-eclust(results, "kmeans", hc_metric="eucliden",k=2)
fviz_silhouette(km1)
## cluster size ave.sil.width
## 1 1 14 0.25
## 2 2 15 0.34
km2<-eclust(results, "kmeans", hc_metric="manhattan",k=2)
fviz_silhouette(km2)
## cluster size ave.sil.width
## 1 1 14 0.25
## 2 2 15 0.34
Both measures have the same Silhouette Statistic, hence there is no difference in which distance measure will be chosen. For the sake of easier analysis, Euclidean distance will be chosen. Average Silhouette is equal to 0.3, which means that clustering quality is not the best. Other clustering will be conducted to see, if that result can be improved.
Obtained results can also be portrayed on the map:
cluster <- km1$cluster
map_cluster <- data.frame(country = c("BE","BG","CZ", "DK", "DE", "EE", "IE", "ES",
"FR", "HR", "IT", "LV", "LT", "LU", "HU", "MT", "NL", "AT", "PL",
"PT", "RO", "SI", "SK", "FI", "SE", "GB", "IS", "NO", "CH"),cluster)
clustering <- joinCountryData2Map(map_cluster,
joinCode = "ISO2",
nameJoinColumn = "country")
## 29 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 214 codes from the map weren't represented in your data
map <- mapCountryData(clustering,
nameColumnToPlot="cluster",
catMethod = "categorical",
missingCountryCol = gray(.8),
addLegend = F,
colourPalette = c("deepskyblue4","aquamarine4"),
borderCol = "black",
mapTitle = "K-means clusters",
xlim = c(-20, 45), ylim = c(40, 71))
do.call(addMapLegendBoxes, c(map,
x = 'right',
title = "Cluster number",
horiz = TRUE,
bg = "transparent",
bty = "n"))
The second method of clustering is PAM algorithm, which is more robust version of K-means and less sensitive to outliers. The overall goal of that method is to cluster points around medoids, instead of artificial points as it was done in K-means algorithm.
As before, let’s establish optimal number of clusters.
# clustering with triangle graphics
fviz_nbclust(results, FUNcluster=cluster::pam, k.max = 7)
fviz_nbclust(results, FUNcluster=cluster::pam, method="gap_stat", k.max = 7)+ theme_classic() # factoextra::
According to Silhouette Statistic, 6-7 clusters should be chosen, whereas Gap Statistic indicates mainly 6 clusters. As 6 clusters were recommended by both methods, they will be used in the further analysis.
pam1<-eclust(results, "pam", k=6) # factoextra::
fviz_silhouette(pam1)
## cluster size ave.sil.width
## 1 1 6 0.10
## 2 2 4 0.44
## 3 3 4 0.25
## 4 4 7 0.39
## 5 5 5 0.36
## 6 6 3 0.16
PAM algorithm brought completely different results to K-means method. It established 6 different clusters, yet if one looked closely countries are logically better grouped than 2 clusters in K-means. Unfortunately, Silhouette statistic is not higher than K-means one. It is equal to only 0.29.
Let’s see how results look on the map.
cluster <- pam1$cluster
map_cluster <- data.frame(country = c("BE","BG","CZ", "DK", "DE", "EE", "IE", "ES",
"FR", "HR", "IT", "LV", "LT", "LU", "HU", "MT", "NL", "AT", "PL",
"PT", "RO", "SI", "SK", "FI", "SE", "GB", "IS", "NO", "CH"),cluster)
clustering <- joinCountryData2Map(map_cluster,
joinCode = "ISO2",
nameJoinColumn = "country")
## 29 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 214 codes from the map weren't represented in your data
map <- mapCountryData(clustering,
nameColumnToPlot="cluster",
catMethod = "categorical",
missingCountryCol = gray(.8),
addLegend = F,
colourPalette = c("deepskyblue4","darkred","aquamarine4","antiquewhite3","coral3","purple"),
borderCol = "black",
mapTitle = "PAM clusters",
xlim = c(-20, 45), ylim = c(40, 71))
do.call(addMapLegendBoxes, c(map,
x = 'top',
title = "Cluster number",
horiz = TRUE,
bg = "transparent",
bty = "n"))
Hierarchical Clustering will be conducted by using cluster dendrogram, which presents hierarchy of clusters.
Next step is to calculate distances between observations and create a simple dendrogram.
dm<-dist(results)
hc<-hclust(dm, method="complete") # simple dendrogram
plot(hc, hang=-1)
rect.hclust(hc, k=5, border="red")
plot(density(dm))
It look like 5 different clusters can be distinguished. To find quality of such division, the inertia measures needs to be calculated.
Inertia is a measure, which helps to evaluate quality of clustering. As a benchmark, good partitioning is characterized by high variability between clusters, and small variability inside clusters.
## division to 2 clust. division to 3 clust. division to 4 clust.
## intra-clust 6.7567 5.0868 4.4332
## total 10.1580 10.1580 10.1580
## percentage 0.6652 0.5008 0.4364
## Q 0.3348 0.4992 0.5636
## division to 5 clust.
## intra-clust 3.3482
## total 10.1580
## percentage 0.3296
## Q 0.6704
Above results show, that 5 clusters are indeed the the most optimal. This is due to high Q statistic, which can be described by the following equation:
Q = 1 - (inner cluster variability)/(total variability)
Final step is to visualize cluster in two dimensions and count Silhouette Statistic.
fviz_cluster(list(data=results, cluster=clust.vec.5))
plot(silhouette(clust.vec.5,dm))
Based on Silhouette Statistic, we received poor clustering results. The reason of that could also be high intra-cluster variability in relation to total variability.
Once again, let’s see the final results on the map.
cluster <- clust.vec.5
map_cluster <- data.frame(country = c("BE","BG","CZ", "DK", "DE", "EE", "IE", "ES",
"FR", "HR", "IT", "LV", "LT", "LU", "HU", "MT", "NL", "AT", "PL",
"PT", "RO", "SI", "SK", "FI", "SE", "GB", "IS", "NO", "CH"),cluster)
clustering <- joinCountryData2Map(map_cluster,
joinCode = "ISO2",
nameJoinColumn = "country")
## 29 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 214 codes from the map weren't represented in your data
map <- mapCountryData(clustering,
nameColumnToPlot="cluster",
catMethod = "categorical",
missingCountryCol = gray(.8),
addLegend = F,
colourPalette = c("deepskyblue4","darkred","aquamarine4","antiquewhite3","coral3"),
borderCol = "black",
mapTitle = "Hierarchical clusters",
xlim = c(-20, 45), ylim = c(40, 71))
do.call(addMapLegendBoxes, c(map,
x = 'top',
title = "Cluster number",
horiz = TRUE,
bg = "transparent",
bty = "n"))
In this part of analysis, clustering will be performed on raw data. All results will be simultaneously compared to the results of clustering on Principal Components. The outcomes will show, if using PCA before clustering indeed helps with the problem of high dimensionality.
km1<-eclust(df.n, "kmeans", hc_metric="eucliden",k=2)
fviz_silhouette(km1)
## cluster size ave.sil.width
## 1 1 14 0.20
## 2 2 15 0.22
km1<-eclust(results, "kmeans", hc_metric="eucliden",k=2)
fviz_silhouette(km1)
## cluster size ave.sil.width
## 1 1 14 0.25
## 2 2 15 0.34
While looking at the above results, Clustering on raw data definitely shows smaller average silhouette width. With that being said, PCA analysis definitely helped and improved the final results of K-means clustering.
pam1<-eclust(df.n, "pam", k=6) # factoextra::
fviz_silhouette(pam1)
## cluster size ave.sil.width
## 1 1 6 0.10
## 2 2 1 0.00
## 3 3 7 0.09
## 4 4 7 0.18
## 5 5 6 -0.02
## 6 6 2 0.37
pam1<-eclust(results, "pam", k=6) # factoextra::
fviz_silhouette(pam1)
## cluster size ave.sil.width
## 1 1 6 0.10
## 2 2 4 0.44
## 3 3 4 0.25
## 4 4 7 0.39
## 5 5 5 0.36
## 6 6 3 0.16
Using PCA before PAM Clustering also proved to be a good decision. Afterall, average silhouette width is higher for PCA solution.
dm<-dist(df.n)
hc<-hclust(dm, method="complete")
clust.vec.5<-cutree(hc, k=5)
fviz_cluster(list(data=df.n, cluster=clust.vec.5))
plot(silhouette(clust.vec.5,dm))
dm<-dist(results)
hc<-hclust(dm, method="complete")
clust.vec.5<-cutree(hc, k=5)
fviz_cluster(list(data=results, cluster=clust.vec.5))
plot(silhouette(clust.vec.5,dm))
Last but not least, hierarchical clustering also proved to have better results on Principal Components, based on silhouette statistic.
In conclusion, The Curse of Dimensionality is indeed a serious problem, especially nowadays when we deal with more and more high-dimension data. One of the solutions to the problem is to use Principal Component Analysis before clustering, in order to reduce number of dimensions, but also to reduce unnecessary noise of data. The research above shows, that PCA can possibly improve the clustering performance (based on average silhouette width) for all methods that were used: K-means, PAM, and hierarchical clustering.
Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001, January). On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory (pp. 420-434). Springer, Berlin, Heidelberg.
Ben-Hur, A., & Guyon, I. (2003). Detecting stable clusters using principal component analysis. In Functional genomics (pp. 159-182). Humana press.
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179-185.