Clustering on PCA results

Introduction
Description of data
Dimensions reduction
Clustering Principal components
Clustering Raw Data
Summary
Bibliography

1. Introduction - The Curse of Dimensionality

In modern times, we tend to store bigger and bigger amounts of data. Because of that, data scientist have also more choice when it comes to choosing features/variables for their models. But is having more features/variables necessarily more helpful when it comes to clustering methods? In fact, when number of features is high, there is a possibility of The Curse of Dimensionality occurrence. According to Aggarwal, Hinneburg, Keim (2001), having more variables in the models of nearest neighbor search does not bring better results. Authors also imply, that the problem is even bigger when Euclidean metric is used. It happens because the distance metric in high dimensions is not as intuitive as in 2 or 3 dimensions. After bringing more dimensions to the model, distance ratio between observations at some point becomes uniformly distributed, hence the Euclidean distance metric becomes unuseful while comparing distances to find the nearest neighbors. (Aggarwal et al., 2001)

To deal with The Curse of Dimensionality, methods of dimensions reductions can be applied before any clustering methods. It would help to reduce the unnecessary noise of data, which shows that dimensions reduction can be also used as a form pre-processing of data to keep the most useful information from the analyzed data. (Ben-Hur and Guyon, 2003)

In the following research, I will verify whether forms of dimensions reduction performed before clustering the data help to achieve better clustering results. To do that, data set of 18 different quality of life features for 31 countries will be used. Having the number of variables more than half as big as the number of observations, can be in fact problematic according to the Curse of Dimensionality. Additionally, the focus will also be on searching for any existing patterns of clustered data, which could tell us more about groups of countries which are similar in terms of life quality.

2. Description of data

The analysis will be based on 31 countries from Europe in year 2018. The data set consists of 18 indicators, where each of them in different ways describes the quality of life of the country’s inhabitants. All data was taken from Eurostat database. The list of variables can be found below:

Employment rate
Consumption per capita
Life expectancy
Percentage of the population rating their life satisfaction as high
Gender employment gap
Percentage of the population with tertiary education
Number of working hours
Crimes reported in the area
Self-reported unmet needs for medical examination
Percentage of the population with part-time employment
Participation in any cultural or sport activities in the last 12 months
Percentage of population having trouble to make ends meet
Population living in a dwelling with a leaking roof
Greenhouse emission
Percentage of population which needs to deal with unexpected expenses
Average week number of working hours
Percentage of inactive population
Percentage of people rating their life as “happy”

3. Dimensions Reduction

Importing the data and R packages

Firstly, let’s read the data with all packages needed.

library(knitr)
library(dplyr)
library(tidyr)
library(clustertend)
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(clusterSim)
library(smacof)
library(factoextra)
library(gridExtra)
library(MASS)
library(paran)
library(BBmisc)
library(stats)
library(rworldmap)
library(ClustGeo)


df<-read.csv("country.csv", sep=";", dec=".", header=TRUE)

MDS - Multidimensional scaling

After reading the data, one of the methods of dimensions reduction will be applied - Multidimensional scaling. One of the first steps is to create the matrix of unlabeled data, and then standardize the data. Standardization will lower the effect of extreme values on the MDS results. Moreover, analyzed data is not in the same units, hence standardization will be used for all dimension reduction and clustering techniques.

# no labels at data 
country <-df[,1] # first column
variable<-colnames(df[1,2:19]) # first row
df <- data.frame(df[,2:19], row.names = df[,1]) # first column as an index
data_raw<-as.matrix(df) # data only

df.n<-data.Normalization(data_raw, type="n1", normalization="column")#clusterSim::

Now, it’s time for Multidimensional scaling for all countries in our data. With the following code, the distances between observations will be counted. Then, the distances in the 2- dimension data will be optimized, so that they correspond to the distances in the original matrix. After having two dimensions, we will be able to see if there are any clusters.

# analysis for countries
distance <- dist(df.n) 
mds1<-cmdscale(distance, k=2) 
plot(mds1, type='n')
text(mds1, labels=country, cex=0.6, adj=0.5)

As one can see from above graph, there are some outliers which might impact the results. Hence, the next step will be to remove all existing outliers.

# looking for outliers
data.out<-which(mds1[,1]>4.1 | mds1[,1]<(-4.1)) # arbitrary limits
data.all<-c(data.out)   # merging into single dataset

# visualization with outliers
plot(mds1) # simple graphics for products (price_what)
abline(v=c(-4.1, 4.1), lty=3, col=2) # vertical line
points(mds1[data.all,], pch=21, bg="red", cex=2)
text(mds1[data.all,]-0.5, labels=rownames(mds1[data.all,]))

# removing outliers from the database
df.n <- df.n[-c(8,31),]
df <- df[-c(8,31),]

Greece and Turkey were identified as outliers, hence we will omit them during our further analysis. They also may highly affect PCA analysis, which will be our next step. However, there are still no significant clusters visible.

PCA - Principal Component Analysis

As a next step, PCA method will be applied to check whether it brings better results.

The Principal Component Analysis is a well-known method, which enables researchers to lower number of data dimensions by creating new eigen vectors, which carry the most variance (most information). Firstly, let’s take a glance at first results of PCA.

xxx.pca1<-prcomp(df.n, center=FALSE, scale.=FALSE)
summary(xxx.pca1)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3063 1.5445 1.2911 1.13655 0.92700 0.82541 0.77157
## Proportion of Variance 0.3618 0.1622 0.1134 0.08785 0.05844 0.04633 0.04049
## Cumulative Proportion  0.3618 0.5240 0.6373 0.72520 0.78364 0.82997 0.87046
##                           PC8    PC9   PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.6598 0.5764 0.5144 0.48224 0.41619 0.38759 0.32713
## Proportion of Variance 0.0296 0.0226 0.0180 0.01582 0.01178 0.01022 0.00728
## Cumulative Proportion  0.9001 0.9227 0.9406 0.95647 0.96825 0.97847 0.98574
##                           PC15   PC16   PC17    PC18
## Standard deviation     0.29968 0.2628 0.1957 0.11171
## Proportion of Variance 0.00611 0.0047 0.0026 0.00085
## Cumulative Proportion  0.99185 0.9966 0.9991 1.00000

# visusalisation of quality
fviz_eig(xxx.pca1)  # eigenvalues on y-axis

fviz_pca_var(xxx.pca1, col.var="steelblue")#

The cumulative variance of two principals is equal to 0.5345. Although it would be easier to draw data in 2 dimensions, more Principal Components may be needed to explain enough variance. In order to determine the adequate number of PCs, two methods will be used: Kaiser Criterion and adjusting the Scree Plot results with parallel analysis.

Number of retained Principal Components

Kaiser Criterion

Kaiser Criterion helps to establish number of Principle Components that explain enough variance. According to the criterion, one should use as many PCs, as there is number of eigenvalues which are higher than 1.

# table of eigenvalues
eig.val<-get_eigenvalue(xxx.pca1)
eig.val

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  5.31922820       36.1747509                    36.17475
## Dim.2  2.38546983       16.2229883                    52.39774
## Dim.3  1.66705179       11.3372055                    63.73494
## Dim.4  1.29175629        8.7849139                    72.51986
## Dim.5  0.85932830        5.8440785                    78.36394
## Dim.6  0.68129903        4.6333456                    82.99728
## Dim.7  0.59532331        4.0486461                    87.04593
## Dim.8  0.43530530        2.9604033                    90.00633
## Dim.9  0.33225811        2.2596049                    92.26594
## Dim.10 0.26460396        1.7995059                    94.06544
## Dim.11 0.23255359        1.5815392                    95.64698
## Dim.12 0.17321546        1.1779954                    96.82498
## Dim.13 0.15022582        1.0216485                    97.84663
## Dim.14 0.10701204        0.7277623                    98.57439
## Dim.15 0.08980844        0.6107649                    99.18515
## Dim.16 0.06904650        0.4695681                    99.65472
## Dim.17 0.03829159        0.2604116                    99.91513
## Dim.18 0.01247908        0.0848671                   100.00000

In the analyzed example 4 eigenvalues are higher than 1, and therefore accordingly to the criterion 4 Principal Components should be chosen.

Parallel analysis with Scree Plot

Parallel analysis with Scree Plot was introduced by Horn (1965). The parallel analysis was designed to both graphically and numerically assess the number of components to keep in PCA analysis. It is based on the assumption that if data was random, we would observe non-correlated factors, therefore eigenvalue of PCA would be equal to 1. Horn’s technique enables to compare obtained PCA results with the random data PCA results. In this case paran package may be found useful.

According to Paran package description, below graph results can be evaluated in two ways:

We retain those factors which have adjusted eigenvalue higher than 1.
We retain those factors which real PCA eigenvalues are higher than randomly generated ones.

paran(df.n, iterations=10000, quietly=FALSE,
    status=FALSE, all=TRUE, cfa=FALSE, graph=TRUE,
    color=TRUE, col=c("black","red","blue"),
    lty=c(1,2,3), lwd=1, legend=TRUE, file="",
    width=640, height=640, grdevice="png", seed=0, mat=NA, n=NA)

## 
## Using eigendecomposition of correlation matrix.
## 
## Results of Horn's Parallel Analysis for component retention
## 10000 iterations, using the mean estimate
## 
## -------------------------------------------------- 
## Component   Adjusted    Unadjusted    Estimated 
##             Eigenvalue  Eigenvalue    Bias 
## -------------------------------------------------- 
## 1           4.547977    6.271412      1.723435
## 2           1.520106    2.826159      1.306052
## 3           1.002757    2.005554      1.002797
## 4           0.801781    1.556885      0.755103
## 5           0.603719    1.140798      0.537079
## 6           0.461683    0.807473      0.345789
## 7           0.598184    0.772484      0.174299
## 8           0.629686    0.646852      0.017166
## 9           0.599397    0.473681     -0.12571
## 10          0.706564    0.449794     -0.25677
## 11          0.652416    0.276837     -0.37557
## 12          0.691169    0.208176     -0.48299
## 13          0.761849    0.180710     -0.58113
## 14          0.798916    0.130131     -0.66878
## 15          0.866667    0.120080     -0.74658
## 16          0.894284    0.078217     -0.81606
## 17          0.919190    0.042183     -0.87700
## 18          0.943645    0.012565     -0.93108
## -------------------------------------------------- 
## 
## Adjusted eigenvalues > 1 indicate dimensions to retain.
## (3 components retained)

Results on both table graph show, that three factors should be retained - only three first meet the requirements of obtained eigenvalues being higher than the ones obtained from random data. This gives us different conclusions than Kaiser Criterion, which suggested using 4 PCs. However, as three eigenvalues explain only 63% of variance and four eigenvalues explain even 72% of variance, 4 dimensions will be chosen for the sake of more information.

Now, let’s see which variables constitute the most for four first principal components (which carry the most variance).

PC1

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=1, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of first Principal Components")

PC2

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=2, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of second Principal Components")

PC3

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=3, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of third Principal Components")

PC4

var<-get_pca_var(xxx.pca1)
a<-fviz_contrib(xxx.pca1, "var", axes=4, xtickslab.rt=90) # default angle=45°
plot(a,main = "Variables percentage contribution of fourth Principal Components")

Looking at first Principal Component, one can see that the most contributing variables concern work life and financial variables, such as consumption, average working hours etc. On the other hand, to second principle component contributed variables such as life expectancy, greenhouse emission, high satisfaction or crime, which could concern general happiness and equality in the societies.

4. Clustering Principal Components

Clustering tendency

Firstly, let’s see what is the clustering tendency for our dataset. The smaller the statistic, the better.

xxx.pca1<-prcomp(df.n, center=FALSE, scale.=FALSE, rank. = 4) # stats::
results <- xxx.pca1$x
hopkins(df, n=nrow(results)-1)

## $H
## [1] 0.4246506

Hopkins statistic does not bring optimistic results. That means, that quality clustering tests may also give poor outcomes. We will see however, if performing PCA before clustering will give better results.

K-Means

After dimension reduction, clustering of the features obtained by PCA analysis will be done. One of the most popular methods of clustering is the K-means algorithm. The method requires to choose optimal metric and number of clusters.

Firstly, proper number of clusters will be established by Silhouette statistic and Gap statistic graph.

Silhouette Statistic

xxx.pca1<-prcomp(df.n, center=FALSE, scale.=FALSE, rank. = 4) # stats::
results <- xxx.pca1$x
fviz_nbclust(results, FUNcluster=kmeans, k.max = 8)

Silhouette Statistic is the highest for 2 clusters.

Gap Statistic

fviz_nbclust(results, FUNcluster=kmeans, method="gap_stat", k.max = 8)+ theme_classic()

The biggest differences for gap statistics on graph are visible for 2 clusters. With that being said, two clusters will be used in the further analysis.

Distance metrics

The next step is to choose the most suitable distance metrics. For that purpose, clustering for two different distance measures will be conducted, specifically for:

Euclidean distance
Manhattan distance

Euclidean distance

km1<-eclust(results, "kmeans", hc_metric="eucliden",k=2)

fviz_silhouette(km1)

##   cluster size ave.sil.width
## 1       1   14          0.25
## 2       2   15          0.34

Manhattan distance

km2<-eclust(results, "kmeans", hc_metric="manhattan",k=2)

fviz_silhouette(km2)

##   cluster size ave.sil.width
## 1       1   14          0.25
## 2       2   15          0.34

Both measures have the same Silhouette Statistic, hence there is no difference in which distance measure will be chosen. For the sake of easier analysis, Euclidean distance will be chosen. Average Silhouette is equal to 0.3, which means that clustering quality is not the best. Other clustering will be conducted to see, if that result can be improved.

Obtained results can also be portrayed on the map:

cluster <- km1$cluster

map_cluster <- data.frame(country = c("BE","BG","CZ",   "DK",   "DE",   "EE",   "IE",   "ES",
                                      "FR", "HR",   "IT",   "LV",   "LT",   "LU",   "HU",   "MT",   "NL",   "AT",   "PL",
                                      "PT", "RO",   "SI",   "SK",   "FI",   "SE",   "GB",   "IS",   "NO",   "CH"),cluster)


clustering <- joinCountryData2Map(map_cluster, 
                                  joinCode = "ISO2",
                                  nameJoinColumn = "country")

## 29 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 214 codes from the map weren't represented in your data

map <- mapCountryData(clustering, 
                            nameColumnToPlot="cluster",
                            catMethod = "categorical",
                            missingCountryCol = gray(.8),
                            addLegend = F,
                            colourPalette = c("deepskyblue4","aquamarine4"),
                            borderCol = "black",
                            mapTitle = "K-means clusters",
                            xlim = c(-20, 45), ylim = c(40, 71))

do.call(addMapLegendBoxes, c(map,
                             x = 'right',
                             title = "Cluster number",
                             horiz = TRUE,
                             bg = "transparent",
                             bty = "n"))

PAM - Partitioning Around Medoids

The second method of clustering is PAM algorithm, which is more robust version of K-means and less sensitive to outliers. The overall goal of that method is to cluster points around medoids, instead of artificial points as it was done in K-means algorithm.

As before, let’s establish optimal number of clusters.

# clustering with triangle graphics
fviz_nbclust(results, FUNcluster=cluster::pam, k.max = 7)

fviz_nbclust(results, FUNcluster=cluster::pam, method="gap_stat", k.max  = 7)+ theme_classic() # factoextra::

According to Silhouette Statistic, 6-7 clusters should be chosen, whereas Gap Statistic indicates mainly 6 clusters. As 6 clusters were recommended by both methods, they will be used in the further analysis.

pam1<-eclust(results, "pam", k=6) # factoextra::

fviz_silhouette(pam1)

##   cluster size ave.sil.width
## 1       1    6          0.10
## 2       2    4          0.44
## 3       3    4          0.25
## 4       4    7          0.39
## 5       5    5          0.36
## 6       6    3          0.16

PAM algorithm brought completely different results to K-means method. It established 6 different clusters, yet if one looked closely countries are logically better grouped than 2 clusters in K-means. Unfortunately, Silhouette statistic is not higher than K-means one. It is equal to only 0.29.

Let’s see how results look on the map.

cluster <- pam1$cluster

map_cluster <- data.frame(country = c("BE","BG","CZ",   "DK",   "DE",   "EE",   "IE",   "ES",
                                      "FR", "HR",   "IT",   "LV",   "LT",   "LU",   "HU",   "MT",   "NL",   "AT",   "PL",
                                      "PT", "RO",   "SI",   "SK",   "FI",   "SE",   "GB",   "IS",   "NO",   "CH"),cluster)


clustering <- joinCountryData2Map(map_cluster, 
                                  joinCode = "ISO2",
                                  nameJoinColumn = "country")

## 29 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 214 codes from the map weren't represented in your data

map <- mapCountryData(clustering, 
                            nameColumnToPlot="cluster",
                            catMethod = "categorical",
                            missingCountryCol = gray(.8),
                            addLegend = F,
                            colourPalette = c("deepskyblue4","darkred","aquamarine4","antiquewhite3","coral3","purple"),
                            borderCol = "black",
                            mapTitle = "PAM clusters",
                            xlim = c(-20, 45), ylim = c(40, 71))

do.call(addMapLegendBoxes, c(map,
                             x = 'top',
                             title = "Cluster number",
                             horiz = TRUE,
                             bg = "transparent",
                             bty = "n"))

Hierarchical Clustering

Hierarchical Clustering will be conducted by using cluster dendrogram, which presents hierarchy of clusters.

Next step is to calculate distances between observations and create a simple dendrogram.

dm<-dist(results) 
hc<-hclust(dm, method="complete") # simple dendrogram
plot(hc, hang=-1) 
rect.hclust(hc, k=5, border="red")

plot(density(dm))

It look like 5 different clusters can be distinguished. To find quality of such division, the inertia measures needs to be calculated.

Inertia

Inertia is a measure, which helps to evaluate quality of clustering. As a benchmark, good partitioning is characterized by high variability between clusters, and small variability inside clusters.

##             division to 2 clust. division to 3 clust. division to 4 clust.
## intra-clust               6.7567               5.0868               4.4332
## total                    10.1580              10.1580              10.1580
## percentage                0.6652               0.5008               0.4364
## Q                         0.3348               0.4992               0.5636
##             division to 5 clust.
## intra-clust               3.3482
## total                    10.1580
## percentage                0.3296
## Q                         0.6704

Above results show, that 5 clusters are indeed the the most optimal. This is due to high Q statistic, which can be described by the following equation:

Q = 1 - (inner cluster variability)/(total variability)

Hierarchical Clustering - data visualisation

Final step is to visualize cluster in two dimensions and count Silhouette Statistic.

fviz_cluster(list(data=results, cluster=clust.vec.5))

plot(silhouette(clust.vec.5,dm))

Based on Silhouette Statistic, we received poor clustering results. The reason of that could also be high intra-cluster variability in relation to total variability.

Once again, let’s see the final results on the map.

cluster <- clust.vec.5

map_cluster <- data.frame(country = c("BE","BG","CZ",   "DK",   "DE",   "EE",   "IE",   "ES",
                                      "FR", "HR",   "IT",   "LV",   "LT",   "LU",   "HU",   "MT",   "NL",   "AT",   "PL",
                                      "PT", "RO",   "SI",   "SK",   "FI",   "SE",   "GB",   "IS",   "NO",   "CH"),cluster)


clustering <- joinCountryData2Map(map_cluster, 
                                  joinCode = "ISO2",
                                  nameJoinColumn = "country")

## 29 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 214 codes from the map weren't represented in your data

map <- mapCountryData(clustering, 
                            nameColumnToPlot="cluster",
                            catMethod = "categorical",
                            missingCountryCol = gray(.8),
                            addLegend = F,
                            colourPalette = c("deepskyblue4","darkred","aquamarine4","antiquewhite3","coral3"),
                            borderCol = "black",
                            mapTitle = "Hierarchical clusters",
                            xlim = c(-20, 45), ylim = c(40, 71))

do.call(addMapLegendBoxes, c(map,
                             x = 'top',
                             title = "Cluster number",
                             horiz = TRUE,
                             bg = "transparent",
                             bty = "n"))

5. Clustering on raw data

In this part of analysis, clustering will be performed on raw data. All results will be simultaneously compared to the results of clustering on Principal Components. The outcomes will show, if using PCA before clustering indeed helps with the problem of high dimensionality.

K-means

Raw Data K-Means Clustering

km1<-eclust(df.n, "kmeans", hc_metric="eucliden",k=2)

fviz_silhouette(km1)

##   cluster size ave.sil.width
## 1       1   14          0.20
## 2       2   15          0.22

PCA K-Means Clustering

km1<-eclust(results, "kmeans", hc_metric="eucliden",k=2)

fviz_silhouette(km1)

##   cluster size ave.sil.width
## 1       1   14          0.25
## 2       2   15          0.34

While looking at the above results, Clustering on raw data definitely shows smaller average silhouette width. With that being said, PCA analysis definitely helped and improved the final results of K-means clustering.

PAM

Raw Data PAM Clustering

pam1<-eclust(df.n, "pam", k=6) # factoextra::

fviz_silhouette(pam1)

##   cluster size ave.sil.width
## 1       1    6          0.10
## 2       2    1          0.00
## 3       3    7          0.09
## 4       4    7          0.18
## 5       5    6         -0.02
## 6       6    2          0.37

PCA PAM Clustering

pam1<-eclust(results, "pam", k=6) # factoextra::

fviz_silhouette(pam1)

##   cluster size ave.sil.width
## 1       1    6          0.10
## 2       2    4          0.44
## 3       3    4          0.25
## 4       4    7          0.39
## 5       5    5          0.36
## 6       6    3          0.16

Using PCA before PAM Clustering also proved to be a good decision. Afterall, average silhouette width is higher for PCA solution.

Hierachical

Raw Data Hierchical Clustering

dm<-dist(df.n) 
hc<-hclust(dm, method="complete") 
clust.vec.5<-cutree(hc, k=5) 
fviz_cluster(list(data=df.n, cluster=clust.vec.5))

plot(silhouette(clust.vec.5,dm))

PCA Hierarchical Clustering

dm<-dist(results) 
hc<-hclust(dm, method="complete") 
clust.vec.5<-cutree(hc, k=5) 
fviz_cluster(list(data=results, cluster=clust.vec.5))

plot(silhouette(clust.vec.5,dm))

Last but not least, hierarchical clustering also proved to have better results on Principal Components, based on silhouette statistic.

6. Conclusions

In conclusion, The Curse of Dimensionality is indeed a serious problem, especially nowadays when we deal with more and more high-dimension data. One of the solutions to the problem is to use Principal Component Analysis before clustering, in order to reduce number of dimensions, but also to reduce unnecessary noise of data. The research above shows, that PCA can possibly improve the clustering performance (based on average silhouette width) for all methods that were used: K-means, PAM, and hierarchical clustering.

7. Bibliography

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001, January). On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory (pp. 420-434). Springer, Berlin, Heidelberg.
Ben-Hur, A., & Guyon, I. (2003). Detecting stable clusters using principal component analysis. In Functional genomics (pp. 159-182). Humana press.
Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179-185.