Analyse the dataset ‘macro’ using tandem analysis (PCA, followed by k-means), k-means analysis and reduced k-means analysis (PCA and k-means simultaneously) and compare the results in order to find the best clustering solution.
We begin by installing the following packages and loading the libraries:
‘clustrd’ , which can perform tandem analysis of PCA and K-means clustering.
‘devtools’, which is required to install the ‘factoextra’ package.
‘factoextra’ is used to visualize the results of the clustering analysis.
‘mclust’, which is used to calculate the Adjusted Rand Index, a measure of the similarity between two clusterings.
‘cluster’, which is used to assess the quality of clustering with via silhouette analysis
library(clustrd)
## Caricamento del pacchetto richiesto: ggplot2
## Caricamento del pacchetto richiesto: grid
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(devtools)
## Caricamento del pacchetto richiesto: usethis
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(mclust)
## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.
library(cluster)
We load the ‘macro’ dataset, which contains data on the GDP, inflation, unemployment and trade balance of 20 countries. We scale the data and assign it to the variable ‘macro’.
data("macro")
macro <- scale(macro)
macro.pca <- prcomp(macro, center = T, scale = T)
macro.pca$sdev
## [1] 1.3163087 1.1999971 1.0872354 0.8074061 0.7665947 0.6369345
By the Kaiser rule, we would select the 1st, 2nd and 3rd principal components to capture a significant portion of the variance (and thus information about the data). However, in order to be able to view 2D plots, we will only use the first two components. This may have consequences for the clustering results.
macro.pca.km3 <- kmeans(macro.pca$x[,1:2], 3, iter.max = 10, nstart = 10)
macro.pca.km3$cluster
## Australia Canada Finland France Spain Sweden
## 1 1 1 1 1 1
## USA Netherlands Greece Mexico Portugal Austria
## 3 2 3 3 2 2
## Belgium Denmark Germany Italy Japan Norway
## 2 2 2 1 2 2
## Switzerland UK
## 2 3
plot(macro.pca$x[, 1:2],
pch =19,
col = macro.pca.km3$cluster,
main = "K-means with 3 clusters on the first 2 principal components")
text(macro.pca$x[, 1:2], labels = rownames(macro), cex = 0.6, pos = 3)
Even with limited knowledge of the economics of the countries listed, it seems strange that the USA and UK have been grouped with Mexico and Greece. This is most likely due to our selection of only the first 2 (and not 3) principal components. This means that a significant portion (47.1%) of the variance (and hence information about the variables) is not being considered.
print(macro.pca.km3)
## K-means clustering with 3 clusters of sizes 7, 9, 4
##
## Cluster means:
## PC1 PC2
## 1 -1.2316556 0.7097318
## 2 1.2073667 0.2373445
## 3 -0.5611776 -1.7760559
##
## Clustering vector:
## Australia Canada Finland France Spain Sweden
## 1 1 1 1 1 1
## USA Netherlands Greece Mexico Portugal Austria
## 3 2 3 3 2 2
## Belgium Denmark Germany Italy Japan Norway
## 2 2 2 1 2 2
## Switzerland UK
## 2 3
##
## Within cluster sum of squares by cluster:
## [1] 6.537273 5.959924 6.134729
## (between_SS / total_SS = 69.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
macro.pca.km3$tot.withinss
## [1] 18.63193
Examining the ‘tot.withinss’ (total within-cluster sum of squares) value, we see that the sum of the within-cluster sum of squares is 18. This is a measure of how well the data points in each cluster are grouped together. A lower value indicates better clustering. This is only meaningful if the clusters are well separated, however.
For comparison, we can also perform K-means clustering on the first 3 principal components:
macro.pca3.km3 <- kmeans(macro.pca$x[,1:3], 3, iter.max = 10, nstart = 10)
macro.pca3.km3$tot.withinss
## [1] 38.88012
lambdas <-macro.pca$sdev^2/sum(macro.pca$sdev^2)
round(lambdas,3)
## [1] 0.289 0.240 0.197 0.109 0.098 0.068
The ‘tot.withinss’ value is now 38.9. This is higher than the previous value, but it is more meaningful as it takes into account much more of the variance in the data (72.6% rather than only 52.9%).
macro.km3 <- kmeans(macro, 3, iter.max = 10, nstart = 10)
macro.km3$cluster
## Australia Canada Finland France Spain Sweden
## 3 3 3 3 3 3
## USA Netherlands Greece Mexico Portugal Austria
## 3 1 2 2 2 1
## Belgium Denmark Germany Italy Japan Norway
## 1 1 1 1 1 1
## Switzerland UK
## 1 1
plot(macro.pca$x[, 1:2],
pch =19,
col = macro.km3$cluster,
main = "K-means with 3 clusters represented on the first 2 principal components")
text(macro.pca$x[, 1:2], labels = rownames(macro), cex = 0.6, pos = 3)
This time, the USA is grouped with Canada and Australia, and the UK has been grouped with many other European countries. The smallest group now contains only Greece, Portugal and Mexico. This seems more reasonable than the previous clustering.
print(macro.km3)
## K-means clustering with 3 clusters of sizes 10, 3, 7
##
## Cluster means:
## GDP LI UR IR TB NNS
## 1 -0.8175538 0.003067305 -0.3397450 -0.3515438 0.2997370 0.3390524
## 2 0.3333478 -0.574608549 -0.3883388 1.6497683 -1.6445426 0.5684991
## 3 1.0250706 0.241878942 0.6517809 -0.2048381 0.2766082 -0.7280030
##
## Clustering vector:
## Australia Canada Finland France Spain Sweden
## 3 3 3 3 3 3
## USA Netherlands Greece Mexico Portugal Austria
## 3 1 2 2 2 1
## Belgium Denmark Germany Italy Japan Norway
## 1 1 1 1 1 1
## Switzerland UK
## 1 1
##
## Within cluster sum of squares by cluster:
## [1] 27.48119 20.96884 20.12533
## (between_SS / total_SS = 39.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
macro.km3$tot.withinss
## [1] 68.57536
The minimum objective function value is 68.6. This is higher, and hence worse, than the value reached with tandem analysis, but it is more meaningful as it takes into account a larger proportion of the variance in the data (100% rather than only 52.9%). Note that although we have plotted the clusters relative to the first 2 principal components, the clustering was performed on the original data.
We can compare the clusters, relative to the first two principal components:
fviz_cluster(macro.pca.km3, macro, title="TANDEM ANALYSIS:
PCA (2 components), then K-means (k=3)")
## Warning: argument title is deprecated; please use main instead.
fviz_cluster(macro.km3, macro, title="K-MEANS ANALYSIS: K-means (k=3) clusters based on scaled data,
plotted against the first 2 principal components")
## Warning: argument title is deprecated; please use main instead.
We use the ‘cluspca’ function from the ‘clustrd’ package to perform PCA and K-means clustering in one step. We specify k=3, the number of principal components (2), the clustering method (reduced K-means), the rotation method (varimax), and we centre and scale the data.
macro.RKM <- cluspca(macro, 3, 2, method = "RKM", center = TRUE, scale = TRUE, rotation = "varimax")
summary(macro.RKM)
## Solution with 3 clusters of sizes 10 (50%), 7 (35%), 3 (15%) in 2 dimensions. Variables were mean centered and standardized.
##
## Cluster centroids:
## Dim.1 Dim.2
## Cluster 1 0.9264 -0.5039
## Cluster 2 -1.4344 -0.3536
## Cluster 3 0.2589 2.5049
##
## Variable scores:
## Dim.1 Dim.2
## GDP -0.7670 0.2123
## LI -0.1150 -0.2175
## UR -0.4271 -0.1109
## IR -0.0201 0.6607
## TB -0.0318 -0.6532
## NNS 0.4634 0.1791
##
## Within cluster sum of squares by cluster:
## [1] 5.1105 4.2402 1.9681
## (between_SS / total_SS = 80.05 %)
##
## Clustering vector:
## Australia Canada Finland France Spain Sweden
## 2 2 2 2 2 2
## USA Netherlands Greece Mexico Portugal Austria
## 2 1 3 3 3 1
## Belgium Denmark Germany Italy Japan Norway
## 1 1 1 1 1 1
## Switzerland UK
## 1 1
##
## Objective criterion value: 34.2877
##
## Available output:
##
## [1] "obscoord" "attcoord" "centroid" "cluster" "criterion" "size"
## [7] "odata" "scale" "center" "nstart"
Using reduced K-means analysis, the USA is grouped with Canada and Australia, and the UK is grouped with many other European countries. The smallest group now contains only Greece, Portugal and Mexico. This is the same result as the K-means clustering analysis.
If we plot, as before, the RKM clusters against the first two principal components, the solution is identical to the K-means clustering analysis.
plot(macro.pca$x[,1:2],
pch=19,
col = macro.RKM$cluster,
main = "REDUCED K-MEANS ANALYSIS: (k=3 and 2 principal components)")
text(macro.pca$x[,1:2], labels = rownames(macro), cex = 0.6, pos = 3)
However, we instead plot the RKM clusters against the principal components calculated by RKM:
plot(macro.RKM$obscoord,
pch=19,
col = macro.RKM$cluster,
main = "REDUCED K-MEANS ANALYSIS: (k=3 and 2 principal components)")
text(macro.RKM$obscoord, labels = rownames(macro), cex = 0.6, pos = 4)
We can see that although the countries are still clustered in the same way, the clusters are rotated slightly. This is because the principal components are rotated to maximize the variance of the data. In this case, the clusters seem to be better separated in the rotated space.
macro.RKM$criterion
## [1] 34.28768
The reduced K-means analysis gives us a ‘criterion’, that is optimal, minimum value of 34.3.
Objective function minimisation results:
Tandem analysis (2PC) 18.6 but only 53% of the variance explained,
Tandem analysis (3PC) 38.9 with 73% of the variance explained,
K-means analysis 68.6
Reduced k-means analysis (2PC) 34.2
The silhouette plot is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette value ranges from -1 to 1, with a value of 1 indicating perfect clustering, that is: the data points are very close to their own cluster centre and far from other clusters. A value of 0 indicates the data points are on or near the boundary between clusters, whilst a score of -1 indicates potentially incorrect clustering.
silRKM <- silhouette(macro.RKM$cluster, dist(macro.RKM$obscoord))
summary(silRKM)
## Silhouette of 20 units in 3 clusters from silhouette.default(x = macro.RKM$cluster, dist = dist(macro.RKM$obscoord)) :
## Cluster sizes and average silhouette widths:
## 10 7 3
## 0.6042366 0.5238140 0.5802296
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2497 0.5344 0.6287 0.5725 0.6759 0.7084
plot(silRKM)
silTAN <- silhouette(macro.pca.km3$cluster, dist(macro.pca$x[,1:2]))
summary(silTAN)
## Silhouette of 20 units in 3 clusters from silhouette.default(x = macro.pca.km3$cluster, dist = dist(macro.pca$x[, 1:2])) :
## Cluster sizes and average silhouette widths:
## 7 9 4
## 0.3742488 0.5213183 0.2781720
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0630 0.3033 0.4337 0.4212 0.6116 0.6873
plot(silTAN)
silKM <- silhouette(macro.km3$cluster, dist(macro))
summary(silKM)
## Silhouette of 20 units in 3 clusters from silhouette.default(x = macro.km3$cluster, dist = dist(macro)) :
## Cluster sizes and average silhouette widths:
## 10 3 7
## 0.26017868 -0.06336677 0.22648422
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.07746 0.09798 0.22412 0.19985 0.30145 0.42817
plot(silKM)
So we have the results:
K-means | Tandem | REDUCED K-MEANS | Tandem |
---|---|---|---|
Objective function minimum: 68.6 | Objective function minimum: 38.9 | Objective function minimum: 34.2 | Objective function minimum: 18.6 |
100% variance explained | 73% variance explained | 53% variance explained | |
Average silhouette score: 0.2 | Average silhouette score: 0.57 | Average silhouette score: 0.42 | |
WORST | BETTER | BEST | DISCARD: variance explained too small |
Here we can see the k-means clustering grouped by rows, and the tandem clustering grouped by columns, using the ‘table’ function:
table(macro.km3$cluster, macro.pca.km3$cluster)
##
## 1 2 3
## 1 1 8 1
## 2 0 1 2
## 3 6 0 1
As already confirmed by our plots, there are differences in the grouping of countries in each cluster using k-means and tandem analysis.
Here we can see the k-means clustering grouped by rows, and the reduced k.means clustering grouped by columns:
table(macro.km3$cluster, macro.RKM$cluster)
##
## 1 2 3
## 1 10 0 0
## 2 0 0 3
## 3 0 7 0
K-means and reduced k-means have clustered the same countries together.
As already confirmed by our plots, there will thus also be differences in the grouping of countries in each cluster using reduced k-means and tandem analysis.
We can also use the function ‘adjustedRandIndex’ which gives a result ranging from negative infinity to 1 (perfect agreement), which, for random partition, has an expected value of zero.
adjustedRandIndex(macro.km3$cluster, macro.RKM$cluster)
## [1] 1
As expected, we see a result of 1 when our clusters agree perfectly.
adjustedRandIndex(macro.km3$cluster, macro.pca.km3$cluster)
## [1] 0.4898084
Comparing k-means and tandem analysis, we obtain a result of 0.49
adjustedRandIndex(macro.RKM$cluster, macro.pca.km3$cluster)
## [1] 0.4898084
Comparing reduced k-means and tandem analysis, we again obtain a result of 0.49
In this case, by all measurements used, the reduced k-means analysis has resulted in better clustering than tandem analysis or k-means alone.