12 - Comparison: Tandem Analysis, K-means Clustering Analysis and Reduced K-means Analysis

Objective:

Analyse the dataset ‘macro’ using tandem analysis (PCA, followed by k-means), k-means analysis and reduced k-means analysis (PCA and k-means simultaneously) and compare the results in order to find the best clustering solution.

We begin by installing the following packages and loading the libraries:

‘clustrd’ , which can perform tandem analysis of PCA and K-means clustering.
‘devtools’, which is required to install the ‘factoextra’ package.
‘factoextra’ is used to visualize the results of the clustering analysis.
‘mclust’, which is used to calculate the Adjusted Rand Index, a measure of the similarity between two clusterings.
‘cluster’, which is used to assess the quality of clustering with via silhouette analysis

library(clustrd)

## Caricamento del pacchetto richiesto: ggplot2

## Caricamento del pacchetto richiesto: grid

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(devtools)

## Caricamento del pacchetto richiesto: usethis

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(mclust)

## Package 'mclust' version 6.1.1
## Type 'citation("mclust")' for citing this R package in publications.

library(cluster)

We load the ‘macro’ dataset, which contains data on the GDP, inflation, unemployment and trade balance of 20 countries. We scale the data and assign it to the variable ‘macro’.

data("macro")
macro <- scale(macro)

Tandem Analysis: Principal Components Analysis, then K-means Clustering

(i) PCA

macro.pca <- prcomp(macro, center = T, scale = T)

macro.pca$sdev

## [1] 1.3163087 1.1999971 1.0872354 0.8074061 0.7665947 0.6369345

By the Kaiser rule, we would select the 1st, 2nd and 3rd principal components to capture a significant portion of the variance (and thus information about the data). However, in order to be able to view 2D plots, we will only use the first two components. This may have consequences for the clustering results.

(ii) K-means Clustering on the first 2 principal components

macro.pca.km3 <- kmeans(macro.pca$x[,1:2], 3, iter.max = 10, nstart = 10)

macro.pca.km3$cluster

##   Australia      Canada     Finland      France       Spain      Sweden 
##           1           1           1           1           1           1 
##         USA Netherlands      Greece      Mexico    Portugal     Austria 
##           3           2           3           3           2           2 
##     Belgium     Denmark     Germany       Italy       Japan      Norway 
##           2           2           2           1           2           2 
## Switzerland          UK 
##           2           3

plot(macro.pca$x[, 1:2], 
     pch =19,
     col = macro.pca.km3$cluster,
     main = "K-means with 3 clusters on the first 2 principal components")
text(macro.pca$x[, 1:2], labels = rownames(macro), cex = 0.6, pos = 3)

Tandem Analysis Solution:

Even with limited knowledge of the economics of the countries listed, it seems strange that the USA and UK have been grouped with Mexico and Greece. This is most likely due to our selection of only the first 2 (and not 3) principal components. This means that a significant portion (47.1%) of the variance (and hence information about the variables) is not being considered.

print(macro.pca.km3)

## K-means clustering with 3 clusters of sizes 7, 9, 4
## 
## Cluster means:
##          PC1        PC2
## 1 -1.2316556  0.7097318
## 2  1.2073667  0.2373445
## 3 -0.5611776 -1.7760559
## 
## Clustering vector:
##   Australia      Canada     Finland      France       Spain      Sweden 
##           1           1           1           1           1           1 
##         USA Netherlands      Greece      Mexico    Portugal     Austria 
##           3           2           3           3           2           2 
##     Belgium     Denmark     Germany       Italy       Japan      Norway 
##           2           2           2           1           2           2 
## Switzerland          UK 
##           2           3 
## 
## Within cluster sum of squares by cluster:
## [1] 6.537273 5.959924 6.134729
##  (between_SS / total_SS =  69.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

macro.pca.km3$tot.withinss

## [1] 18.63193

Examining the ‘tot.withinss’ (total within-cluster sum of squares) value, we see that the sum of the within-cluster sum of squares is 18. This is a measure of how well the data points in each cluster are grouped together. A lower value indicates better clustering. This is only meaningful if the clusters are well separated, however.

For comparison, we can also perform K-means clustering on the first 3 principal components:

macro.pca3.km3 <- kmeans(macro.pca$x[,1:3], 3, iter.max = 10, nstart = 10)
macro.pca3.km3$tot.withinss

## [1] 38.88012

lambdas <-macro.pca$sdev^2/sum(macro.pca$sdev^2)
round(lambdas,3)

## [1] 0.289 0.240 0.197 0.109 0.098 0.068

The ‘tot.withinss’ value is now 38.9. This is higher than the previous value, but it is more meaningful as it takes into account much more of the variance in the data (72.6% rather than only 52.9%).

K-means Analysis

macro.km3 <- kmeans(macro, 3, iter.max = 10, nstart = 10)

macro.km3$cluster

##   Australia      Canada     Finland      France       Spain      Sweden 
##           3           3           3           3           3           3 
##         USA Netherlands      Greece      Mexico    Portugal     Austria 
##           3           1           2           2           2           1 
##     Belgium     Denmark     Germany       Italy       Japan      Norway 
##           1           1           1           1           1           1 
## Switzerland          UK 
##           1           1

plot(macro.pca$x[, 1:2], 
     pch =19,
     col = macro.km3$cluster,
     main = "K-means with 3 clusters represented on the first 2 principal components")
text(macro.pca$x[, 1:2], labels = rownames(macro), cex = 0.6, pos = 3)

K-means Analysis Solution:

This time, the USA is grouped with Canada and Australia, and the UK has been grouped with many other European countries. The smallest group now contains only Greece, Portugal and Mexico. This seems more reasonable than the previous clustering.

print(macro.km3)

## K-means clustering with 3 clusters of sizes 10, 3, 7
## 
## Cluster means:
##          GDP           LI         UR         IR         TB        NNS
## 1 -0.8175538  0.003067305 -0.3397450 -0.3515438  0.2997370  0.3390524
## 2  0.3333478 -0.574608549 -0.3883388  1.6497683 -1.6445426  0.5684991
## 3  1.0250706  0.241878942  0.6517809 -0.2048381  0.2766082 -0.7280030
## 
## Clustering vector:
##   Australia      Canada     Finland      France       Spain      Sweden 
##           3           3           3           3           3           3 
##         USA Netherlands      Greece      Mexico    Portugal     Austria 
##           3           1           2           2           2           1 
##     Belgium     Denmark     Germany       Italy       Japan      Norway 
##           1           1           1           1           1           1 
## Switzerland          UK 
##           1           1 
## 
## Within cluster sum of squares by cluster:
## [1] 27.48119 20.96884 20.12533
##  (between_SS / total_SS =  39.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

macro.km3$tot.withinss

## [1] 68.57536

The minimum objective function value is 68.6. This is higher, and hence worse, than the value reached with tandem analysis, but it is more meaningful as it takes into account a larger proportion of the variance in the data (100% rather than only 52.9%). Note that although we have plotted the clusters relative to the first 2 principal components, the clustering was performed on the original data.

We can compare the clusters, relative to the first two principal components:

fviz_cluster(macro.pca.km3, macro, title="TANDEM ANALYSIS: 
                             PCA (2 components), then K-means (k=3)")

## Warning: argument title is deprecated; please use main instead.

fviz_cluster(macro.km3, macro, title="K-MEANS ANALYSIS: K-means (k=3) clusters based on scaled data, 
                             plotted against the first 2 principal components")

## Warning: argument title is deprecated; please use main instead.

Reduced K-means Analysis

We use the ‘cluspca’ function from the ‘clustrd’ package to perform PCA and K-means clustering in one step. We specify k=3, the number of principal components (2), the clustering method (reduced K-means), the rotation method (varimax), and we centre and scale the data.

macro.RKM <- cluspca(macro, 3, 2, method = "RKM", center = TRUE, scale = TRUE, rotation = "varimax")

summary(macro.RKM)

## Solution with 3 clusters of sizes 10 (50%), 7 (35%), 3 (15%) in 2 dimensions. Variables were mean centered and standardized.
## 
## Cluster centroids:
##             Dim.1   Dim.2
## Cluster 1  0.9264 -0.5039
## Cluster 2 -1.4344 -0.3536
## Cluster 3  0.2589  2.5049
## 
## Variable scores:
##       Dim.1   Dim.2
## GDP -0.7670  0.2123
## LI  -0.1150 -0.2175
## UR  -0.4271 -0.1109
## IR  -0.0201  0.6607
## TB  -0.0318 -0.6532
## NNS  0.4634  0.1791
## 
## Within cluster sum of squares by cluster:
## [1] 5.1105 4.2402 1.9681
##  (between_SS / total_SS =  80.05 %) 
## 
## Clustering vector:
##   Australia      Canada     Finland      France       Spain      Sweden 
##           2           2           2           2           2           2 
##         USA Netherlands      Greece      Mexico    Portugal     Austria 
##           2           1           3           3           3           1 
##     Belgium     Denmark     Germany       Italy       Japan      Norway 
##           1           1           1           1           1           1 
## Switzerland          UK 
##           1           1 
## 
## Objective criterion value: 34.2877 
## 
## Available output:
## 
##  [1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"     
##  [7] "odata"     "scale"     "center"    "nstart"

Reduced K-means Analysis Solution

Using reduced K-means analysis, the USA is grouped with Canada and Australia, and the UK is grouped with many other European countries. The smallest group now contains only Greece, Portugal and Mexico. This is the same result as the K-means clustering analysis.

If we plot, as before, the RKM clusters against the first two principal components, the solution is identical to the K-means clustering analysis.

plot(macro.pca$x[,1:2], 
     pch=19,
     col = macro.RKM$cluster, 
     main = "REDUCED K-MEANS ANALYSIS: (k=3 and 2 principal components)")
text(macro.pca$x[,1:2], labels = rownames(macro), cex = 0.6, pos = 3)

However, we instead plot the RKM clusters against the principal components calculated by RKM:

plot(macro.RKM$obscoord, 
     pch=19,
     col = macro.RKM$cluster, 
     main = "REDUCED K-MEANS ANALYSIS: (k=3 and 2 principal components)")
text(macro.RKM$obscoord, labels = rownames(macro), cex = 0.6, pos = 4)

We can see that although the countries are still clustered in the same way, the clusters are rotated slightly. This is because the principal components are rotated to maximize the variance of the data. In this case, the clusters seem to be better separated in the rotated space.

macro.RKM$criterion

## [1] 34.28768

The reduced K-means analysis gives us a ‘criterion’, that is optimal, minimum value of 34.3.

Comparison of results

Objective function minimisation results:

Tandem analysis (2PC) 18.6 but only 53% of the variance explained,
Tandem analysis (3PC) 38.9 with 73% of the variance explained,
K-means analysis 68.6
Reduced k-means analysis (2PC) 34.2

The silhouette plot is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette value ranges from -1 to 1, with a value of 1 indicating perfect clustering, that is: the data points are very close to their own cluster centre and far from other clusters. A value of 0 indicates the data points are on or near the boundary between clusters, whilst a score of -1 indicates potentially incorrect clustering.

silRKM <- silhouette(macro.RKM$cluster, dist(macro.RKM$obscoord))
summary(silRKM)

## Silhouette of 20 units in 3 clusters from silhouette.default(x = macro.RKM$cluster, dist = dist(macro.RKM$obscoord)) :
##  Cluster sizes and average silhouette widths:
##        10         7         3 
## 0.6042366 0.5238140 0.5802296 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2497  0.5344  0.6287  0.5725  0.6759  0.7084

plot(silRKM)

silTAN <- silhouette(macro.pca.km3$cluster, dist(macro.pca$x[,1:2]))
summary(silTAN)

## Silhouette of 20 units in 3 clusters from silhouette.default(x = macro.pca.km3$cluster, dist = dist(macro.pca$x[, 1:2])) :
##  Cluster sizes and average silhouette widths:
##         7         9         4 
## 0.3742488 0.5213183 0.2781720 
## Individual silhouette widths:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0630  0.3033  0.4337  0.4212  0.6116  0.6873

plot(silTAN)

silKM <- silhouette(macro.km3$cluster, dist(macro))
summary(silKM)

## Silhouette of 20 units in 3 clusters from silhouette.default(x = macro.km3$cluster, dist = dist(macro)) :
##  Cluster sizes and average silhouette widths:
##          10           3           7 
##  0.26017868 -0.06336677  0.22648422 
## Individual silhouette widths:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.07746  0.09798  0.22412  0.19985  0.30145  0.42817

plot(silKM)

So we have the results:

K-means	Tandem	REDUCED K-MEANS	Tandem
Objective function minimum: 68.6	Objective function minimum: 38.9	Objective function minimum: 34.2	Objective function minimum: 18.6
100% variance explained	73% variance explained		53% variance explained
Average silhouette score: 0.2		Average silhouette score: 0.57	Average silhouette score: 0.42
WORST	BETTER	BEST	DISCARD: variance explained too small

Comparison of clusters

Here we can see the k-means clustering grouped by rows, and the tandem clustering grouped by columns, using the ‘table’ function:

table(macro.km3$cluster, macro.pca.km3$cluster)

##    
##     1 2 3
##   1 1 8 1
##   2 0 1 2
##   3 6 0 1

As already confirmed by our plots, there are differences in the grouping of countries in each cluster using k-means and tandem analysis.

Here we can see the k-means clustering grouped by rows, and the reduced k.means clustering grouped by columns:

table(macro.km3$cluster, macro.RKM$cluster)

##    
##      1  2  3
##   1 10  0  0
##   2  0  0  3
##   3  0  7  0

K-means and reduced k-means have clustered the same countries together.

As already confirmed by our plots, there will thus also be differences in the grouping of countries in each cluster using reduced k-means and tandem analysis.

We can also use the function ‘adjustedRandIndex’ which gives a result ranging from negative infinity to 1 (perfect agreement), which, for random partition, has an expected value of zero.

adjustedRandIndex(macro.km3$cluster, macro.RKM$cluster)

## [1] 1

As expected, we see a result of 1 when our clusters agree perfectly.

adjustedRandIndex(macro.km3$cluster, macro.pca.km3$cluster)

## [1] 0.4898084

Comparing k-means and tandem analysis, we obtain a result of 0.49

adjustedRandIndex(macro.RKM$cluster, macro.pca.km3$cluster)

## [1] 0.4898084

Comparing reduced k-means and tandem analysis, we again obtain a result of 0.49

Conclusion

In this case, by all measurements used, the reduced k-means analysis has resulted in better clustering than tandem analysis or k-means alone.