11 - Simultaneous Use of Principal Components and Cluster Analysis

Examining the Economic Indicators Dataset ‘macro’ with PCA and Cluster Analysis

Examining the Dataset

library(clustrd)

## Caricamento del pacchetto richiesto: ggplot2

## Caricamento del pacchetto richiesto: grid

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

First we load the dataset ‘macro’ and scale the data. From R documentation, we know that this dataset gives Economic Indicators of 20 OECD countries for 1999. There are six main economic indicators (percentage change from the previous year): gross domestic product (GDP), leading indicator (LI), unemployment rate (UR), interest rate (IR), trade balance (TB), net national savings (NNS)

We then load the dataset and scale the data, reassigning the scaled data to the variable ‘macro’:

data("macro")
macro.scaled <- scale(macro)

To begin, we use the ‘head’ function, rounding the results to 2d.p., to display the first few rows of the dataset ‘macro’ and run the ‘pairs’ function to create a scatterplot matrix of the data, in order to get a better feel for the data:

round(head(macro.scaled),2)

##             GDP    LI   UR    IR    TB   NNS
## Australia  1.82  2.13 0.10 -0.04 -0.25 -1.19
## Canada     0.65  0.33 0.17 -0.11 -0.05 -1.06
## Finland    1.16 -0.75 1.01 -0.44  1.52 -0.42
## France    -0.01 -0.23 0.99 -0.42  0.45 -0.52
## Spain      0.94  0.33 2.79 -0.15 -0.14  0.07
## Sweden     1.31 -0.10 0.30 -0.30  1.13 -1.37

pairs(macro.scaled)

In this case, it is quite difficult to spot any correlations by eye. We can use the ‘cor’ function to calculate the correlation matrix of the dataset ‘macro’ and round the results to one decimal place:

round(cor(macro.scaled),1)

##      GDP   LI   UR   IR   TB  NNS
## GDP  1.0  0.1  0.3  0.2 -0.1 -0.4
## LI   0.1  1.0 -0.1  0.2  0.2 -0.2
## UR   0.3 -0.1  1.0 -0.1  0.1 -0.4
## IR   0.2  0.2 -0.1  1.0 -0.4  0.0
## TB  -0.1  0.2  0.1 -0.4  1.0  0.0
## NNS -0.4 -0.2 -0.4  0.0  0.0  1.0

Indeed, the correlation matrix shows that there are no correlations exceeding + or - 0.4 between the variables. We can now proceed with the PCA and cluster analysis.

Performing PCA and Reduced K-means Cluster Analysis

The function cluspca() from the ‘clustrd’ package is used to perform PCA and cluster analysis simultaneously. The function takes the arguments: dataset, number of clusters, number of principal components, clustering method, rotation method for the principal components. We will use RKM (reduced k-means) and varimax rotation. We will set the number of principal components to 2 and the number of clusters to 3:

outRKM <- cluspca(macro.scaled, 3, 2, method = "RKM", rotation = "varimax")

##   |                                                                              |                                                                      |   0%  |                                                                              |=======                                                               |  10%  |                                                                              |==============                                                        |  20%  |                                                                              |=====================                                                 |  30%  |                                                                              |============================                                          |  40%  |                                                                              |===================================                                   |  50%  |                                                                              |==========================================                            |  60%  |                                                                              |=================================================                     |  70%  |                                                                              |========================================================              |  80%  |                                                                              |===============================================================       |  90%  |                                                                              |======================================================================| 100%

The output of the function is stored in the variable ‘outRKM’. We can use the ‘names’ function to see the components of the output:

names(outRKM)

##  [1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"     
##  [7] "odata"     "scale"     "center"    "nstart"

Results of the PCA and Cluster Analysis

Or use the summary function to summarise all 10 parts of the output:

summary(outRKM)

## Solution with 3 clusters of sizes 10 (50%), 7 (35%), 3 (15%) in 2 dimensions. Variables were mean centered and standardized.
## 
## Cluster centroids:
##             Dim.1   Dim.2
## Cluster 1  0.9264 -0.5039
## Cluster 2 -1.4344 -0.3536
## Cluster 3  0.2589  2.5049
## 
## Variable scores:
##       Dim.1   Dim.2
## GDP -0.7670  0.2123
## LI  -0.1150 -0.2175
## UR  -0.4271 -0.1109
## IR  -0.0201  0.6607
## TB  -0.0318 -0.6532
## NNS  0.4634  0.1791
## 
## Within cluster sum of squares by cluster:
## [1] 5.1105 4.2402 1.9681
##  (between_SS / total_SS =  80.05 %) 
## 
## Clustering vector:
##   Australia      Canada     Finland      France       Spain      Sweden 
##           2           2           2           2           2           2 
##         USA Netherlands      Greece      Mexico    Portugal     Austria 
##           2           1           3           3           3           1 
##     Belgium     Denmark     Germany       Italy       Japan      Norway 
##           1           1           1           1           1           1 
## Switzerland          UK 
##           1           1 
## 
## Objective criterion value: 34.2877 
## 
## Available output:
## 
##  [1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"     
##  [7] "odata"     "scale"     "center"    "nstart"

The output of the function ‘cluspca’ is a list of 10 components. In the summary given, we can see that our minimum within cluster sum of squares scores are 5.1, 4.2 and 2.0, while the maximum between/total is 80%.

‘centroid’ gives us the coordinates of the centre of each cluster, although these are not directly interpretable because the data is scaled.

round(outRKM$centroid, 2)

##       [,1]  [,2]
## [1,]  0.93 -0.50
## [2,] -1.43 -0.35
## [3,]  0.26  2.50

‘attcoord’ (loadings) and ‘obscoord’ give us the coordinates of the variables and of the observations respectively in the rotated space. These are difficult to interpret, however.

round(outRKM$attcoord,1)

##     [,1] [,2]
## GDP -0.8  0.2
## LI  -0.1 -0.2
## UR  -0.4 -0.1
## IR   0.0  0.7
## TB   0.0 -0.7
## NNS  0.5  0.2

We can see that the most important variables for the first principal component are ‘Gross Dom. Prod.’ and ‘Net Nat. Savings’. For the second principal component, the most important variables are ‘Interest Rate’ and ‘Trade Balance’.

‘cluster’ show us which countries belong to which clusters. ‘size’ gives us the number of objects in each cluster.

outRKM$cluster

##   Australia      Canada     Finland      France       Spain      Sweden 
##           2           2           2           2           2           2 
##         USA Netherlands      Greece      Mexico    Portugal     Austria 
##           2           1           3           3           3           1 
##     Belgium     Denmark     Germany       Italy       Japan      Norway 
##           1           1           1           1           1           1 
## Switzerland          UK 
##           1           1

outRKM$size

## [1] 10  7  3

We can see that there are two larger clusters and a smaller one.

‘criterion’ gives us the optimal value of the objective function.

outRKM$criterion

## [1] 34.28768

We can use the ‘aggregate’ function to calculate the mean of the original data for each cluster.

aggregate(macro, by=list(cluster=outRKM$cluster), mean)

##   cluster      GDP         LI        UR        IR        TB      NNS
## 1       1 1.190000  1.4500000  6.330000  3.982000  3.220000 10.67000
## 2       2 3.714286  2.2285714 10.342857  4.607143  3.114286  6.50000
## 3       3 2.766667 -0.4333333  6.133333 12.510000 -5.666667 11.56667

We can use the ‘plot’ function to plot the results of the PCA and cluster analysis.

plot(outRKM, cludesc = TRUE)

In this plot, each coloured line represents a cluster. The black horizontal line represents the scaled mean.

We can also use the ‘plot’ function to see the biplot of the PCA and cluster analysis.

plot(outRKM)

This allows us the see the relative importance of the variables in the rotated space and the centres of the clusters.

We can also use the ‘plot’ function to see the biplot of the PCA and cluster analysis, with the labels of the variables.

lbl <- c("Gross Dom. Prod.", "Lead. Indicator", "Unempl. Rate", "Interest Rate", "Trade Balance", "Net Nat. Savings")
plot(outRKM, what = c(FALSE, TRUE), attlabs = lbl)

This allows us to see the relative importance of the variables in the rotated space. We can see that for the first principal component, the most important variables are ‘Gross Dom. Prod.’ and ‘Net Nat. Savings’. For the second principal component, the most important variables are ‘Interest Rate’ and ‘Trade Balance’. This confirms what we saw when looking at the ‘attacoord’ component of the output.

Interpretation of the results:

The PCA and cluster analysis has shown that the countries can be grouped into three clusters based on the economic indicators: The largest cluster consists of countries with low values for all the economic indicators; the second largest cluster consists of the countries with high Gross Domestic Product and low Net National Savings whilst the smallest cluster consists of countries with high Interest Rates and low Trade Balances.