Purpose

Clustering is an unsupervised learning method that aims to split the observations in a dataset into clusters, based on their similarity. Clustering can be used as an exploratory data analysis tool, to help discover the structures and relationships in the data. The resulting cluster ID’s can be considered as a “summary” of the variables in the data, and can be used to train supervised learning models with large amounts of data and variables.

Clustering is used for real-world applications such as customer segmenting, anomaly detection and medical imaging. In this example analysis, we have a dataset of country statistics, and we will see if we can cluster them meaningfully. The dataset is sourced from Kaggle, uploaded by the user Rohan kokkula.

Data preparation

Let’s load our dataset, and make a table of the first five rows.

Country data
country child_mort exports health imports income inflation life_expect total_fert gdpp
1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553
2 Albania 16.6 28.0 6.55 48.6 9930 4.49 76.3 1.65 4090
3 Algeria 27.3 38.4 4.17 31.4 12900 16.10 76.5 2.89 4460
4 Angola 119.0 62.3 2.85 42.9 5900 22.40 60.1 6.16 3530
5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8 2.13 12200


The variables in our dataset are:

All variables in our dataset except country are numeric variables. Some variables with a range between 0-100 are actually percentages, but we don’t need to modify them now, as we will standardize all variables before we apply our algorithms.

Exploratory analysis

Let’s visualize and explore the distributions of variables, and their correlations with eachother. Since clustering doesn’t make predictions, we don’t have an outcome variable, so we don’t need to look for a specific relationship between any variables.

Distributions

Let’s create and visualize histograms for all our variables, starting with the first four.

We see that all 4 variables are right skewed, with numerous outliers. Especially child mortality is very right skewed: Most countries have low child mortality rates, but a small number of countries have very high rates. The median is very close to the 25th quartile, and far from the 75th. This is likely to be a key factor in splitting countries into clusters.

Let’s look at the next 2 variables.

Income per capita and inflation are both very right skewed with outliers, especially inflation. Most countries have low inflation rates, but a few countries have very high rates. Most countries have an income per capita lower than roughly 25,000.

Let’s look at our last 3 variables.

  • Life expectancy, unlike other variables, has a left skewed distribution, but is also susceptible to outliers. Most countries have high life expectancy, roughly over 65, but a few countries have very low life expectancy such as around 30 or 50. Possibly another key factor to split the data.
  • Fertility rates are right skewed, but with a considerable number of observations at the right tail of the data. Many countries have relatively low fertility rates around 2, but quite a few countries have higher rates around 4 and 6.
  • GDP per capita is very right skewed: Most countries are below roughly 15,000, while numerous countries have a GDPP of more than 2-3x that amount.

Overall, the distributions in our data are very right skewed, with a lot of outliers both for “positive” variables such as income, as well as “negative” variables such as child mortality. We have no reason to remove any outliers from our analysis, as they all represent real countries we need to account for, and are potentially more interesting for our analysis than countries with moderate statistics.

Correlations

Highly correlated variables can have a strong impact on clustering analyses, especially if the algorithm is distance based: Highly correlated variables can exaggerate the similarity or dissimilarity between observations. Dropping them, unless there is a serious reason to keep them, can be a good idea.

Let’s calculate and plot the correlations of all variables in our dataset.

In the square plot, larger blue squares indicate stronger positive correlation, and larger red squares indicate stronger negative correlation. We have numerous strong correlations between variables:

  • Income and GDPP have a very strong positive correlation of 0.89, as we would expect.
    • The pattern in the scatterplot from the first plot suggests there may be two different groups in respect to the relationship between income and GDPP. This is possibly because there may be some disparity between income and GDPP for some countries.
  • Child mortality and life expectancy have a strong negative correlation of 0.88, again, an expected relationship.
  • Child mortality and fertility have a strong positive correlation of 0.84.
    • Looking at the scatterplot of these two variables, we can say this correlation may have been even higher if not for one outlier dragging the line down: Haiti has a high child mortality of 208, yet a relatively low fertility of 3.33.
  • Fertility and life expectancy have a negative correlation of 0.76.
    • Again highly affected by the same outlier: Haiti has a relatively low fertility of 3.33, yet also a low life expectancy of 32.1.
  • Imports and exports have a positive correlation of 0.73.

Since some of these variables are highly correlated, and record very similar features, we should drop some of them from our clustering analysis. To help with the elimination of variables, and to reduce the dimensions in the dataset, we can make use of a principal component analysis (PCA).

Principal Component Analysis (PCA)

Principal component analysis basically aims to reduce the dimensions of a dataset with minimal information loss. Information from all variables is combined into Principal Components while minimizing redundancy, i.e correlation. A dataset with 10 variables yields 10 PCs, but each PC aims to maximize the information summarized in one dimension, and is uncorrelated with the other PCs. We can then decide to discard some of the PCs while using the remaining ones as our variables. This way, we reduce the number of dimensions, with minimal information loss and no correlation between the new variables.

Before we can compute a PCA, let’s standardize our variables into z-scores. Our variables have different scales, some in the thousands while others represent percentages, and this can lead to biased results both for PCA and for clustering.

df_z <- as.data.frame(scale(df))

Now let’s compute a PCA for our standardized dataframe, and display the calculated PCs.

pca <- prcomp(df_z)
eival <- get_eigenvalue(pca)
Principal components
eigenvalue variance.percent cumulative.variance.percent
Dim.1 4.13565658 45.9517398 45.95174
Dim.2 1.54634631 17.1816257 63.13337
Dim.3 1.17038330 13.0042589 76.13762
Dim.4 0.99478456 11.0531618 87.19079
Dim.5 0.66061903 7.3402114 94.53100
Dim.6 0.22358112 2.4842347 97.01523
Dim.7 0.11343874 1.2604304 98.27566
Dim.8 0.08831536 0.9812817 99.25694
Dim.9 0.06687501 0.7430556 100.00000


The variance percent column shows the percentage of variance captured by each single PC. The cumulative variance column shows the percentage of variance captured by each PC, plus the previous PCs. We can see that only the first 4 PCs capture 87% of the variance, the fifth PC brings it to 94%, and the sixth to 97%. These likely represent a good tradeoff between fewer dimensions and variance explained.

We can also use the results of our PCA to get insight on the effects of our original variables on the first two PCs, by plotting them as vectors in a two-dimensional space.

  • In the above plot, variables with longer vectors and closer directions to a PC axis have a bigger effect on that PC.
    • We can see PC1 is most strongly influenced by income and GDPP.
  • If the angle between two variable vectors are close to 0, they are strongly positively correlated.
    • We can see income and GDPP have a strong positive correlation.
  • If the angles between two variable vectors are close to 180, they are strongly negatively correlated.
    • Child mortality and life expectancy have a strong negative correlation.
  • A 90 degree angle between two vectors represents close to zero correlation.
    • Imports and life expectancy are close to having zero correlation.

If we wanted to use the original variables for clustering while dropping some of the highly correlated ones, the intuition from this plot could help us make our choice. Since we want the variables with the biggest impact, without correlating with one another, a good choice could be the following variables, in order of importance: life_expect, income, imports and health.

Clustering Analysis

Cluster tendencies

Before we start our clustering analysis, we can test the cluster tendency of our data. An issue with clustering algorithms is that they will cluster any given dataset, even if there are no meaningful structures present. Metrics of cluster tendency can sometimes indicate whether the data is likely to contain meaningful clusters or not.

We apply two methods to test cluster tendency: The Hopkins statistic, and a visual assessment. Let’s test the cluster tendency of a dataset with the first 5 PCs from our PCA.

set.seed(1)
hopkins(df_pc5, m=nrow(df_pc5)-1)
## [1] 0.9481295
get_clust_tendency(data=df_pc5, n=50, gradient=list(low="white",high="blue"), seed=1)
## $hopkins_stat
## [1] 0.7874387
## 
## $plot


A clear consensus on the interpretation of clustering tendency does not seem to exist, and we can see that one function returned a Hopkins statistic of 0.95, while the other returned 0.79.

  • Generally, a Hopkins statistic close to 0 indicates a very non-uniform dataset, which may be interpreted as a tendency to be clustered. A statistic close to 1 indicates a very uniform dataset, which may be interpreted as low tendency to be clustered.
  • The plot visualizes the dissimilarity matrix of the observations in our dataset, with white representing low dissimilarity, and blue representing high. Ideally, we would have distinct, blue squares across the diagonal, indicating the presence of strongly dissimilar clusters.

Let’s apply the same tests to a dataset of the four original variables we chose using the PCA: life_expect, income, imports and health.

## [1] 0.9903301
## $hopkins_stat
## [1] 0.8470405
## 
## $plot


The results are a higher Hopkins stat of 0.99 or 0.85, but we also have more dissimilar areas in the dissimilarity plot.

The results for both datasets suggest a low clustering tendency for our data, but intuitively, we would expect countries to cluster at least somewhat decently based on statistics of income and life quality. Since there is no clear consensus on the usage and interpretation of clustering tendency statistics, let’s attempt to cluster both datasets and evaluate the results.

K-medoids clustering

The most well-known and used clustering algorithm is probably the k-means algorithm. This algorithm takes a K number of random centroids as centers of the K number of clusters in the data, assigns all data points to their closest centroid, recalculates the centroids of all clusters as the means of clusters, reassigns the data points to the closest centroid, and repeats this until convergence. As this approach is based on means, it is sensitive to outliers.

The variables in our data generally follow a very right skewed distribution, with a lot of outliers. Because of this, we will prefer a k-medoids clustering algorithm: K-medoids takes an actual K number of observations from the data as centers of the clusters, instead of calculating means. These centroids are chosen to minimize the dissimilarity within each cluster. K-medoids is more robust against outliers, as it does not rely on means.

To apply a non-hierarchical clustering method, such as k-medoids, we need to determine the number of clusters we want. Choosing the optimal number of clusters has a strong impact on the quality of our clustering, and can be subjective. One method to choose the optimal number of clusters is by using average silhouette width:

fviz_nbclust(df_pc5, cluster::pam, method="silhouette")


This plot displays the average silhouette width for each cluster. The closer the average silhouette width to 1, the more efficient the clustering. A negative silhouette width for an observation indicates that the observation was likely put in the wrong cluster. According to our plot, 2 clusters is optimal, but 3 clusters also comes close, and intuitively makes more sense instead of just splitting the countries in two groups.

Another method we can use is the elbow method:

fviz_nbclust(df_pc5, cluster::pam, method="wss")


This plot displays the unexplained variance for each number of clusters. As we divide the data into more clusters, we can expect to explain more of the variance, but also expect to have smaller, less meaningful and less generalizable clusters, similar to the overfitting problem in regression analysis. The elbow method refers to choosing the number of clusters that creates an “elbow” in this plot: A point of diminishing returns. We don’t have a very clear elbow in this case, but 3 and 6 are the candidates.

Another method we can use is the gap statistic:

fviz_nbclust(df_pc5, cluster::pam, method="gap_stat")


Gap statistic is a measure of the explained variance for each number of clusters. Here, we want to choose the smallest number of K clusters that has a gap statistic within 1 standard deviation of the gap statistic of K+1 clusters. This can be considered the “inverse” of the previous method, we are again choosing the point of diminishing returns. In this case, 3 clusters emerges as the ideal choice again.

Clustering attempt: pam1

Let’s fit and plot our first clustering, with 3 clusters, using the PAM k-medoids algorithm. We choose Manhattan distance as our metric of distance, as it’s known to be more robust to outliers.

pam1 <- pam(df_pc5, k=3, metric="manhattan")

fviz_cluster(pam1, data=df_pc5, geom="point", ellipse.type="norm", repel=TRUE, ggtheme=theme_bw()) +
  labs(title="Cluster plot, pam1", subtitle="k=3 clusters", x="Dim1, 20% variance", y="Dim2, 20% variance" )


The first and second clusters are decently separated, but the third cluster appears to overlap with the other two quite a bit. Some observations in cluster 3 also seem to be quite distant from one another. This is likely not a very effective clustering, but in the same time, it suggests at least some structure in our data. We should also keep in mind that the plot only represents the observations and clusters in respect to two dimensions, which have 20% effect each, so the remaining dimensions may still separate the data more clearly.

Let’s plot our clusters against the first 3 dimensions, in a 3D scatterplot:
The 3D plot shows that our clusters are actually much more separated than the 2D plot would suggest.

Let’s plot the silhouette widths for our clusters.

fviz_silhouette(pam1)
##   cluster size ave.sil.width
## 1       1   41          0.31
## 2       2   96          0.36
## 3       3   30          0.25


The average silhouette width of all clusters is 0.33, while the smallest average width per cluster is 0.25, and maximum is 0.36, indicative of a somewhat ineffective clustering. Some observations in cluster 2 and 3 have negative silhouette widths, indicating they may have been wrongly placed in these clusters. However, some observations exceed a silhouette width of 0.5, again indicating some structure in the data may be present at least partially.

Let’s see the 3 countries that form the center of each cluster.

pam1$id.med
## [1]  95 144  54
Center observations of pam1 clusters
country child_mort exports health imports income inflation life_expect total_fert gdpp
95 Malawi 90.5 22.8 6.59 34.9 1030 12.100 53.1 5.31 459
144 Suriname 24.1 52.5 7.01 38.4 14200 7.200 70.3 2.52 8300
54 Finland 3.0 38.7 8.95 37.4 39800 0.351 80.0 1.87 46200
  • The first cluster’s center is Malawi. We can see a high child mortality rate, low income, and relatively lower life expectancy.
  • The second cluster’s center is Suriname. A relatively lower child mortality rate, higher life expectancy, and higher income, closer to the median amount.
  • The third cluster’s center is Finland. Very low child mortality, very high life expectancy and income.

This may suggest that our clusters 1, 2 and 3 respectively may represent the least developed countries, developing countries and highly developed countries, even if some observations may be poorly clustered.

Clustering attempt: pam2

Let’s also try clustering using the dataset with the four key variables we chose using our PCA: life_expect, income, imports and health. Again, we start with testing the optimal number of clusters.

fviz_nbclust(df1, cluster::pam, method="silhouette")

fviz_nbclust(df1, cluster::pam, method="wss")

fviz_nbclust(df1, cluster::pam, method="gap_stat")


Both the silhouette and gap methods suggest 2 as the optimal number, with 3 being a close runner-up. The elbow method doesn’t yield any number that can be considered a point of diminishing returns. This suggests this data is even less suitable for clustering.

Let’s try fitting and 3D plotting 2 clusters.

##   cluster size ave.sil.width
## 1       1   77          0.40
## 2       2   90          0.21


With 2 clusters, the average silhouette width is 0.3, slightly lower than pam1. This time, one cluster performs a bit better, with 0.4 width, while the other performs a bit worse, with 0.21 width. We see some observations with a width of more than 0.5, but also more observations with negative widths in cluster 2.

Let’s see the medoids for each cluster.

pam2$id.med
## [1]  60 122


Center observations of pam2 clusters
country child_mort exports health imports income inflation life_expect total_fert gdpp
60 Ghana 74.7 29.5 5.22 45.9 3060 16.60 62.2 4.27 1310
122 Poland 6.0 40.1 7.46 42.1 21800 1.66 76.3 1.41 12600
  • Cluster 1’s center is Ghana, which has high child mortality, low income, and a somewhat medium life expectancy. Not as extreme an observation as Malawi from the previous clustering.
  • Cluster 2’s center is Poland, which has low child mortality, higher income and life expectancy. Again, not as extreme an observation as Finland from the previous clustering.

This clustering is likely a more generalized version of the previous one, which may still be meaningful, but is likely to ignore the extreme observations, such as very underdeveloped or very developed countries. Indeed, we see numerous observations with negative silhouette width in cluster 2: These are likely the highly developed countries such as Finland, which were previously in cluster 3.

Clustering attempt: pam2.1

Let’s try 3 clusters with the same dataset of 4 original variables.

##   cluster size ave.sil.width
## 1       1   67          0.26
## 2       2   68          0.29
## 3       3   32          0.25


The result is somewhat similar to pam1’s clusters. Cluster 1 and 3 are completely separated, while cluster 2 is somewhat caught in the middle. The average silhouette width is 0.27, slightly lower than the previous attempts, but the difference between average widths of clusters is lower: They are closer to eachother, all between 0.25 and 0.29. Each cluster has some observations with negative widths, and some observations close to 0.5 width.

Let’s see the medoids for each cluster.

pam2.1$id.med
## [1]  60  25 159
Center observations of pam2.1 clusters
country child_mort exports health imports income inflation life_expect total_fert gdpp
60 Ghana 74.7 29.5 5.22 45.9 3060 16.60 62.2 4.27 1310
25 Bulgaria 10.8 50.2 6.87 53.0 15300 1.11 73.9 1.57 6840
159 United Kingdom 5.2 28.2 9.64 30.8 36200 1.57 80.3 1.92 38900
  • The first cluster’s center is Ghana again: An example of an underdeveloped country, not as extreme as Malawi from the first clustering attempt.
  • The second cluster’s center is Bulgaria: A developing country with middle levels of income and life quality statistics, probably more representative of developing countries compared to Suriname from the first attempt.
  • The third cluster’s center is United Kingdom: A developed country with high income and good life quality statistics, though not as extreme as Finland, and possibly more representative of developed countries as a whole.

These results suggest that this clustering attempt “distributes” the error more evenly between clusters, and may offer a more balanced degree of representativeness compared to the first attempt. There is no objective way to determine if one is better than the other: If we are interested in the most “accurate” results overall, we may prefer the first clustering, but if we prefer to minimize the differences in accuracy between clusters, we may prefer the third clustering. Another upside of the third clustering is that it uses original variables instead of PCs, so the plots and values are more directly interpretable.

Analysis with clustering results

Now, we can group our original dataset into 3 clusters, and see if the variables differ strongly by clusters. First, let’s add the cluster to the original database as a new categorical variable, and see some of the countries in each cluster:

Cluster 1
country child_mort exports health imports income inflation life_expect total_fert gdpp cluster
1 Afghanistan 90.2 10.0 7.58 44.9 1610 9.44 56.2 5.82 553 1
50 Equatorial Guinea 111.0 85.8 4.48 58.9 33700 24.90 60.9 5.21 17100 1
100 Mauritania 97.4 50.7 4.41 61.2 3320 18.90 68.2 4.98 1200 1
138 South Africa 53.7 28.6 8.94 27.4 12000 6.35 54.3 2.59 7280 1
150 Timor-Leste 62.6 2.2 9.12 27.8 1850 26.50 71.1 6.23 3600 1


We can see cluster 1 consists of generally underdeveloped countries. This cluster has 41 observations.

Cluster 2
country child_mort exports health imports income inflation life_expect total_fert gdpp cluster
6 Argentina 14.5 18.9 8.10 16.0 18700 20.90 75.8 2.37 10300 2
44 Czech Republic 3.4 66.0 7.88 62.9 28300 -1.43 77.5 1.51 19800 2
68 Hungary 6.0 81.8 7.33 76.5 22300 2.33 74.5 1.25 13100 2
155 Turkmenistan 62.0 76.3 2.50 44.5 9940 2.31 67.9 2.83 4440 2
162 Uzbekistan 36.3 31.7 5.81 28.5 4240 16.50 68.8 2.34 1380 2


Cluster 2 generally consists of developing countries across the world, but there are some observations that may have fit into the first cluster as well. We knew that cluster 2 had some overlap with the other 2 clusters. This is the largest cluster with 96 observations.

Cluster 3
country child_mort exports health imports income inflation life_expect total_fert gdpp cluster
8 Australia 4.8 19.8 8.73 20.9 41400 1.160 82.0 1.93 51900 3
16 Belgium 4.5 76.4 10.70 74.7 41100 1.880 80.0 1.86 44400 3
69 Iceland 2.6 53.4 9.40 43.3 38800 5.470 82.0 2.20 41900 3
134 Singapore 2.8 200.0 3.96 174.0 72100 -0.046 82.7 1.15 46600 3
158 United Arab Emirates 8.6 77.7 3.66 63.6 57600 12.500 76.5 1.87 35000 3


Cluster 3 seems to be the more developed countries in the world. This is the smallest cluster with 30 observations.

Since all our variables are numeric, let’s see boxplots for each one, grouped by clusters, to determine if there are significant differences between clusters. Let’s start with the first 4 variables:

From the boxplots, we see that:

These differences are in line with what we’d expect for underdeveloped, developing and developed countries.

Let’s look at the next 2 variables.

Let’s see the next 2 variables.

And our final variable.

Cluster 1 has a very low GDPP, cluster 2 is a bit higher, and cluster 3 is much higher, as expected due to the income inequality between developed and underdeveloped countries.

Conclusion

The suitability and performance of clustering algorithms is not always easy to assess: In our example, cluster tendency and performance statistics indicated a low suitability and performance for our clustering, but exploring the results show us that we have clearly identified three meaningful clusters in the data:

The first and third clusters are smaller, with more extreme observations, and the middle cluster is larger, with less extreme observations, as we’d expect. Clustering yielded to be an effective and insightful exploratory analysis tool, even though the pure statistics and metrics suggested otherwise. Visualizing the clustering results in 3D helped us see a clearer picture of the separation between clusters, which looked deceptively overlapped in 2D plots.

Another important step was the preprocessing: If we hadn’t applied PCA and/or removed the highly correlated variables, our results may have been less effective: Some variables in our dataset were highly correlated measures of similar features, and they could have exaggerated the similarity or dissimilarity of observations and clusters.