Clustering is an unsupervised learning method that aims to split the
observations in a dataset into clusters, based on their similarity.
Clustering can be used as an exploratory data analysis tool, to help
discover the structures and relationships in the data. The resulting
cluster ID’s can be considered as a “summary” of the variables in the
data, and can be used to train supervised learning models with large
amounts of data and variables.
Clustering is used for real-world applications such as customer
segmenting, anomaly detection and medical imaging. In this example
analysis, we have a dataset of country statistics, and we will see if we
can cluster them meaningfully. The dataset is sourced from Kaggle,
uploaded by the user Rohan
kokkula.
Let’s load our dataset, and make a table of the first five
rows.
| Country data | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| country | child_mort | exports | health | imports | income | inflation | life_expect | total_fert | gdpp | |
| 1 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 |
| 2 | Albania | 16.6 | 28.0 | 6.55 | 48.6 | 9930 | 4.49 | 76.3 | 1.65 | 4090 |
| 3 | Algeria | 27.3 | 38.4 | 4.17 | 31.4 | 12900 | 16.10 | 76.5 | 2.89 | 4460 |
| 4 | Angola | 119.0 | 62.3 | 2.85 | 42.9 | 5900 | 22.40 | 60.1 | 6.16 | 3530 |
| 5 | Antigua and Barbuda | 10.3 | 45.5 | 6.03 | 58.9 | 19100 | 1.44 | 76.8 | 2.13 | 12200 |
The variables in our dataset are:
All variables in our dataset except country are numeric variables. Some variables with a range between 0-100 are actually percentages, but we don’t need to modify them now, as we will standardize all variables before we apply our algorithms.
Let’s visualize and explore the distributions of variables, and their correlations with eachother. Since clustering doesn’t make predictions, we don’t have an outcome variable, so we don’t need to look for a specific relationship between any variables.
Let’s create and visualize histograms for all our variables, starting
with the first four.
We see that all 4 variables are right skewed, with numerous outliers.
Especially child mortality is very right skewed: Most countries have low
child mortality rates, but a small number of countries have very high
rates. The median is very close to the 25th quartile, and far from the
75th. This is likely to be a key factor in splitting countries into
clusters.
Let’s look at the next 2 variables.
Income per capita and inflation are both very right skewed with
outliers, especially inflation. Most countries have low inflation rates,
but a few countries have very high rates. Most countries have an income
per capita lower than roughly 25,000.
Let’s look at our last 3 variables.
Overall, the distributions in our data are very right skewed, with a lot of outliers both for “positive” variables such as income, as well as “negative” variables such as child mortality. We have no reason to remove any outliers from our analysis, as they all represent real countries we need to account for, and are potentially more interesting for our analysis than countries with moderate statistics.
Highly correlated variables can have a strong impact on clustering
analyses, especially if the algorithm is distance based: Highly
correlated variables can exaggerate the similarity or dissimilarity
between observations. Dropping them, unless there is a serious reason to
keep them, can be a good idea.
Let’s calculate and plot the correlations of all variables in our
dataset.
In the square plot, larger blue squares indicate stronger positive
correlation, and larger red squares indicate stronger negative
correlation. We have numerous strong correlations between variables:
Since some of these variables are highly correlated, and record very similar features, we should drop some of them from our clustering analysis. To help with the elimination of variables, and to reduce the dimensions in the dataset, we can make use of a principal component analysis (PCA).
Principal component analysis basically aims to reduce the dimensions
of a dataset with minimal information loss. Information from all
variables is combined into Principal Components while minimizing
redundancy, i.e correlation. A dataset with 10 variables yields 10 PCs,
but each PC aims to maximize the information summarized in one
dimension, and is uncorrelated with the other PCs. We can then decide to
discard some of the PCs while using the remaining ones as our variables.
This way, we reduce the number of dimensions, with minimal information
loss and no correlation between the new variables.
Before we can compute a PCA, let’s standardize our variables into
z-scores. Our variables have different scales, some in the thousands
while others represent percentages, and this can lead to biased results
both for PCA and for clustering.
df_z <- as.data.frame(scale(df))
Now let’s compute a PCA for our standardized dataframe, and display the calculated PCs.
pca <- prcomp(df_z)
eival <- get_eigenvalue(pca)
| Principal components | |||
|---|---|---|---|
| eigenvalue | variance.percent | cumulative.variance.percent | |
| Dim.1 | 4.13565658 | 45.9517398 | 45.95174 |
| Dim.2 | 1.54634631 | 17.1816257 | 63.13337 |
| Dim.3 | 1.17038330 | 13.0042589 | 76.13762 |
| Dim.4 | 0.99478456 | 11.0531618 | 87.19079 |
| Dim.5 | 0.66061903 | 7.3402114 | 94.53100 |
| Dim.6 | 0.22358112 | 2.4842347 | 97.01523 |
| Dim.7 | 0.11343874 | 1.2604304 | 98.27566 |
| Dim.8 | 0.08831536 | 0.9812817 | 99.25694 |
| Dim.9 | 0.06687501 | 0.7430556 | 100.00000 |
The variance percent column shows the percentage of variance captured by
each single PC. The cumulative variance column shows the percentage of
variance captured by each PC, plus the previous PCs. We can see that
only the first 4 PCs capture 87% of the variance, the fifth PC brings it
to 94%, and the sixth to 97%. These likely represent a good tradeoff
between fewer dimensions and variance explained.
We can also use the results of our PCA to get insight on the effects of
our original variables on the first two PCs, by plotting them as vectors
in a two-dimensional space.
If we wanted to use the original variables for clustering while dropping some of the highly correlated ones, the intuition from this plot could help us make our choice. Since we want the variables with the biggest impact, without correlating with one another, a good choice could be the following variables, in order of importance: life_expect, income, imports and health.
Before we start our clustering analysis, we can test the cluster
tendency of our data. An issue with clustering algorithms is that they
will cluster any given dataset, even if there are no meaningful
structures present. Metrics of cluster tendency can sometimes indicate
whether the data is likely to contain meaningful clusters or not.
We apply two methods to test cluster tendency: The Hopkins statistic,
and a visual assessment. Let’s test the cluster tendency of a dataset
with the first 5 PCs from our PCA.
set.seed(1)
hopkins(df_pc5, m=nrow(df_pc5)-1)
## [1] 0.9481295
get_clust_tendency(data=df_pc5, n=50, gradient=list(low="white",high="blue"), seed=1)
## $hopkins_stat
## [1] 0.7874387
##
## $plot
A clear consensus on the interpretation of clustering tendency does not
seem to exist, and we can see that one function returned a Hopkins
statistic of 0.95, while the other returned 0.79.
Let’s apply the same tests to a dataset of the four original variables we chose using the PCA: life_expect, income, imports and health.
## [1] 0.9903301
## $hopkins_stat
## [1] 0.8470405
##
## $plot
The results are a higher Hopkins stat of 0.99 or 0.85, but we also have
more dissimilar areas in the dissimilarity plot.
The results for both datasets suggest a low clustering tendency for our
data, but intuitively, we would expect countries to cluster at least
somewhat decently based on statistics of income and life quality. Since
there is no clear consensus on the usage and interpretation of
clustering tendency statistics, let’s attempt to cluster both datasets
and evaluate the results.
The most well-known and used clustering algorithm is probably the
k-means algorithm. This algorithm takes a K number of random centroids
as centers of the K number of clusters in the data, assigns all data
points to their closest centroid, recalculates the centroids of all
clusters as the means of clusters, reassigns the data points to the
closest centroid, and repeats this until convergence. As this approach
is based on means, it is sensitive to outliers.
The variables in our data generally follow a very right skewed
distribution, with a lot of outliers. Because of this, we will prefer a
k-medoids clustering algorithm: K-medoids takes an actual K number of
observations from the data as centers of the clusters, instead of
calculating means. These centroids are chosen to minimize the
dissimilarity within each cluster. K-medoids is more robust against
outliers, as it does not rely on means.
To apply a non-hierarchical clustering method, such as k-medoids, we
need to determine the number of clusters we want. Choosing the optimal
number of clusters has a strong impact on the quality of our clustering,
and can be subjective. One method to choose the optimal number of
clusters is by using average silhouette width:
fviz_nbclust(df_pc5, cluster::pam, method="silhouette")
This plot displays the average silhouette width for each cluster. The
closer the average silhouette width to 1, the more efficient the
clustering. A negative silhouette width for an observation indicates
that the observation was likely put in the wrong cluster. According to
our plot, 2 clusters is optimal, but 3 clusters also comes close, and
intuitively makes more sense instead of just splitting the countries in
two groups.
Another method we can use is the elbow method:
fviz_nbclust(df_pc5, cluster::pam, method="wss")
This plot displays the unexplained variance for each number of clusters.
As we divide the data into more clusters, we can expect to explain more
of the variance, but also expect to have smaller, less meaningful and
less generalizable clusters, similar to the overfitting problem in
regression analysis. The elbow method refers to choosing the number of
clusters that creates an “elbow” in this plot: A point of diminishing
returns. We don’t have a very clear elbow in this case, but 3 and 6 are
the candidates.
Another method we can use is the gap statistic:
fviz_nbclust(df_pc5, cluster::pam, method="gap_stat")
Gap statistic is a measure of the explained variance for each number of
clusters. Here, we want to choose the smallest number of K clusters that
has a gap statistic within 1 standard deviation of the gap statistic of
K+1 clusters. This can be considered the “inverse” of the previous
method, we are again choosing the point of diminishing returns. In this
case, 3 clusters emerges as the ideal choice again.
Let’s fit and plot our first clustering, with 3 clusters, using the
PAM k-medoids algorithm. We choose Manhattan distance as our metric of
distance, as it’s known to be more robust to outliers.
pam1 <- pam(df_pc5, k=3, metric="manhattan")
fviz_cluster(pam1, data=df_pc5, geom="point", ellipse.type="norm", repel=TRUE, ggtheme=theme_bw()) +
labs(title="Cluster plot, pam1", subtitle="k=3 clusters", x="Dim1, 20% variance", y="Dim2, 20% variance" )
The first and second clusters are decently separated, but the third
cluster appears to overlap with the other two quite a bit. Some
observations in cluster 3 also seem to be quite distant from one
another. This is likely not a very effective clustering, but in the same
time, it suggests at least some structure in our data. We should also
keep in mind that the plot only represents the observations and clusters
in respect to two dimensions, which have 20% effect each, so the
remaining dimensions may still separate the data more clearly.
Let’s plot our clusters against the first 3 dimensions, in a 3D
scatterplot:
The 3D plot shows that our clusters are actually much more separated
than the 2D plot would suggest.
Let’s plot the silhouette widths for our clusters.
fviz_silhouette(pam1)
## cluster size ave.sil.width
## 1 1 41 0.31
## 2 2 96 0.36
## 3 3 30 0.25
The average silhouette width of all clusters is 0.33, while the smallest
average width per cluster is 0.25, and maximum is 0.36, indicative of a
somewhat ineffective clustering. Some observations in cluster 2 and 3
have negative silhouette widths, indicating they may have been wrongly
placed in these clusters. However, some observations exceed a silhouette
width of 0.5, again indicating some structure in the data may be present
at least partially.
Let’s see the 3 countries that form the center of each cluster.
pam1$id.med
## [1] 95 144 54
| Center observations of pam1 clusters | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| country | child_mort | exports | health | imports | income | inflation | life_expect | total_fert | gdpp | |
| 95 | Malawi | 90.5 | 22.8 | 6.59 | 34.9 | 1030 | 12.100 | 53.1 | 5.31 | 459 |
| 144 | Suriname | 24.1 | 52.5 | 7.01 | 38.4 | 14200 | 7.200 | 70.3 | 2.52 | 8300 |
| 54 | Finland | 3.0 | 38.7 | 8.95 | 37.4 | 39800 | 0.351 | 80.0 | 1.87 | 46200 |
This may suggest that our clusters 1, 2 and 3 respectively may represent the least developed countries, developing countries and highly developed countries, even if some observations may be poorly clustered.
Let’s also try clustering using the dataset with the four key
variables we chose using our PCA: life_expect, income, imports and
health. Again, we start with testing the optimal number of
clusters.
fviz_nbclust(df1, cluster::pam, method="silhouette")
fviz_nbclust(df1, cluster::pam, method="wss")
fviz_nbclust(df1, cluster::pam, method="gap_stat")
Both the silhouette and gap methods suggest 2 as the optimal number,
with 3 being a close runner-up. The elbow method doesn’t yield any
number that can be considered a point of diminishing returns. This
suggests this data is even less suitable for clustering.
Let’s try fitting and 3D plotting 2 clusters.
## cluster size ave.sil.width
## 1 1 77 0.40
## 2 2 90 0.21
With 2 clusters, the average silhouette width is 0.3, slightly lower
than pam1. This time, one cluster performs a bit better, with 0.4 width,
while the other performs a bit worse, with 0.21 width. We see some
observations with a width of more than 0.5, but also more observations
with negative widths in cluster 2.
Let’s see the medoids for each cluster.
pam2$id.med
## [1] 60 122
| Center observations of pam2 clusters | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| country | child_mort | exports | health | imports | income | inflation | life_expect | total_fert | gdpp | |
| 60 | Ghana | 74.7 | 29.5 | 5.22 | 45.9 | 3060 | 16.60 | 62.2 | 4.27 | 1310 |
| 122 | Poland | 6.0 | 40.1 | 7.46 | 42.1 | 21800 | 1.66 | 76.3 | 1.41 | 12600 |
This clustering is likely a more generalized version of the previous one, which may still be meaningful, but is likely to ignore the extreme observations, such as very underdeveloped or very developed countries. Indeed, we see numerous observations with negative silhouette width in cluster 2: These are likely the highly developed countries such as Finland, which were previously in cluster 3.
Let’s try 3 clusters with the same dataset of 4 original
variables.
## cluster size ave.sil.width
## 1 1 67 0.26
## 2 2 68 0.29
## 3 3 32 0.25
The result is somewhat similar to pam1’s clusters. Cluster 1 and 3 are
completely separated, while cluster 2 is somewhat caught in the middle.
The average silhouette width is 0.27, slightly lower than the previous
attempts, but the difference between average widths of clusters is
lower: They are closer to eachother, all between 0.25 and 0.29. Each
cluster has some observations with negative widths, and some
observations close to 0.5 width.
Let’s see the medoids for each cluster.
pam2.1$id.med
## [1] 60 25 159
| Center observations of pam2.1 clusters | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| country | child_mort | exports | health | imports | income | inflation | life_expect | total_fert | gdpp | |
| 60 | Ghana | 74.7 | 29.5 | 5.22 | 45.9 | 3060 | 16.60 | 62.2 | 4.27 | 1310 |
| 25 | Bulgaria | 10.8 | 50.2 | 6.87 | 53.0 | 15300 | 1.11 | 73.9 | 1.57 | 6840 |
| 159 | United Kingdom | 5.2 | 28.2 | 9.64 | 30.8 | 36200 | 1.57 | 80.3 | 1.92 | 38900 |
These results suggest that this clustering attempt “distributes” the error more evenly between clusters, and may offer a more balanced degree of representativeness compared to the first attempt. There is no objective way to determine if one is better than the other: If we are interested in the most “accurate” results overall, we may prefer the first clustering, but if we prefer to minimize the differences in accuracy between clusters, we may prefer the third clustering. Another upside of the third clustering is that it uses original variables instead of PCs, so the plots and values are more directly interpretable.
Now, we can group our original dataset into 3 clusters, and see if the variables differ strongly by clusters. First, let’s add the cluster to the original database as a new categorical variable, and see some of the countries in each cluster:
| Cluster 1 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| country | child_mort | exports | health | imports | income | inflation | life_expect | total_fert | gdpp | cluster | |
| 1 | Afghanistan | 90.2 | 10.0 | 7.58 | 44.9 | 1610 | 9.44 | 56.2 | 5.82 | 553 | 1 |
| 50 | Equatorial Guinea | 111.0 | 85.8 | 4.48 | 58.9 | 33700 | 24.90 | 60.9 | 5.21 | 17100 | 1 |
| 100 | Mauritania | 97.4 | 50.7 | 4.41 | 61.2 | 3320 | 18.90 | 68.2 | 4.98 | 1200 | 1 |
| 138 | South Africa | 53.7 | 28.6 | 8.94 | 27.4 | 12000 | 6.35 | 54.3 | 2.59 | 7280 | 1 |
| 150 | Timor-Leste | 62.6 | 2.2 | 9.12 | 27.8 | 1850 | 26.50 | 71.1 | 6.23 | 3600 | 1 |
We can see cluster 1 consists of generally underdeveloped countries.
This cluster has 41 observations.
| Cluster 2 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| country | child_mort | exports | health | imports | income | inflation | life_expect | total_fert | gdpp | cluster | |
| 6 | Argentina | 14.5 | 18.9 | 8.10 | 16.0 | 18700 | 20.90 | 75.8 | 2.37 | 10300 | 2 |
| 44 | Czech Republic | 3.4 | 66.0 | 7.88 | 62.9 | 28300 | -1.43 | 77.5 | 1.51 | 19800 | 2 |
| 68 | Hungary | 6.0 | 81.8 | 7.33 | 76.5 | 22300 | 2.33 | 74.5 | 1.25 | 13100 | 2 |
| 155 | Turkmenistan | 62.0 | 76.3 | 2.50 | 44.5 | 9940 | 2.31 | 67.9 | 2.83 | 4440 | 2 |
| 162 | Uzbekistan | 36.3 | 31.7 | 5.81 | 28.5 | 4240 | 16.50 | 68.8 | 2.34 | 1380 | 2 |
Cluster 2 generally consists of developing countries across the world,
but there are some observations that may have fit into the first cluster
as well. We knew that cluster 2 had some overlap with the other 2
clusters. This is the largest cluster with 96 observations.
| Cluster 3 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| country | child_mort | exports | health | imports | income | inflation | life_expect | total_fert | gdpp | cluster | |
| 8 | Australia | 4.8 | 19.8 | 8.73 | 20.9 | 41400 | 1.160 | 82.0 | 1.93 | 51900 | 3 |
| 16 | Belgium | 4.5 | 76.4 | 10.70 | 74.7 | 41100 | 1.880 | 80.0 | 1.86 | 44400 | 3 |
| 69 | Iceland | 2.6 | 53.4 | 9.40 | 43.3 | 38800 | 5.470 | 82.0 | 2.20 | 41900 | 3 |
| 134 | Singapore | 2.8 | 200.0 | 3.96 | 174.0 | 72100 | -0.046 | 82.7 | 1.15 | 46600 | 3 |
| 158 | United Arab Emirates | 8.6 | 77.7 | 3.66 | 63.6 | 57600 | 12.500 | 76.5 | 1.87 | 35000 | 3 |
Cluster 3 seems to be the more developed countries in the world. This is
the smallest cluster with 30 observations.
Since all our variables are numeric, let’s see boxplots for each one,
grouped by clusters, to determine if there are significant differences
between clusters. Let’s start with the first 4 variables:
From the boxplots, we see that:
These differences are in line with what we’d expect for
underdeveloped, developing and developed countries.
Let’s look at the next 2 variables.
Let’s see the next 2 variables.
And our final variable.
Cluster 1 has a very low GDPP, cluster 2 is a bit higher, and cluster 3
is much higher, as expected due to the income inequality between
developed and underdeveloped countries.
The suitability and performance of clustering algorithms is not always easy to assess: In our example, cluster tendency and performance statistics indicated a low suitability and performance for our clustering, but exploring the results show us that we have clearly identified three meaningful clusters in the data:
The first and third clusters are smaller, with more extreme
observations, and the middle cluster is larger, with less extreme
observations, as we’d expect. Clustering yielded to be an effective and
insightful exploratory analysis tool, even though the pure statistics
and metrics suggested otherwise. Visualizing the clustering results in
3D helped us see a clearer picture of the separation between clusters,
which looked deceptively overlapped in 2D plots.
Another important step was the preprocessing: If we hadn’t applied PCA
and/or removed the highly correlated variables, our results may have
been less effective: Some variables in our dataset were highly
correlated measures of similar features, and they could have exaggerated
the similarity or dissimilarity of observations and clusters.