Is the data clusterable?
Clustering algorithms will form clusters even when data is uniformly
distributed. The first step in cluster analysis is to investigate the
underlying structure of the data in order to ensure that clustering is
justifiable. The Hopkins
Statistic gives a measure of whether the data can be clustered: a
measure close to .5 indicates random data, whereas values close to 1
indicate clusterable data. We can also visualize the structure of the
data by looking at the distance matrix. Two occupations that require the
same skills will have a distance of zero, whereas the more different the
occupations are the larger the distance will be. This raises the
question of the distance metric to use: we consider two, Euclidean (as
the crow flies) and Manhattan (as the New York resident walks). There is
some evidence Aggarwal, Hinneburg, and Keim
(2001) that Manhattan distance might be more appropriate in high
dimensional space.

In the above heatmaps the rows and columns are the 506 occupations,
and the colour indicates the distance between any two occupations. Note
that along the diagonal line distance is zero (the distance from an
occupation to itself.) The clustering of the data can be inferred from
the patchwork design of the plot: Occupations that are similar to one
another in terms of skill requirements are grouped together, and the
colour of the grouping is blueish. Occupations that differ markedly are
located further apart, and the colour tends towards yellow. When we
measure distance as Euclidean it appears that there are 3 main groupings
of occupations in terms of skills, whereas if we utilize Manhattan
distance it appears that there are 2 main groups. Of course there is
some variation in colour within these main groups, suggesting that it
might be appropriate to break the data down further.
One potential way of addressing the problems associated with
measuring distance in high dimensional data is to perform principal
component analysis and then cluster on a subset of the principal
componenents. In the table below we see that the first 5 principal
components contain 85% of the total variation in skills, and the 5th
component only captures 2% of the variation (components 6-35 each
contain even less.)
The relationship can also be depicted by what is known as a scree
plot:

Next we look as the distance matrices based on the first 5 principal
components.

There are a couple differences of note when compared with the
distance plots based on the original (35D) data. Regarding the Euclidean
distance metric, now rather than looking like there are 3 main clusters
it looks like there are more (well defined) smaller clusters: these look
like small blue squares along the diagonal, of which I count at least
11. The Manhattan distance plot now looks more similar to the Euclidean
distance plot than it did based on the original (35D) data, and the two
distance measures are closer in 5D than they were in 35D.
If the skill profiles of occupations were independent the distance
matrix would look much different. To investigate, for each skill we
replace each NOC’s value with a random draw from the uniform
distribution with the same support
as the skill:

Now all occupations are roughly the same distance apart from all
other occupations, with the exception of distance to oneself: the random
data is not cluster-able, whereas the skills data is. Based on the
above, it appears to be optimal to cluster the data based on the first 5
principal components using Euclidean distance and setting the number of
clusters to be 11. But how to cluster?
Algorithms considered
Agnes
- Agglomerative clustering is the most common type of hierarchical
clustering use to group objects in clusters based on their
similarity.
- It’s also known as AGNES (Agglomerative Nesting) and works from the
bottom up.
- The algorithm starts by treating each object as a singleton cluster
(leaf).
- At each step of the algorithm, the two clusters that are the most
similar are combined into a new bigger cluster (nodes).
- This procedure is iterated until all points are member of just one
single big cluster (root).
- The result is a tree-based representation of the object called a
dendrogram.
Diana
- The inverse of agglomerative clustering is divisive clustering,
which is also known as DIANA (Divise Analysis) and it works in a
top-down manner.
- It begins with the root, in which all objects are included in a
single cluster.
- At each step of iteration, the most heterogeneous cluster is divided
into two (nodes).
- The process is iterated until all objects are in their own cluster
(leaf).
K-means
- K-means clustering (MacQueen, 1967) is the most commonly used
unsupervised machine learning algorithm for partitioning a given data
set into a set of k groups (i.e. k clusters), where k represents the
number of groups pre-specified by the analyst.
- It classifies objects in multiple groups (i.e., clusters), such that
objects within the same cluster are as similar as possible (i.e., high
intra-class similarity), whereas objects from different clusters are as
dissimilar as possible (i.e., low inter-class similarity).
- In k-means clustering, each cluster is represented by its center
(i.e, centroid) which corresponds to the mean of points assigned to the
cluster.
- The algorithm starts by randomly selecting k objects from the data
set to serve as the initial centers for the clusters. The selected
objects are also known as cluster means or centroids.
- Cluster assignment: each of the remaining objects is assigned to
it’s closest centroid, where closest is defined using either Euclidean
or Manhattan distance between the object and the cluster mean.
- Centroid Update: After the assignment step, the algorithm computes
the new mean value of each cluster.
- After the centers have been recalculated, every observation is
checked again to see if it might be closer to a different cluster.
- All the objects are reassigned again using the updated cluster
means.
- The cluster assignment and centroid update steps are iteratively
repeated until the cluster assignments stop changing (i.e until
convergence is achieved).
PAM
- The k-medoids algorithm is a clustering approach related to k-means
clustering.
- In k-medoids clustering, each cluster is represented by one of the
data point in the cluster.
- These points are named cluster medoids.
- The term medoid refers to an object within a cluster for which
average dissimilarity between it and all the other the members of the
cluster is minimal.
- It corresponds to the most centrally located point in the
cluster.
- These objects (one per cluster) can be considered as a
representative example of the members of that cluster.
- K-medoid is a robust alternative to k-means clustering.
- This means that, the algorithm is less sensitive to noise and
outliers, compared to k-means, because it uses medoids as cluster
centers instead of means.
- The algorithm starts by randomly selecting k objects to become the
medoids.
- Assign every object to its closest medoid.
- For each cluster, search if any of the object of the cluster
decreases the average dissimilarity; if it does, select the entity as
the new medoid.
- If at least one medoid has changed go back to 2; else end the
algorithm.
Measures to assess the models:
We need to assess which is the best clustering algorithm to use and
specify the number of clusters we want. We can assess the algorithms
based on the following measures:
Connectivity
- Corresponds to what extent items are placed in the same cluster as
their nearest neighbors in the data space.
- Let \(N\) denote the total number
of observations (rows) in a dataset and \(M\) denote the total number of
columns.
- Define \(nn_{i(j)}\) as the \(j\)th nearest neighbor of observation \(i\), and let \(x_{i,nn_{i(j)}}\) be zero if \(i\) and \(j\) are in the same cluster and \(1/j\) otherwise.
- For a particular clustering partition \(C
= \{C_1 , . . . , C_K \}\) of the \(N\) observations into \(K\) disjoint clusters, the connectivity is
defined as: \[Conn(C) =
\sum^{N}_{i=1}\sum^{L}_{j=1}x_{i,nn_{i(j)}}\] where \(L\) is a parameter giving the number of
nearest neighbors to use.
- The connectivity has a value between zero and \(\infty\) and should be minimized.
Dunn
- The Dunn index is another internal clustering validation measure
which can be computed as follow:
- For each cluster, compute the distance between each of the objects
in the cluster and the objects in the other clusters
- Use the minimum of this pairwise distance as the inter-cluster
separation (min.separation)
- For each cluster, compute the distance between the objects in the
same cluster.
- Use the maximal intra-cluster distance (i.e maximum diameter) as the
intra- cluster compactness
- Calculate the Dunn index (D) as follow: \(D=\frac{min.separation}{max.diameter}\)
- If the data set contains compact and well-separated clusters, the
diameter of the clusters is expected to be small and the distance
between the clusters is expected to be large.
- Large values of the Dunn index are desirable.
Silhouette
- The silhouette analysis measures how well an observation is
clustered and it estimates the average distance between clusters.
- For each observation \(i\), the
silhouette width \(s_i\) is calculated
as follows:
- For each observation i, calculate the average dissimilarity \(a_i\) between \(i\) and all other points of the cluster to
which \(i\) belongs.
- For all other clusters \(C\), to
which \(i\) does not belong, calculate
the average dissimilarity \(d(i, C)\)
of \(i\) to all observations of \(C\). The smallest of these \(d(i, C)\) is defined as \(b_i = \min_{C} d(i, C)\). The value of
\(b_i\) can be seen as the
dissimilarity between \(i\) and its
neighbor cluster, i.e., the nearest one to which it does not
belong.
- The silhouette width of the observation \(i\) is defined by the formula: \(S_i = (b_i - a_i )/max(a_i , b_i )\).
- Silhouette width can be interpreted as follow:
- Observations with a large \(S_i\)
(almost 1) are very well clustered.
- A small \(S_i\) (around 0) means
that the observation lies between two clusters.
- Observations with a negative \(S_i\) are probably placed in the wrong
cluster.
- A simple example:
- Consider a case where we want to calculate the silhouette width of
an observation \(x_1\) in cluster \(x\) with two other member \(x_2\) and \(x_3\), and the nearest other cluster \(y\) has two members \(y_1\) and \(y_2\).
- Suppose that the distance between \(x_1\) and \(x_2\) is 1.2, and the distance between
\(x_1\) and \(x_3\) is .8.
- Thus the average dissimilarity between \(x_1\) and all other points in cluster is
\(a_{x_1}=1\).
- Further, suppose that the distance between \(x_1\) and \(y_1\) is 1.6, and the distance between
\(x_1\) and \(y_2\) is 1.4.
- Thus the average dissimilarity between \(x_1\) and all points in the nearest other
cluster is \(b_{x_1}=1.5\).
- The silhouette value for \(x_1\) is
\(\frac{.5}{1.5}=\frac13\).
- In other words, if the closest other cluster is (on average) 50%
further away than the other observations in cluster then the silhouette
value will be \(\frac13\).
What algorithm and what K?
The function clValid tests for
what algorithm/K combination maximizes the above (internal) measures of
validity. Based on the dimensionality of the data and the distance
metric utilized we can see the following:
- Agnes (agglomerative hierarchical clustering) is typically the
“best” algorithm.
- If the number of clusters was chosen on the basis of Connectivity or
Silhouette width the optimal number of clusters would be two small (pun
intended).
- In contrast the Dunn index is maximized using 35D data with
Euclidean distance and 11 clusters.
35D data, Euclidean distance
## Score Method Clusters
## Connectivity 25.8960317 agnes 2
## Dunn 0.1959499 agnes 11
## Silhouette 0.3489387 kmeans 2
35D data, Manhattan distance
## Score Method Clusters
## Connectivity 35.0781746 agnes 2
## Dunn 0.1906731 agnes 16
## Silhouette 0.3961289 kmeans 2
5D data, Euclidean distance
## Score Method Clusters
## Connectivity 6.0813492 agnes 2
## Dunn 0.1146672 agnes 2
## Silhouette 0.4099474 diana 2
5D data, Manhattan distance
## Score Method Clusters
## Connectivity 21.9507937 agnes 2
## Dunn 0.1339870 agnes 20
## Silhouette 0.3606603 kmeans 2
Models fit:
Based on the above we proceed with 2 candidate algorithms:
- Agnes, 5D data, 11 clusters, Euclidean distance (suggested by visual
inspection of Distance matrix)
- Agnes, 35D data, 11 clusters, Euclidean distance (suggested by
maximal value of Dunn statistic)
Chosen algorithm:
Recall that for Agnes, 5D data, 11 clusters, Euclidean distance the
average silhouette width was 0.3, where as for Agnes, 35D data, 11
clusters, Euclidean distance the average silhouette width was 0.18.
Because a higher average silhouette width is desirable, we proceed with
Agnes, 5D data, 11 clusters, Euclidean distance. Note that even though
the average silhouette value is higher, there are still 56 NOCs that
have negative silhouette widths, indicating assignment to the wrong
cluster. In the table below we report both the original and corrected
cluster assignments, where we move NOCs that have a negative silhouette
value to their closest neighbour cluster.
Different than the Conference Board:
It is known that the Conference Board created its clusters using
K-means clustering with 8 clusters using NOC 2016 data, but beyond that
their method is somewhat of a black box. If we map from 2016 NOCs to
2021 NOCs we can create an alluvium plot that show how the occupations
from the Conference Board clusters flow into our clusters. As you can
see from the alluvium plot, the clustering is obviously quite
different.

Conference Board vs. our clusters :
Higher measures of internal validity than Conference Board:
Next we apply the conference board methodology (K-means, 8 clusters)
to the NOC 2021 data, and compare the internal validity criteria with
our AGNES clustering. We find that our clustering weakly dominates the
Conference Board’s clustering, when making an “apples to apples”
comparison (i.e. using the same NOC2021 data set.) In particular, Agnes
dominates in terms of silhouette width and connectivity, and ties with
K-means in terms of the Dunn index.
Cluster Means:
Next we attach the cluster assignments to the scaled data and then
calculate cluster means for each skill. It would be nice to see how the
cluster means are arranged relative to one another. To do so we can use
a biplot, which takes high dimensional data (we have 35 skills) and
represents it in 2 dimensional space.
Biplot of new cluster means:
Consider the following biplot. The arrows represent the projection of
the 35 different skills into this 2 dimensional space. The length of the
arrows is proportional to the variance of the skill across NOCs, and the
angle between two arrows is proportional to the correlation between the
two skills. The fact that many arrows are overlapping suggests a fair
degree of redundancy in the skill measures (variables measuring
basically the same thing). Notice that most arrows point northeast(ish),
indicating this is where high skill NOCs will be located (e.g. STEM). In
the north direction the skills are hard (making and fixing things),
whereas in the east the skills are soft (managing people). If high skill
NOCs are located to the Northeast, then low skill NOCs will be found to
the Southwest (e.g Doers, personal services and artists).
Top 5 skills of each cluster:
Relationship between TEER and New clusters:
Note the strong relationship between our clusters and the TEER
clasifications.
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 791.33, df = 50, p-value < 2.2e-16
And finally, the scaled data:
References
Aggarwal, Charu C, Alexander Hinneburg, and Daniel A Keim. 2001.
“On the Surprising Behavior of Distance Metrics in High
Dimensional Space.” In Database Theory—ICDT 2001: 8th
International Conference London, UK, January 4–6, 2001 Proceedings
8, 420–34. Springer.