Is the data clusterable?

Clustering algorithms will form clusters even when data is uniformly distributed. The first step in cluster analysis is to investigate the underlying structure of the data in order to ensure that clustering is justifiable. The Hopkins Statistic gives a measure of whether the data can be clustered: a measure close to .5 indicates random data, whereas values close to 1 indicate clusterable data. We can also visualize the structure of the data by looking at the distance matrix. Two occupations that require the same skills will have a distance of zero, whereas the more different the occupations are the larger the distance will be. This raises the question of the distance metric to use: we consider two, Euclidean (as the crow flies) and Manhattan (as the New York resident walks). There is some evidence Aggarwal, Hinneburg, and Keim (2001) that Manhattan distance might be more appropriate in high dimensional space.

In the above heatmaps the rows and columns are the 506 occupations, and the colour indicates the distance between any two occupations. Note that along the diagonal line distance is zero (the distance from an occupation to itself.) The clustering of the data can be inferred from the patchwork design of the plot: Occupations that are similar to one another in terms of skill requirements are grouped together, and the colour of the grouping is blueish. Occupations that differ markedly are located further apart, and the colour tends towards yellow. When we measure distance as Euclidean it appears that there are 3 main groupings of occupations in terms of skills, whereas if we utilize Manhattan distance it appears that there are 2 main groups. Of course there is some variation in colour within these main groups, suggesting that it might be appropriate to break the data down further.

One potential way of addressing the problems associated with measuring distance in high dimensional data is to perform principal component analysis and then cluster on a subset of the principal componenents. In the table below we see that the first 5 principal components contain 85% of the total variation in skills, and the 5th component only captures 2% of the variation (components 6-35 each contain even less.)

The relationship can also be depicted by what is known as a scree plot:

Next we look as the distance matrices based on the first 5 principal components.

There are a couple differences of note when compared with the distance plots based on the original (35D) data. Regarding the Euclidean distance metric, now rather than looking like there are 3 main clusters it looks like there are more (well defined) smaller clusters: these look like small blue squares along the diagonal, of which I count at least 11. The Manhattan distance plot now looks more similar to the Euclidean distance plot than it did based on the original (35D) data, and the two distance measures are closer in 5D than they were in 35D.

If the skill profiles of occupations were independent the distance matrix would look much different. To investigate, for each skill we replace each NOC’s value with a random draw from the uniform distribution with the same support as the skill:

Now all occupations are roughly the same distance apart from all other occupations, with the exception of distance to oneself: the random data is not cluster-able, whereas the skills data is. Based on the above, it appears to be optimal to cluster the data based on the first 5 principal components using Euclidean distance and setting the number of clusters to be 11. But how to cluster?

Algorithms considered

Agnes

  • Agglomerative clustering is the most common type of hierarchical clustering use to group objects in clusters based on their similarity.
  • It’s also known as AGNES (Agglomerative Nesting) and works from the bottom up.
    • The algorithm starts by treating each object as a singleton cluster (leaf).
    • At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes).
  • This procedure is iterated until all points are member of just one single big cluster (root).
  • The result is a tree-based representation of the object called a dendrogram.

Diana

  • The inverse of agglomerative clustering is divisive clustering, which is also known as DIANA (Divise Analysis) and it works in a top-down manner.
    • It begins with the root, in which all objects are included in a single cluster.
    • At each step of iteration, the most heterogeneous cluster is divided into two (nodes).
    • The process is iterated until all objects are in their own cluster (leaf).

K-means

  • K-means clustering (MacQueen, 1967) is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst.
  • It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity).
  • In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.
    • The algorithm starts by randomly selecting k objects from the data set to serve as the initial centers for the clusters. The selected objects are also known as cluster means or centroids.
    • Cluster assignment: each of the remaining objects is assigned to it’s closest centroid, where closest is defined using either Euclidean or Manhattan distance between the object and the cluster mean.
    • Centroid Update: After the assignment step, the algorithm computes the new mean value of each cluster.
    • After the centers have been recalculated, every observation is checked again to see if it might be closer to a different cluster.
    • All the objects are reassigned again using the updated cluster means.
    • The cluster assignment and centroid update steps are iteratively repeated until the cluster assignments stop changing (i.e until convergence is achieved).

PAM

  • The k-medoids algorithm is a clustering approach related to k-means clustering.
  • In k-medoids clustering, each cluster is represented by one of the data point in the cluster.
  • These points are named cluster medoids.
  • The term medoid refers to an object within a cluster for which average dissimilarity between it and all the other the members of the cluster is minimal.
  • It corresponds to the most centrally located point in the cluster.
  • These objects (one per cluster) can be considered as a representative example of the members of that cluster.
  • K-medoid is a robust alternative to k-means clustering.
  • This means that, the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means.
    1. The algorithm starts by randomly selecting k objects to become the medoids.
    2. Assign every object to its closest medoid.
    3. For each cluster, search if any of the object of the cluster decreases the average dissimilarity; if it does, select the entity as the new medoid.
    4. If at least one medoid has changed go back to 2; else end the algorithm.

Measures to assess the models:

We need to assess which is the best clustering algorithm to use and specify the number of clusters we want. We can assess the algorithms based on the following measures:

Connectivity

  • Corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space.
    • Let \(N\) denote the total number of observations (rows) in a dataset and \(M\) denote the total number of columns.
    • Define \(nn_{i(j)}\) as the \(j\)th nearest neighbor of observation \(i\), and let \(x_{i,nn_{i(j)}}\) be zero if \(i\) and \(j\) are in the same cluster and \(1/j\) otherwise.
    • For a particular clustering partition \(C = \{C_1 , . . . , C_K \}\) of the \(N\) observations into \(K\) disjoint clusters, the connectivity is defined as: \[Conn(C) = \sum^{N}_{i=1}\sum^{L}_{j=1}x_{i,nn_{i(j)}}\] where \(L\) is a parameter giving the number of nearest neighbors to use.
  • The connectivity has a value between zero and \(\infty\) and should be minimized.

Dunn

  • The Dunn index is another internal clustering validation measure which can be computed as follow:
    1. For each cluster, compute the distance between each of the objects in the cluster and the objects in the other clusters
    2. Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
    3. For each cluster, compute the distance between the objects in the same cluster.
    4. Use the maximal intra-cluster distance (i.e maximum diameter) as the intra- cluster compactness
    5. Calculate the Dunn index (D) as follow: \(D=\frac{min.separation}{max.diameter}\)
  • If the data set contains compact and well-separated clusters, the diameter of the clusters is expected to be small and the distance between the clusters is expected to be large.
  • Large values of the Dunn index are desirable.

Silhouette

  • The silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters.
  • For each observation \(i\), the silhouette width \(s_i\) is calculated as follows:
    1. For each observation i, calculate the average dissimilarity \(a_i\) between \(i\) and all other points of the cluster to which \(i\) belongs.
    2. For all other clusters \(C\), to which \(i\) does not belong, calculate the average dissimilarity \(d(i, C)\) of \(i\) to all observations of \(C\). The smallest of these \(d(i, C)\) is defined as \(b_i = \min_{C} d(i, C)\). The value of \(b_i\) can be seen as the dissimilarity between \(i\) and its neighbor cluster, i.e., the nearest one to which it does not belong.
    3. The silhouette width of the observation \(i\) is defined by the formula: \(S_i = (b_i - a_i )/max(a_i , b_i )\).
  • Silhouette width can be interpreted as follow:
    • Observations with a large \(S_i\) (almost 1) are very well clustered.
    • A small \(S_i\) (around 0) means that the observation lies between two clusters.
    • Observations with a negative \(S_i\) are probably placed in the wrong cluster.
  • A simple example:
    • Consider a case where we want to calculate the silhouette width of an observation \(x_1\) in cluster \(x\) with two other member \(x_2\) and \(x_3\), and the nearest other cluster \(y\) has two members \(y_1\) and \(y_2\).
    • Suppose that the distance between \(x_1\) and \(x_2\) is 1.2, and the distance between \(x_1\) and \(x_3\) is .8.
    • Thus the average dissimilarity between \(x_1\) and all other points in cluster is \(a_{x_1}=1\).
    • Further, suppose that the distance between \(x_1\) and \(y_1\) is 1.6, and the distance between \(x_1\) and \(y_2\) is 1.4.
    • Thus the average dissimilarity between \(x_1\) and all points in the nearest other cluster is \(b_{x_1}=1.5\).
    • The silhouette value for \(x_1\) is \(\frac{.5}{1.5}=\frac13\).
    • In other words, if the closest other cluster is (on average) 50% further away than the other observations in cluster then the silhouette value will be \(\frac13\).

What algorithm and what K?

The function clValid tests for what algorithm/K combination maximizes the above (internal) measures of validity. Based on the dimensionality of the data and the distance metric utilized we can see the following:

35D data, Euclidean distance

##                   Score Method Clusters
## Connectivity 25.8960317  agnes        2
## Dunn          0.1959499  agnes       11
## Silhouette    0.3489387 kmeans        2

35D data, Manhattan distance

##                   Score Method Clusters
## Connectivity 35.0781746  agnes        2
## Dunn          0.1906731  agnes       16
## Silhouette    0.3961289 kmeans        2

5D data, Euclidean distance

##                  Score Method Clusters
## Connectivity 6.0813492  agnes        2
## Dunn         0.1146672  agnes        2
## Silhouette   0.4099474  diana        2

5D data, Manhattan distance

##                   Score Method Clusters
## Connectivity 21.9507937  agnes        2
## Dunn          0.1339870  agnes       20
## Silhouette    0.3606603 kmeans        2

Models fit:

Based on the above we proceed with 2 candidate algorithms:

Chosen algorithm:

Recall that for Agnes, 5D data, 11 clusters, Euclidean distance the average silhouette width was 0.3, where as for Agnes, 35D data, 11 clusters, Euclidean distance the average silhouette width was 0.18. Because a higher average silhouette width is desirable, we proceed with Agnes, 5D data, 11 clusters, Euclidean distance. Note that even though the average silhouette value is higher, there are still 56 NOCs that have negative silhouette widths, indicating assignment to the wrong cluster. In the table below we report both the original and corrected cluster assignments, where we move NOCs that have a negative silhouette value to their closest neighbour cluster.

Different than the Conference Board:

It is known that the Conference Board created its clusters using K-means clustering with 8 clusters using NOC 2016 data, but beyond that their method is somewhat of a black box. If we map from 2016 NOCs to 2021 NOCs we can create an alluvium plot that show how the occupations from the Conference Board clusters flow into our clusters. As you can see from the alluvium plot, the clustering is obviously quite different.

Conference Board vs. our clusters :

Higher measures of internal validity than Conference Board:

Next we apply the conference board methodology (K-means, 8 clusters) to the NOC 2021 data, and compare the internal validity criteria with our AGNES clustering. We find that our clustering weakly dominates the Conference Board’s clustering, when making an “apples to apples” comparison (i.e. using the same NOC2021 data set.) In particular, Agnes dominates in terms of silhouette width and connectivity, and ties with K-means in terms of the Dunn index.

Cluster Means:

Next we attach the cluster assignments to the scaled data and then calculate cluster means for each skill. It would be nice to see how the cluster means are arranged relative to one another. To do so we can use a biplot, which takes high dimensional data (we have 35 skills) and represents it in 2 dimensional space.

Biplot of new cluster means:

Consider the following biplot. The arrows represent the projection of the 35 different skills into this 2 dimensional space. The length of the arrows is proportional to the variance of the skill across NOCs, and the angle between two arrows is proportional to the correlation between the two skills. The fact that many arrows are overlapping suggests a fair degree of redundancy in the skill measures (variables measuring basically the same thing). Notice that most arrows point northeast(ish), indicating this is where high skill NOCs will be located (e.g. STEM). In the north direction the skills are hard (making and fixing things), whereas in the east the skills are soft (managing people). If high skill NOCs are located to the Northeast, then low skill NOCs will be found to the Southwest (e.g Doers, personal services and artists).

Top 5 skills of each cluster:

Relationship between TEER and New clusters:

Note the strong relationship between our clusters and the TEER clasifications.

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 791.33, df = 50, p-value < 2.2e-16

The raw data:

And finally, the scaled data:

References

Aggarwal, Charu C, Alexander Hinneburg, and Daniel A Keim. 2001. “On the Surprising Behavior of Distance Metrics in High Dimensional Space.” In Database Theory—ICDT 2001: 8th International Conference London, UK, January 4–6, 2001 Proceedings 8, 420–34. Springer.