imaging data pulled: 2021-01-21
clinical data pulled: 2020-11-16
code written: 2021-02-15
last ran: 2021-08-09
website: http://rpubs.com/navona/NM_clustering
code: https://github.com/navonacalarco/NM-MRI/blob/master/analyses/05_hierarchicalClustering.Rmd





Background: CCA analysis

In another set of analyses, we performed canonical correlation analysis (CCA), to describe the association between two sets of variables: a ‘brain’ (\(X\)) set, and a ‘behaviour’ (\(Y\)) set. The \(X\) set included 6 NM variables (brightness from the segments of the LC) and 14 behaviour variables from the RBANS and DKEFS. Both the brain and cognition variables were corrected for age and sex. The CCA analysis found a statistically significant relationship between sets (Wilks’ p= 0.0031), with the first two variates of the solution meeting statistical significance, with canonical correlation values of 0.8081 and 0.7686, respectively.


Present analysis: Hierarchical clustering

CCA could be our endpoint, but it is also possible to use its’ data – capturing multivariate correlations between brain and behaviour – as in input to various data-driven modeling techniques. Here, we perform hierarchical clustering on the \(X\) values derived from the first (CV1) and second (CV2) canonical variates from the CCA. This allows us to restrict our clustering effort to lie along axes of brain variability that we know are related to the cognitive symptomatology we are interested in. The derived clusters will therefore be constrained to meaningful information, that we can then review for potential clinical relevance (e.g., identify those with depression, or different clinical/severity profiles). We largely follow the approach taken by Dinga and colleagues in their 2019 NeuroImage paper.


STEP 1

Select hierarchical clustering method

We opted to perform hierarchical clustering, as it does not require us to pre-specify the number of clusters to be generated (as is required by k-means, etc.). However, we do select from available hierarchical clustering solution by using that with the highest agglomerative coefficient, which measures the strength of the identified clustering structure found (values closer to 1 suggest strong clustering structure). We find that hierarchical clustering using Euclidean distance and Ward’s D linkage method, is best for our data, for both significant variates. This method minimizes the total within-cluster variance.

CV1

##   average    single  complete      ward 
## 0.9052629 0.7847475 0.9526009 0.9720249

CV2

##   average    single  complete      ward 
## 0.8865483 0.6848490 0.9421517 0.9665573

STEP 2

Perform hierarchical clustering

Now, using the Euclidean distance measure and Ward’s D linkage method, we perform hierarchical clustering on the \(X\) set’s CV1 values (participant-wise scores), from the previously found CCA solution. We can visualize hierarchical clustering using a dendrogram, below. Individual observations (i.e., participants) are on the x axis. The y axis represents ‘height of the fusion’, which indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are. (Note that conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused: we cannot use the proximity of two observations along the x axis as a criteria of their similarity.)

Cluster solution

Visualization of 2 clusters

Here, we highlight where the algorithm would “cut the tree” for 2 clusters. This looks good, but it seems as though the right cluster might contain 2 distinct groups (i.e., a three-cluster solution might be preferable).

Visualization of 3 clusters

Here, we highlight where the algorithm would “cut the tree” for 3 clusters. This looks like a good fit for the data.

Visualization of 4 clusters

Here, we highlight where the algorithm would “cut the tree” for 4 clusters. Here, we see that one of the clusters is very small, containing just 4 participants. As we likely want all clusters to contain >= 5 participants, we might not want to use the 4 cluster solution, even if it proves to be a good fit.


STEP 3

Confirm optimal clusters

Next, we can determine the optimal number of clusters, without relying on visual heuristics (as we did above). Dinga uses two methods: the Calinski-Harabasz (CH) index, and the silhouette index. The two methods agree that the ‘best’ clustering yields 3 clusters, if we insist that all clusters contain a minimum of >=5 participants. Essentially, then, we are comparing the 3 cluster solution only to the 2 cluster solution. Note that neither the CH index nor the silhouette index provide a statistical test or statistical evidence of the existence of 3 clusters, only that if clustering is performed (i.e., if 2 or more clusters are derived from the data), then the 3 cluster solution is ideal. Notably, it does not tell us if the derived indices are significantly higher than what would be been expected under the null hypothesis of data with no underlying clusters (i.e., a continuous distribution).

Calinski-Harabasz index

The Calinski-Harabasz index is also known as the Variance Ratio Criterion. It is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters. The higher the score, the better the performance.

2 3 4 5
59.636 62.6817 68.6381 60.3294
2 3 4 5
57.3633 66.8429 69.1993 63.4672

Silhouette index

The silhouette index compares average within-cluster distances, to average distances between points from different clusters. A higher silhouette coefficient score relates to a model with better defined clusters.

2 3 4 5
0.4418 0.4106 0.3583 0.3472
2 3 4 5
0.4525 0.4264 0.4255 0.3738


STEP 4

Visualize clusters

For visualization purposes, we view participant-wise \(X\) vs \(Y\) scores on the first and second variate (both significant in our model), as above, now with colour in accordance with the 3-cluster solution. It is important to note that there is not high agreement between the cluster assignments on CV1 vs CV2.


STEP 5

Cluster significance test

Now, we test whether the observed CH and silhouette indices for 3 clusters are unlikely to occur under the null hypothesis of no clusters extant in the data. The null hypothesis here is that the data came from a single two-dimensional Gaussian distribution (i.e., distribution with no underlying clusters). We take random samples the size of our dataset from a Gaussian distribution, defined by our same model parameters (defined as mean and covariance). Then, we run the same hierarchical clustering as performed on real data on each random sample, and calculate the best obtained CH and silhoutte index (this time, capped at 3 clusters), thus obtaining an empirical null distribution of these indices. The p-value was then defined as a proportion of the calculated indices in the null distribution smaller than what we observed in our real data.

We fail to reject the null hypothesis, i.e., we do not find evidence of clusters in our data. However, it may be that the found biological axes related to cognition (via the CCA) are important on their own merit, without subsequent classification into subtypes, which appear not to provide any more potential for insight.

P values
CV1
CH p-value Silhouette p-value
0.383 0.782
CV2
CH p-value Silhouette p-value
0.216 0.142
Visualization of null distribution CH indices

The CH value of our 3-cluster solution is shown for comparative purposes in red, at 62.6817 and 66.8429, respectively.

Visualization of null distribution Silhouette indices

The silhouette value of our 3-cluster solution is shown for comparative purposes in red, at 0.4106 and 0.4264, respectively.


STEP 6

Post hoc exploration

As stated above, there is no evidence for clusters in our data. For exploratory purposes only, we can examine the characteristics of participants in each of the clusters in the 3 cluster solution. It does not appear that the (non-significant) clusters identify age, or depression severity.

Age

PHQ-9

MADRS