imaging data pulled: 2021-01-21
clinical data pulled: 2020-11-16
code written: 2021-02-15
last ran: 2021-08-09
website: http://rpubs.com/navona/NM_clustering
code: https://github.com/navonacalarco/NM-MRI/blob/master/analyses/05_hierarchicalClustering.Rmd
Background: CCA analysis
In another set of analyses, we performed canonical correlation analysis (CCA), to describe the association between two sets of variables: a ‘brain’ (\(X\)) set, and a ‘behaviour’ (\(Y\)) set. The \(X\) set included 6 NM variables (brightness from the segments of the LC) and 14 behaviour variables from the RBANS and DKEFS. Both the brain and cognition variables were corrected for age and sex. The CCA analysis found a statistically significant relationship between sets (Wilks’ p= 0.0031), with the first two variates of the solution meeting statistical significance, with canonical correlation values of 0.8081 and 0.7686, respectively.
Present analysis: Hierarchical clustering
CCA could be our endpoint, but it is also possible to use its’ data – capturing multivariate correlations between brain and behaviour – as in input to various data-driven modeling techniques. Here, we perform hierarchical clustering on the \(X\) values derived from the first (CV1) and second (CV2) canonical variates from the CCA. This allows us to restrict our clustering effort to lie along axes of brain variability that we know are related to the cognitive symptomatology we are interested in. The derived clusters will therefore be constrained to meaningful information, that we can then review for potential clinical relevance (e.g., identify those with depression, or different clinical/severity profiles). We largely follow the approach taken by Dinga and colleagues in their 2019 NeuroImage paper.
Select hierarchical clustering method
We opted to perform hierarchical clustering, as it does not require us to pre-specify the number of clusters to be generated (as is required by k-means, etc.). However, we do select from available hierarchical clustering solution by using that with the highest agglomerative coefficient, which measures the strength of the identified clustering structure found (values closer to 1 suggest strong clustering structure). We find that hierarchical clustering using Euclidean distance and Ward’s D linkage method, is best for our data, for both significant variates. This method minimizes the total within-cluster variance.
CV1
## average single complete ward
## 0.9052629 0.7847475 0.9526009 0.9720249
CV2
## average single complete ward
## 0.8865483 0.6848490 0.9421517 0.9665573
Perform hierarchical clustering
Now, using the Euclidean distance measure and Ward’s D linkage method, we perform hierarchical clustering on the \(X\) set’s CV1 values (participant-wise scores), from the previously found CCA solution. We can visualize hierarchical clustering using a dendrogram, below. Individual observations (i.e., participants) are on the x axis. The y axis represents ‘height of the fusion’, which indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are. (Note that conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused: we cannot use the proximity of two observations along the x axis as a criteria of their similarity.)
Here, we highlight where the algorithm would “cut the tree” for 2 clusters. This looks good, but it seems as though the right cluster might contain 2 distinct groups (i.e., a three-cluster solution might be preferable).
Here, we highlight where the algorithm would “cut the tree” for 3 clusters. This looks like a good fit for the data.
Here, we highlight where the algorithm would “cut the tree” for 4 clusters. Here, we see that one of the clusters is very small, containing just 4 participants. As we likely want all clusters to contain >= 5 participants, we might not want to use the 4 cluster solution, even if it proves to be a good fit.
Confirm optimal clusters
Next, we can determine the optimal number of clusters, without relying on visual heuristics (as we did above). Dinga uses two methods: the Calinski-Harabasz (CH) index, and the silhouette index. The two methods agree that the ‘best’ clustering yields 3 clusters, if we insist that all clusters contain a minimum of >=5 participants. Essentially, then, we are comparing the 3 cluster solution only to the 2 cluster solution. Note that neither the CH index nor the silhouette index provide a statistical test or statistical evidence of the existence of 3 clusters, only that if clustering is performed (i.e., if 2 or more clusters are derived from the data), then the 3 cluster solution is ideal. Notably, it does not tell us if the derived indices are significantly higher than what would be been expected under the null hypothesis of data with no underlying clusters (i.e., a continuous distribution).
The Calinski-Harabasz index is also known as the Variance Ratio Criterion. It is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters. The higher the score, the better the performance.
| 2 | 3 | 4 | 5 |
|---|---|---|---|
| 59.636 | 62.6817 | 68.6381 | 60.3294 |
| 2 | 3 | 4 | 5 |
|---|---|---|---|
| 57.3633 | 66.8429 | 69.1993 | 63.4672 |
The silhouette index compares average within-cluster distances, to average distances between points from different clusters. A higher silhouette coefficient score relates to a model with better defined clusters.
| 2 | 3 | 4 | 5 |
|---|---|---|---|
| 0.4418 | 0.4106 | 0.3583 | 0.3472 |
| 2 | 3 | 4 | 5 |
|---|---|---|---|
| 0.4525 | 0.4264 | 0.4255 | 0.3738 |
Visualize clusters
For visualization purposes, we view participant-wise \(X\) vs \(Y\) scores on the first and second variate (both significant in our model), as above, now with colour in accordance with the 3-cluster solution. It is important to note that there is not high agreement between the cluster assignments on CV1 vs CV2.
Cluster significance test
Now, we test whether the observed CH and silhouette indices for 3 clusters are unlikely to occur under the null hypothesis of no clusters extant in the data. The null hypothesis here is that the data came from a single two-dimensional Gaussian distribution (i.e., distribution with no underlying clusters). We take random samples the size of our dataset from a Gaussian distribution, defined by our same model parameters (defined as mean and covariance). Then, we run the same hierarchical clustering as performed on real data on each random sample, and calculate the best obtained CH and silhoutte index (this time, capped at 3 clusters), thus obtaining an empirical null distribution of these indices. The p-value was then defined as a proportion of the calculated indices in the null distribution smaller than what we observed in our real data.
We fail to reject the null hypothesis, i.e., we do not find evidence of clusters in our data. However, it may be that the found biological axes related to cognition (via the CCA) are important on their own merit, without subsequent classification into subtypes, which appear not to provide any more potential for insight.
| CH p-value | Silhouette p-value |
|---|---|
| 0.383 | 0.782 |
| CH p-value | Silhouette p-value |
|---|---|
| 0.216 | 0.142 |
The CH value of our 3-cluster solution is shown for comparative purposes in red, at 62.6817 and 66.8429, respectively.
The silhouette value of our 3-cluster solution is shown for comparative purposes in red, at 0.4106 and 0.4264, respectively.
Post hoc exploration
As stated above, there is no evidence for clusters in our data. For exploratory purposes only, we can examine the characteristics of participants in each of the clusters in the 3 cluster solution. It does not appear that the (non-significant) clusters identify age, or depression severity.