Clustering and cut-off for CCA CV1 X scores

Date last code update: 2021-08-19
Date last ran: 2021-08-22

#clean environment
rm(list = ls())

#load packages
library(cluster)
library(NbClust)
library(MASS)
library(tidyverse)
library(factoextra)
library(ggbeeswarm)
library(cutpointr)
library(gridExtra)
library(caret)
library(kableExtra)
library(ggROC)
library(pROC)


#set up chunk options
knitr::opts_chunk$set(
 fig.width = 6,
 fig.asp = 0.8,
 out.width = "60%",
 echo=FALSE
)

#set seed so doesn't change in permutation for youden
addTaskCallback(function(...) {set.seed(123);TRUE})

## 1 
## 1

In this analysis, we utilize the CCA’s participant-wise \(X\) set scores, from CV1 (only significant variate). Because these scores have been computed separately for the discovery (GE) and validation (PRISMA) samples, and are based on correlations, they can be combined here, across all 308 participants. (Note that while the cluster analysis performs clustering on all participants, the cutpoint analysis separates participants into a train and test set, described below).

CLUSTER ANALYSIS

Step 1: First, we determine what type of clustering is ideal for our data, on the basis of the largest agglomerative coefficient (value closest to 1 is best). We find Ward’s method is the ‘best’ clustering method.

##   average    single  complete      ward 
## 0.9939510 0.9732562 0.9975855 0.9994854

Step 2. Next, we perform clustering using Ward’s D linkage method. Below is the dendrogram of our data (no groups imposed).

Step 3.

Next, we determine the optimal number of clusters in the dendrogram, using the Calinski-Harabasz index, setting an arbitrary maximum of 7 groups. We find that, within this range, the optimal number of clusters is 5 (with the highest CH value of 595.7073).

Step 4.

Next, we determine if a CH index of 595.7073 can occur under the null hypothesis of no clusters embedded in the data, using the method described by Dinga, 2019. We employ 1000 permutations. We derive a p value of 0.1098901, and thus we fail to reject the null hypothesis: our cluster solutions’ properties are not distinguishable from those of continuous data.

Step 5.

Though we now know we cannot say that natural clusters exist in our data – on the basis of the distributional properties of the CV1 X set scores – we consider a two-group cluster solution. We opt to consider a two-group cluster solution because we ‘know’ we have two diagnostic groups in our data, i.e., SSD and HC. This is a bit of a strange analysis, because usually if labels are known, clustering (cf. classification) is not performed. But in keeping with spirit of RDoC, etc., and because evidence of continuous multivariate relationship between WM-cognition, we aren’t yet convinced the labels are ‘good’ here. Thus, we want to want to explore the commonality of the hierarchical clustering-derived label (cluster 1, cluster 2), and diagnostic label (HC, SSD). Put otherwise, the question is, if we impose a two-group cluster solution, will we find that the derived cluster labels match well with known diagnostic labels?

It seems that the two-group solution cluster labels may be related to diagnostic group. In cluster 1, there are 213 participants with 173 SSD, and in cluster 2, there are 95 participants, with 7 SSD. However, the conventionally-applied external criterion validation metrics are not high. First, we calculate the Adjusted Rand Index (ARI), which indicates how similar the clusters are. A value of 1 indicates clusters are identical, 0 indicates they behave as randomly unrelated, and negative values indicate worse than chance. Our adjusted ARI value is 0.479169 (poor recovery). We also calculate cluster purity. Purity is the percent of the total number of observations that were classified correctly. Our purity is 0.8474026. This means the clusters derived from WM-cognition map onto diagnostic labels with some accuracy, but not very good; could be because diagnostic labels not reliable, or because WM-cognition not sufficient information (or likely both).

CUT POINT ANALYSIS

Here, we are asking a different question than above. We are asking, given the data distribution of CV1 \(X\) scores, and ‘known’ diagnostic labels, can we/where can we place a cut point, that best separates HC and SSD? Thus, this analysis does not involve any data-driven clustering (we are no longer using labels derived from hierarchical clustering).

For the cutpoint analysis, we split the sample into training (n=200) and test (n=108) sets, equal to a ~64.94% percent of data in training. We use the rPROC and cutpointr packages, which allows us to determine and evaluate optimal cutpoints in binary classification tasks.

Step 6.

The first thing we do, before attempting to identify a cutpoint, is fit the ROC curve, and then calculate the AUC. We do this on our real data (training set), and also on a permuted distribution (i.e., diagnostic labels shuffled). We then statistically test to see if the real data have a higher AUC compared to random data.

## 
##  Bootstrap test for two ROC curves
## 
## data:  roc_real and roc_permuted
## D = 9.4367, boot.n = 1000, boot.stratified = 1, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is greater than 0
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.9414062   0.5351663

We find a large, significant difference in the AUC values, with 1000 permutations, p=1.924372510^{-21}. Thus, our observed distribution has a cutpoint that distinguishes between diagnostic groups far better than a null distribution. The confidence intervals for the real distribution were 0.9173639, 0.9414062, 0.9654486 and for the null distribution were 0.4528586, 0.5351663, 0.6174741.

Step 7.

Now that we know our data are more related to diagnostic class than chance, we can go about identifying the cutpoint. There are many different ways to calculate a cutpoint. All give slightly different solutions, visualized below:

[red] Youden (J) index (sensitivity + specificity - 1)
[blue] maximizes accuracy (fraction correctly classified)
[purple] minimizes difference between sensitivity and specificity
[pink] minimizes the product of positive predictive value (PPV) and negative predictive value (NPV)

We calculated cutpoints using a bootstrap of 1000. It has been shown that bagging can substantially improve performance of a wide range of types of models in classification tasks. The optimization is carried out in each one and the mean of the resulting optimal cutpoints is taken as the optimal-optimal point. The advantages of bootstrapping the optimal cutpoint are that the procedure doesn’t possess parameters that have to be tuned, unlike the LOESS smoothing, that it doesn’t rely on assumptions, unlike the Normal method. Furthermore, like random forests, it cannot be overfit by increasing the number of trees, the bootstrapped cutpoints cannot be overfit by running an excessive amount of repetitions. Note that we did not perform smoothing (e.g., LOESS, GAM, spline), which may be useful in small samples.

Step 8.

We decided to compute the cutpoint via the Youden index, which estimates the ideal cutpoint value at -0.237. We chose Youden as the reviewed metrics have similar values, and Youden is most familiar in medicine. Youden treats sensitivity and specificity similarly – might not be desirable, but we think OK for question at hand. Below, we visualize:

1. optimal cutpoint, by diagnostic group
- (note that the optimal cut-point occurs at the intersection of the normal probability density functions of SSD and HC subjects)
1. the ROC curve
- (note that the Youden J-index minimum occurs when sensitivity=1−specificity, i.e., represented by the equal line (the diagonal) in the ROC diagram, and the vertical distance between the equal line and the ROC curve is the J-index for that particular cutoff.)
1. the Youden-index curve (higher is better)

The third visualization, directly above (Youden curve) suggests that other potential cutpoints might have similar J values. We confirmed this. Of the 200 cutpoints in the training dataset, n=22 cutpoints have values, within .05 of this index. This suggests a ‘wide’ window for our cutpoint might be required; a precise cutpoint value is likely not to be accurate around its boundaries.

Step 9.

Next, we visualize how the Youden index splits participants’ observed X1 scores. (This visualization is a bit cleaner than that comparing cutpoint methods above, but it contains the same data.) It appears that this separation is quite good. In any event, we need to evaluate the split statistically.

Step 10.

Here, we evaluate the accuracy of the Youden index cutpoint in separating the HC and SSD groups, within the training test. Typically, accuracy is established via the The Area Under the Curve (AUC), which is the measure of the ability of a classifier to distinguish between classes, and is used as a summary of the ROC curve. An AUC of 0.5 suggests no discrimination, 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent. Our AUC value is 0.9388323. An AUC this high is considered outstanding. This suggests that the multivariate WM-cognition relationship is strongly picked up by DSM diagnosis, which endorses both diagnosis, as well as the relevance of the WM-cognition relationship to it. The AUC value is 0.9388323, with confidence intervals of 0.9077358, 0.9388323, 0.9699287. The model can be summarized as:

## Method: method 
## Predictor: X_CV1 
## Outcome: group 
## Direction: >= 
## 
##     AUC   n n_pos n_neg
##  0.9388 200   117    83
## 
##  optimal_cutpoint youden   acc sensitivity specificity    ppv    npv  tp fn fp
##           -0.2368 0.7544 0.885      0.9231      0.8313 0.8852 0.8846 108  9 14
##  tn
##  69
## 
## Predictor summary: 
##     Data      Min.         5%     1st Qu.      Median        Mean    3rd Qu.
##  Overall -1.992815 -1.4376393 -0.73549668  0.05715452  0.05298668  0.7473492
##       HC -1.992815 -1.6447964 -1.21947594 -0.87246289 -0.79228807 -0.4890552
##      SSD -1.414598 -0.4227561  0.09848468  0.66228032  0.65262602  1.1177835
##        95%      Max.        SD NAs
##  1.6706560 2.5164753 0.9755037   0
##  0.3380836 0.4783015 0.5840327   0
##  1.9069671 2.5164753 0.7193636   0

Step 11.

Here, we evaluate the accuracy of the Youden index cutpoint in separating the HC and SSD groups, within the testing set. It is essential to test classification performance in held-out data. We applied our model to the testing set of n=108. Below, on the left, we visualize the performance of our model, on held=out data. (On the right, we show the training data, as above.) As summarize above, we see that 94 participants were classified correctly, and 14 were classified incorrectly, an accuracy of 87.037037%, an error of 12.962963%. The AUC value was 0.9488536, with r roc_test_confidence.

Step 12.

The model metrics are as follows. Note that the the kappa value is the model accuracy corrected for chance.

##                 predicted.classes
## observed.classes HC SSD
##              HC  38   7
##              SSD  7  56

	value
Accuracy	0.870
Kappa	0.733
AccuracyLower	0.792
AccuracyUpper	0.927
AccuracyNull	0.583
AccuracyPValue	0.000
McnemarPValue	1.000
Sensitivity	0.889
Specificity	0.844
Pos Pred Value	0.889
Neg Pred Value	0.844
Precision	0.889
Recall	0.889
F1	0.889
Prevalence	0.583
Detection Rate	0.519
Detection Prevalence	0.583
Balanced Accuracy	0.867

Step 13.

Finally, we are interested in knowing if our Youden-derived cutpoint on the basis of multivariate WM-cognition relationships is better (higher AUC value) that cutpoints based on any univariate characterizations of WM and cognition. We repeated the bootstrapped cutpoint analysis described above, but replaced \(X\) set CV1 scores which each of the 36 independent features provided to the CCA analysis (i.e., we fit and compare 36 models). We find that the AUC derived from the \(X\) set CV1 scores is much higher than that from any single variable. It is the only model with an ‘oustanding’ AUC. We also find that on average, the WM features have higher AUC cutpoints (mean=0.7373055) than the cognition features (mean=0.5365854). Note that, in keeping with the size of our training test, we completed this analysis in a partial sample of the same 200 participants upon whom the observed cutpoint was established.