07_hierarchicalClustering.utf8

imaging data pulled: 2021-10-12
clinical data pulled: 2020-11-16
code written: 2021-11-22
last ran: 2021-11-23

Description. CCA could be our endpoint, but it is also possible to use its’ data – capturing multivariate correlations between brain and behaviour – as in input to various data-driven modeling techniques. Here, we perform hierarchical clustering on the \(X\) and \(Y\) scores derived from the first canonical variate from the CCA (CV1). This allows us to restrict our clustering effort to lie along axes of brain variability that we know are related to the cognitive symptomatology we are interested in. The derived clusters will therefore be constrained to meaningful information, that we can then review for potential clinical relevance (e.g., identify those with depression, or different clinical/severity profiles). We largely follow the approach taken by Dinga and colleagues in their 2019 NeuroImage paper.

#clear environment
rm(list = ls())

#set global RMarkdown chunk options
knitr::opts_chunk$set(fig.width=9.5, warning=FALSE, message=FALSE) 

#list required libraries
packages<- c(
  'tidyverse',
  'cluster', #agnes fn
  'purrr',
  'stats',
  'NbClust',
  'kableExtra',
  'MASS',
  'ggpubr'
)

#load required libraries
lapply(packages, require, character.only=TRUE)

#read in CCA model
cca <- readRDS('cca.rds')

#make dataframe that contains just X and Y CV1 scores
df_scores <- as.data.frame(cbind(cca$scores$xscores[,1], cca$scores$yscores[,1]))
names(df_scores) <- c('x','y')

#read in demographic data
df_demo <- read.csv(dir('../clinical', full.names=TRUE, pattern="^df_2021")) 

#for clarity, keep only demographic data of interest
df <- merge(df_scores, df_demo[, c('id','Diagnosis','Age','madrs_total_calc','phq9_total','pss_total','gad7_total')], by.x='row.names', by.y='id')
  
#cleanup
rm(df_scores, df_demo)

#set number of permutations
numberPermutations <- 1000

STEP 1

Select hierarchical clustering method

We opted to perform hierarchical clustering, as it does not require us to pre-specify the number of clusters to be generated (as is required by k-means, etc.). However, we do select from available hierarchical clustering solution by using that with the highest agglomerative coefficient, which measures the strength of the identified clustering structure found (values closer to 1 suggest strong clustering structure). We find that hierarchical clustering using Euclidean distance and Ward’s D linkage method, is best for our data, i.e., the first significant variate. This method minimizes the total within-cluster variance.

#methods by which examine agglomerative coefficient (closer to 1: stronger clustering)
methods <- c('average', 'single', 'complete', 'ward')
names(methods) <- c('average', 'single', 'complete', 'ward')

#function to compute agglomerative coefficient 
agglomerativeCoefficient_fn <- function(x){
  cluster::agnes(df, method=x)$ac
}

#apply function to data
purrr::map_dbl(methods, agglomerativeCoefficient_fn) #Ward is best.

##   average    single  complete      ward 
## 0.7141652 0.4415155 0.8221957 0.9164410

STEP 2

Perform hierarchical clustering

Now, using the Euclidean distance measure and Ward’s D linkage method, we perform hierarchical clustering on the \(X\) and \(Y\) set’s CV1 values (participant-wise scores), from the previously found CCA solution. We can visualize hierarchical clustering using a dendrogram, below. Individual observations (i.e., participants) are on the x axis. The y axis represents ‘height of the fusion’, which indicates the (dis)similarity between two observations. The higher the height of the fusion, the less similar the observations are. (Note that conclusions about the proximity of two observations can be drawn only based on the height where branches containing those two observations first are fused: we cannot use the proximity of two observations along the x axis as a criteria of their similarity.)

#create dissimilarity matrix
d <- dist(df[, c('x','y')], method = "euclidean") 

#hierarchical clustering, using Ward's method
hc <- stats::hclust(d, method = "ward.D")

#ensure cluster membership count >= 2
count <- table(cutree(hc, k=3)) 

#make to plots, side by side
par(mfrow=c(1, 2))

#plot the dendrogram (with no clusters drawn)
plot(hc, cex=.6, hang=-1, xlab='NM CV1 value')

#plot the dengrogram (with 3 clusters)
plot(hc, cex=.6, hang=-1)
rect.hclust(hc, k=length(count))

STEP 3

Confirm optimal clusters

Next, we can determine the optimal number of clusters, without relying on visual heuristics (as we did above). Dinga uses two methods: the Calinski-Harabasz (CH) index, and the silhouette index. We insist that all clusters contain a minimum of >=2 participants (when divided into three, clusters 1 through 4 have counts as follows: 18, 19, 11, and higher numbers of divisions show 2 or fewer participants in each cluster).

Note that neither the CH index nor the silhouette index provide a statistical test or statistical evidence of the existence of any number of clusters, only that if clustering is performed (i.e., if 2 or more clusters are derived from the data), then the given cluster solution is ideal. Notably, it does not tell us if the derived indices are significantly higher than what would be been expected under the null hypothesis of data with no underlying clusters (i.e., a continuous distribution).

Below, we see that the two methods, i.e., the Calinski-Harabasz (CH) index and the silhouette index suggest that 3 clusters are preferable to 2. Note that we show the indices for 10 clusters, for visualization only (but do not analyze them, as 4 clusters and above involve too-small numbers of participants).

Calinski-Harabasz index

The Calinski-Harabasz index is also known as the Variance Ratio Criterion. It is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters. The higher the score, the better the performance.

#calculate CH
hcfit_ch <- NbClust::NbClust(df[, c('x','y')], method="ward.D",  min.nc = 2, max.nc = 10, index = "ch")

#print CH values
t(hcfit_ch$All.index) %>%
  kable() %>% kable_styling()

2	3	4	5	6	7	8	9	10
40.6349	46.6769	75.5456	73.9854	73.129	78.3646	79.2193	76.9203	82.2578

#plot CH
plot(names(hcfit_ch$All.index), hcfit_ch$All.index, 
     main = "variance ratio criterion\n (Calinski-Harabasz index)", 
     xlab = "Number of clusters, CV1", ylab="variance ratio criterion", type='b')

Silhouette index

The silhouette index compares average within-cluster distances, to average distances between points from different clusters. A higher silhouette coefficient score relates to a model with better defined clusters.

#calculate silhouette
hcfit_sl <- NbClust(df[,c('x','y')], method="ward.D", min.nc=2, max.nc=10, index = "silhouette")

#print silhouette values
t(hcfit_sl$All.index) %>%
  kable() %>% kable_styling()

2	3	4	5	6	7	8	9	10
0.4638	0.3706	0.4279	0.4174	0.3685	0.3718	0.4062	0.3744	0.3865

#plot silhouette
plot(names(hcfit_sl$All.index), hcfit_sl$All.index, 
     main = "Silhoutte", xlab = "Number of clusters, CV1", ylab="Silhoutte", 
     type='b')

####

STEP 4

Visualize clusters

For visualization purposes, we view participant-wise \(X\) vs \(Y\) scores on the first variate, now with colour in accordance with the 3-cluster solution.

#decide when to cut the tree, based on optimal clusters
df$Clusters <- as.factor(cutree(hc, k=length(count)))

#make ggplot, CV1
ggplot(df, aes(x, y, colour=Clusters)) +
  geom_point(size=3.5, stroke=2, aes(shape=Diagnosis)) + 
  theme_classic() +
  xlab('scores, LC') +
  ylab('scores, cognition') +
  ggtitle('CV1') + 
  theme(text= element_text(size=15))

STEP 5

Cluster significance test

Now, we test whether the observed CH and silhouette indices for 3 clusters are unlikely to occur under the null hypothesis of no clusters extant in the data. The null hypothesis here is that the data came from a single two-dimensional Gaussian distribution (i.e., distribution with no underlying clusters). We take random samples the size of our dataset from a Gaussian distribution, defined by our same model parameters (defined as mean and covariance). Then, we run the same hierarchical clustering as performed on real data on each random sample, and calculate the best obtained CH and silhoutte index (this time, capped at 3 clusters), thus obtaining an empirical null distribution of these indices. The p-value was then defined as a proportion of the calculated indices in the null distribution smaller than what we observed in our real data.

We fail to reject the null hypothesis, i.e., we do not find evidence of clusters in our data. However, it may be that the found biological axes related to cognition (via the CCA) are important on their own merit, without subsequent classification into subtypes, which appear not to provide any more potential for insight.

#first, define function to perform hierarchical clustering and return highest clustering indexes
clusterTest_fn <- function(df, clusters=length(count)){
  hcfit <- NbClust(df[, c('x','y')], method="ward.D", index="ch", min.nc=clusters, max.nc=clusters) #note: min could be 2
  CH_index <- max(hcfit$All.index)
  hcfit <- NbClust(df[, c('x','y')], method="ward.D", index="silhouette", min.nc=clusters, max.nc=clusters)
  sil_index <- max(hcfit$All.index)
  return(c("CH"=CH_index, "Silhouette"=sil_index))
}

#identify distributional characteristics of our real data (just LC data)
sigma <- cov(df[,c('x','y')])
mu <- colMeans(df[,c('x','y')])
real_CI <- clusterTest_fn(df[,c('x','y')])

#repeatedly perform hierarchical clustering on samples from this distribution
#(thus creating an empirical null distribution of clustering indices)
null_CI <- list()
for (i in 1:numberPermutations){
  rand_sample <- MASS::mvrnorm(n=nrow(df), mu=mu, Sigma=sigma)
  null_CI[[i]] <- clusterTest_fn(rand_sample)
}

#turn empirical null distribution list into dataframe
null_CI <- as.data.frame(do.call(rbind, null_CI))

P values

CV1

#calculate p-values CV1
rank_CH <- sum(real_CI[1] < null_CI[,1]) + 1
pval_CH <- rank_CH / (numberPermutations + 1)
rank_sil <- sum(real_CI[2] < null_CI[,2]) + 1
pval_sil <- rank_sil / (numberPermutations + 1)

#put p-values into a nice table
t((c("CH p-value"=pval_CH, "Silhouette p-value"=pval_sil))) %>%
  kable() %>% kable_styling()

CH p-value	Silhouette p-value
0.979021	0.9200799

Visualization of null distribution CH indices

The CH value of our 3-cluster solution is shown for comparative purposes in red, at 46.6769.

plot(null_CI$CH,
     main = "Calinski-Harabasz (CH) index, CV1", xlab = "Permutation", ylab="CH value")
points(numberPermutations/2, hcfit_ch$All.index[2], col = "red", cex = 3, pch=19)

Visualization of null distribution Silhouette indices

The silhouette value of our 3-cluster solution is shown for comparative purposes in red, at 0.3706.

plot(null_CI$Silhouette,
     main = "Silhouette index (CV1)", xlab = "Permutation", ylab="Silhouette value")
points(numberPermutations/2, hcfit_sl$All.index[2], col = "red", cex = 3, pch=19)

STEP 6

Post hoc exploration

As stated above, there is no evidence of clusters in our data. Nonetheless, for exploratory purposes, we can examine the characteristics of participants in each of the clusters in the 3 cluster solution. We do not see any interesting differences, in terms of age, depression severity, or anxiety/stress.

#make a list of all comparisons between clusters
my_comparisons <- list(c('1','2'), c('1','3'), c('2','3'))

#function for boxplot
plot_fn <- function(y){
  max <- max(df[['y']])
  ggboxplot(df, x='Clusters', y=y, color='Clusters', add='jitter', size=1, add.params=list(size=3)) +
    stat_compare_means(method='anova') +
    stat_compare_means(comparisons=my_comparisons) 
}

Age

plot_fn('Age')

PHQ-9

plot_fn('phq9_total')

MADRS

plot_fn('madrs_total_calc')

PSS

plot_fn('pss_total')

GAD

plot_fn('gad7_total')

NM-MRI 07_hierarchicalClustering