imaging data pulled: 2021-10-13
clinical data pulled: 2020-11-16
code written: 2021-10-16
last ran: 2021-11-23

Description. In this script, we perform CCA on the X (LC NM) and Y (cognition) variables of interest, for the n=48 participants included in our analysis. The 6 LC segments are visualized below:


Load libraries and data

#clear environment
rm(list = ls())

#set global RMarkdown chunk options
knitr::opts_chunk$set(fig.width=9.5, warning=FALSE, message=FALSE) 

#detach all libraries -- otherwise, may be conflicts in CCA functions
basic.packages <- c("package:stats","package:graphics","package:grDevices","package:utils","package:datasets","package:methods","package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0)  for (package in package.list) detach(package, character.only=TRUE)

#list required libraries
packages<- c(
  'tidyverse',
  'forcats',
  'maditr',
  'kableExtra',
  'GGally', #ggpairs
  'pheatmap', #heatmaps
  'usdm', #multicollinearity
  'RColorBrewer',
  'cowplot',
  'CCA',
  'CCP',
  'psych')

#load required libraries
lapply(packages, require, character.only=TRUE)

#read in cleaned, corrected dataset
df <- read.csv(dir('../clinical', full.names=TRUE, pattern="^dfCorrected_")) #48

#also read in list of participants to exclude, on the basis of medication
exclude <- read.csv('../clinical/participantOmit.txt', sep='', header=FALSE)

#set number of permutations
numberPermutations <- 1000

#set threshold for feature importance
threshold <- .35

#set random seed globally
set.seed(1)

Data cleaning

#exclude only those participants taking a noradrenergic-affecting medication
#df <- df[!df$id %in% exclude$V1[exclude$V2 == "depression_med"],] #42

#also exclude all participants with beta-blocker, or antipsychotic, etc
#df <- df[!df$id %in% exclude$V1,] #39

#cleanup
rm(exclude)

#scale all numeric variables
df <- df %>% mutate_at(vars(-id,-Diagnosis,-Age,-Sex), scale) %>% as.list %>% as.data.frame

#move id column to row.names
row.names(df) <- df$id

#keep demographic variables aside
df_demo <- df[, c('id','Diagnosis','Age','Sex')]

#remove demographic variables, not include in CCA
df <- subset(df, select=-c(id,Diagnosis,Age,Sex))

#rename all variables (safenames) -- better for readability
df <- df %>% rename( 
`Segment1_left.rostral` =  avg_max_seg_1_cor,  
`Segment2_left.middle` =  avg_max_seg_2_cor,    
`Segment3_left.caudal` =   avg_max_seg_3_cor,   
`Segment4_right.rostral` =  avg_max_seg_4_cor, 
`Segment5_right.middle` =  avg_max_seg_5_cor,    
`Segment6_right.caudal` =  avg_max_seg_6_cor, 
`RBANS_immediate.memory` = rbans_immmemory_index_cor, 
`RBANS_visuospatial` = rbans_visuo_index_cor,     
`RBANS_language` = rbans_language_index_cor, 
`RBANS_attention` = rbans_attention_index_cor, 
`RBANS_delayed.memory` = rbans_delmem_index_cor,    
`DKEFS_CWIT.colour.naming` = dkefs_cwi_1_time_normcor,     
`DKEFS_CWIT.word.reading` = dkefs_cwi_2_time_normcor,      
`DKEFS_CWIT.inhibition` = dkefs_cwi_3_time_normcor,      
`DKEFS_CWIT.inhibition.switching` = dkefs_cwi_4_time_normcor,     
`DKEFS_TMT.A` = dkefs_trails4_time_cor,    
`DKEFS_TMT.B` = dkefs_trails5_time_cor,
`DKEFS_verbal.fluency_letters` = dkefs_vf_lftotalcorrect_cor,
`DKEFS_verbal.fluency_categories` = dkefs_vf_cftotalcorrect_cor,
`DKEFS_verbal.fluency_category.switching` = dkefs_vf_csswitchtotcor_cor) 

#reorganize variables, for clarity, nice plots, etc
df <- df[, c(
      "Segment1_left.rostral",                  
      "Segment2_left.middle",                    
      "Segment3_left.caudal",                    
      "Segment4_right.rostral",                 
      "Segment5_right.middle",                   
      "Segment6_right.caudal",  
      "RBANS_immediate.memory",                  
      "RBANS_visuospatial",                     
      "RBANS_language",                          
      "RBANS_attention",                         
      "RBANS_delayed.memory",                   
      "DKEFS_TMT.A",                             
      "DKEFS_TMT.B",     
      "DKEFS_CWIT.colour.naming",              
      "DKEFS_CWIT.word.reading",                 
      "DKEFS_CWIT.inhibition",                   
      "DKEFS_CWIT.inhibition.switching",
      "DKEFS_verbal.fluency_letters",          
      "DKEFS_verbal.fluency_categories",         
      "DKEFS_verbal.fluency_category.switching"
)]

#make separate dfs by diagnosis, for select analyses
df_HC <- df[row.names(df) %in% df_demo$id[df_demo$Diagnosis == 'HC'],] #23
df_LLD <- df[row.names(df) %in% df_demo$id[df_demo$Diagnosis == 'LLD'],] #25

#make a vector of X and Y set variables
vars_nm <- names(df[, grep('Segment', names(df))])
vars_cog <- names(df[! names(df) %in% vars_nm])

#make separate X and Y sets
X <- df[, grep(paste(vars_nm,collapse="|"), names(df), value=TRUE)] #6
Y <- df[, grep(paste(vars_cog,collapse="|"), names(df), value=TRUE)] #14

#write out df
write.csv(df, paste0('../clinical/dfCorrectedScaled_', Sys.Date(), '.csv', sep=''), row.names = F) 

Define functions

########################################
#Preprocessing
########################################
heatmap_fn <- function(df, title){
  heatmap <- heatmap(cov(df), main=title, scale="column", Rowv=NA, Colv=NA, cexRow=.7, cexCol=.7) 
  legend(x='bottomleft', legend=c("min", "ave", "max"), fill=colorRampPalette(brewer.pal(8, "Oranges"))(3))
}

########################################
#CCA test statistics
########################################

#function to pull out important model values
modelCalculation_fn <- function(cca, X, Y){
  
  #calculate correlations
  rxx <- cor(X, use = "p") #pairwise complete observations
  ryy <- cor(Y, use = "p")
  rxy <- cor(X,Y, use = "p")
  
  #calculate omega
  omega = t(solve(chol(rxx))) %*% rxy %*% solve(chol(ryy)) 
  
  #decompose matrix
  usv <- svd(omega)
  
  #standardized weights (beta)
  x.weights.std <- solve(chol(rxx)) %*% usv$u
  y.weights.std <- solve(chol(ryy)) %*% usv$v
  
  #structure coefficients
  x.structures <- rxx %*% x.weights.std
  y.structures <- ryy %*% y.weights.std
  
  #canonical cross-loadings/redundancy
  x.redun <- as.matrix(X) %*% x.weights.std
  y.redun <- as.matrix(Y) %*% y.weights.std
  
  return(list(rxx=rxx, ryy=ryy, rxy=rxy, omega=omega, usv=usv, 
              x.weights.std=x.weights.std, y.weights.std=y.weights.std, 
              x.structures=x.structures, y.structures=y.structures, 
              x.redun=x.redun, y.redun=y.redun))
}

#######Parametric

#function to calculate parametric F distributions (4 multivariate tests)
sig_fn <- function(tstat){
  result <- CCP::p.asym(rho, n, p, q, tstat)
  return(result)
}

#function for pulling out desired statistical values from parametric F distribution
sigPrint_fn <- function(model, statistic){
  result <- round(model[[statistic]][1],3)
  return(result)
}

#######Permuted

#function for permuted distribution -- shuffle rows in Y
permute_fn <- function(X, Y){
  Y_shuffle <- Y[sample(nrow(Y)),] #shuffle rows in Y set
  cca <- cc(X, Y_shuffle)
  correlations <- cca$cor #compute CCA, take first correlation value
  return(correlations) #add values to a dataframe
}

#function to manipulate permuted null dataframes
permutedNull_fn <- function(null){
  df_null <- data.frame(matrix(NA, ncol=min(ncol(X), ncol(Y)), nrow=numberPermutations))
  for (i in 1:min(ncol(X), ncol(Y))){
    df_null[i] <- unlist(purrr::map(null, i))
  }
  names(df_null) <- paste('permRc_var', 1:min(ncol(X), ncol(Y)), sep='')
  df_null$permuation <- seq(1:numberPermutations) #add variable
  return(df_null)
}

#function to plot null distribution histogram
nullPlot_fn <- function(df_null, variate, title){
 df_null <- df_null[variate]
 ggplot(df_null, aes_string(x=names(df_null)[1])) +
   geom_histogram(color='black', fill='white', size=.75) +
   xlim(.5, .9) +
   ylim(0, 120) +
   ggtitle(title) +
   theme_classic() +
   theme(axis.ticks=element_blank(), 
         text= element_text(size=20),
         panel.grid.minor = element_blank(), 
         panel.grid.major.x = element_blank()) + 
     geom_vline(xintercept = cca$cor[variate], linetype="dashed", color="black", size=2) 
}

#function to compute significance, via traditional multivariate tests
permutedStats_fn <- function(X, Y, nboot=numberPermutations, rhostart=1){
  wilks <- CCP::p.perm(X, Y, nboot=nboot, rhostart=rhostart, type='Wilks')
  hotelling <- CCP::p.perm(X, Y, nboot=nboot, rhostart=rhostart, type='Hotelling')
  pillai <- CCP::p.perm(X, Y, nboot=nboot, rhostart=rhostart, type='Pillai')
  roy <- CCP::p.perm(X, Y, nboot=nboot, rhostart=rhostart, type='Roy') 
  return(list(wilksPermuted=wilks, hotellingPermuted=hotelling, pillaiPermuted=pillai, royPermuted=roy))
}

########################################
#Participant scores
########################################

#function to pull together participant scores for all analyzed variates
scores_fn <- function(variatesToAnalyze){
  x_scores <- as.data.frame(cca$scores$xscores[,variatesToAnalyze]) #get data
  names(x_scores) <- paste('X_variate', seq_along(variatesToAnalyze), sep='_') #add names
  y_scores <- as.data.frame(cca$scores$yscores[,variatesToAnalyze]) #get data
  names(y_scores) <- paste('Y_variate', seq_along(variatesToAnalyze), sep='_') #add names
  scores_plot <- merge(x_scores, y_scores, by='row.names') #merge
  scores_plot <- merge(scores_plot, df_demo[, c('id', 'Diagnosis')], by.x='Row.names', by.y='id') #add in diagnosis
  return(scores_plot)
}

#function to reshape score data for plot
scoresReshape_fn <- function(df_scores){
  df_scores <- df_scores %>% pivot_longer(cols= -c(Row.names, Diagnosis))
  df_scores$variable_set <- substring(df_scores$name, 1, 1)
  df_scores$variate <- substring(df_scores$name, 3)
  df_scores <- maditr::dcast(df_scores, Row.names + Diagnosis + variate ~ variable_set, value.var='value')
  return(df_scores)
}

#function to print scores plot
scoresPlot_fn <- function(scores){
  ggplot(scores, aes(X, Y)) +
    geom_point(size=3.5, stroke=2, aes(shape=scores[['Diagnosis']])) +
    geom_smooth(method=lm, se=FALSE, color='black') +
    theme_classic() +
    theme(text= element_text(size=20),
        axis.ticks = element_blank(), 
        legend.title = element_blank()) + 
    xlab('scores, LC') +
    ylab('scores, cognition') +
    xlim(-3.5, 3.5) +
    ylim(-3.5, 3.5) 
}

########################################
#Feature importance
########################################

#function to summarize model information
modelSummary_fn <- function(list_model, variate){
  
  #pull out standardized weights
  x.weights.std <- t(data.frame(lapply(list_model$x.weights.std[,variate], type.convert), stringsAsFactors = F))
  y.weights.std <- t(data.frame(lapply(list_model$y.weights.std[,variate], type.convert), stringsAsFactors = F))
  weights <- rbind(x.weights.std, y.weights.std) #bind together standardized weights
  
  #pull out structure coefficients
  x.structures <- t(data.frame(lapply(list_model$x.structures[,variate], type.convert), stringsAsFactors = F))
  y.structures <- t(data.frame(lapply(list_model$y.structures[,variate], type.convert), stringsAsFactors = F))
  structures <- rbind(x.structures, y.structures) #bind together structures
  
  #calculate communality coefficients
  communalities <- structures ^ 2
  
  #make table
  tbl <- as.data.frame(cbind(weights, structures, communalities)) #for specified variate
  names(tbl) <- c('weights', 'structures', 'communalities') #rename variables in data, for clarity
  tbl <- tibble::rownames_to_column(tbl, "variable") #add rownames as a variable
  
  #round all numeric variables 
  tbl <- tbl %>% mutate_if(is.numeric, round, digits=3)
  
  #sort the variables in the table alphabetically
  tbl <- tbl[order(tbl$variable),]
  
  #return
  return(tbl)
}

#function for model summary table, for each variate
structureTbl_fn <- function(df_model, variate=1, threshold){
  
  #create dynamic names for table header
  label <- paste("variate", variate, "$R_c$=", round(cca$cor[variate], 5))
  myHeader <- c(" " = 1, label = 3) # create vector with colspan
  names(myHeader) <- c(" ", label) #set names
  
  #make formatted table
  df_model %>%
    mutate(structures = ifelse((structures > threshold | structures < -threshold),
                               cell_spec(structures, color='black', underline = T, bold = T), #no green
                               cell_spec(structures, color='black'))) %>%
    kable(escape=F,
          align = c('l', 'c', 'c', 'c'),
          col.names = c('_Variable_', '$\\beta$', '$r_s$', '$r_s^2$ ($h^2$)')) %>%
    kable_styling(bootstrap_options=c('striped', 'hover', 'condensed')) %>%
    add_header_above(myHeader) %>%
    pack_rows('X', 1, p) %>%
    pack_rows('Y', p+1, p+q) %>%
    footnote(general = "following convention, $r_s$ values >.40 are emphasized;  
         $\\beta$ = standardized canonical function coefficient;  
         $r_s$ = structure coefficient;  
         $r_s^2$ = squared structure coefficient, here also communality $h^2$", escape=T)
}

#function to visualize the structure coefficients
featurePlot_fn <- function(df_model){
ggplot(df_model, aes(forcats::fct_rev(variable), structures)) + 
    geom_bar(stat='identity', fill='white', color='black', size=1, width=1) + 
    theme_classic() +
    theme(axis.ticks=element_blank(), 
          panel.grid.minor = element_blank(), 
          panel.grid.major.x = element_blank(),
          legend.position = "none",
          text = element_text(colour='black'),
          axis.text = element_text(colour = 'black', size=20)) + 
    xlab('') + 
    ylab('') + 
    geom_hline(yintercept = threshold, linetype="dashed", color = "black", size=1.5) +
    geom_hline(yintercept = -threshold, linetype="dashed", color = "black", size=1.5) + 
    coord_flip() 
}

#function to reorder the variables in the table/plot, for nice presentation
 reorder_fn <- function(df){
 df[,'variable']  <- factor(df[,'variable'],
     levels=c(
       "Segment1_left.rostral",                  
       "Segment2_left.middle",                    
       "Segment3_left.caudal",                    
       "Segment4_right.rostral",                 
       "Segment5_right.middle",                   
       "Segment6_right.caudal",  
       "RBANS_immediate.memory",                  
       "RBANS_visuospatial",                     
       "RBANS_language",                          
       "RBANS_attention",                         
       "RBANS_delayed.memory",                   
       "DKEFS_TMT.A",                             
       "DKEFS_TMT.B",     
       "DKEFS_CWIT.colour.naming",              
       "DKEFS_CWIT.word.reading",                 
       "DKEFS_CWIT.inhibition",                   
       "DKEFS_CWIT.inhibition.switching",
       "DKEFS_verbal.fluency_letters",          
       "DKEFS_verbal.fluency_categories",         
       "DKEFS_verbal.fluency_category.switching"
     ))
 return(df)
 }
 
########################################
#Stability analysis
########################################
 
#######By feature
 
#function to leave out a feature (column), and calculate canonical correlation (stability)
leaveOneFeatureOut_fn <- function(X, Y, list_model){ #takes X and Y dataframe or matrix
  
  #use base combn function to leave out a column ...
  f <- function(x) combn(ncol(x), ncol(x) - 1) 
  
  #run on both variable sets
  X_inx <- f(X) #from X
  Y_inx <- f(Y) #from Y
  
  #extract the canonical correlations
  ccX <- as.data.frame(t(apply(X_inx, 2, function(i) cc(X[, i], Y)$cor))) #correlation between dropped X and Y
  ccY <- as.data.frame(t(apply(Y_inx, 2, function(i) cc(X, Y[, i])$cor))) #correlation between dropped Y and X
  
  #put into a dataframe
  values <- plyr::rbind.fill(ccX, ccY)
  
  #add the original observations as the last row
  values <- rbind(values, list_model$usv$d)
  
  #add a column indicating that the last row is original -- used to colour the plot
  values$original <- c(rep(FALSE, ncol(X)), rep(FALSE, ncol(Y)), TRUE)
  
  #melt for ggplot
  values <- reshape2::melt(values)
  
  #return df
  return(values)
  
}
 
#function to visualize feature stability
stabilityFeaturePlot_fn <- function(df_stabilityFeature){
  ggplot(df_stabilityFeature, aes(x=variable, y=value)) +
    geom_boxplot(size=1, width=1, outlier.shape = NA) +
    geom_violin(fill='grey', alpha=.5, size=1, width=1, color='grey') +
    geom_jitter(aes(colour = original, size= original), shape=21, stroke=1.5) +
    theme_classic() +
    xlab('variate') +
    ylab('canonical correlation value') +
    theme(axis.ticks = element_blank(),
          text= element_text(size=20, color='black')) +
    scale_color_manual(values = c("FALSE" = "darkgrey", "TRUE" = "black")) +
    scale_size_manual(values = c("FALSE" = 2.5, "TRUE" = 5)) +
    guides(color = FALSE, size = FALSE)
}

#######By participant

#function to leave out a participant (row), and calculate canonical correlation (stability)
leaveOneParticipantOut_fn <- function(X,Y){
  df_stabilityParticipant <- data.frame(matrix(ncol=nrow(X), nrow=length(X)+length(Y)))
  
  for(i in 1:nrow(X)){
  
   #calculate correlations
    rxx <- cor(X[-i,], use = "p") #pairwise complete observations
    ryy <- cor(Y[-i,], use = "p")
    rxy <- cor(X[-i,], Y[-i,], use = "p")
  
    #calculate omega
    omega = t(solve(chol(rxx))) %*% rxy %*% solve(chol(ryy)) 
  
    #decompose matrix
    usv <- svd(omega)
  
    #standardized weights (beta)
    x.weights.std <- solve(chol(rxx)) %*% usv$u
    y.weights.std <- solve(chol(ryy)) %*% usv$v
  
    #structure coefficients 
    x.structures <- as.data.frame(rxx %*% x.weights.std)
    y.structures <- as.data.frame(ryy %*% y.weights.std)
    
    #put in dataframe
    values <- plyr::rbind.fill(x.structures, y.structures)
    
    #take only first column, i.e., CVI
    values <- values[, 1, drop=FALSE] #keep as a dataframe
    
    #write iteration into big dataframe
    df_stabilityParticipant[i] <- values
    
  }
    
  #label our dataframe, so really clear what's what
  names(df_stabilityParticipant) <- paste('omit', seq(1:ncol(df_stabilityParticipant)), sep='_')

  #add variables names, as a variable
  df_stabilityParticipant$variable <- c(names(X), names(Y))

  #melt for ggplot
  df_stabilityParticipant <- df_stabilityParticipant %>%
    pivot_longer(!variable, names_to='participant', values_to='value')
  
  #make sure variable is a factor
  df_stabilityParticipant$variable <- as.factor(df_stabilityParticipant$variable)
  
  #like Dinga, turn loadings into absolute values
  df_stabilityParticipant$value <- abs(df_stabilityParticipant$value)
  
  #return
  return(as.data.frame(df_stabilityParticipant))
}

#function to visualize participant stability
stabilityParticipantPlot_fn <- function(df_stabilityParticipant){
  df_stabilityParticipant <- reorder_fn(df_stabilityParticipant)
  ggplot(df_stabilityParticipant, aes(forcats::fct_rev(variable), value)) + 
  geom_boxplot(size=1, outlier.size = 2, outlier.shape=21, outlier.stroke=1.5, width=1) +
  theme_classic() +
  theme(axis.ticks = element_blank(),
        axis.text = element_text(color='black'),
        text= element_text(size=20, color='black')) +
  xlab('') +
  ylab('') +
  coord_flip() 
}

Preprocessing

Correlations

Below, we summarise correlations within the \(X\) and \(Y\) sets, respectively, and between the \(X\) and \(Y\) sets (the cross correlation matrix). We see that variables within sets are more highly correlated than between sets. We also see that the \(X\) (brain) variables are more highly correlated than the \(Y\) (behaviour) variables.

Within X set
ggpairs(X)

Within Y set
ggpairs(Y)

Between X and Y sets
XY_cor <- matcor(X, Y)$XYcor 
pheatmap(XY_cor, angle_col=45)

The mean correlation is 0.0241929. The range of correlations is -0.614526, 0.8220572.


Multicollinearity

Multicollinearity indicates a redundancy between variables, rendering any solution unstable and difficult to interpret. If the pair-wise correlation coefficient between two variables is high (say >0.8) then it is possible that multicollinearity exists; however, this it is not a sufficient nor necessary condition for the detection of multicollinearity, as a linear relation in CCA involves many variables. One common approach to identify collinearity among explanatory variables is the use of variance inflation factors (VIF). The VIF is an indicator of how much the variance of the coefficient is inflated compared to the non-significant correlations with any other variables in the model. The higher the VIF value, the higher the collinearity. Typically, VIF = 1 indicates no correlation; > 1 < 5 indicates moderate correlation, >5 high correlation. We see evidence of moderate multicollinearity in some variables in both the X and Y sets.

X set
usdm::vif(X)
##                Variables      VIF
## 1  Segment1_left.rostral 1.942288
## 2   Segment2_left.middle 2.176708
## 3   Segment3_left.caudal 2.000835
## 4 Segment4_right.rostral 2.578963
## 5  Segment5_right.middle 2.212560
## 6  Segment6_right.caudal 1.739271
Y set
usdm::vif(Y)
##                                  Variables      VIF
## 1                   RBANS_immediate.memory 2.437143
## 2                       RBANS_visuospatial 1.664639
## 3                           RBANS_language 2.734084
## 4                          RBANS_attention 2.777030
## 5                     RBANS_delayed.memory 1.898539
## 6                              DKEFS_TMT.A 2.126379
## 7                              DKEFS_TMT.B 1.579855
## 8                 DKEFS_CWIT.colour.naming 4.757744
## 9                  DKEFS_CWIT.word.reading 4.363634
## 10                   DKEFS_CWIT.inhibition 3.716795
## 11         DKEFS_CWIT.inhibition.switching 3.570581
## 12            DKEFS_verbal.fluency_letters 2.174128
## 13         DKEFS_verbal.fluency_categories 1.951267
## 14 DKEFS_verbal.fluency_category.switching 1.988091


Homogeneity

We can use Box’s M-test to test the null hypothesis that the LLD and HC covariance matrices are equal. We find that the covariance matrices are indeed equal; i.e., the observed covariance matrices of the variable sets are equal across the LLD and HC groups. We can also visually review the covariance matrices, upon which Box’s M-test is based. The matrices display how much two random variables vary together (as opposed to correlation, which tells how a change in one variable corresponds to a change in another). In our case, the covariance matrices are useful to compare between LLD and HC groups.

Box’s M test
#box's m-test
box <- rstatix::box_m(df, df_demo$Diagnosis) 

#pretty table
box %>% kable() %>% kable_styling()
statistic p.value parameter method
224.8796 0.2290651 210 Box’s M-test for Homogeneity of Covariance Matrices
LLD
heatmap_fn(df_LLD, 'LLD')

HC
heatmap_fn(df_HC, 'HC')


CCA

We compute CCA on k=20 brain and behaviour variables of interest. There are 6 variables in our brain (\(X\)) set, representing the hemispheric rostral-caudal LC segment scores. There are 14 variables in our behaviour (\(Y\)) set: factor scores from the RBANS and DKEFS. As before, the LC and cognition variables have been corrected for age and sex, and the DKEFS RT variables have additionally be adjusted for normalcy. Note that the DKEFS RT variables have not been reverse scored (so, a lower score is “better”). All variables have been scaled to mean 0, and unit variance.

Canonical correlations

#compute CCA
cca <- cc(X, Y)

#save the CCA model
saveRDS(cca, "cca.rds")

#calculate other model information
list_model <- modelCalculation_fn(cca, X, Y)

#review canonical correlations
rbind(paste('variate', seq(1, 6)), round(cca$cor, 3)) %>%
  kable(caption='Canonical correlation ($R_c$) per variate', escape=FALSE) %>% 
  kable_styling() 
Canonical correlation (\(R_c\)) per variate
variate 1 variate 2 variate 3 variate 4 variate 5 variate 6
0.853 0.687 0.629 0.549 0.467 0.373

Parametric test

We used the CCP package to calculate significance of the CCA, via comparison to the theoretical F distribution. There are 4 primary tests of significance in multivariate models (Wilks, Hotelling, Pillai, and Roy). Test results often vary slightly.

#define variables, for significance calculation
rho <- cca$cor
n <- nrow(df)
p <- ncol(X)
q <- ncol(Y)
Wilks
model_wilks <- (sig_fn('Wilks'))
## Wilks' Lambda, using F-approximation (Rao's F):
##                stat    approx df1       df2    p.value
## 1 to 6:  0.04076633 1.4997546  84 162.43988 0.01433982
## 2 to 6:  0.15004345 1.0712794  65 140.99227 0.36265789
## 3 to 6:  0.28401410 0.9468785  48 117.60182 0.57524060
## 4 to 6:  0.46998349 0.8147077  33  92.03571 0.74343113
## 5 to 6:  0.67293907 0.7008757  20  64.00000 0.81054178
## 6 to 6:  0.86061116 0.5938715   9  33.00000 0.79257412
Hotelling
model_hotelling <- (sig_fn('Hotelling'))
##  Hotelling-Lawley Trace, using F-approximation:
##               stat    approx df1 df2     p.value
## 1 to 6:  5.1009261 1.5990998  84 158 0.005843257
## 2 to 6:  2.4203530 1.0550257  65 170 0.385564726
## 3 to 6:  1.5274740 0.9652787  48 182 0.543170519
## 4 to 6:  0.8726847 0.8550547  33 194 0.695773346
## 5 to 6:  0.4408492 0.7567911  20 206 0.762836780
## 6 to 6:  0.1619649 0.6538585   9 218 0.749949497
Pillai
model_pillai <- (sig_fn('Pillai'))
##  Pillai-Bartlett Trace, using F-approximation:
##               stat    approx df1 df2    p.value
## 1 to 6:  2.2547539 1.4190728  84 198 0.02480977
## 2 to 6:  1.5264507 1.1023931  65 210 0.30012221
## 3 to 6:  1.0547465 0.9864413  48 222 0.50477621
## 4 to 6:  0.6590530 0.8749918  33 234 0.66705241
## 5 to 6:  0.3574572 0.7792097  20 246 0.73769373
## 6 to 6:  0.1393888 0.6818083   9 258 0.72519551
Roy
model_roy <- (sig_fn('Roy'))
##  Roy's Largest Root, using F-approximation:
##               stat   approx df1 df2      p.value
## 1 to 1:  0.7283032 6.318494  14  33 6.810228e-06
## 
##  F statistic for Roy's Greatest Root is an upper bound.

Evaluation of the full model. In the present analysis, we see that 4 of the 4 multivariate tests find significance: Wilks=0.014, Hotelling=0.006, Pillai=0.025, Roy’s largest root=0. Wilk’s lambda is the oft-most reported statistic: here, Wilks’ \(\lambda\)=0.041, F(84, 162.44)=1.5, p=0.014, with an effect size (calculated as 1-lambda) of 95.9% variance explained. (The effect statistic indicates the proportion of variance shared between the \(X\) and \(Y\) sets across all variates.)

Evaluation of variates within the full model. Further examination of the variates within the full model reveal that only the first canonical dimensions are significant. The first canonical variate showed an effect size (calculated as \(R_c^2\)) of 72.83% variance explained.


Permutation test

Statistics.

#get stats for permuted sample
permuted <- permutedStats_fn(X, Y)

#pull out important results
tbl_permuted <- data.frame(matrix(ncol = 4, nrow = 1))
colnames(tbl_permuted) <- c('Wilks', 'Hotelling', 'Pillai', 'Roy')
tbl_permuted[1,1] <- permuted$wilksPermuted$p.value
tbl_permuted[1,2] <- permuted$hotellingPermuted$p.value
tbl_permuted[1,3] <- permuted$pillaiPermuted$p.value
tbl_permuted[1,4] <- permuted$royPermuted$p.value
tbl_permuted %>% kable() %>% kable_styling() 
Wilks Hotelling Pillai Roy
0.014 0.018 0.021 0.009

In addition to/instead of the objective statistical test above, we should examine results to a permuted sample, as we don’t know the population distribution in question. Permutation testing builds - rather than assumes - the sampling distribution by resampling the observed data. We permuted the observe distribution, by shuffling rows of Y set so they no longer correspond to X features. We did this 1000 times, without replacement. Using the CPP package, we can calculate the 4 multivariate statistics. The calculation of significance for all 4 multivariate statistics show that the observed data has “stronger statistic” than the permuted distribution data. For example, using the Wilks’ statistics, we see that only 14 of the 1000 have a lower Wilks value than the real distribution (lower Wilks values are better), meaning that the observed model can be interpreted, p=0.014. (Also note: this value is virtually identical to the parametric distribution.)

We can also review the significance of variates within the CCA model using the same method. Below, we compare shuffling rows of Y set so they no longer correspond to X features (again, 1000 times), and compare the canonical correlation values of each variate obtained for the null distribution to the observed distribution. This method suggests that the entire model is significant, and nested significance testing within the entire model suggests that only the first variate is significant.

#run the permutation function, to get a null distribution (for all variates)
null <- lapply(seq_len(numberPermutations), function(x) permute_fn(X, Y))

#run function to manipulate null dataframes for ggplot
df_null <- permutedNull_fn(null)

#run function to make plots
plot_null_1 <- nullPlot_fn(df_null, 1, 'Variate 1')
plot_null_2 <- nullPlot_fn(df_null, 2, 'Variate 2')

#write out plots
ggsave('../paperFigures/nullDistribution_CV1.png', plot=plot_null_1, height=3, width=4)
ggsave('../paperFigures/nullDistribution_CV2.png', plot=plot_null_2, height=3, width=4)

#visualize
plot_grid(plot_null_1, plot_null_2)

Participant scores

#run function to pull out participant scores on all variates of interest
scores_CV1 <- scores_fn(1)

#reshape data for plot
scores_CV1 <- scoresReshape_fn(scores_CV1) 

#pull out correlations for LLD and HC
scoreCor_CV1 <- scores_CV1 %>%
  group_by(Diagnosis) %>%
  dplyr::summarize(cor(X, Y))

#store correlation values
scoreCor_CV1_HC <- scoreCor_CV1$`cor(X, Y)`[1]
scoreCor_CV1_SSD <- scoreCor_CV1$`cor(X, Y)`[2]

#t-test to compare correlations between groups
scores_p_CV1 <- psych::paired.r(scoreCor_CV1_HC, scoreCor_CV1_SSD, yz=NULL, n=length(X))

Because our full model met significance, we proceed to analyze our model structure, i.e., the structure of the first variate. Here, we show the association between scores in the \(X\) set and \(Y\) set, for each participant, for the first variate. Scores are, for each participant, the variable ‘coordinates’ on the given canonical variate. We see that LLD and HC participants show similar correspondence between scores. Specifically, the correlation for the HC group is 0.817, and the correlation for the LLD group is 0.892, which a t-test shows is an insignificant test of the difference between two independent correlations, with probability p = 0.727. I believe that, in addition to our finding of homogeneity of variance between groups, this provides additional justification for combining LLD and HC groups.

scoresPlot_fn(scores_CV1); ggsave('../paperFigures/scoresPlot_CV1.png', height=3, width=5)


Feature importance

The table below shows standardized canonical function coefficients, structure coefficients, and squared structure coefficients, for the two significant canonical variates.

Standardized canonical function coefficients (\(\beta\)). The \(\beta\) values reflect the relative contribution of one predictor to the criterion given the contribution of other predictors. These values are analogous to beta weights in multiple regression; however, they are less relevant here, especially given multicollinearity.

Structure coefficients (\(r_s\)). The \(r_s\) values are of most interest here. Structure coefficients show the bivariate correlation between an observed variable and the canonical variate scores for the variable’s set (i.e., the synthetic variable created from all the predictor/criterion variables via the linear equation). We can interpret the structure coefficients like regression coefficients, i.e., a one unit increase in the given variable would result in an increase/decrease of the \(r_s\) value of the given canonical variate for the given variable’s set, when the other variables are held constant. Thus, they inform interpretation by helping to define the structure of the synthetic variable, that is, what observed variables can be useful in creating the synthetic variable and therefore may be useful in the model.

Squared structure coefficients / ‘communalities’ (\(r^2_s\)). The \(r^2_s\) values represent the percentage of shared variance between the observed variable and the synthetic variable created from the observed variable’s set. Because we here consider only one variate, the \(r^2_s\) values are identical to communality coefficients (h\(^2\)), which represent the amount of variance in the observed variable that was reproducible across all functions. The communalities are analogous to communality coefficients in factor analysis, and can be viewed as an indication of how useful the variable was for the solution.

#summarize weights, structures, and communalities, for CV1
df_modelVar1 <- modelSummary_fn(list_model, variate=1)

#organize levels of "variable" variable, so nice table, plot
df_modelVar1 <- reorder_fn(df_modelVar1)
df_modelVar1 <- df_modelVar1[order(as.numeric(rownames(df_modelVar1))),]

#put in nice table
structureTbl_fn(df_modelVar1, variate=1, threshold)
variate 1 \(R_c\)= 0.85341
Variable \(\beta\) \(r_s\) \(r_s^2\) (\(h^2\))
X
Segment1_left.rostral -0.881 -0.314 0.099
Segment2_left.middle -0.222 -0.139 0.019
Segment3_left.caudal -0.529 -0.478 0.228
Segment4_right.rostral 0.882 0.366 0.134
Segment5_right.middle 0.329 0.272 0.074
Segment6_right.caudal 0.202 0.137 0.019
Y
RBANS_immediate.memory -0.365 0.067 0.004
RBANS_visuospatial -0.213 0.288 0.083
RBANS_language 0.317 0.291 0.084
RBANS_attention 0.493 0.355 0.126
RBANS_delayed.memory 0.474 0.399 0.159
DKEFS_TMT.A -0.383 -0.371 0.138
DKEFS_TMT.B 0.219 0.105 0.011
DKEFS_CWIT.colour.naming 0.163 0.187 0.035
DKEFS_CWIT.word.reading 0.242 0.2 0.040
DKEFS_CWIT.inhibition -0.006 -0.239 0.057
DKEFS_CWIT.inhibition.switching -0.454 -0.344 0.118
DKEFS_verbal.fluency_letters -0.461 -0.395 0.156
DKEFS_verbal.fluency_categories -0.237 -0.235 0.055
DKEFS_verbal.fluency_category.switching -0.161 0.06 0.004
Note:
following convention, \(r_s\) values >.40 are emphasized;
\(\beta\) = standardized canonical function coefficient;
\(r_s\) = structure coefficient;
\(r_s^2\) = squared structure coefficient, here also communality \(h^2\)

This is the same data as in the table above, but perhaps more clearly presented in a visualization. As above, we see that the relationship reflected in the first variate is driven by Segment3_left.caudal, Segment4_right.rostral in the \(X\) set, and RBANS_attention, RBANS_delayed.memory, DKEFS_TMT.A, DKEFS_verbal.fluency_letters in the \(Y\) set. Several other variables come close to our chosen threshold of interpretation (i.e., 0.35); we could be more liberal or more conservative, if desired. Recall that the DKEFS variables are reaction time, and have not been reverse scored, so a higher score (higher reaction time) is ‘worse’.

#make structure plot
featurePlot_fn(df_modelVar1); ggsave('../paperFigures/featureImportance_CV1.png', width=9, height=6)

Stability

Stability related to features. We iteratively performed feature removal, i.e., we removed one of combined 20 features of the \(X\) and \(Y\) sets, ran the CCA, and compared the derived canonical correlation coefficients. Across variates, we see that estimates with all features include are similar (albiet slightly higher) than estimates made on the basis of n-1 features (generally, the more features a model includes, the higher correlation estimates tend to be). As the difference is slight, we can conclude that estimates are relatively stable, irrespective of feature removal.

#run function to iteratively leave out a feature
df_stabilityFeature <- leaveOneFeatureOut_fn(X,Y,list_model)

#create stability figure
stabilityFeaturePlot_fn(df_stabilityFeature)

Stability related to participants. We perform leave-one-out jackknife procedure to get uncertainty of canonical loadings. Jackknife repeatedly leaves one subject out and then performs the CCA procedure. If the model is stable, estimates will remain similar; if unstable, estimate will vary widely. This analysis is identical to that performed by Dinga, is his 2019 paper.

df_stabilityParticipant <- leaveOneParticipantOut_fn(X,Y)
stabilityParticipantPlot_fn(df_stabilityParticipant)




Summary

  • We performed CCA on 6 brain (LC segment) variables and 14 behaviour (cognition) variables, with data from 48 participants (25 LLD, 23 HC).
  • Note that data from LLD and HC is combined (endorsed by lack of evidence of non-homogeneity between groups in the raw variables (p=0.229) and no significant difference in the correlation between \(X\) and \(Y\) participant scores on CV1, between groups (p= 0.727) .
  • The analysis yielded 6 variates, with canonical correlations (\(R_c\)) of 0.853, 0.687, 0.629, 0.549, 0.467, 0.373 for each successive variate. Importantly, feature- and participant-wise analyses suggest that model values are stable.
  • Collectively, the full model across all variates was found to be statistically significant by 4 of 4 parametric multivariate tests (Wilks=0.014; Hotelling=0.006; Pillai=0.025; Roy=0).
  • Likewise, our permutation test found 4 of the 4 multivariate tests to be significant via permutation testing (Wilks=0.014; Hotelling=0.018; Pillai=0.021; Roy=0.009).
  • The full statistical results using Wilks’ lambda (the most often reported) were: \(\lambda\)=0.041, F(84, 162.44)=1.5, p=0.014, p(perm,1000)=0.014.
  • The full model showed an effect size of \(R_c^2\)= 95.9% variance explained (i.e., percentage of variance shared by the \(X\) and \(Y\) variable sets).
  • Further review of the dimensions showed that only the first canonical variate within the full model was statistically significant. The first variate (CV1) showed a large effect size of 72.83% variance explained (i.e., percentage of variance explained within the first variate). We proceeded to interpret only this variate.
  • We found that 2 observed \(X\) (brain) variables loaded highly onto the synthetic \(X\) variable for the first variate: Segment3_left.caudal, Segment4_right.rostral, and 4 observed Y (cognition) variables loaded highly onto the synthetic \(Y\) variable for the first variate: RBANS_attention, RBANS_delayed.memory, DKEFS_TMT.A, DKEFS_verbal.fluency_letters. These variables also tended to have larger canonical function coefficients (\(\beta\)).
  • Some above-threshold structure coefficients within the \(X\) and \(Y\) set have discordant signs (e.g., positive and negative), indicating that they were not all positively related. Note that the DKEFS is a reaction time measure that was not reverse scored in the present analysis.