Date written: 2020-08-25
Date last ran: 2020-09-02


#clean environment
rm(list = ls())

#set seed (for replicable permutation)
set.seed(123)

#libraries
libraries <- c(
  'kableExtra',
  'eqs2lavaan',
  'gridExtra',
  'TExPosition',    #two tables
  'PTCA4CATA',      #remotes::install_github("HerveAbdi/PTCA4CATA")
  'data4PCCAR',     #remotes::install_github("HerveAbdi/data4PCCAR")
  'prettyGraphs')    

#load libraries
lapply(libraries, require, character.only = T); rm(libraries)

#read in df (already scaled)
df <- read.csv(dir('../csvs/modified', full.names=T, pattern="^df_CCA_noOutliersINTTransformed_"), row.names = 1) #174

#define Y variables (k=12)
Y_vars <- names(df)[!grepl('left|right|commissural', colnames(df))]

#define X variables (k=23)
X_vars <- c(

 #association
 "AF_left",  
 "CB_left",             
 "ILF_left",            
 "IOFF_left", 
 "UF_left", 
 "AF_right",
 "CB_right",            
 "ILF_right",            
 "IOFF_right", 
 "UF_right",
 
 #projection
 "CR_F_left",            
 "CR_P_left",            
 "TF_left",              
 "CR_F_right",
 "CR_P_right",  
 "TF_right",            

 #commissural
 "CC1_commissural",     
 "CC2_commissural",     
 "CC3_commissural",     
 "CC4_commissural",      
 "CC5_commissural",      
 "CC6_commissural",     
 "CC7_commissural"      
)

#make new df, of just X and Y
df <- cbind(df[, X_vars], df[, Y_vars])

#build X and Y sets as matrices
X <- as.matrix(df[, X_vars])
Y <- as.matrix(df[, Y_vars])

Description

Overview. Here, we perform Partial Least Squares Correlation (PLSC). As in CCA, PLS is "doubly-multivariate", i.e., both the X (left, rows) and Y (right, column) sets contain multiple variables, and the model is "symmetrical", i.e., both the X and Y sets play a similar role and can be conceptualized as dependent variables. The goal of PLSC is to analyze commonality between the two sets. Specifically, PLSC derives two new sets of latent dimensions (one for each set) as linear combinations of the observed variables, with the goal of maximizing covariance (i.e., finding latent dimensions that “explain” the largest portion of the covariance between the X and Y sets), under the constraints that the latent dimensions are orthogonal, and coefficients are normalized.

Dataset. The complete dataset is identical to that used in the previous CCA analysis. We include data from n=174 participants. The X set contains 23 brain variables (white matter FA), and the Y set 12 behaviour variables (neurocognition and social cognition scores). Univariate outliers have been removed, and missing values have been interpolated (though non-normality and missing data is permitted by PLSC, we wanted an identical dataset to that used in the CCA, to aide comparison). The data have been scaled and normed before model application.

Method. I used Derek Beaton's TExPosition package to compute the PLSC, and a variety of related packages for visualization.


Correlation plot

The plot below shows both correlation (upper) and covariance (lower). Color intensity is proportional to the coefficients. We see that the X and Y sets contain higher values within set than between set.

#compute covariance matrix
XY.cov <- cov(df)

#plot correlation and covariance
plotCov(XY.cov)


PLSC model

I fit the PLSC with the tepPLS function in Derek's TExPosition package. As in CCA, PLSC produces a number of latent dimensions commensurate with the number of observed variables in the smaller of the X and Y set; here, 12. The model returns a TExPosition.Data list with the following 18 objects, where i is the X (row, left) set, and j is the Y (column, right) set:

#PLSC
model <- tepPLS(X, Y,
       center1=FALSE, 
       center2=FALSE,
       scale1=FALSE, #'SS1'
       scale2=FALSE,
       DESIGN = NULL, #rows do not belong to groups
       make_design_nominal = FALSE,
       graphs = FALSE, 
       k=0) #don't decide components yet

#capture model contents
text <- capture.output(model$TExPosition.Data) 

#put model objects into table
text <- data.frame(matrix(unlist(text)))
text <- data.frame(text[7:nrow(text),])
text <- data.frame(scan(text=as.vector(text[,1]), what='character', quiet=TRUE))
text <- data.frame(split(text, 1:3))
names(text) <- c('number', 'name', 'description')
kable(text, row.names = FALSE) %>% kable_styling(fixed_thead = TRUE, bootstrap_options='condensed') 
number name description
1 $fi Factor scores of the rows
2 $di Squared distances of the rows
3 $ci Contributions of the rows
4 $ri Cosines of the rows
5 $fj Factor scores of the columns
6 $dj square distances of the columns
7 $cj Contributions for the columns
8 $rj Cosines of the columns
9 $lx Latent variables of X (DATA1)
10 $ly Latent variables of Y (DATA2)
11 $t Explained Variance
12 $eigs Eigenvalues
13 $W1 weights for DATA1
14 $W2 weights for DATA2
15 $pdq GSVD data
16 $X X matrix to decompose
17 $data1.norm center and scale for DATA1
18 $data2.norm center and scale for DATA2

Number of dimensions

Determining the optimal number of latent dimensions/components is a very important problem in PLS. In PLSR (regression), interpreting a less-than-optimal number of dimensions leads to a loss of information, whereas selecting a more-than-optimal number can results in a model with poor predictive ability. In PLSC (present analysis), selecting a less-than-optimal number of dimensions likewise leads to a loss of information, whereas selecting a more-than-optimal number can lead to interpreting non-meaningful relationships.

Scree plot

We can use a scree plot to help determine how many dimensions are important for interpretation. Here, the black dots indicate the percent variance explained by each dimension. The large red dot overlaying the black dot indicates a significant permuted probability (1000 iterations) associated with the given dimension, i.e., that number of dimensions is significant. The horizontal purple line corresponds to the average inertia (Kaiser's criterion). We see that both the permutation test and Kaiser's criterion would have us interpret the first three dimensions, which together explain ~79.22% variance.

#permutation test for sig
resPerm4PLSC <- perm4PLSC(X, Y, 
                center1=FALSE,
                center2=FALSE,
                scale1=FALSE,
                scale2=FALSE,
                nIter = 1000) 

#plot scree
PlotScree(ev = model$TExPosition.Data$eigs,
          p.ev = resPerm4PLSC$pEigenvalues,
          plotKaiser = TRUE,
          col.ns = "black",
          col.sig = "red")

Explained variance
names <- paste0('dim', seq(1:length(Y_vars)))
t(round(model$TExPosition.Data$t, 2)) %>% kable(col.names = names) %>% kable_styling()
dim1 dim2 dim3 dim4 dim5 dim6 dim7 dim8 dim9 dim10 dim11 dim12
53.08 13.94 12.21 6.23 5.43 3.56 1.84 1.26 1.03 0.91 0.45 0.08
Eigenvalues
t(round(model$TExPosition.Data$eigs, 2)) %>% kable(col.names = names) %>% kable_styling()
dim1 dim2 dim3 dim4 dim5 dim6 dim7 dim8 dim9 dim10 dim11 dim12
25766.11 6765.37 5927.51 3024.51 2637.03 1729.96 891.31 611.28 497.9 440.29 216.14 36.79
Permuted probabilities
t(resPerm4PLSC$pEigenvalues) %>% kable(col.names = names) %>% kable_styling()
dim1 dim2 dim3 dim4 dim5 dim6 dim7 dim8 dim9 dim10 dim11 dim12
0.372 0.69 0.043 0.372 0.073 0.2 0.862 0.89 0.68 0.24 0.637 0.997


Score plots

Here, we show the score plots (one point for each of the 174 participants) for the first three latent dimensions. The scores of the X (brain) set are on the X axis, and the scores for the Y (behaviour) set are on the Y axis. Note that the scores for both sets have a mean of 0. The distance on the plot directly reflects the amount of explained covariances (i.e. the correlation matrix).

#function for correlations 
correlation_fn <- function(dimension){
  round(cor(model$TExPosition.Data$lx[,dimension], model$TExPosition.Data$ly[,dimension]), 3)
}

#use correlation function
cor_1 <- correlation_fn(1)
cor_2 <- correlation_fn(2)
cor_3 <- correlation_fn(3)

#function for plotting score plots
plotScore_fn <- function(dimension, correlation){
  latvar <- cbind(model$TExPosition.Data$lx[,dimension], model$TExPosition.Data$ly[,dimension])
  colnames(latvar) <- c(paste("Latent X dim", dimension), paste("Latent Y dim", dimension))
  plot.lv <- createFactorMap(latvar, title=paste('correlation = ', correlation))
  plot <- plot.lv$zeMap_background + plot.lv$zeMap_dots
}

#use function for first three dimensions
score1 <- plotScore_fn(1, cor_1)
score2 <- plotScore_fn(2, cor_2)
score3 <- plotScore_fn(3, cor_3)

#plot together
grid.arrange(score1, score2, score3, nrow=1)


Component plots

Highly correlated observed variables appear close together in the component plots of all dimensions. Below, the size of the observed variable reflects the size of the salience / contribution to the given latent dimension.

Combined
factor <- rbind(model$TExPosition.Data$fi, model$TExPosition.Data$fj)
contributions <- rbind(model$TExPosition.Data$ci, model$TExPosition.Data$cj)

#Component plots with factor scores
prettyPlot(factor, 
           contributionCircles = TRUE,
           contributions=contributions,
           xlab='Component 1',
           ylab='Component 2',
           main='Component plot',
           dev.new=F) #required so doesn't print list

X set
prettyPlot(model$TExPosition.Data$fi, 
           contributionCircles = TRUE,
           contributions=model$TExPosition.Data$ci,
           xlab='Component 1',
           ylab='Component 2',
           main='Component map X set', 
           dev.new = F)

Y set
prettyPlot(model$TExPosition.Data$fj, 
           contributionCircles = TRUE,
           contributions=model$TExPosition.Data$cj,
           xlab='Component 1',
           ylab='Component 2',
           main='Component map Y set', 
           dev.new = F)


Salience plots

For each latent dimension, the salience for each X and Y observed variable indicates the extent to which that variable has contributed to the latent dimension. The variables that have a large salience contributed a large amount and should be used to interpret that dimension. It is often the case that dimensions are “made” of a small number of highly contributing variables. In the plots below, purple bars indicate high positive contributions, green bars indicate high negative contributions, and grey bars indicate low contributions. The red line indicates a data-driven expected contribution threshold of 100 divided by the number of observed variables in X or Y, respectively. We also conducted bootstrap salience estimates (1000 iterations), to identify significane after random sampling, with an alpha = .05 and critical value = 2.

#get sign of contributions / saliences
signed.X <- model$TExPosition.Data$ci * sign(model$TExPosition.Data$fi) 
signed.Y <- model$TExPosition.Data$cj * sign(model$TExPosition.Data$fj)

#function for plotting salience plots
plotSalience_fn <- function(set, dimension){
  PrettyBarPlot2(bootratio = round(100*set[,dimension]),
    threshold = 100 / nrow(set),
    plotnames = TRUE,
    main = paste('Variable Factor loadings Dim', dimension),
    ylab = "Variable Factor loadings")
}

#bootstrapped significance model
model_boot <- Boot4PLSC(
  X,
  Y,
  center1 = FALSE,
  center2 = FALSE,
  scale1 = FALSE,
  scale2 = FALSE,
  Fi = NULL,
  Fj = NULL,
  #nf2keep = 3,
  nIter = 1000,
  critical.value = 2,
  eig = TRUE,
  alphaLevel = 0.05
)
X set Dim 1
plotSalience_fn(signed.X, 1)

The permutation test shows the following variables to be significant: CR_F_right

X set Dim 2
plotSalience_fn(signed.X, 2)

The permutation test shows the following variables to be significant: UF_right

X set Dim 3
plotSalience_fn(signed.X, 3)

The permutation test shows the following variables to be significant: ILF_right, CC2_commissural, CC6_commissural


Y set Dim 1
plotSalience_fn(signed.Y, 1)

The permutation test shows the following variables to be significant: Attention_vigilance, Working_memory, Verbal_learning, TASIT_3, RMET, RAD, ER_40

Y set Dim 2
plotSalience_fn(signed.Y, 2)

The permutation test shows the following variables to be significant: Attention_vigilance

Y set Dim 3
plotSalience_fn(signed.Y, 3)

The permutation test shows the following variables to be significant: Problem_solving