Date written: 2020-08-25
Date last ran: 2020-09-02
#clean environment
rm(list = ls())
#set seed (for replicable permutation)
set.seed(123)
#libraries
libraries <- c(
'kableExtra',
'eqs2lavaan',
'gridExtra',
'TExPosition', #two tables
'PTCA4CATA', #remotes::install_github("HerveAbdi/PTCA4CATA")
'data4PCCAR', #remotes::install_github("HerveAbdi/data4PCCAR")
'prettyGraphs')
#load libraries
lapply(libraries, require, character.only = T); rm(libraries)
#read in df (already scaled)
df <- read.csv(dir('../csvs/modified', full.names=T, pattern="^df_CCA_noOutliersINTTransformed_"), row.names = 1) #174
#define Y variables (k=12)
Y_vars <- names(df)[!grepl('left|right|commissural', colnames(df))]
#define X variables (k=23)
X_vars <- c(
#association
"AF_left",
"CB_left",
"ILF_left",
"IOFF_left",
"UF_left",
"AF_right",
"CB_right",
"ILF_right",
"IOFF_right",
"UF_right",
#projection
"CR_F_left",
"CR_P_left",
"TF_left",
"CR_F_right",
"CR_P_right",
"TF_right",
#commissural
"CC1_commissural",
"CC2_commissural",
"CC3_commissural",
"CC4_commissural",
"CC5_commissural",
"CC6_commissural",
"CC7_commissural"
)
#make new df, of just X and Y
df <- cbind(df[, X_vars], df[, Y_vars])
#build X and Y sets as matrices
X <- as.matrix(df[, X_vars])
Y <- as.matrix(df[, Y_vars])
Overview. Here, we perform Partial Least Squares Correlation (PLSC). As in CCA, PLS is "doubly-multivariate", i.e., both the X (left, rows) and Y (right, column) sets contain multiple variables, and the model is "symmetrical", i.e., both the X and Y sets play a similar role and can be conceptualized as dependent variables. The goal of PLSC is to analyze commonality between the two sets. Specifically, PLSC derives two new sets of latent dimensions (one for each set) as linear combinations of the observed variables, with the goal of maximizing covariance (i.e., finding latent dimensions that “explain” the largest portion of the covariance between the X and Y sets), under the constraints that the latent dimensions are orthogonal, and coefficients are normalized.
Dataset. The complete dataset is identical to that used in the previous CCA analysis. We include data from n=174 participants. The X set contains 23 brain variables (white matter FA), and the Y set 12 behaviour variables (neurocognition and social cognition scores). Univariate outliers have been removed, and missing values have been interpolated (though non-normality and missing data is permitted by PLSC, we wanted an identical dataset to that used in the CCA, to aide comparison). The data have been scaled and normed before model application.
Method. I used Derek Beaton's TExPosition package to compute the PLSC, and a variety of related packages for visualization.
The plot below shows both correlation (upper) and covariance (lower). Color intensity is proportional to the coefficients. We see that the X and Y sets contain higher values within set than between set.
#compute covariance matrix
XY.cov <- cov(df)
#plot correlation and covariance
plotCov(XY.cov)
I fit the PLSC with the tepPLS
function in Derek's TExPosition
package. As in CCA, PLSC produces a number of latent dimensions commensurate with the number of observed variables in the smaller of the X and Y set; here, 12. The model returns a TExPosition.Data
list with the following 18 objects, where i is the X (row, left) set, and j is the Y (column, right) set:
#PLSC
model <- tepPLS(X, Y,
center1=FALSE,
center2=FALSE,
scale1=FALSE, #'SS1'
scale2=FALSE,
DESIGN = NULL, #rows do not belong to groups
make_design_nominal = FALSE,
graphs = FALSE,
k=0) #don't decide components yet
#capture model contents
text <- capture.output(model$TExPosition.Data)
#put model objects into table
text <- data.frame(matrix(unlist(text)))
text <- data.frame(text[7:nrow(text),])
text <- data.frame(scan(text=as.vector(text[,1]), what='character', quiet=TRUE))
text <- data.frame(split(text, 1:3))
names(text) <- c('number', 'name', 'description')
kable(text, row.names = FALSE) %>% kable_styling(fixed_thead = TRUE, bootstrap_options='condensed')
number | name | description |
---|---|---|
1 | $fi | Factor scores of the rows |
2 | $di | Squared distances of the rows |
3 | $ci | Contributions of the rows |
4 | $ri | Cosines of the rows |
5 | $fj | Factor scores of the columns |
6 | $dj | square distances of the columns |
7 | $cj | Contributions for the columns |
8 | $rj | Cosines of the columns |
9 | $lx | Latent variables of X (DATA1) |
10 | $ly | Latent variables of Y (DATA2) |
11 | $t | Explained Variance |
12 | $eigs | Eigenvalues |
13 | $W1 | weights for DATA1 |
14 | $W2 | weights for DATA2 |
15 | $pdq | GSVD data |
16 | $X | X matrix to decompose |
17 | $data1.norm | center and scale for DATA1 |
18 | $data2.norm | center and scale for DATA2 |
Determining the optimal number of latent dimensions/components is a very important problem in PLS. In PLSR (regression), interpreting a less-than-optimal number of dimensions leads to a loss of information, whereas selecting a more-than-optimal number can results in a model with poor predictive ability. In PLSC (present analysis), selecting a less-than-optimal number of dimensions likewise leads to a loss of information, whereas selecting a more-than-optimal number can lead to interpreting non-meaningful relationships.
We can use a scree plot to help determine how many dimensions are important for interpretation. Here, the black dots indicate the percent variance explained by each dimension. The large red dot overlaying the black dot indicates a significant permuted probability (1000 iterations) associated with the given dimension, i.e., that number of dimensions is significant. The horizontal purple line corresponds to the average inertia (Kaiser's criterion). We see that both the permutation test and Kaiser's criterion would have us interpret the first three dimensions, which together explain ~79.22% variance.
#permutation test for sig
resPerm4PLSC <- perm4PLSC(X, Y,
center1=FALSE,
center2=FALSE,
scale1=FALSE,
scale2=FALSE,
nIter = 1000)
#plot scree
PlotScree(ev = model$TExPosition.Data$eigs,
p.ev = resPerm4PLSC$pEigenvalues,
plotKaiser = TRUE,
col.ns = "black",
col.sig = "red")
names <- paste0('dim', seq(1:length(Y_vars)))
t(round(model$TExPosition.Data$t, 2)) %>% kable(col.names = names) %>% kable_styling()
dim1 | dim2 | dim3 | dim4 | dim5 | dim6 | dim7 | dim8 | dim9 | dim10 | dim11 | dim12 |
---|---|---|---|---|---|---|---|---|---|---|---|
53.08 | 13.94 | 12.21 | 6.23 | 5.43 | 3.56 | 1.84 | 1.26 | 1.03 | 0.91 | 0.45 | 0.08 |
t(round(model$TExPosition.Data$eigs, 2)) %>% kable(col.names = names) %>% kable_styling()
dim1 | dim2 | dim3 | dim4 | dim5 | dim6 | dim7 | dim8 | dim9 | dim10 | dim11 | dim12 |
---|---|---|---|---|---|---|---|---|---|---|---|
25766.11 | 6765.37 | 5927.51 | 3024.51 | 2637.03 | 1729.96 | 891.31 | 611.28 | 497.9 | 440.29 | 216.14 | 36.79 |
t(resPerm4PLSC$pEigenvalues) %>% kable(col.names = names) %>% kable_styling()
dim1 | dim2 | dim3 | dim4 | dim5 | dim6 | dim7 | dim8 | dim9 | dim10 | dim11 | dim12 |
---|---|---|---|---|---|---|---|---|---|---|---|
0.372 | 0.69 | 0.043 | 0.372 | 0.073 | 0.2 | 0.862 | 0.89 | 0.68 | 0.24 | 0.637 | 0.997 |
Here, we show the score plots (one point for each of the 174 participants) for the first three latent dimensions. The scores of the X (brain) set are on the X axis, and the scores for the Y (behaviour) set are on the Y axis. Note that the scores for both sets have a mean of 0. The distance on the plot directly reflects the amount of explained covariances (i.e. the correlation matrix).
#function for correlations
correlation_fn <- function(dimension){
round(cor(model$TExPosition.Data$lx[,dimension], model$TExPosition.Data$ly[,dimension]), 3)
}
#use correlation function
cor_1 <- correlation_fn(1)
cor_2 <- correlation_fn(2)
cor_3 <- correlation_fn(3)
#function for plotting score plots
plotScore_fn <- function(dimension, correlation){
latvar <- cbind(model$TExPosition.Data$lx[,dimension], model$TExPosition.Data$ly[,dimension])
colnames(latvar) <- c(paste("Latent X dim", dimension), paste("Latent Y dim", dimension))
plot.lv <- createFactorMap(latvar, title=paste('correlation = ', correlation))
plot <- plot.lv$zeMap_background + plot.lv$zeMap_dots
}
#use function for first three dimensions
score1 <- plotScore_fn(1, cor_1)
score2 <- plotScore_fn(2, cor_2)
score3 <- plotScore_fn(3, cor_3)
#plot together
grid.arrange(score1, score2, score3, nrow=1)
Highly correlated observed variables appear close together in the component plots of all dimensions. Below, the size of the observed variable reflects the size of the salience / contribution to the given latent dimension.
factor <- rbind(model$TExPosition.Data$fi, model$TExPosition.Data$fj)
contributions <- rbind(model$TExPosition.Data$ci, model$TExPosition.Data$cj)
#Component plots with factor scores
prettyPlot(factor,
contributionCircles = TRUE,
contributions=contributions,
xlab='Component 1',
ylab='Component 2',
main='Component plot',
dev.new=F) #required so doesn't print list
prettyPlot(model$TExPosition.Data$fi,
contributionCircles = TRUE,
contributions=model$TExPosition.Data$ci,
xlab='Component 1',
ylab='Component 2',
main='Component map X set',
dev.new = F)
prettyPlot(model$TExPosition.Data$fj,
contributionCircles = TRUE,
contributions=model$TExPosition.Data$cj,
xlab='Component 1',
ylab='Component 2',
main='Component map Y set',
dev.new = F)
For each latent dimension, the salience for each X and Y observed variable indicates the extent to which that variable has contributed to the latent dimension. The variables that have a large salience contributed a large amount and should be used to interpret that dimension. It is often the case that dimensions are “made” of a small number of highly contributing variables. In the plots below, purple bars indicate high positive contributions, green bars indicate high negative contributions, and grey bars indicate low contributions. The red line indicates a data-driven expected contribution threshold of 100 divided by the number of observed variables in X or Y, respectively. We also conducted bootstrap salience estimates (1000 iterations), to identify significane after random sampling, with an alpha = .05 and critical value = 2.
#get sign of contributions / saliences
signed.X <- model$TExPosition.Data$ci * sign(model$TExPosition.Data$fi)
signed.Y <- model$TExPosition.Data$cj * sign(model$TExPosition.Data$fj)
#function for plotting salience plots
plotSalience_fn <- function(set, dimension){
PrettyBarPlot2(bootratio = round(100*set[,dimension]),
threshold = 100 / nrow(set),
plotnames = TRUE,
main = paste('Variable Factor loadings Dim', dimension),
ylab = "Variable Factor loadings")
}
#bootstrapped significance model
model_boot <- Boot4PLSC(
X,
Y,
center1 = FALSE,
center2 = FALSE,
scale1 = FALSE,
scale2 = FALSE,
Fi = NULL,
Fj = NULL,
#nf2keep = 3,
nIter = 1000,
critical.value = 2,
eig = TRUE,
alphaLevel = 0.05
)
plotSalience_fn(signed.X, 1)
The permutation test shows the following variables to be significant: CR_F_right
plotSalience_fn(signed.X, 2)
The permutation test shows the following variables to be significant: UF_right
plotSalience_fn(signed.X, 3)
The permutation test shows the following variables to be significant: ILF_right, CC2_commissural, CC6_commissural
plotSalience_fn(signed.Y, 1)
The permutation test shows the following variables to be significant: Attention_vigilance, Working_memory, Verbal_learning, TASIT_3, RMET, RAD, ER_40
plotSalience_fn(signed.Y, 2)
The permutation test shows the following variables to be significant: Attention_vigilance
plotSalience_fn(signed.Y, 3)
The permutation test shows the following variables to be significant: Problem_solving