Background. I’d like to do a CCA. X will be imaging data. But, I have lots of missing values (result of unreliable white matter fiber estimation in small tracts).
Set up.
#load libraries
libraries <- c('CCA', 'dplyr', 'matrixcalc') #list
lapply(libraries, require, character.only = T) #load
#read in data
df <- read.csv(dir('../data/out/', full.names=T, pattern="^eigen"))
Data cleaning.
#identify variables with data from fewer than 10 participants
(missing <- names(which(colSums(is.na(df)) > 10))) #n=29, all imaging vars are expected
## [1] "FA.CC7_left" "FA.CC5_left"
## [3] "FA.PLIC_left" "FA.CR.P_right"
## [5] "FA.SLF.I_left" "FA.EC_left"
## [7] "FA.MCP_right" "FA.SLF.II_right"
## [9] "FA.AF_right" "FA.IOFF_left"
## [11] "FA.SO_left" "FA.CPC_left"
## [13] "FA.CR.P_left" "FA.CC5_right"
## [15] "FA.CC6_right" "FA.CC7_right"
## [17] "FA.EC_right" "FA.EmC_right"
## [19] "FA.SLF.II_left" "FA.CR.F_commissural"
## [21] "FA.MCP_left" "FA.Intra.CBLM.I.P_commissural"
## [23] "FA.IOFF_right" "FA.EmC_left"
## [25] "FA.CPC_commissural" "FA.CPC_right"
## [27] "FA.SLF.I_right" "FA.CB_commissural"
## [29] "FA.ICP_commissural"
#remove variables with data from fewer than 10 participants
df<- df[, !colnames(df) %in% missing]
Create X and Y matrices.
#grep for relevant vars in df
FA <- grep('FA', names(df), value=TRUE)
ncog <- grep('np_domain', names(df), value=TRUE)
scog <- grep('scog_', names(df), value=TRUE)
neg <- grep('sans_sub_', names(df), value=TRUE)
#create X as imaging -- contains 66 vars
X <- scale(df[, names(df) %in% FA])
#create Y as ncog, scog, and neg -- contains 12 vars
Y <- c(ncog, scog, neg)
Y <- scale(df[, names(df) %in% Y])
Calculate within and between correlation matrices.
#because of missing data, using pairwise complete observations
rxx <- cor(X, use = "p") ; ryy <- cor(Y, use = "p") ; rxy <- cor(X,Y, use = "p")
#check if any _complete_ columns of correlation matrices are NA
sum(colSums(is.na(rxx)) > 0) ; sum(colSums(is.na(ryy)) > 0) ; sum(colSums(is.na(rxy)) > 0) #all 0
## [1] 0
## [1] 0
## [1] 0
But, omega calculation fails… (can’t complete chol).
#calculate omega
#omega = t(solve(chol(rxx))) %*% rxy %*% solve(chol(ryy)) #FAILS because can't compute chol(rxx)
#figure out why it fails - ensure symmetric
is.symmetric.matrix(rxx)
## [1] TRUE
is.positive.definite(rxx) #uhoh
## [1] FALSE
#figure out why it fails - ensure positive/non-0 eigenvalues
eigen(rxx)[1] #lots of negatives, and pretty big ones, too...
## $values
## [1] 19.069233157 5.062949103 3.726063703 3.163291291 2.863719020
## [6] 2.612850446 2.346930308 2.296552056 2.115957039 1.868643425
## [11] 1.561577794 1.488734107 1.426247862 1.352851961 1.331568384
## [16] 1.207808815 1.078191420 1.004179647 0.981671762 0.935848036
## [21] 0.778493868 0.750788015 0.738481158 0.716927051 0.615803392
## [26] 0.586544765 0.545450742 0.518911575 0.467132747 0.440889130
## [31] 0.382934668 0.341028375 0.314113691 0.299981803 0.278021891
## [36] 0.270899032 0.219290422 0.187842289 0.163850075 0.156708885
## [41] 0.152931027 0.128201168 0.111111374 0.094026192 0.079206427
## [46] 0.065963077 0.050382174 0.039674881 0.032054030 0.020996974
## [51] 0.014594319 0.011746533 -0.001500048 -0.004055399 -0.016014938
## [56] -0.021683976 -0.026602420 -0.030988033 -0.044858662 -0.049264397
## [61] -0.067672551 -0.072787360 -0.100301257 -0.105399265 -0.119188258
## [66] -0.409534525
So, we see X is non-positive definite, and the negative eigenvalues (see last 14 values) are non-trivially large. This suggests that the correlation matrix estimation isn’t robust, I think because I calculated pairwise correlation coefficients in the presence of missing values. So, I need to do something else about the missing data…
Unfortunately, N is not big enough to omit all observations with missing values (also I don’t think it would be valid, given that so participants are expected to have some missing values, i.e., missing white matter tracts).
So, should I impute missing values? Or, should I “shrink” the correlation matrix (e.g., via the Ledoit Wolf method). Or something else?