Missing data, unreliable corr matrices, negative eigenvalues

Background. I’d like to do a CCA. X will be imaging data. But, I have lots of missing values (result of unreliable white matter fiber estimation in small tracts).

Set up.

#load libraries
libraries <- c('CCA', 'dplyr', 'matrixcalc') #list
lapply(libraries, require, character.only = T) #load 

#read in data
df <- read.csv(dir('../data/out/', full.names=T, pattern="^eigen"))

Data cleaning.

#identify variables with data from fewer than 10 participants 
(missing <- names(which(colSums(is.na(df)) > 10))) #n=29, all imaging vars are expected

##  [1] "FA.CC7_left"                   "FA.CC5_left"                  
##  [3] "FA.PLIC_left"                  "FA.CR.P_right"                
##  [5] "FA.SLF.I_left"                 "FA.EC_left"                   
##  [7] "FA.MCP_right"                  "FA.SLF.II_right"              
##  [9] "FA.AF_right"                   "FA.IOFF_left"                 
## [11] "FA.SO_left"                    "FA.CPC_left"                  
## [13] "FA.CR.P_left"                  "FA.CC5_right"                 
## [15] "FA.CC6_right"                  "FA.CC7_right"                 
## [17] "FA.EC_right"                   "FA.EmC_right"                 
## [19] "FA.SLF.II_left"                "FA.CR.F_commissural"          
## [21] "FA.MCP_left"                   "FA.Intra.CBLM.I.P_commissural"
## [23] "FA.IOFF_right"                 "FA.EmC_left"                  
## [25] "FA.CPC_commissural"            "FA.CPC_right"                 
## [27] "FA.SLF.I_right"                "FA.CB_commissural"            
## [29] "FA.ICP_commissural"

#remove variables with data from fewer than 10 participants 
df<- df[, !colnames(df) %in% missing]

Create X and Y matrices.

#grep for relevant vars in df
FA   <- grep('FA', names(df), value=TRUE)
ncog <- grep('np_domain', names(df), value=TRUE)
scog <- grep('scog_', names(df), value=TRUE)
neg  <- grep('sans_sub_', names(df), value=TRUE)

#create X as imaging -- contains 66 vars
X <- scale(df[, names(df) %in% FA]) 

#create Y as ncog, scog, and neg -- contains 12 vars
Y <- c(ncog, scog, neg)
Y <- scale(df[, names(df) %in% Y])

Calculate within and between correlation matrices.

#because of missing data, using pairwise complete observations
rxx <- cor(X, use = "p") ; ryy <- cor(Y, use = "p") ; rxy <- cor(X,Y, use = "p") 

#check if any _complete_ columns of correlation matrices are NA
sum(colSums(is.na(rxx)) > 0) ; sum(colSums(is.na(ryy)) > 0) ; sum(colSums(is.na(rxy)) > 0) #all 0

## [1] 0

## [1] 0

## [1] 0

But, omega calculation fails… (can’t complete chol).

#calculate omega
#omega = t(solve(chol(rxx))) %*% rxy %*% solve(chol(ryy)) #FAILS because can't compute chol(rxx)

#figure out why it fails - ensure symmetric
is.symmetric.matrix(rxx)

## [1] TRUE

is.positive.definite(rxx) #uhoh

## [1] FALSE

#figure out why it fails - ensure positive/non-0 eigenvalues
eigen(rxx)[1] #lots of negatives, and pretty big ones, too...

## $values
##  [1] 19.069233157  5.062949103  3.726063703  3.163291291  2.863719020
##  [6]  2.612850446  2.346930308  2.296552056  2.115957039  1.868643425
## [11]  1.561577794  1.488734107  1.426247862  1.352851961  1.331568384
## [16]  1.207808815  1.078191420  1.004179647  0.981671762  0.935848036
## [21]  0.778493868  0.750788015  0.738481158  0.716927051  0.615803392
## [26]  0.586544765  0.545450742  0.518911575  0.467132747  0.440889130
## [31]  0.382934668  0.341028375  0.314113691  0.299981803  0.278021891
## [36]  0.270899032  0.219290422  0.187842289  0.163850075  0.156708885
## [41]  0.152931027  0.128201168  0.111111374  0.094026192  0.079206427
## [46]  0.065963077  0.050382174  0.039674881  0.032054030  0.020996974
## [51]  0.014594319  0.011746533 -0.001500048 -0.004055399 -0.016014938
## [56] -0.021683976 -0.026602420 -0.030988033 -0.044858662 -0.049264397
## [61] -0.067672551 -0.072787360 -0.100301257 -0.105399265 -0.119188258
## [66] -0.409534525

So, we see X is non-positive definite, and the negative eigenvalues (see last 14 values) are non-trivially large. This suggests that the correlation matrix estimation isn’t robust, I think because I calculated pairwise correlation coefficients in the presence of missing values. So, I need to do something else about the missing data…

Unfortunately, N is not big enough to omit all observations with missing values (also I don’t think it would be valid, given that so participants are expected to have some missing values, i.e., missing white matter tracts).

So, should I impute missing values? Or, should I “shrink” the correlation matrix (e.g., via the Ledoit Wolf method). Or something else?