imaging data pulled: 2021-10-12
clinical data pulled: 2020-11-16
code written: 2021-11-05
last ran: 2021-11-15

Description. Here, we examine the distributions of cognitive and LC NM-MRI data. Both cognition and imaging variables have already been corrected for age and sex, in the prior script 01_dataCleaning.Rmd. The present script finds that we need to additionally correct the reaction time cognition variables on the DKEFS for normality. (Note that variables are not standardized to zero mean and unit variance; this is done prior to the CCA in 06_CCA).


Load libraries and data

#clear environment
rm(list = ls())

#list required libraries
packages <- c(
  'tidyverse',
  'kableExtra', #pretty tables
  'reshape2', #wrangling
  'bestNormalize',
  'MVN'
)

#load required libraries
lapply(packages, require, character.only = TRUE)

#read in cleaned participant demographic/clinical/cognition/lc data
df <- read.csv(dir('../clinical', full.names=T, pattern="^df_2021")) #48

Functions for analyses

#function to plot violin distributions for univariate analysis
plotDistribution_fn <- function(df){
  ggplot(df, aes(x=Diagnosis, y=value, color=Diagnosis, fill=Diagnosis)) +
    geom_violin(alpha=.1) +
    geom_dotplot(dotsize=1, binaxis='y', stackdir='center') +
    geom_boxplot(alpha=.2, width=0.5, outlier.colour='black', outlier.size=3, outlier.alpha=1) +
    theme_minimal() +
    theme(legend.position = 'none',
          axis.title = element_blank()) +
    facet_wrap(~variable, scales='free')
}

LC

Outliers

The plots below show the LC NM-MRI data distributions, by diagnosis. Outliers are plotted in black. Note that we decided not to remove any outlying LC NM-MRI values, as all appear to be plausible.

#identify the LC variables
vars_LC <- names(df[,grep('^avg_max.+[0-9]_cor$', names(df))])

#pull out LC variables, ID, and diagnosis
df_LC <- df[, c('id', 'Diagnosis', vars_LC)] 
            
#melt data, for easier plotting
df_LC <- melt(df_LC, id.vars=c('id', 'Diagnosis')) 

#run plotting function
plotDistribution_fn(df_LC)

#function to pull outliers out of data -- for manual review
is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

#identify outliers
df_LC <- df_LC %>%
  group_by(variable) %>%
  mutate(outlier = ifelse(is_outlier(value), value, as.numeric(NA)))

The participants to double check the values of, for possible manual intervention, are as follows: SEN039 avg_max_seg_1_cor, SEN046 avg_max_seg_3_cor, SEN047 avg_max_seg_5_cor, SEN029 avg_max_seg_6_cor, SEN087 avg_max_seg_6_cor.


Univariate normality

We also reviewed univariate normality, via the Shapiro-Wilk’s test. Normality is indicated by p values >.05. We see that LC NM-MRI values are effectively normal, with a potential small deviation in avg_max_seg_6, i.e., right caudal LC.

Combined
mvn(df[,vars_LC], mvnTest = 'mardia', univariatePlot='qqplot')$univariateNormality

##           Test          Variable Statistic   p value Normality
## 1 Shapiro-Wilk avg_max_seg_1_cor    0.9731    0.3322    YES   
## 2 Shapiro-Wilk avg_max_seg_2_cor    0.9774    0.4769    YES   
## 3 Shapiro-Wilk avg_max_seg_3_cor    0.9722    0.3087    YES   
## 4 Shapiro-Wilk avg_max_seg_4_cor    0.9929    0.9916    YES   
## 5 Shapiro-Wilk avg_max_seg_5_cor    0.9677    0.2065    YES   
## 6 Shapiro-Wilk avg_max_seg_6_cor    0.9468    0.0299    NO
LLD
mvn(df[df$Diagnosis == 'LLD', vars_LC], mvnTest = 'mardia', univariatePlot='qqplot')$univariateNormality

##           Test          Variable Statistic   p value Normality
## 1 Shapiro-Wilk avg_max_seg_1_cor    0.9699    0.6437    YES   
## 2 Shapiro-Wilk avg_max_seg_2_cor    0.9359    0.1190    YES   
## 3 Shapiro-Wilk avg_max_seg_3_cor    0.9785    0.8540    YES   
## 4 Shapiro-Wilk avg_max_seg_4_cor    0.9780    0.8423    YES   
## 5 Shapiro-Wilk avg_max_seg_5_cor    0.9330    0.1017    YES   
## 6 Shapiro-Wilk avg_max_seg_6_cor    0.9468    0.2120    YES
HC
mvn(df[df$Diagnosis == 'HC', vars_LC], mvnTest = 'mardia', univariatePlot='qqplot')$univariateNormality

##           Test          Variable Statistic   p value Normality
## 1 Shapiro-Wilk avg_max_seg_1_cor    0.9254    0.0871    YES   
## 2 Shapiro-Wilk avg_max_seg_2_cor    0.9733    0.7673    YES   
## 3 Shapiro-Wilk avg_max_seg_3_cor    0.9523    0.3257    YES   
## 4 Shapiro-Wilk avg_max_seg_4_cor    0.9590    0.4432    YES   
## 5 Shapiro-Wilk avg_max_seg_5_cor    0.9826    0.9457    YES   
## 6 Shapiro-Wilk avg_max_seg_6_cor    0.9066    0.0346    NO

Multivariate normality

Next, we assessed multivariate normality via Mardia’s test. Normality is indicated by p values >.05. Though the combined sample has evidence of skewness, it does not seem particularly problematic, and we have opted not to adjusted these the LC NM-MRI values as a consequence.

Combined
mvn(df[,vars_LC], mvnTest = 'mardia', multivariateOutlierMethod='adj')$multivariateNormality

##              Test         Statistic            p value Result
## 1 Mardia Skewness  80.2701250017644 0.0184102830303479     NO
## 2 Mardia Kurtosis 0.160730717974568  0.872305494533103    YES
## 3             MVN              <NA>               <NA>     NO
LLD
mvn(df[df$Diagnosis == 'LLD', vars_LC], mvnTest = 'mardia', multivariateOutlierMethod='adj')$multivariateNormality

##              Test          Statistic           p value Result
## 1 Mardia Skewness   57.4114310562735 0.422626028728036    YES
## 2 Mardia Kurtosis -0.102512139294482 0.918350177847724    YES
## 3             MVN               <NA>              <NA>    YES
HC
mvn(df[df$Diagnosis == 'HC', vars_LC], mvnTest = 'mardia', multivariateOutlierMethod='adj')$multivariateNormality

##              Test          Statistic           p value Result
## 1 Mardia Skewness   62.5737714401465 0.254353775116634    YES
## 2 Mardia Kurtosis -0.656660408378788 0.511399297074907    YES
## 3             MVN               <NA>              <NA>    YES

Variance

Lastly, we performanced a variance test for all LC NM-MRI values, to ensure that LLD and HC groups show similar variance. Equal variance is indicated by p values >.5. We see all LC NM-MRI values have indistinguishable variance.

lapply(df[, vars_LC], function(x) var.test(x ~ df$Diagnosis, alternative='two.sided'))
## $avg_max_seg_1_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 1.3011, num df = 22, denom df = 24, p-value = 0.5287
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.5666854 3.0334117
## sample estimates:
## ratio of variances 
##           1.301056 
## 
## 
## $avg_max_seg_2_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.5961, num df = 22, denom df = 24, p-value = 0.2267
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2596359 1.3898055
## sample estimates:
## ratio of variances 
##          0.5960994 
## 
## 
## $avg_max_seg_3_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.73734, num df = 22, denom df = 24, p-value = 0.4758
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3211538 1.7191049
## sample estimates:
## ratio of variances 
##          0.7373387 
## 
## 
## $avg_max_seg_4_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 1.4009, num df = 22, denom df = 24, p-value = 0.4208
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6101724 3.2661933
## sample estimates:
## ratio of variances 
##           1.400898 
## 
## 
## $avg_max_seg_5_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.57311, num df = 22, denom df = 24, p-value = 0.1937
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.249623 1.336207
## sample estimates:
## ratio of variances 
##          0.5731108 
## 
## 
## $avg_max_seg_6_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 1.5124, num df = 22, denom df = 24, p-value = 0.3242
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6587491 3.5262195
## sample estimates:
## ratio of variances 
##           1.512426

Cognition

Outliers

As above, these plots show the cognition data distributions, by diagnosis. Outliers are plotted in black. Note that we decided not to remove any outlying cognition values, as all are “real” values, within the possible range of scores on the various assessments.

#identify the cognition variables
vars_cognition <- names(df[,grep('^rbans.+index_cor$|^dkefs.+_cor$', names(df))])

#put cognition, ID, and diagnosis in a separate df
df_cognition <- df[, c('id', 'Diagnosis', vars_cognition)] 
            
#melt data
df_cognition <- melt(df_cognition, id.vars=c('id', 'Diagnosis')) 

#run function
plotDistribution_fn(df_cognition)


Univariate normality

As above, we review univariate normality with Shapiro-Wilk’s test (normality shown by p values >.05). We see that several variables are not normal. Most are reaction time scores on the D-KEFS. We have opted to correct all reaction time variables for non-normality. We do not correct non-normal, non-reaction time variables, as they are not appear to pose significant deviations from normality.

Combined
mvn(df[, vars_cognition], mvnTest = 'mardia', univariatePlot='qqplot')$univariateNormality
##            Test                    Variable Statistic   p value Normality
## 1  Shapiro-Wilk  rbans_immmemory_index_cor     0.9664  0.1834      YES   
## 2  Shapiro-Wilk    rbans_visuo_index_cor       0.9856  0.8137      YES   
## 3  Shapiro-Wilk  rbans_language_index_cor      0.9712  0.2811      YES   
## 4  Shapiro-Wilk  rbans_attention_index_cor     0.9763  0.4341      YES   
## 5  Shapiro-Wilk   rbans_delmem_index_cor       0.9346  0.0101      NO    
## 6  Shapiro-Wilk   dkefs_trails4_time_cor       0.9319   0.008      NO    
## 7  Shapiro-Wilk   dkefs_trails5_time_cor       0.8303  <0.001      NO    
## 8  Shapiro-Wilk    dkefs_cwi_1_time_cor        0.8402  <0.001      NO    
## 9  Shapiro-Wilk    dkefs_cwi_2_time_cor        0.8349  <0.001      NO    
## 10 Shapiro-Wilk    dkefs_cwi_3_time_cor        0.8569  <0.001      NO    
## 11 Shapiro-Wilk    dkefs_cwi_4_time_cor        0.9679  0.2104      YES   
## 12 Shapiro-Wilk dkefs_vf_lftotalcorrect_cor    0.9799  0.5766      YES   
## 13 Shapiro-Wilk dkefs_vf_cftotalcorrect_cor    0.9759  0.4225      YES   
## 14 Shapiro-Wilk dkefs_vf_csswitchtotcor_cor    0.9337  0.0094      NO

LLD

As expected, not all variables are normal in LLD.

mvn(df[df$Diagnosis == 'LLD', vars_cognition], mvnTest = 'mardia', univariatePlot='qqplot')$univariateNormality
##            Test                    Variable Statistic   p value Normality
## 1  Shapiro-Wilk  rbans_immmemory_index_cor     0.9430    0.1736    YES   
## 2  Shapiro-Wilk    rbans_visuo_index_cor       0.9806    0.8967    YES   
## 3  Shapiro-Wilk  rbans_language_index_cor      0.9280    0.0783    YES   
## 4  Shapiro-Wilk  rbans_attention_index_cor     0.9730    0.7220    YES   
## 5  Shapiro-Wilk   rbans_delmem_index_cor       0.8709    0.0045    NO    
## 6  Shapiro-Wilk   dkefs_trails4_time_cor       0.8942    0.0137    NO    
## 7  Shapiro-Wilk   dkefs_trails5_time_cor       0.9739    0.7431    YES   
## 8  Shapiro-Wilk    dkefs_cwi_1_time_cor        0.8034    0.0003    NO    
## 9  Shapiro-Wilk    dkefs_cwi_2_time_cor        0.7882    0.0001    NO    
## 10 Shapiro-Wilk    dkefs_cwi_3_time_cor        0.8150    0.0004    NO    
## 11 Shapiro-Wilk    dkefs_cwi_4_time_cor        0.9476    0.2217    YES   
## 12 Shapiro-Wilk dkefs_vf_lftotalcorrect_cor    0.9693    0.6283    YES   
## 13 Shapiro-Wilk dkefs_vf_cftotalcorrect_cor    0.9766    0.8105    YES   
## 14 Shapiro-Wilk dkefs_vf_csswitchtotcor_cor    0.9166    0.0428    NO

HC

As expected, not all variables are normal in HC.

mvn(df[df$Diagnosis == 'HC', vars_cognition], mvnTest = 'mardia', univariatePlot='qqplot')$univariateNormality
##            Test                    Variable Statistic   p value Normality
## 1  Shapiro-Wilk  rbans_immmemory_index_cor     0.9686  0.6565      YES   
## 2  Shapiro-Wilk    rbans_visuo_index_cor       0.9673  0.6244      YES   
## 3  Shapiro-Wilk  rbans_language_index_cor      0.9370  0.1549      YES   
## 4  Shapiro-Wilk  rbans_attention_index_cor     0.9508  0.3046      YES   
## 5  Shapiro-Wilk   rbans_delmem_index_cor       0.9697  0.6826      YES   
## 6  Shapiro-Wilk   dkefs_trails4_time_cor       0.9639  0.5454      YES   
## 7  Shapiro-Wilk   dkefs_trails5_time_cor       0.7399  <0.001      NO    
## 8  Shapiro-Wilk    dkefs_cwi_1_time_cor        0.8983  0.0233      NO    
## 9  Shapiro-Wilk    dkefs_cwi_2_time_cor        0.8624  0.0046      NO    
## 10 Shapiro-Wilk    dkefs_cwi_3_time_cor        0.9277  0.0977      YES   
## 11 Shapiro-Wilk    dkefs_cwi_4_time_cor        0.9803  0.9107      YES   
## 12 Shapiro-Wilk dkefs_vf_lftotalcorrect_cor    0.9633  0.5322      YES   
## 13 Shapiro-Wilk dkefs_vf_cftotalcorrect_cor    0.9600  0.4638      YES   
## 14 Shapiro-Wilk dkefs_vf_csswitchtotcor_cor    0.9479  0.2642      YES


Multivariate normality

As above, we review Mardia’s test for multivariate normality (normality shown by p values >.05). As with the LC NM-MRI values, we see evidence of skewness in the combined sample, but this is not present in the separate diagnostic groups.

Combined
mvn(df[,vars_cognition], mvnTest = 'mardia', multivariateOutlierMethod='adj')$multivariateNormality

##              Test         Statistic              p value Result
## 1 Mardia Skewness   687.45883813858 0.000176887770014033     NO
## 2 Mardia Kurtosis 0.624677068615212    0.532183026436021    YES
## 3             MVN              <NA>                 <NA>     NO
LLD
mvn(df[df$Diagnosis == 'LLD', vars_cognition], mvnTest = 'mardia', multivariateOutlierMethod='adj')$multivariateNormality

##              Test         Statistic           p value Result
## 1 Mardia Skewness  544.834053007212 0.669074356353767    YES
## 2 Mardia Kurtosis -1.16565590897873  0.24375359338542    YES
## 3             MVN              <NA>              <NA>    YES
HC
mvn(df[df$Diagnosis == 'HC', vars_cognition], mvnTest = 'mardia', multivariateOutlierMethod='adj')$multivariateNormality

##              Test         Statistic            p value Result
## 1 Mardia Skewness  511.497343769698  0.929667137951828    YES
## 2 Mardia Kurtosis -1.85729924566618 0.0632685916860407    YES
## 3             MVN              <NA>               <NA>    YES

Variance

Lastly, we review variance, to ensure that participants in the HC and LLD groups show similar values. three reaction time variables show unequal variance (before these variables have been corrected for non-normality), which contributes to our decision to correct the reaction-time cognition variables.

lapply(df[, vars_cognition], function(x) var.test(x ~ df$Diagnosis, alternative='two.sided'))
## $rbans_immmemory_index_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.73587, num df = 22, denom df = 24, p-value = 0.4729
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3205138 1.7156790
## sample estimates:
## ratio of variances 
##          0.7358693 
## 
## 
## $rbans_visuo_index_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.82984, num df = 22, denom df = 24, p-value = 0.6635
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3614423 1.9347653
## sample estimates:
## ratio of variances 
##          0.8298373 
## 
## 
## $rbans_language_index_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 1.627, num df = 22, denom df = 24, p-value = 0.2469
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7086466 3.7933161
## sample estimates:
## ratio of variances 
##           1.626986 
## 
## 
## $rbans_attention_index_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.91167, num df = 22, denom df = 24, p-value = 0.8314
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3970834 2.1255487
## sample estimates:
## ratio of variances 
##          0.9116659 
## 
## 
## $rbans_delmem_index_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 1.6607, num df = 22, denom df = 24, p-value = 0.2278
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.7233127 3.8718225
## sample estimates:
## ratio of variances 
##           1.660658 
## 
## 
## $dkefs_trails4_time_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.67567, num df = 22, denom df = 24, p-value = 0.3589
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.294294 1.575327
## sample estimates:
## ratio of variances 
##          0.6756713 
## 
## 
## $dkefs_trails5_time_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 3.2655, num df = 22, denom df = 24, p-value = 0.005829
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.422326 7.613574
## sample estimates:
## ratio of variances 
##           3.265527 
## 
## 
## $dkefs_cwi_1_time_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.43191, num df = 22, denom df = 24, p-value = 0.05186
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1881199 1.0069875
## sample estimates:
## ratio of variances 
##          0.4319055 
## 
## 
## $dkefs_cwi_2_time_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.65002, num df = 22, denom df = 24, p-value = 0.3136
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2831203 1.5155151
## sample estimates:
## ratio of variances 
##          0.6500174 
## 
## 
## $dkefs_cwi_3_time_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.51505, num df = 22, denom df = 24, p-value = 0.1223
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2243334 1.2008351
## sample estimates:
## ratio of variances 
##          0.5150484 
## 
## 
## $dkefs_cwi_4_time_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.40693, num df = 22, denom df = 24, p-value = 0.03773
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.1772430 0.9487646
## sample estimates:
## ratio of variances 
##          0.4069332 
## 
## 
## $dkefs_vf_lftotalcorrect_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 0.79548, num df = 22, denom df = 24, p-value = 0.5929
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.3464759 1.8546518
## sample estimates:
## ratio of variances 
##           0.795476 
## 
## 
## $dkefs_vf_cftotalcorrect_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 1.1254, num df = 22, denom df = 24, p-value = 0.7747
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4901959 2.6239707
## sample estimates:
## ratio of variances 
##           1.125443 
## 
## 
## $dkefs_vf_csswitchtotcor_cor
## 
##  F test to compare two variances
## 
## data:  x by df$Diagnosis
## F = 1.1225, num df = 22, denom df = 24, p-value = 0.7795
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4888947 2.6170058
## sample estimates:
## ratio of variances 
##           1.122456

Transform cognition

Identify best transform.
Here, we review the success of several common transformations to all 6 of the reaction-time cognition variables. We find that the two D-KEFS TMT variables are best left untransformed. We see that all four of the D-KEFS CWI variables ought to be transformed. Arcsinh is best for two of the four variables; thus, we opt to apply it to all four of the D-KEFS CWI variables.

#first, pull out the age- and sex- corrected reaction time variables
df_cognitionTime <- df[ ,grep('time_cor', names(df))]

#compare transformations on all age- and sex- corrected reaction time variables
transformComparison <- lapply(df_cognitionTime, function(x) bestNormalize(x, allow_orderNorm = FALSE, out_of_sample = FALSE))
  
#extract goodness of fit and model selection
transform_fit <- lapply(transformComparison, `[`, 'norm_stats')
  
#bind together elements of list into a df
df_transform <- do.call(cbind, transform_fit) %>% 
  as.data.frame() %>% gather() %>% separate(col=value, sep=',', remove=TRUE, 
  into=c('arcsinh_x','boxcox ','exp_x','log_x','no_transform','sqrt_x','yeojohnson'))
  
#remove non-numeric components of variables
df_transform <- cbind(df_transform[1], sapply(df_transform[2:ncol(df_transform)], function(x) gsub("[^0-9.-]", "", x)))
  
#make sure all variables with numbers are numeric
df_transform[2:ncol(df_transform)] <- sapply(df_transform[2:ncol(df_transform)],as.numeric)
  
#make a variable, indicating selection
df_transform$selection <- colnames(df_transform)[apply(df_transform,1,which.min)]
  
#round all numeric values
df_transform <- df_transform %>% mutate_if(is.numeric, round, 4)
  
#remove unneeded tables, dfs
rm(transform_fit, transformComparison)
  
#put into table
df_transform %>% kable() %>% kable_styling() 
key arcsinh_x boxcox exp_x log_x no_transform sqrt_x yeojohnson selection
dkefs_trails4_time_cor 20.2262 58.9167 7.7857 2.0119 1.0595 1.2976 NA no_transform
dkefs_trails5_time_cor 2.0714 58.9167 4.7500 2.6667 0.1667 0.2857 NA no_transform
dkefs_cwi_1_time_cor 0.6429 0.7619 58.9167 0.6429 1.7143 0.7619 0.9405 arcsinh_x
dkefs_cwi_2_time_cor 2.8452 1.2976 56.1786 2.8452 3.3214 2.2500 0.5238 yeojohnson
dkefs_cwi_3_time_cor 0.2857 0.4048 58.9167 0.2857 1.5357 1.0000 0.4048 arcsinh_x
dkefs_cwi_4_time_cor 0.8214 0.3452 58.9167 0.8214 0.3452 0.3452 0.3452 boxcox

Apply transformation. Note: though the variables are not standardized here, they are substantially different from their original values.

#transform the four variables with arcsinh
dkefs_cwi_1_time_normcor <- arcsinh_x(df_cognitionTime$dkefs_cwi_1_time_cor, standardize = F)$x.t
dkefs_cwi_2_time_normcor <- arcsinh_x(df_cognitionTime$dkefs_cwi_2_time_cor, standardize = F)$x.t
dkefs_cwi_3_time_normcor <- arcsinh_x(df_cognitionTime$dkefs_cwi_3_time_cor, standardize = F)$x.t
dkefs_cwi_4_time_normcor <- arcsinh_x(df_cognitionTime$dkefs_cwi_4_time_cor, standardize = F)$x.t

#replace the D-KEFS CWI reaction time variables in the dataset 
df <- df[, !names(df) %in% names(df_cognitionTime[, grep('cwi', names(df_cognitionTime))])]
df <- cbind(df, dkefs_cwi_1_time_normcor, dkefs_cwi_2_time_normcor, dkefs_cwi_3_time_normcor, dkefs_cwi_4_time_normcor)

#remove unneeded dfs
rm(df_cognitionTime, df_transform)

Write out data

#pull dataframe with variables of interest
df <- df[, grep('^id$|Diagnosis|Age|Sex|^rbans.+index_cor$|^dkefs.+_cor$|^dkefs.+normcor$|avg_max_seg.+[0-9]_cor', names(df))]

#write out
write.csv(df, paste0('../clinical/dfCorrected_', Sys.Date(), '.csv', sep=''), row.names = F)