08b | Review and impose normality

Description. In a previous analysis, we removed highly influential points/outliers (with both a "capped" and "interpolated" IQR method). Here, we review normality for the interpolated data (the "capped" method was unable to be adjusted to normality). After outlier treatment, we find that k=16 variables in our set of k=84 intended for CCA are not normally distributed, and thus should be transformed to normality, where normal means “to render data Gaussian”, as is required by CCA. We use the bestNormalize package to identify the ideal transformation , and then implement it (Yeo-Johnson).

Normality

Density with normal distribution. Note that visual inspection of these plots make it sometimes difficult to detect non-normality (though the visualization is helpful, need to rely on significance testing, below).

#function to get into long format
plot_fn <- function(df) {
  df %>%
  as_tibble() %>%
  select_if(is.numeric) %>%
  gather(key='variable', value='value')
}

#make melted datasets
df_plot <- plot_fn(df)
df_plot_pns <- plot_fn(df_pns)
df_plot_spins <- plot_fn(df_spins)

#function for plot -- density with normal distribution
density_fn <- function(df){
ggplot(df, aes(x=value)) +
  geom_density() + 
  stat_overlay_normal_density(color = "red", linetype = "dashed") +
  facet_wrap(~variable, scales = 'free', ncol=6)
}

Entire sample

density_fn(df_plot)

SPINS only

density_fn(df_plot_spins)

PNS only

density_fn(df_plot_pns)

Data descriptives. Here, we see that skew and kurtosis are high in several variables; interestingly; this is especially the case in social cognition (negative):

DT::datatable(result$Descriptives[9:10]) %>%
  DT::formatRound(digits=3, columns = c('Skew', 'Kurtosis'))

Significance testing of univariate normality. Here, we can review the statistical test result in the Normality column (YES indicates a normal distribution). In total, k=67 of the k=84 variables intended for the CCA are normal. Note that none of the six social cognition variables are normal, and only three of the six neurocognition variables are (the rest of the count of course reflects FA). (Also note that missing p values just indicates very small values, p <0.0000.)

DT::datatable(result$univariateNormality[2:length(result$univariateNormality)]) %>%
    DT::formatRound(digits=3, columns = c('Statistic', 'p value'))

Significance testing of multivariate normality. We can also conduct a multivariate test of normality (there are several; Mardia is used here). We see that kurtosis, and the entire variable set, does not meet normality characteristics.

result$multivariateNormality

##              Test         Statistic              p value Result
## 1 Mardia Skewness  101781.266262134    0.891730035954104    YES
## 2 Mardia Kurtosis -3.49949471522181 0.000466140844881835     NO
## 3             MVN              <NA>                 <NA>     NO

Transformation

Identify non-normal variables for transformation.

The k=17 non-normal variables in the dataset of 84 variables intended for possible inclusion in the CCA are: Verbal_learning, Visual_learning, Problem_solving, TASIT_1, TASIT_2, TASIT_3, RMET, RAD, ER_40, CC6_commissural, CST_right, CST_left, Sup_P_left, SLF_I_left, SLF_I_right, SLF_III_right, Sup_O_left.

Compare transformation methods.

Here, we use bestNormalize to identify the best transformation from according to the Pearson P statistic (divided by its degrees of freedom). The comparative benefit of the Pearson P / df normality test is that it is a relatively interpretable goodness of fit test, and the ratio P / df can be compared between transformations as an absolute measure of the departure from normality (if the data follows close to a normal distribution, this ratio will be close to 1; below, the lowest values are indicated in the brightest shade of green). We calculate this value via out of sample calculation, which computes via CV with 10 folds and 5 repeats. The best transformation per variable is indicated in the selection column. We will ultimately select a single transformation to apply to all non-normal variables, according to which transformation is the best overall. We see that Yeo-Johnson (also used by Lindsay) achives the best results across the greatest number of variables (k=4) and consistently shows low values.

#run comparisons
transformComparison <- lapply(df_nonNormal, function(x) bestNormalize(x, allow_orderNorm = FALSE))

#extract goodness of fit and model selection
transform_fit <- lapply(transformComparison, `[`, 'norm_stats')

#bind together elements of list into a dataframetest 
df_transform <- do.call(cbind, transform_fit) %>% 
  as.data.frame() %>%
  gather() %>%
  separate(col=value, sep=',', remove=TRUE, into=c(
    'arcsinh_x',
    'boxcox ',
    'exp_x',
    'log_x',
    'no_transform',       
    'sqrt_x',   
    'yeojohnson'))

#why did ER_40 fail? who knows... rewrite values
er_40 <- c('ER_40',
           'arcsinh_x = 2.31212418300654',
           'boxcox = NA',
           'exp_x = 3.82',
           'log_x = 2.78794934640523',
           'no_transform = 2.11996732026144',  
           'sqrt_x = 2.30203839869281',
           'yeojohnson = 4.06532679738562')

#replace row
df_transform[10,] <- er_40

#remove non-numeric components of variables
df_transform <- cbind(df_transform[1], sapply(df_transform[2:ncol(df_transform)], function(x) gsub("[^0-9.-]", "", x)))

#make sure all variables with numbers are numeric
df_transform[2:ncol(df_transform)] <- sapply(df_transform[2:ncol(df_transform)],as.numeric)

#make a final final variable, indicating selection
df_transform$selection <- colnames(df_transform)[apply(df_transform,1,which.min)]

#round all numeric values
df_transform <- df_transform %>% mutate_if(is.numeric, round, 4)

#function to highlight value of number
color_fn = function(x) cell_spec(x, bold = T, color = spec_color(x, direction=1, begin=.7, end=.4))    

#apply function
df_transform[, 2:8] = t(apply(df_transform[,2:8], 1, color_fn)) 

#make a pretty kable table
df_transform %>%
  kable(escape=F) %>%
  kable_styling()

key	arcsinh_x	boxcox	exp_x	log_x	no_transform	sqrt_x	yeojohnson	selection
Verbal_learning	1.1061	1.1306	19.8055	1.1061	1.1073	1.1358	1.1306	arcsinh_x
Visual_learning	1.9482	1.167	18.344	1.9482	1.21	1.5774	1.1752	boxcox
Problem_solving	1.1542	1.2665	18.5869	1.1542	1.1946	1.1214	1.2665	sqrt_x
TASIT_1	2.0129	1.9454	7.0197	2.0129	1.9436	1.9155	1.9415	sqrt_x
TASIT_2	2.0271	1.3696	13.6901	2.0271	1.5327	1.7885	1.4234	boxcox
TASIT_3	1.3923	1.1994	15.6075	1.3923	1.21	1.2745	1.1908	yeojohnson
RMET	1.8187	1.4678	14.3503	1.8187	1.5044	1.5822	1.441	yeojohnson
RAD	2.9671	1.7906	15.8596	2.9671	2.3286	2.6793	1.487	yeojohnson
ER_40	1.2363	0	3.5455	1.3985	1.9366	1.1528	NA	boxcox
ER_40	2.3121	NA	3.82	2.7879	2.12	2.302	4.0653	no_transform
CST_right	1.0874	1.0357	1.0116	1.0787	1.0562	1.1075	1.035	exp_x
CST_left	1.5632	1.482	1.5106	1.5833	1.5796	1.603	1.4433	yeojohnson
Sup_P_left	1.0931	1.1681	1.0858	1.1628	1.0931	1.121	1.1873	exp_x
SLF_I_left	1.2052	1.245	1.2246	1.317	1.193	1.2132	1.2527	no_transform
SLF_I_right	1.2681	1.271	1.2438	1.2898	1.2557	1.28	1.2788	exp_x
SLF_III_right	1.0231	1.0629	1.067	1.0757	1.0151	1.051	1.0677	no_transform
Sup_O_left	1.0537	0.9357	1.0244	1.1441	1.0576	1.0949	0.9357	boxcox

Application of Yeo-Johnson transformation

The Yeo-Johnson transformation, proposed by Yeo and Johnson in 2000, attempts to find the value of lambda (see equation) that minimizes the Kullback-Leibler distance between the normal distribution and the transformed distribution. Similar to the Box-Cox λ, this λ parameter can be estimated via maximum likelihood. The plots below show the success of this transformation. The first, Transformed distribution, shows new scaled data values, with the actual density distribution in black, and a normal distribution in blue. The second, Pre- and post- correction, visualizes the density of data before (pink) and after (blue) application of the Yeo-Johnson transformation.

#Yeo-Johnson's transformation, to all non-normal variables
yeojohnson <- lapply(df_nonNormal, function(x) yeojohnson(x))

#pull out new data - this will comprise the plots of new distributions
yeojohnson <- lapply(yeojohnson, `[`, 'x.t')

#bind data together, into a dataframe (instead of list)
df_yeojohnson  <-  as.data.frame(matrix(unlist(yeojohnson), nrow=length(unlist(yeojohnson[1]))))
names(df_yeojohnson) <- nonNormal
row.names(df_yeojohnson) <- row.names(df_nonNormal)

#melt the df for ggplot
plot_yeojohnson <- reshape2::melt(df_yeojohnson)

#cleanup
rm(yeojohnson)

Transformed distribution

ggplot(plot_yeojohnson, aes(x=value)) +
  geom_histogram(aes(y=..density..)) + 
  geom_density(alpha=.2, fill='white') +
  stat_overlay_normal_density(color = "blue", linetype = "dashed") +
  facet_wrap(~variable, scales = 'free', ncol=6)

Pre- and post- correction

#scale the original data, as the transformed data has been, to allow for comparison
df_nonNormal_scaled <- as.data.frame(scale(df_nonNormal))

#melt the original data for ggplot
plot_original <- reshape2::melt(df_nonNormal_scaled)

#make a new variable for colour
plot_original$source <- 'original'
plot_yeojohnson$source <- 'transformed'

#combine with yeo-johnson transformed data
plot_compare <- rbind(plot_yeojohnson, plot_original)

#make plot
ggplot(plot_compare, aes(x=value, fill=source)) +
  geom_histogram(aes(y=..density..)) + 
  geom_density(alpha=.2)  +
  facet_wrap(~variable, scales = 'free', ncol=6) +
  theme(legend.position='bottom',
        legend.title = element_blank())

#scale the normal data
df_normal_scaled <- as.data.frame(scale(df_normal))

#combine with the scaled transformed data
df_transformed <- merge(df_normal_scaled, df_yeojohnson, by='row.names')

#put participant IDs back as rownames
rownames(df_transformed) <- df_transformed$Row.names ; df_transformed <- df_transformed[, -1]

Significance testing for univariate normality, post Yeo-Johnson. We still see some skew and kurtosis, though less than before. Now, just k=10 variables are not normal, as follows: Visual_learning, Problem_solving, TASIT_1, TASIT_2, TASIT_3, RMET, RAD, CST_right, CST_left, SLF_I_right. We are not concerned about the SLFs (as these segmentations are uniquely poor and will not be included in the CCA), but it is concerning that all social cognition variables obviate correction.

Review significance testing of multivariate normality, post Yeo-Johnson. We also see that the dataset does not meet the criteria for multivariate normal.

result$multivariateNormality

##              Test         Statistic              p value Result
## 1 Mardia Skewness  101698.634152625    0.922071637526136    YES
## 2 Mardia Kurtosis -3.58377920158076 0.000338658395604874     NO
## 3             MVN              <NA>                 <NA>     NO

#write out csv
write.csv(df_transformed, paste('../csvs/modified/df_CCA_noOutliersINTTransformed_', Sys.Date(), '.csv', sep=''), row.names = T)