Description. In a previous analysis, we removed highly influential points/outliers (with both a "capped" and "interpolated" IQR method). Here, we review normality for the interpolated data (the "capped" method was unable to be adjusted to normality). After outlier treatment, we find that k=16 variables in our set of k=84 intended for CCA are not normally distributed, and thus should be transformed to normality, where normal means “to render data Gaussian”, as is required by CCA. We use the bestNormalize
package to identify the ideal transformation , and then implement it (Yeo-Johnson).
Density with normal distribution. Note that visual inspection of these plots make it sometimes difficult to detect non-normality (though the visualization is helpful, need to rely on significance testing, below).
#function to get into long format
plot_fn <- function(df) {
df %>%
as_tibble() %>%
select_if(is.numeric) %>%
gather(key='variable', value='value')
}
#make melted datasets
df_plot <- plot_fn(df)
df_plot_pns <- plot_fn(df_pns)
df_plot_spins <- plot_fn(df_spins)
#function for plot -- density with normal distribution
density_fn <- function(df){
ggplot(df, aes(x=value)) +
geom_density() +
stat_overlay_normal_density(color = "red", linetype = "dashed") +
facet_wrap(~variable, scales = 'free', ncol=6)
}
density_fn(df_plot)
density_fn(df_plot_spins)
density_fn(df_plot_pns)
Data descriptives. Here, we see that skew and kurtosis are high in several variables; interestingly; this is especially the case in social cognition (negative):
DT::datatable(result$Descriptives[9:10]) %>%
DT::formatRound(digits=3, columns = c('Skew', 'Kurtosis'))
Significance testing of univariate normality. Here, we can review the statistical test result in the Normality
column (YES indicates a normal distribution). In total, k=67 of the k=84 variables intended for the CCA are normal. Note that none of the six social cognition variables are normal, and only three of the six neurocognition variables are (the rest of the count of course reflects FA). (Also note that missing p values just indicates very small values, p <0.0000.)
DT::datatable(result$univariateNormality[2:length(result$univariateNormality)]) %>%
DT::formatRound(digits=3, columns = c('Statistic', 'p value'))
Significance testing of multivariate normality. We can also conduct a multivariate test of normality (there are several; Mardia is used here). We see that kurtosis, and the entire variable set, does not meet normality characteristics.
result$multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 101781.266262134 0.891730035954104 YES
## 2 Mardia Kurtosis -3.49949471522181 0.000466140844881835 NO
## 3 MVN <NA> <NA> NO
Identify non-normal variables for transformation.
The k=17 non-normal variables in the dataset of 84 variables intended for possible inclusion in the CCA are: Verbal_learning, Visual_learning, Problem_solving, TASIT_1, TASIT_2, TASIT_3, RMET, RAD, ER_40, CC6_commissural, CST_right, CST_left, Sup_P_left, SLF_I_left, SLF_I_right, SLF_III_right, Sup_O_left.
Compare transformation methods.
Here, we use bestNormalize
to identify the best transformation from according to the Pearson P statistic (divided by its degrees of freedom). The comparative benefit of the Pearson P / df normality test is that it is a relatively interpretable goodness of fit test, and the ratio P / df can be compared between transformations as an absolute measure of the departure from normality (if the data follows close to a normal distribution, this ratio will be close to 1; below, the lowest values are indicated in the brightest shade of green). We calculate this value via out of sample calculation, which computes via CV with 10 folds and 5 repeats. The best transformation per variable is indicated in the selection
column. We will ultimately select a single transformation to apply to all non-normal variables, according to which transformation is the best overall. We see that Yeo-Johnson (also used by Lindsay) achives the best results across the greatest number of variables (k=4) and consistently shows low values.
#run comparisons
transformComparison <- lapply(df_nonNormal, function(x) bestNormalize(x, allow_orderNorm = FALSE))
#extract goodness of fit and model selection
transform_fit <- lapply(transformComparison, `[`, 'norm_stats')
#bind together elements of list into a dataframetest
df_transform <- do.call(cbind, transform_fit) %>%
as.data.frame() %>%
gather() %>%
separate(col=value, sep=',', remove=TRUE, into=c(
'arcsinh_x',
'boxcox ',
'exp_x',
'log_x',
'no_transform',
'sqrt_x',
'yeojohnson'))
#why did ER_40 fail? who knows... rewrite values
er_40 <- c('ER_40',
'arcsinh_x = 2.31212418300654',
'boxcox = NA',
'exp_x = 3.82',
'log_x = 2.78794934640523',
'no_transform = 2.11996732026144',
'sqrt_x = 2.30203839869281',
'yeojohnson = 4.06532679738562')
#replace row
df_transform[10,] <- er_40
#remove non-numeric components of variables
df_transform <- cbind(df_transform[1], sapply(df_transform[2:ncol(df_transform)], function(x) gsub("[^0-9.-]", "", x)))
#make sure all variables with numbers are numeric
df_transform[2:ncol(df_transform)] <- sapply(df_transform[2:ncol(df_transform)],as.numeric)
#make a final final variable, indicating selection
df_transform$selection <- colnames(df_transform)[apply(df_transform,1,which.min)]
#round all numeric values
df_transform <- df_transform %>% mutate_if(is.numeric, round, 4)
#function to highlight value of number
color_fn = function(x) cell_spec(x, bold = T, color = spec_color(x, direction=1, begin=.7, end=.4))
#apply function
df_transform[, 2:8] = t(apply(df_transform[,2:8], 1, color_fn))
#make a pretty kable table
df_transform %>%
kable(escape=F) %>%
kable_styling()
key | arcsinh_x | boxcox | exp_x | log_x | no_transform | sqrt_x | yeojohnson | selection |
---|---|---|---|---|---|---|---|---|
Verbal_learning | 1.1061 | 1.1306 | 19.8055 | 1.1061 | 1.1073 | 1.1358 | 1.1306 | arcsinh_x |
Visual_learning | 1.9482 | 1.167 | 18.344 | 1.9482 | 1.21 | 1.5774 | 1.1752 | boxcox |
Problem_solving | 1.1542 | 1.2665 | 18.5869 | 1.1542 | 1.1946 | 1.1214 | 1.2665 | sqrt_x |
TASIT_1 | 2.0129 | 1.9454 | 7.0197 | 2.0129 | 1.9436 | 1.9155 | 1.9415 | sqrt_x |
TASIT_2 | 2.0271 | 1.3696 | 13.6901 | 2.0271 | 1.5327 | 1.7885 | 1.4234 | boxcox |
TASIT_3 | 1.3923 | 1.1994 | 15.6075 | 1.3923 | 1.21 | 1.2745 | 1.1908 | yeojohnson |
RMET | 1.8187 | 1.4678 | 14.3503 | 1.8187 | 1.5044 | 1.5822 | 1.441 | yeojohnson |
RAD | 2.9671 | 1.7906 | 15.8596 | 2.9671 | 2.3286 | 2.6793 | 1.487 | yeojohnson |
ER_40 | 1.2363 | 0 | 3.5455 | 1.3985 | 1.9366 | 1.1528 | NA | boxcox |
ER_40 | 2.3121 | NA | 3.82 | 2.7879 | 2.12 | 2.302 | 4.0653 | no_transform |
CST_right | 1.0874 | 1.0357 | 1.0116 | 1.0787 | 1.0562 | 1.1075 | 1.035 | exp_x |
CST_left | 1.5632 | 1.482 | 1.5106 | 1.5833 | 1.5796 | 1.603 | 1.4433 | yeojohnson |
Sup_P_left | 1.0931 | 1.1681 | 1.0858 | 1.1628 | 1.0931 | 1.121 | 1.1873 | exp_x |
SLF_I_left | 1.2052 | 1.245 | 1.2246 | 1.317 | 1.193 | 1.2132 | 1.2527 | no_transform |
SLF_I_right | 1.2681 | 1.271 | 1.2438 | 1.2898 | 1.2557 | 1.28 | 1.2788 | exp_x |
SLF_III_right | 1.0231 | 1.0629 | 1.067 | 1.0757 | 1.0151 | 1.051 | 1.0677 | no_transform |
Sup_O_left | 1.0537 | 0.9357 | 1.0244 | 1.1441 | 1.0576 | 1.0949 | 0.9357 | boxcox |
Application of Yeo-Johnson transformation
The Yeo-Johnson transformation, proposed by Yeo and Johnson in 2000, attempts to find the value of lambda (see equation) that minimizes the Kullback-Leibler distance between the normal distribution and the transformed distribution. Similar to the Box-Cox λ, this λ parameter can be estimated via maximum likelihood. The plots below show the success of this transformation. The first,
Transformed distribution
, shows new scaled data values, with the actual density distribution in black, and a normal distribution in blue. The second, Pre- and post- correction
, visualizes the density of data before (pink) and after (blue) application of the Yeo-Johnson transformation.
#Yeo-Johnson's transformation, to all non-normal variables
yeojohnson <- lapply(df_nonNormal, function(x) yeojohnson(x))
#pull out new data - this will comprise the plots of new distributions
yeojohnson <- lapply(yeojohnson, `[`, 'x.t')
#bind data together, into a dataframe (instead of list)
df_yeojohnson <- as.data.frame(matrix(unlist(yeojohnson), nrow=length(unlist(yeojohnson[1]))))
names(df_yeojohnson) <- nonNormal
row.names(df_yeojohnson) <- row.names(df_nonNormal)
#melt the df for ggplot
plot_yeojohnson <- reshape2::melt(df_yeojohnson)
#cleanup
rm(yeojohnson)
ggplot(plot_yeojohnson, aes(x=value)) +
geom_histogram(aes(y=..density..)) +
geom_density(alpha=.2, fill='white') +
stat_overlay_normal_density(color = "blue", linetype = "dashed") +
facet_wrap(~variable, scales = 'free', ncol=6)
#scale the original data, as the transformed data has been, to allow for comparison
df_nonNormal_scaled <- as.data.frame(scale(df_nonNormal))
#melt the original data for ggplot
plot_original <- reshape2::melt(df_nonNormal_scaled)
#make a new variable for colour
plot_original$source <- 'original'
plot_yeojohnson$source <- 'transformed'
#combine with yeo-johnson transformed data
plot_compare <- rbind(plot_yeojohnson, plot_original)
#make plot
ggplot(plot_compare, aes(x=value, fill=source)) +
geom_histogram(aes(y=..density..)) +
geom_density(alpha=.2) +
facet_wrap(~variable, scales = 'free', ncol=6) +
theme(legend.position='bottom',
legend.title = element_blank())
#scale the normal data
df_normal_scaled <- as.data.frame(scale(df_normal))
#combine with the scaled transformed data
df_transformed <- merge(df_normal_scaled, df_yeojohnson, by='row.names')
#put participant IDs back as rownames
rownames(df_transformed) <- df_transformed$Row.names ; df_transformed <- df_transformed[, -1]
Significance testing for univariate normality, post Yeo-Johnson. We still see some skew and kurtosis, though less than before. Now, just k=10 variables are not normal, as follows: Visual_learning, Problem_solving, TASIT_1, TASIT_2, TASIT_3, RMET, RAD, CST_right, CST_left, SLF_I_right. We are not concerned about the SLFs (as these segmentations are uniquely poor and will not be included in the CCA), but it is concerning that all social cognition variables obviate correction.
Review significance testing of multivariate normality, post Yeo-Johnson. We also see that the dataset does not meet the criteria for multivariate normal.
result$multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 101698.634152625 0.922071637526136 YES
## 2 Mardia Kurtosis -3.58377920158076 0.000338658395604874 NO
## 3 MVN <NA> <NA> NO
#write out csv
write.csv(df_transformed, paste('../csvs/modified/df_CCA_noOutliersINTTransformed_', Sys.Date(), '.csv', sep=''), row.names = T)