Background. In preceding analyses, we found that, even post-transformation, 10 of the variables to be included in the CCA analysis were not normally distributed (all 6 social cognition variables, 3 of the 6 neurocognition variables, and just one estimate of FA). Jumping ahead, we also found that our CCA had a significant test statistic (Roy’s largest root) when compared to the theoretical F distribution – but not when the same statistical test was permuted from randomized empirical data – suggesting an important deviation from normality.
Purpose. Our purpose here is to understand how our variables diverge from the F distribution assumed by multivariate significance tests (which with our sample size would be considered to be well approximated by a normal distribution), and determine if the empirical distributions better approximate a non-normal theoretical distribution. Because our data is (i) continuous, (ii) asymmetric, and (iii) we observe both negative and positive outliers, we will review Weibull
, lognormal
, and gamma
distributions. Goodness of fit tests suggest that social cognition tests are well described by the Weibull
distribution, and neurocognition tests by the gamma
distribution.
Damodaran, Probablistic Approaches to Risk
Review candidate distributions.
Below we show Cullen and Frey graphs, that display skewness squared versus kurtosis (i.e., third and fourth standardized moments) that can help us identify a suitable distribution. The blue dot represents our data, and bootstrapped data estimates (500) are shown in yellow. The other symbols and lines represent theoretical distributions. Summary statistics that can help choose candidates are shown below (a non-zero skewness reveals a lack of symmetry of the empirical distribution, while the kurtosis value quantifies the weight of tails in comparison to the normal distribution for which the kurtosis is 3). Note that these plots, though shown on pre-transformed data, would look identical after transformation, as skewness and kurtosis are independent of location and scale. I think that in our case, the beta distribution isn’t appropriate, as our data is not bounded (and the beta distribution is). In most cases, it looks like the Weibull
, lognormal
, and gamma
distributions are most appropriate.
## summary statistics
## ------
## min: 97.36918 max: 101.6429
## median: 100.2183
## mean: 100
## estimated sd: 1
## estimated skewness: -0.5490273
## estimated kurtosis: 2.783908
## summary statistics
## ------
## min: 97.26561 max: 101.4561
## median: 100.1667
## mean: 100
## estimated sd: 1
## estimated skewness: -0.7634337
## estimated kurtosis: 2.838951
## summary statistics
## ------
## min: 97.36619 max: 101.8022
## median: 100.1196
## mean: 100
## estimated sd: 1
## estimated skewness: -0.4669279
## estimated kurtosis: 2.574405
## summary statistics
## ------
## min: 97.38306 max: 101.8602
## median: 100.28
## mean: 100
## estimated sd: 1
## estimated skewness: -0.494101
## estimated kurtosis: 2.732824
## summary statistics
## ------
## min: 97.51778 max: 101.335
## median: 100.3246
## mean: 100
## estimated sd: 1
## estimated skewness: -0.9588261
## estimated kurtosis: 2.943157
## summary statistics
## ------
## min: 97.48794 max: 101.855
## median: 100.1466
## mean: 100
## estimated sd: 1
## estimated skewness: -0.5678609
## estimated kurtosis: 2.726506
## summary statistics
## ------
## min: 98.10316 max: 102.941
## median: 99.98455
## mean: 100
## estimated sd: 1
## estimated skewness: 0.2206722
## estimated kurtosis: 2.441465
## summary statistics
## ------
## min: 97.46088 max: 101.7152
## median: 100.143
## mean: 100
## estimated sd: 1
## estimated skewness: -0.6183466
## estimated kurtosis: 2.724818
## summary statistics
## ------
## min: 97.41447 max: 101.8081
## median: 99.96888
## mean: 100
## estimated sd: 1
## estimated skewness: -0.1538115
## estimated kurtosis: 2.345181
Graphically compare select distributions.
On the basis of the Cullen and Frey graphs, which suggest distributions that may be appropriate, we elected to compare the Weibull
, lognormal
, and gamma
distributions alongside our empirical data distribution. Note: we do not perform statistical tests to determine appropriate distributions, due to the fact that, with large sample sizes, the test can have so much power that trivial departures from the distribution produce statistically significant results (significance indicating lack of fit). In the following plots, it seems that all social cognition tests may be consistent with the Weibull
distribution. The best fit for neurocognition is less (visually) evident. Specifically, the plots indicate:
Compare goodness of fit across select distributions.
Several goodness of fit tests exit. The different goodness-of-fit tests are not equally sensitive to different types of deviation between the empirical and theoretical distributions. For example, the Kolmogorov-Smirnov test is sensitive when distributions differ near their centre; the Cramer von Mises is more sensitive when there are small but repetitive differences between empirical and theoretical distributions; the Anderson-Darling test is more sensitive when distributions differ in their tails. In all cases, we are interested to review the Akaike information criterion (AIC), which is an estimator of out-of-sample prediction error and thereby relative quality of statistical models. Thus, the AIC estimates the relative amount of information lost by each model: the less information a model loses, the higher the quality of that model. Lower value indicates better fit. Interestingly, Weibull
has the lowest AIC for all social cognition tests, and gamma
for the neurocognition tests.
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.1091686 0.1399918 0.139423
## Cramer-von Mises statistic 0.3917458 0.4754609 0.472199
## Anderson-Darling statistic 2.3654409 2.9852493 2.963973
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 484.2635 497.7518 497.4257
## Bayesian Information Criterion 490.5816 504.0699 503.7438
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.08611888 0.1140297 0.1134703
## Cramer-von Mises statistic 0.17000055 0.5230725 0.5169119
## Anderson-Darling statistic 1.33194888 3.5700645 3.5336596
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 468.0565 498.1191 497.6709
## Bayesian Information Criterion 474.3746 504.4372 503.9891
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.0690982 0.08464583 0.08455328
## Cramer-von Mises statistic 0.1073189 0.20095425 0.19755782
## Anderson-Darling statistic 0.6890720 1.37861493 1.35699048
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 487.5941 497.6080 497.3301
## Bayesian Information Criterion 493.9122 503.9261 503.6482
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.08339626 0.1407987 0.1402443
## Cramer-von Mises statistic 0.15674962 0.3832250 0.3781606
## Anderson-Darling statistic 0.94891887 2.0996778 2.0717640
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 488.9946 497.6570 497.3625
## Bayesian Information Criterion 495.3128 503.9751 503.6806
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.1137354 0.185395 0.1847622
## Cramer-von Mises statistic 0.4201826 1.199827 1.1904235
## Anderson-Darling statistic 2.6644963 6.768894 6.7188064
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 452.4309 498.4539 497.8946
## Bayesian Information Criterion 458.7490 504.7720 504.2127
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.04708707 0.07011926 0.06993143
## Cramer-von Mises statistic 0.06490998 0.26152576 0.25736510
## Anderson-Darling statistic 0.44286706 1.78154207 1.75477067
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 482.8315 497.7831 497.4467
## Bayesian Information Criterion 489.1497 504.1012 503.7648
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.1016918 0.07866827 0.07881389
## Cramer-von Mises statistic 0.3418518 0.11277359 0.11350090
## Anderson-Darling statistic 2.4574078 0.75087375 0.75579474
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 527.7688 496.4244 496.5410
## Bayesian Information Criterion 534.0869 502.7425 502.8591
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.05722486 0.1047389 0.1041203
## Cramer-von Mises statistic 0.06122281 0.3488407 0.3435403
## Anderson-Darling statistic 0.43078817 2.1939108 2.1615515
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 479.0047 497.8694 497.5043
## Bayesian Information Criterion 485.3228 504.1875 503.8225
## Goodness-of-fit statistics
## Weibull lognormal gamma
## Kolmogorov-Smirnov statistic 0.1047548 0.07041396 0.07021146
## Cramer-von Mises statistic 0.3010507 0.12275933 0.12322818
## Anderson-Darling statistic 1.7919436 0.88595890 0.88609301
##
## Goodness-of-fit criteria
## Weibull lognormal gamma
## Akaike's Information Criterion 504.0895 497.0675 496.9698
## Bayesian Information Criterion 510.4076 503.3856 503.2879