08c | Determine distributions

Background. In preceding analyses, we found that, even post-transformation, 10 of the variables to be included in the CCA analysis were not normally distributed (all 6 social cognition variables, 3 of the 6 neurocognition variables, and just one estimate of FA). Jumping ahead, we also found that our CCA had a significant test statistic (Roy’s largest root) when compared to the theoretical F distribution – but not when the same statistical test was permuted from randomized empirical data – suggesting an important deviation from normality.

Purpose. Our purpose here is to understand how our variables diverge from the F distribution assumed by multivariate significance tests (which with our sample size would be considered to be well approximated by a normal distribution), and determine if the empirical distributions better approximate a non-normal theoretical distribution. Because our data is (i) continuous, (ii) asymmetric, and (iii) we observe both negative and positive outliers, we will review Weibull, lognormal, and gamma distributions. Goodness of fit tests suggest that social cognition tests are well described by the Weibull distribution, and neurocognition tests by the gamma distribution.

Damodaran, Probablistic Approaches to Risk

Review candidate distributions.

Below we show Cullen and Frey graphs, that display skewness squared versus kurtosis (i.e., third and fourth standardized moments) that can help us identify a suitable distribution. The blue dot represents our data, and bootstrapped data estimates (500) are shown in yellow. The other symbols and lines represent theoretical distributions. Summary statistics that can help choose candidates are shown below (a non-zero skewness reveals a lack of symmetry of the empirical distribution, while the kurtosis value quantifies the weight of tails in comparison to the normal distribution for which the kurtosis is 3). Note that these plots, though shown on pre-transformed data, would look identical after transformation, as skewness and kurtosis are independent of location and scale. I think that in our case, the beta distribution isn’t appropriate, as our data is not bounded (and the beta distribution is). In most cases, it looks like the Weibull, lognormal, and gamma distributions are most appropriate.

TASIT 1

## summary statistics
## ------
## min:  97.36918   max:  101.6429 
## median:  100.2183 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.5490273 
## estimated kurtosis:  2.783908

TASIT 2

## summary statistics
## ------
## min:  97.26561   max:  101.4561 
## median:  100.1667 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.7634337 
## estimated kurtosis:  2.838951

TASIT 3

## summary statistics
## ------
## min:  97.36619   max:  101.8022 
## median:  100.1196 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.4669279 
## estimated kurtosis:  2.574405

RMET

## summary statistics
## ------
## min:  97.38306   max:  101.8602 
## median:  100.28 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.494101 
## estimated kurtosis:  2.732824

RAD

## summary statistics
## ------
## min:  97.51778   max:  101.335 
## median:  100.3246 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.9588261 
## estimated kurtosis:  2.943157

ER-40

## summary statistics
## ------
## min:  97.48794   max:  101.855 
## median:  100.1466 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.5678609 
## estimated kurtosis:  2.726506

Verbal learning

## summary statistics
## ------
## min:  98.10316   max:  102.941 
## median:  99.98455 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  0.2206722 
## estimated kurtosis:  2.441465

Visual learning

## summary statistics
## ------
## min:  97.46088   max:  101.7152 
## median:  100.143 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.6183466 
## estimated kurtosis:  2.724818

Problem solving

## summary statistics
## ------
## min:  97.41447   max:  101.8081 
## median:  99.96888 
## mean:  100 
## estimated sd:  1 
## estimated skewness:  -0.1538115 
## estimated kurtosis:  2.345181

Graphically compare select distributions.

On the basis of the Cullen and Frey graphs, which suggest distributions that may be appropriate, we elected to compare the Weibull, lognormal, and gamma distributions alongside our empirical data distribution. Note: we do not perform statistical tests to determine appropriate distributions, due to the fact that, with large sample sizes, the test can have so much power that trivial departures from the distribution produce statistically significant results (significance indicating lack of fit). In the following plots, it seems that all social cognition tests may be consistent with the Weibull distribution. The best fit for neurocognition is less (visually) evident. Specifically, the plots indicate:

[a] Histogram and theoretical densities. The underlying histogram shows the distribution of our empirical data on a density scale. The coloured lines draw the distribution of each named theoretical density.
[b] Empirical and theoretical CDFs. The empirical CDF usually approximates the CDF quite well. The CDF plot may be considered as the basic classical goodness-of-fit plot.
[c] Q-Q plot. This plot shows, for each value of the data set, the quantiles of the theoretical fitted distribution (x-axis) against the empirical quantiles of our data (y). The Q-Q plot emphasizes the lack-of-fit at the distribution tails.
[d] P-P plot. This plot shows, for each value of the data set, the cumulative density function of the fitted distribution (x-axis) against the empirical cumulative density function (y-axis). The P-P plot emphasizes the lack-of-fit at the distribution center.

TASIT 1

TASIT 2

TASIT 3

RMET

RAD

ER-40

Verbal learning

Visual learning

Problem solving

Compare goodness of fit across select distributions.

Several goodness of fit tests exit. The different goodness-of-fit tests are not equally sensitive to different types of deviation between the empirical and theoretical distributions. For example, the Kolmogorov-Smirnov test is sensitive when distributions differ near their centre; the Cramer von Mises is more sensitive when there are small but repetitive differences between empirical and theoretical distributions; the Anderson-Darling test is more sensitive when distributions differ in their tails. In all cases, we are interested to review the Akaike information criterion (AIC), which is an estimator of out-of-sample prediction error and thereby relative quality of statistical models. Thus, the AIC estimates the relative amount of information lost by each model: the less information a model loses, the higher the quality of that model. Lower value indicates better fit. Interestingly, Weibull has the lowest AIC for all social cognition tests, and gamma for the neurocognition tests.

TASIT 1

## Goodness-of-fit statistics
##                                Weibull lognormal    gamma
## Kolmogorov-Smirnov statistic 0.1091686 0.1399918 0.139423
## Cramer-von Mises statistic   0.3917458 0.4754609 0.472199
## Anderson-Darling statistic   2.3654409 2.9852493 2.963973
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 484.2635  497.7518 497.4257
## Bayesian Information Criterion 490.5816  504.0699 503.7438

TASIT 2

## Goodness-of-fit statistics
##                                 Weibull lognormal     gamma
## Kolmogorov-Smirnov statistic 0.08611888 0.1140297 0.1134703
## Cramer-von Mises statistic   0.17000055 0.5230725 0.5169119
## Anderson-Darling statistic   1.33194888 3.5700645 3.5336596
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 468.0565  498.1191 497.6709
## Bayesian Information Criterion 474.3746  504.4372 503.9891

TASIT 3

## Goodness-of-fit statistics
##                                Weibull  lognormal      gamma
## Kolmogorov-Smirnov statistic 0.0690982 0.08464583 0.08455328
## Cramer-von Mises statistic   0.1073189 0.20095425 0.19755782
## Anderson-Darling statistic   0.6890720 1.37861493 1.35699048
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 487.5941  497.6080 497.3301
## Bayesian Information Criterion 493.9122  503.9261 503.6482

RMET

## Goodness-of-fit statistics
##                                 Weibull lognormal     gamma
## Kolmogorov-Smirnov statistic 0.08339626 0.1407987 0.1402443
## Cramer-von Mises statistic   0.15674962 0.3832250 0.3781606
## Anderson-Darling statistic   0.94891887 2.0996778 2.0717640
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 488.9946  497.6570 497.3625
## Bayesian Information Criterion 495.3128  503.9751 503.6806

RAD

## Goodness-of-fit statistics
##                                Weibull lognormal     gamma
## Kolmogorov-Smirnov statistic 0.1137354  0.185395 0.1847622
## Cramer-von Mises statistic   0.4201826  1.199827 1.1904235
## Anderson-Darling statistic   2.6644963  6.768894 6.7188064
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 452.4309  498.4539 497.8946
## Bayesian Information Criterion 458.7490  504.7720 504.2127

ER-40

## Goodness-of-fit statistics
##                                 Weibull  lognormal      gamma
## Kolmogorov-Smirnov statistic 0.04708707 0.07011926 0.06993143
## Cramer-von Mises statistic   0.06490998 0.26152576 0.25736510
## Anderson-Darling statistic   0.44286706 1.78154207 1.75477067
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 482.8315  497.7831 497.4467
## Bayesian Information Criterion 489.1497  504.1012 503.7648

Verbal learning

## Goodness-of-fit statistics
##                                Weibull  lognormal      gamma
## Kolmogorov-Smirnov statistic 0.1016918 0.07866827 0.07881389
## Cramer-von Mises statistic   0.3418518 0.11277359 0.11350090
## Anderson-Darling statistic   2.4574078 0.75087375 0.75579474
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 527.7688  496.4244 496.5410
## Bayesian Information Criterion 534.0869  502.7425 502.8591

Visual learning

## Goodness-of-fit statistics
##                                 Weibull lognormal     gamma
## Kolmogorov-Smirnov statistic 0.05722486 0.1047389 0.1041203
## Cramer-von Mises statistic   0.06122281 0.3488407 0.3435403
## Anderson-Darling statistic   0.43078817 2.1939108 2.1615515
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 479.0047  497.8694 497.5043
## Bayesian Information Criterion 485.3228  504.1875 503.8225

Problem solving

## Goodness-of-fit statistics
##                                Weibull  lognormal      gamma
## Kolmogorov-Smirnov statistic 0.1047548 0.07041396 0.07021146
## Cramer-von Mises statistic   0.3010507 0.12275933 0.12322818
## Anderson-Darling statistic   1.7919436 0.88595890 0.88609301
## 
## Goodness-of-fit criteria
##                                 Weibull lognormal    gamma
## Akaike's Information Criterion 504.0895  497.0675 496.9698
## Bayesian Information Criterion 510.4076  503.3856 503.2879