Variable Description ID: Unique identifier for the participant. ID_genetics: Genetic identifier related to the participant’s sample or genetic profile. cognitive: Assessment or overall score of cognitive function. de800cog: Result obtained from a specific cognitive test (e.g., DE800 version). images: Indicates the availability or analysis of images (e.g., neuroimaging). DATE_STUDY: Date when the study was conducted. BIRTH: Participant’s birth date. AGE-2024: Participant’s age in the year 2024. AGE-PCR: Age of the participant at the time of the PCR test. AGE_INTERVAL: Interval between age assessments (e.g., between the study date and the PCR test).
Variable Description ID Identifier: associated with the symptom evaluation. ANOSMIA: Indicator of loss of smell (anosmia), presented in binary format or on a scale. Anosmia: A second measure or confirmation of the presence of anosmia. RISK-HOSPITAL-ICU: Indicator of the risk of hospitalization or the need for ICU care due to COVID-19. VACCINE_BEFORE_STUDY: Participant’s vaccination status prior to the study. COVID_BEFORE_VACCINATION: History of COVID-19 infection before vaccination. FEVER: Presence of fever. COUGH: Presence of cough. MUSCLE_PAIN: Presence of muscle pain (myalgia). BREATH_DIF: Indication of breathing difficulties (dyspnea). SMELL_LOST: Report of loss of smell. TASTE_LOST: Report of loss of taste. DATE_PCR: Date when the PCR test for COVID-19 was performed. PCR: Qualitative result of the PCR test (e.g., positive or negative). PCR_NUM: Numerical value associated with the PCR test (e.g., Ct value or viral load). COVID-VARIANT: SARS-CoV-2 variant identified (e.g., Alpha, Delta, Omicron). VACCINE_1: Information regarding the first dose of the vaccine (type or brand). VACCINE_2: Information regarding the second dose of the vaccine (type or brand). VACCINE_3: Information regarding the third dose or booster dose of the vaccine.
Variable Description ID: Identifier related to environmental data or COVID-19 exposure. LISTAPRIMERREC: List referring to the first recognition (possibly of symptoms or initial contact). LISTAAPRENDIZAJE: List associated with the acquisition of knowledge or learning in the context of COVID-19. LISTACP: Indicator or list related to close contacts (CP) or similar parameters. LISTALP: List of parameters or locations (LP), whose definition depends on the study protocol. LISTARECON: List for the recognition of signs or symptoms related to COVID-19. CORSIDIRECTO: Measure or score of direct correlation, possibly related to environmental factors. CORSIINVERSO: Measure or score of inverse correlation related to the same parameters. CACTUSCORRECTAS: Number of correct responses obtained in the “cactus” test. CACTUSVIVOS: Number of responses classified as “living” in the “cactus” test. CACTUSINANIM: Number of responses classified as “inanimate” in the “cactus” test.
Variable Description OTVERBALTPO: Response time in the verbal task (OT). OTVERBALERR: Number of errors in the verbal task (OT). OTVISUALTPO: Response time in the visual task (OT). OTVISUALERR: Number of errors in the visual task (OT). OTMENTALTPO: Response time in the mental task (OT). OTMENTALERR: Number of errors in the mental task (OT). OTVISMENTTPO: Response time in the task combining visual and mental stimuli (OT). OTVISMENTERR: Number of errors in the task combining visual and mental stimuli (OT). OTSWITCHTPO: Response time in the switching task (OT evaluation). OTSWITCHERR: Number of errors in the switching task (OT evaluation). 5DREADTPO: Reading time in the designated 5D task. 5DREADERR: Number of errors in the 5D reading task. 5DCOUNTTPO: Time taken in the counting task (5D). 5DCOUNTERR: Number of errors in the counting task (5D). 5DFOCTPO: Execution time in the focus task (5D). 5DFOCERR: Number of errors in the focus task (5D). 5DSWITCHTPO: Response time in the switching task (5D evaluation). 5DSWITCHERR: Number of errors in the switching task (5D evaluation). DSCORR: Number of correct responses in the DS task (e.g., digit span test). DSOMIS: Number of omissions in the DS task. DSCOMIS: Number of commission errors (incorrect responses) in the DS task. TORREMOV: Indicator or number of removals in a tower task (possibly related to response inhibition). TORRETPO: Execution time in the tower task (e.g., Tower of Hanoi or similar). BOSTONSC: Score obtained on the Boston test subscale, possibly related to naming. BOSTONLAT: Latency in the performance of the Boston test. BOSTONSEM: Performance in the semantic component of the Boston test. BOSTONSEMERR: Number of errors in the semantic component of the Boston test. BOSTONFON: Performance in the phonemic component of the Boston test. BOSTONFONERR: Number of errors in the phonemic component of the Boston test. FLUENCIA: Measure of verbal fluency, evaluating the ability to generate words within a specified time.
library(readxl)
library(tidyverse)
library(caret)
# Read the Excel file
dataset <- read_excel("~/Downloads/en_uso8_last_V10.xlsx",
sheet = "dataset", col_types = c("numeric",
"numeric", "numeric", "numeric",
"numeric", "date", "date", "numeric",
"numeric", "numeric", "date", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "date", "text", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric"))
library(janitor)
# Rename variables to clean names
dataset <- dataset %>%
clean_names()
# Subset the data (selecting specific columns)
data <- dataset[c(2:5, 8:10, 12, 14:22, 24:29, 31:37, 39:65, 67, 69:101)]
# Get the names of the columns
names(data)
## [1] "id_genetics"
## [2] "cognitive"
## [3] "de800cog"
## [4] "images"
## [5] "age_2024"
## [6] "age_pcr"
## [7] "age_interval"
## [8] "anosmia"
## [9] "risk_hospital_icu"
## [10] "vaccine_before_study"
## [11] "covid_before_vaccination"
## [12] "fever"
## [13] "cough"
## [14] "muscle_pain"
## [15] "breath_dif"
## [16] "smell_lost"
## [17] "taste_lost"
## [18] "pcr"
## [19] "pcr_num"
## [20] "covid_variant"
## [21] "vaccine_1"
## [22] "vaccine_2"
## [23] "vaccine_3"
## [24] "listaprimerrec"
## [25] "listaaprendizaje"
## [26] "listacp"
## [27] "listalp"
## [28] "listarecon"
## [29] "corsidirecto"
## [30] "corsiinverso"
## [31] "cactusvivos"
## [32] "cactusinanim"
## [33] "otverbaltpo"
## [34] "otverbalerr"
## [35] "otvisualtpo"
## [36] "otvisualerr"
## [37] "otmentaltpo"
## [38] "otmentalerr"
## [39] "otvismenttpo"
## [40] "otvismenterr"
## [41] "otswitchtpo"
## [42] "otswitcherr"
## [43] "x5dreadtpo"
## [44] "x5dreaderr"
## [45] "x5dcounttpo"
## [46] "x5dcounterr"
## [47] "x5dfoctpo"
## [48] "x5dfocerr"
## [49] "x5dswitchtpo"
## [50] "x5dswitcherr"
## [51] "dscorr"
## [52] "dsomis"
## [53] "dscomis"
## [54] "torremov"
## [55] "torretpo"
## [56] "bostonsc"
## [57] "bostonlat"
## [58] "bostonsemerr"
## [59] "bostonfonerr"
## [60] "fluencia"
## [61] "right_accumbens_area"
## [62] "left_accumbens_area"
## [63] "right_amygdala"
## [64] "left_amygdala"
## [65] "right_cerebellum_exterior"
## [66] "left_cerebellum_exterior"
## [67] "right_hippocampus"
## [68] "left_hippocampus"
## [69] "right_putamen"
## [70] "left_putamen"
## [71] "right_thalamus_proper"
## [72] "left_thalamus_proper"
## [73] "fornix_right"
## [74] "fornix_left"
## [75] "anterior_limb_of_internal_capsule_right"
## [76] "anterior_limb_of_internal_capsule_left"
## [77] "posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right"
## [78] "posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left"
## [79] "corpus_callosum"
## [80] "right_a_cg_g_anterior_cingulate_gyrus"
## [81] "left_a_cg_g_anterior_cingulate_gyrus"
## [82] "right_a_ins_anterior_insula"
## [83] "left_a_ins_anterior_insula"
## [84] "right_an_g_angular_gyrus"
## [85] "left_an_g_angular_gyrus"
## [86] "right_cun_cuneus"
## [87] "left_cun_cuneus"
## [88] "right_ent_entorhinal_area"
## [89] "left_ent_entorhinal_area"
## [90] "right_g_re_gyrus_rectus"
## [91] "left_g_re_gyrus_rectus"
# Load skimr
library(skimr)
# Get a detailed summary of cognitive data:
skim(data[24:60])
| Name | data[24:60] |
| Number of rows | 463 |
| Number of columns | 37 |
| _______________________ | |
| Column type frequency: | |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| listaprimerrec | 0 | 1.00 | 3.71 | 1.63 | 0.00 | 3.00 | 4.00 | 5.00 | 16.00 | ▇▇▁▁▁ |
| listaaprendizaje | 0 | 1.00 | 23.99 | 7.35 | 0.00 | 19.00 | 24.00 | 29.00 | 41.00 | ▁▂▇▇▂ |
| listacp | 0 | 1.00 | 6.20 | 2.21 | 0.00 | 5.00 | 6.00 | 8.00 | 12.00 | ▁▃▇▃▁ |
| listalp | 0 | 1.00 | 5.48 | 2.38 | 0.00 | 4.00 | 5.00 | 7.00 | 12.00 | ▂▅▇▃▁ |
| listarecon | 0 | 1.00 | 21.21 | 3.27 | 0.00 | 20.00 | 22.00 | 23.00 | 24.00 | ▁▁▁▂▇ |
| corsidirecto | 0 | 1.00 | 5.13 | 1.34 | 0.00 | 4.00 | 5.00 | 6.00 | 11.00 | ▁▃▇▁▁ |
| corsiinverso | 1 | 1.00 | 4.17 | 1.21 | 0.00 | 4.00 | 4.00 | 5.00 | 9.00 | ▁▂▇▂▁ |
| cactusvivos | 0 | 1.00 | 12.67 | 2.56 | 0.00 | 12.00 | 13.00 | 14.00 | 16.00 | ▁▁▁▃▇ |
| cactusinanim | 0 | 1.00 | 15.11 | 2.46 | 0.00 | 14.00 | 16.00 | 17.00 | 17.00 | ▁▁▁▁▇ |
| otverbaltpo | 0 | 1.00 | 185.29 | 11.42 | 99.00 | 180.00 | 189.00 | 192.00 | 215.00 | ▁▁▁▇▃ |
| otverbalerr | 0 | 1.00 | 19.84 | 0.55 | 15.00 | 20.00 | 20.00 | 20.00 | 20.00 | ▁▁▁▁▇ |
| otvisualtpo | 0 | 1.00 | 63.95 | 31.16 | 0.00 | 46.00 | 55.00 | 74.00 | 300.00 | ▇▅▁▁▁ |
| otvisualerr | 0 | 1.00 | 19.70 | 1.42 | 0.00 | 20.00 | 20.00 | 20.00 | 20.00 | ▁▁▁▁▇ |
| otmentaltpo | 0 | 1.00 | 245.58 | 52.95 | 19.00 | 234.50 | 262.00 | 278.00 | 319.00 | ▁▁▁▅▇ |
| otmentalerr | 0 | 1.00 | 18.04 | 3.03 | 0.00 | 17.00 | 19.00 | 20.00 | 20.00 | ▁▁▁▂▇ |
| otvismenttpo | 0 | 1.00 | 240.89 | 49.22 | 32.00 | 226.50 | 256.00 | 272.50 | 332.00 | ▁▁▂▇▃ |
| otvismenterr | 0 | 1.00 | 18.63 | 2.70 | 0.00 | 18.00 | 20.00 | 20.00 | 20.00 | ▁▁▁▁▇ |
| otswitchtpo | 0 | 1.00 | 350.00 | 53.89 | 28.00 | 338.00 | 366.00 | 383.00 | 428.00 | ▁▁▁▃▇ |
| otswitcherr | 0 | 1.00 | 17.65 | 2.95 | 0.00 | 17.00 | 19.00 | 20.00 | 20.00 | ▁▁▁▂▇ |
| x5dreadtpo | 0 | 1.00 | 188.32 | 20.01 | 22.00 | 185.00 | 193.00 | 198.00 | 222.00 | ▁▁▁▂▇ |
| x5dreaderr | 3 | 0.99 | 49.83 | 1.06 | 30.00 | 50.00 | 50.00 | 50.00 | 50.00 | ▁▁▁▁▇ |
| x5dcounttpo | 2 | 1.00 | 165.29 | 14.23 | 68.00 | 162.00 | 168.00 | 173.00 | 200.00 | ▁▁▁▇▃ |
| x5dcounterr | 2 | 1.00 | 49.79 | 1.51 | 19.00 | 50.00 | 50.00 | 50.00 | 50.00 | ▁▁▁▁▇ |
| x5dfoctpo | 2 | 1.00 | 246.04 | 23.31 | 89.00 | 239.00 | 252.00 | 259.00 | 300.00 | ▁▁▁▇▃ |
| x5dfocerr | 2 | 1.00 | 48.28 | 3.01 | 3.00 | 48.00 | 49.00 | 50.00 | 50.00 | ▁▁▁▁▇ |
| x5dswitchtpo | 2 | 1.00 | 271.96 | 35.53 | 0.00 | 257.00 | 280.00 | 295.00 | 350.00 | ▁▁▁▇▇ |
| x5dswitcherr | 2 | 1.00 | 46.58 | 4.59 | 0.00 | 46.00 | 48.00 | 49.00 | 50.00 | ▁▁▁▁▇ |
| dscorr | 0 | 1.00 | 69.16 | 18.11 | 0.00 | 57.00 | 70.00 | 82.00 | 109.00 | ▁▂▆▇▂ |
| dsomis | 0 | 1.00 | 0.35 | 1.95 | 0.00 | 0.00 | 0.00 | 0.00 | 24.00 | ▇▁▁▁▁ |
| dscomis | 0 | 1.00 | 25.93 | 2.25 | 0.00 | 26.00 | 27.00 | 27.00 | 27.00 | ▁▁▁▁▇ |
| torremov | 1 | 1.00 | 341.44 | 3.93 | 323.00 | 340.00 | 343.00 | 344.00 | 351.00 | ▁▁▂▇▁ |
| torretpo | 0 | 1.00 | 264.54 | 74.67 | 2.58 | 240.58 | 287.58 | 316.08 | 352.58 | ▁▁▁▅▇ |
| bostonsc | 0 | 1.00 | 3.20 | 2.93 | 0.00 | 1.00 | 2.00 | 5.00 | 12.00 | ▇▃▂▂▁ |
| bostonlat | 0 | 1.00 | 2.71 | 1.89 | 0.00 | 1.83 | 2.62 | 3.52 | 16.00 | ▇▃▁▁▁ |
| bostonsemerr | 0 | 1.00 | 1.99 | 2.37 | 0.00 | 0.00 | 1.00 | 3.00 | 10.00 | ▇▂▁▁▁ |
| bostonfonerr | 0 | 1.00 | 1.12 | 1.76 | 0.00 | 0.00 | 0.00 | 2.00 | 11.00 | ▇▁▁▁▁ |
| fluencia | 1 | 1.00 | 16.52 | 4.82 | 1.00 | 14.00 | 16.50 | 20.00 | 33.00 | ▁▃▇▂▁ |
# Load DataExplorer
library(DataExplorer)
# Visualize the missing data pattern:
plot_missing(data[24:60])
# Plot density distributions of numeric variables:
plot_density(data[24:60])
# Load dlookr
library(dlookr)
# Diagnose potential data issues
print(diagnose_outlier(data[24:60]), n = 40)
## # A tibble: 37 × 6
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 listaprimer… 4 0.864 10.8 3.71 3.64
## 2 listaaprend… 4 0.864 0.75 24.0 24.2
## 3 listacp 8 1.73 0 6.20 6.31
## 4 listalp 4 0.864 12 5.48 5.42
## 5 listarecon 16 3.46 8.75 21.2 21.7
## 6 corsidirecto 7 1.51 6 5.13 5.12
## 7 corsiinverso 52 11.2 3.17 4.17 4.3
## 8 cactusvivos 31 6.70 5.81 12.7 13.2
## 9 cactusinanim 20 4.32 6.6 15.1 15.5
## 10 otverbaltpo 20 4.32 157. 185. 187.
## 11 otverbalerr 49 10.6 18.5 19.8 20
## 12 otvisualtpo 29 6.26 142. 64.0 58.7
## 13 otvisualerr 75 16.2 18.2 19.7 20
## 14 otmentaltpo 44 9.50 110. 246. 260.
## 15 otmentalerr 18 3.89 7 18.0 18.5
## 16 otvismenttpo 29 6.26 97.7 241. 250.
## 17 otvismenterr 24 5.18 10.0 18.6 19.1
## 18 otswitchtpo 30 6.48 197. 350. 361.
## 19 otswitcherr 21 4.54 8.43 17.7 18.1
## 20 x5dreadtpo 29 6.26 138. 188. 192.
## 21 x5dreaderr 38 8.21 47.9 49.8 50
## 22 x5dcounttpo 32 6.91 135. 165. 168.
## 23 x5dcounterr 46 9.94 47.9 49.8 50
## 24 x5dfoctpo 28 6.05 194. 246. 249.
## 25 x5dfocerr 21 4.54 39 48.3 48.7
## 26 x5dswitchtpo 19 4.10 160. 272. 277.
## 27 x5dswitcherr 31 6.70 34.4 46.6 47.5
## 28 dscorr 3 0.648 11.3 69.2 69.5
## 29 dsomis 46 9.94 3.54 0.352 0
## 30 dscomis 68 14.7 21.8 25.9 26.6
## 31 torremov 18 3.89 330. 341. 342.
## 32 torretpo 37 7.99 66.8 265. 282.
## 33 bostonsc 4 0.864 12 3.20 3.12
## 34 bostonlat 19 4.10 8.04 2.71 2.48
## 35 bostonsemerr 17 3.67 8.76 1.99 1.73
## 36 bostonfonerr 22 4.75 6.64 1.12 0.846
## 37 fluencia 8 1.73 17.6 16.5 16.5
# Explore missing data patterns
print(diagnose(data[24:60]), n = 40)
## # A tibble: 37 × 6
## variables types missing_count missing_percent unique_count unique_rate
## <chr> <chr> <int> <dbl> <int> <dbl>
## 1 listaprimerrec nume… 0 0 11 0.0238
## 2 listaaprendizaje nume… 0 0 39 0.0842
## 3 listacp nume… 0 0 13 0.0281
## 4 listalp nume… 0 0 13 0.0281
## 5 listarecon nume… 0 0 17 0.0367
## 6 corsidirecto nume… 0 0 11 0.0238
## 7 corsiinverso nume… 1 0.216 11 0.0238
## 8 cactusvivos nume… 0 0 14 0.0302
## 9 cactusinanim nume… 0 0 13 0.0281
## 10 otverbaltpo nume… 0 0 49 0.106
## 11 otverbalerr nume… 0 0 5 0.0108
## 12 otvisualtpo nume… 0 0 99 0.214
## 13 otvisualerr nume… 0 0 7 0.0151
## 14 otmentaltpo nume… 0 0 136 0.294
## 15 otmentalerr nume… 0 0 14 0.0302
## 16 otvismenttpo nume… 0 0 128 0.276
## 17 otvismenterr nume… 0 0 13 0.0281
## 18 otswitchtpo nume… 0 0 134 0.289
## 19 otswitcherr nume… 0 0 14 0.0302
## 20 x5dreadtpo nume… 0 0 65 0.140
## 21 x5dreaderr nume… 3 0.648 7 0.0151
## 22 x5dcounttpo nume… 2 0.432 59 0.127
## 23 x5dcounterr nume… 2 0.432 7 0.0151
## 24 x5dfoctpo nume… 2 0.432 84 0.181
## 25 x5dfocerr nume… 2 0.432 15 0.0324
## 26 x5dswitchtpo nume… 2 0.432 118 0.255
## 27 x5dswitcherr nume… 2 0.432 21 0.0454
## 28 dscorr nume… 0 0 82 0.177
## 29 dsomis nume… 0 0 12 0.0259
## 30 dscomis nume… 0 0 13 0.0281
## 31 torremov nume… 1 0.216 20 0.0432
## 32 torretpo nume… 0 0 151 0.326
## 33 bostonsc nume… 0 0 13 0.0281
## 34 bostonlat nume… 0 0 147 0.317
## 35 bostonsemerr nume… 0 0 11 0.0238
## 36 bostonfonerr nume… 0 0 10 0.0216
## 37 fluencia nume… 1 0.216 30 0.0648
# Summary of the COVID data
summary(data[c(2:23)])
## cognitive de800cog images age_2024 age_pcr
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :12.0 Min. :53.00
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:65.0 1st Qu.:62.00
## Median :2.000 Median :2.000 Median :2.000 Median :69.0 Median :65.00
## Mean :1.501 Mean :2.352 Mean :1.877 Mean :69.7 Mean :66.48
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:73.0 3rd Qu.:70.00
## Max. :2.000 Max. :4.000 Max. :3.000 Max. :90.0 Max. :86.00
## NA's :13
## age_interval anosmia risk_hospital_icu vaccine_before_study
## Min. :1.000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :0.0000 Median :2.000
## Mean :1.922 Mean :1.734 Mean :0.3024 Mean :1.482
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:2.000
## Max. :3.000 Max. :3.000 Max. :3.0000 Max. :3.000
## NA's :1
## covid_before_vaccination fever cough muscle_pain
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.5487 Mean :0.4323 Mean :0.5178 Mean :0.6081
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## NA's :42 NA's :42 NA's :42 NA's :42
## breath_dif smell_lost taste_lost pcr
## Min. :0.0000 Min. :0.0000 Min. :0.000 Length:463
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 Class :character
## Median :0.0000 Median :0.0000 Median :0.000 Mode :character
## Mean :0.3967 Mean :0.4561 Mean :0.399
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :1.000
## NA's :42 NA's :42 NA's :42
## pcr_num covid_variant vaccine_1 vaccine_2
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000
## Median :1.0000 Median :1.000 Median :2.000 Median :2.000
## Mean :0.8377 Mean :1.348 Mean :1.996 Mean :1.819
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :1.0000 Max. :7.000 Max. :5.000 Max. :6.000
## NA's :1
## vaccine_3
## Min. :0.000
## 1st Qu.:0.000
## Median :1.000
## Mean :1.726
## 3rd Qu.:2.000
## Max. :6.000
##
# Summary of the cognitive data
summary(data[24:60])
## listaprimerrec listaaprendizaje listacp listalp
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:19.00 1st Qu.: 5.000 1st Qu.: 4.000
## Median : 4.000 Median :24.00 Median : 6.000 Median : 5.000
## Mean : 3.706 Mean :23.99 Mean : 6.203 Mean : 5.477
## 3rd Qu.: 5.000 3rd Qu.:29.00 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :16.000 Max. :41.00 Max. :12.000 Max. :12.000
##
## listarecon corsidirecto corsiinverso cactusvivos
## Min. : 0.00 Min. : 0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:20.00 1st Qu.: 4.000 1st Qu.:4.000 1st Qu.:12.00
## Median :22.00 Median : 5.000 Median :4.000 Median :13.00
## Mean :21.21 Mean : 5.134 Mean :4.173 Mean :12.67
## 3rd Qu.:23.00 3rd Qu.: 6.000 3rd Qu.:5.000 3rd Qu.:14.00
## Max. :24.00 Max. :11.000 Max. :9.000 Max. :16.00
## NA's :1
## cactusinanim otverbaltpo otverbalerr otvisualtpo
## Min. : 0.00 Min. : 99.0 Min. :15.00 Min. : 0.00
## 1st Qu.:14.00 1st Qu.:180.0 1st Qu.:20.00 1st Qu.: 46.00
## Median :16.00 Median :189.0 Median :20.00 Median : 55.00
## Mean :15.11 Mean :185.3 Mean :19.84 Mean : 63.95
## 3rd Qu.:17.00 3rd Qu.:192.0 3rd Qu.:20.00 3rd Qu.: 74.00
## Max. :17.00 Max. :215.0 Max. :20.00 Max. :300.00
##
## otvisualerr otmentaltpo otmentalerr otvismenttpo otvismenterr
## Min. : 0.0 Min. : 19.0 Min. : 0.00 Min. : 32.0 Min. : 0.00
## 1st Qu.:20.0 1st Qu.:234.5 1st Qu.:17.00 1st Qu.:226.5 1st Qu.:18.00
## Median :20.0 Median :262.0 Median :19.00 Median :256.0 Median :20.00
## Mean :19.7 Mean :245.6 Mean :18.04 Mean :240.9 Mean :18.63
## 3rd Qu.:20.0 3rd Qu.:278.0 3rd Qu.:20.00 3rd Qu.:272.5 3rd Qu.:20.00
## Max. :20.0 Max. :319.0 Max. :20.00 Max. :332.0 Max. :20.00
##
## otswitchtpo otswitcherr x5dreadtpo x5dreaderr x5dcounttpo
## Min. : 28 Min. : 0.00 Min. : 22.0 Min. :30.00 Min. : 68.0
## 1st Qu.:338 1st Qu.:17.00 1st Qu.:185.0 1st Qu.:50.00 1st Qu.:162.0
## Median :366 Median :19.00 Median :193.0 Median :50.00 Median :168.0
## Mean :350 Mean :17.65 Mean :188.3 Mean :49.83 Mean :165.3
## 3rd Qu.:383 3rd Qu.:20.00 3rd Qu.:198.0 3rd Qu.:50.00 3rd Qu.:173.0
## Max. :428 Max. :20.00 Max. :222.0 Max. :50.00 Max. :200.0
## NA's :3 NA's :2
## x5dcounterr x5dfoctpo x5dfocerr x5dswitchtpo x5dswitcherr
## Min. :19.00 Min. : 89 Min. : 3.00 Min. : 0 Min. : 0.00
## 1st Qu.:50.00 1st Qu.:239 1st Qu.:48.00 1st Qu.:257 1st Qu.:46.00
## Median :50.00 Median :252 Median :49.00 Median :280 Median :48.00
## Mean :49.79 Mean :246 Mean :48.28 Mean :272 Mean :46.58
## 3rd Qu.:50.00 3rd Qu.:259 3rd Qu.:50.00 3rd Qu.:295 3rd Qu.:49.00
## Max. :50.00 Max. :300 Max. :50.00 Max. :350 Max. :50.00
## NA's :2 NA's :2 NA's :2 NA's :2 NA's :2
## dscorr dsomis dscomis torremov
## Min. : 0.00 Min. : 0.0000 Min. : 0.00 Min. :323.0
## 1st Qu.: 57.00 1st Qu.: 0.0000 1st Qu.:26.00 1st Qu.:340.0
## Median : 70.00 Median : 0.0000 Median :27.00 Median :343.0
## Mean : 69.16 Mean : 0.3521 Mean :25.93 Mean :341.4
## 3rd Qu.: 82.00 3rd Qu.: 0.0000 3rd Qu.:27.00 3rd Qu.:344.0
## Max. :109.00 Max. :24.0000 Max. :27.00 Max. :351.0
## NA's :1
## torretpo bostonsc bostonlat bostonsemerr
## Min. : 2.58 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:240.58 1st Qu.: 1.000 1st Qu.: 1.830 1st Qu.: 0.000
## Median :287.58 Median : 2.000 Median : 2.620 Median : 1.000
## Mean :264.54 Mean : 3.201 Mean : 2.706 Mean : 1.991
## 3rd Qu.:316.08 3rd Qu.: 5.000 3rd Qu.: 3.520 3rd Qu.: 3.000
## Max. :352.58 Max. :12.000 Max. :16.000 Max. :10.000
##
## bostonfonerr fluencia
## Min. : 0.000 Min. : 1.00
## 1st Qu.: 0.000 1st Qu.:14.00
## Median : 0.000 Median :16.50
## Mean : 1.121 Mean :16.52
## 3rd Qu.: 2.000 3rd Qu.:20.00
## Max. :11.000 Max. :33.00
## NA's :1
# Summary of the volume brain data
summary(data[71:91])
## right_thalamus_proper left_thalamus_proper fornix_right fornix_left
## Min. :4349 Min. :4478 Min. : 67.0 Min. :105.0
## 1st Qu.:6449 1st Qu.:6800 1st Qu.:324.0 1st Qu.:399.5
## Median :6859 Median :7203 Median :361.0 Median :453.0
## Mean :6871 Mean :7231 Mean :358.6 Mean :446.2
## 3rd Qu.:7265 3rd Qu.:7658 3rd Qu.:394.0 3rd Qu.:496.0
## Max. :9039 Max. :9560 Max. :541.0 Max. :694.0
## anterior_limb_of_internal_capsule_right anterior_limb_of_internal_capsule_left
## Min. :1849 Min. :1438
## 1st Qu.:2842 1st Qu.:2440
## Median :3071 Median :2673
## Mean :3099 Mean :2687
## 3rd Qu.:3336 3rd Qu.:2919
## Max. :4742 Max. :4213
## posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right
## Min. :1571
## 1st Qu.:2194
## Median :2349
## Mean :2384
## 3rd Qu.:2557
## Max. :4129
## posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left corpus_callosum
## Min. :1455 Min. : 6119
## 1st Qu.:2155 1st Qu.: 9718
## Median :2333 Median :10789
## Mean :2365 Mean :10885
## 3rd Qu.:2550 3rd Qu.:11905
## Max. :3814 Max. :17580
## right_a_cg_g_anterior_cingulate_gyrus left_a_cg_g_anterior_cingulate_gyrus
## Min. :1901 Min. :2210
## 1st Qu.:2962 1st Qu.:3522
## Median :3316 Median :3896
## Mean :3392 Mean :3953
## 3rd Qu.:3772 3rd Qu.:4374
## Max. :5867 Max. :6185
## right_a_ins_anterior_insula left_a_ins_anterior_insula
## Min. :1537 Min. :1562
## 1st Qu.:2925 1st Qu.:2907
## Median :3186 Median :3165
## Mean :3206 Mean :3177
## 3rd Qu.:3493 3rd Qu.:3450
## Max. :4521 Max. :5098
## right_an_g_angular_gyrus left_an_g_angular_gyrus right_cun_cuneus
## Min. : 4960 Min. : 4715 Min. :2653
## 1st Qu.: 7492 1st Qu.: 6732 1st Qu.:4556
## Median : 8189 Median : 7401 Median :5139
## Mean : 8312 Mean : 7447 Mean :5188
## 3rd Qu.: 9116 3rd Qu.: 8115 3rd Qu.:5791
## Max. :13560 Max. :11434 Max. :8236
## left_cun_cuneus right_ent_entorhinal_area left_ent_entorhinal_area
## Min. :2627 Min. : 749 Min. : 796
## 1st Qu.:4551 1st Qu.:2131 1st Qu.:1856
## Median :5085 Median :2372 Median :2070
## Mean :5144 Mean :2424 Mean :2109
## 3rd Qu.:5652 3rd Qu.:2704 3rd Qu.:2336
## Max. :8628 Max. :3989 Max. :4156
## right_g_re_gyrus_rectus left_g_re_gyrus_rectus
## Min. :1253 Min. :1189
## 1st Qu.:1752 1st Qu.:1774
## Median :1955 Median :1956
## Mean :1972 Mean :1978
## 3rd Qu.:2166 3rd Qu.:2152
## Max. :2912 Max. :3072
# Impute the data using the MICE package
library(mice)
# Use the CART imputation method
imp <- mice( data[c(2:91)], m = 5, maxit = 5, method = "cart", seed = 123)
# Complete the imputation process
imputed_data <- complete(imp)
names(imputed_data)
library(EGAnet)
# Run EGA on the imputed data
# EGA on the Symptoms and Demographics
fit_demo <- EGA(imputed_data[c(1:3, 5, 8:16, 18:22)])
# Scale cognitive variables before running the EGA
fit_cog <- EGA(scale(imputed_data[23:59]))
# Confirmatory Factor Analysis from EGA model
cfa <- CFA(fit_cog, plot = TRUE, data = (scale(imputed_data[23:59])), estimator = "MLR")
## [1] "listaprimerrec" "listaaprendizaje" "listacp" "listalp"
## [5] "listarecon" "fluencia"
## [1] "corsidirecto" "corsiinverso" "cactusvivos" "cactusinanim" "otverbalerr"
## [6] "dscorr" "dscomis" "bostonsc" "bostonlat" "bostonsemerr"
## [11] "bostonfonerr"
## [1] "otverbaltpo" "otvisualtpo" "x5dreadtpo" "x5dcounttpo" "x5dfoctpo"
## [6] "x5dswitchtpo"
## [1] "otvisualerr" "otmentaltpo" "otmentalerr" "otvismenttpo" "otvismenterr"
## [6] "otswitchtpo" "otswitcherr"
## [1] "x5dreaderr" "x5dcounterr" "x5dfocerr" "x5dswitcherr"
## [1] "torremov" "torretpo"
# Scale brain variables before running the EGA
fit_all <- EGA(scale(imputed_data[60:90]))
#' Compare Groups Across Multiple Variables Automatically, Adjusting for Covariates
#'
#' This function iterates through specified variables in a dataset, determines their type,
#' fits appropriate generalized linear models (GLM), ordinal, or multinomial models,
#' and compares a model with the group predictor (and optional covariates) to a
#' null model (with optional covariates only) using a Likelihood Ratio Test (LRT).
#' It reports p-values and FDR-adjusted p-values for the group effect, adjusted for covariates.
#'
#' @param data A data.frame containing the variables to compare, the grouping variable,
#' and any covariates.
#' @param group A character string specifying the name of the column in `data` that
#' contains the grouping factor. Must have at least 2 levels.
#' @param vars_to_test A character vector specifying the names of the outcome variables
#' in `data` to be tested against the `group`. If NULL (default), all columns
#' except `group` and `covariates` are tested.
#' @param covariates A character vector specifying the names of the columns in `data` to
#' be used as covariates in the regression models. Default is `NULL` (no covariates).
#' These variables will be included in both the full and null models.
#' @param alpha The significance level (default 0.05) used to determine significance
#' based on the FDR-adjusted p-value of the group effect.
#' @param zero_threshold For continuous outcome variables, the minimum proportion of zero values
#' (default 0.3) to consider potentially using a Gamma GLM (along with skewness).
#' @param skew_threshold For continuous outcome variables, the minimum absolute skewness
#' (default 2) to consider potentially using a Gamma GLM (along with zero proportion).
#' Note: Uses `moments::skewness`.
#' @param verbose Logical indicating whether to print progress messages (default TRUE).
#' @param group_ref Optional. A string specifying the reference level for the group variable.
#' If NULL (default), the first level is used.
#'
#' @return A data.frame summarizing the results for each variable tested, including:
#' \item{Variable}{Name of the outcome variable tested.}
#' \item{Type}{Detected type of the outcome variable (e.g., "Continuous (Gaussian GLM used)", "Binary", "Ordinal", "Categorical", "Constant Outcome Variable").}
#' \item{Covariates_Used}{Comma-separated string of covariates included in the models for this variable.}
#' \item{n_obs}{Number of observations used for the test after handling missing data for the outcome, group, and all covariates.}
#' \item{Status}{Outcome of the modeling ("OK", "Error: [message]", "Constant Outcome Variable", "No DF improvement", "Low N or too many NAs", "Unsupported outcome variable type", "Log-Likelihood calculation failed...").}
#' \item{p_value}{The raw p-value from the Likelihood Ratio Test for the `group` effect, formatted as a string.}
#' \item{p_value_FDR}{The group p-value adjusted for multiple comparisons using the Benjamini-Hochberg FDR method, formatted as a string.}
#' \item{Significant}{"Yes" if p_value_FDR < alpha, "No" otherwise (based on non-NA adjusted p-values).}
#' \item{Gamma_Shift_Warning}{A flag ("Yes"/"No") indicating if non-positive outcome values were shifted for the Gamma GLM.}
#' \item{Convergence_Warning}{A flag ("Yes"/"No") indicating if GLM/Multinom models reported non-convergence or if logLik calculation failed for Ordinal models (proxy for potential issues).}
#'
#' @details
#' - **Covariate Adjustment:** Compares `model(y ~ group + covariates)` vs `model(y ~ covariates)` (or `y ~ group` vs `y ~ 1` if no covariates).
#' - **Variable Selection:** Uses `vars_to_test` or defaults to all other variables.
#' - **Variable Types & Modeling:** Automatically detects outcome type. Uses Gaussian GLM (default continuous), Gamma GLM (skewed/zeros continuous), Binomial GLM (binary), `MASS::polr` (ordinal), `nnet::multinom` (categorical). Covariates used as-is (ensure factor/numeric).
#' - **Convergence:** Checks convergence flags for GLM/Multinom. For Ordinal (`polr`), failure to calculate log-likelihood is used as a proxy for potential convergence/Hessian issues and flagged.
#' - **Missing Data:** Uses complete cases for outcome, group, and all covariates per test.
#' - **Group Reference Level:** Users can specify the reference level for the group variable using `group_ref`. If the specified level is not present in the data for a particular variable after removing NAs, a warning is issued, and the default reference level is used.
#' - **Dependencies:** Requires 'MASS', 'nnet', and 'moments' packages.
#'
#' @examples
#' \dontrun{
#' set.seed(123)
#' sample_data <- data.frame(
#' Site = factor(rep(c("Site1", "Site2"), each = 60)),
#' Age = rnorm(120, mean = 50, sd = 10),
#' Education = sample(10:20, 120, replace = TRUE),
#' Sex = factor(sample(c("M", "F"), 120, replace = TRUE)),
#' CognitiveScore = rnorm(120, mean = rep(c(100, 105), each = 60) + 0.1 * Age - 5 * (Sex == "F")),
#' BiomarkerA = rgamma(120, shape = rep(c(2, 3), each = 60) + 0.05 * Age, scale = 1.5),
#' Diagnosis = factor(sample(c("Normal", "MCI", "AD"), 120, replace = TRUE, prob = c(0.6, 0.3, 0.1))),
#' OrdinalResp = factor(sample(1:5, 120, replace = TRUE), ordered = TRUE),
#' ConstantVar = rep(10, 120)
#' )
#' sample_data$Age[sample(1:120, 10)] <- NA
#' sample_data$CognitiveScore[sample(1:120, 5)] <- NA
#' sample_data$BiomarkerA[1:5] <- NA
#'
#' results_with_cov <- compare_groups_auto_v4(
#' data = sample_data,
#' group = "Site",
#' vars_to_test = c("CognitiveScore", "BiomarkerA", "Diagnosis", "OrdinalResp", "ConstantVar"),
#' covariates = c("Age", "Education", "Sex"),
#' verbose = TRUE
#' )
#' print(results_with_cov)
#'
#' # Example with specified group reference level
#' results_with_ref <- compare_groups_auto_v4(
#' data = sample_data,
#' group = "Site",
#' vars_to_test = c("CognitiveScore", "BiomarkerA", "Diagnosis", "OrdinalResp", "ConstantVar"),
#' covariates = c("Age", "Education", "Sex"),
#' group_ref = "Site2",
#' verbose = TRUE
#' )
#' print(results_with_ref)
#' }
#' @importFrom stats glm gaussian binomial Gamma logLik pchisq p.adjust complete.cases sd anova reformulate relevel na.omit
#' @importFrom MASS polr
#' @importFrom nnet multinom
#' @importFrom moments skewness
#' @export
compare_groups_auto_v4 <- function(data, group, vars_to_test = NULL, covariates = NULL, alpha = 0.05,
zero_threshold = 0.3, skew_threshold = 2, verbose = TRUE, group_ref = NULL) {
# --- 1. Input Validation and Package Checks ---
if (!requireNamespace("MASS", quietly = TRUE)) stop("Package 'MASS' is required. Please install it.")
if (!requireNamespace("nnet", quietly = TRUE)) stop("Package 'nnet' is required. Please install it.")
if (!requireNamespace("moments", quietly = TRUE)) stop("Package 'moments' is required. Please install it.")
if (!is.data.frame(data)) stop("'data' must be a data.frame.")
if (!is.character(group) || length(group) != 1 || !(group %in% names(data))) {
stop("'group' must be a single string naming an existing column in 'data'.")
}
if (!is.null(covariates)) {
if (!is.character(covariates) || any(!covariates %in% names(data))) {
stop("'covariates' must be NULL or a character vector of existing column names in 'data'.")
}
if (any(covariates %in% group)) stop("The 'group' variable cannot also be listed in 'covariates'.")
covariates <- unique(covariates)
} else {
covariates <- character(0)
}
if (!is.null(vars_to_test)) {
if (!is.character(vars_to_test) || any(!vars_to_test %in% names(data))) {
stop("'vars_to_test' must be NULL or a character vector of existing column names in 'data'.")
}
if (any(vars_to_test %in% c(group, covariates))) stop("Variables in 'vars_to_test' cannot include the 'group' or 'covariates'.")
vars_to_test <- unique(vars_to_test)
}
if (!is.null(group_ref) && (!is.character(group_ref) || length(group_ref) != 1)) {
stop("'group_ref' must be NULL or a single string.")
}
if (!is.numeric(alpha) || alpha <= 0 || alpha >= 1) stop("'alpha' must be a numeric value strictly between 0 and 1.")
if (!is.numeric(zero_threshold) || zero_threshold < 0 || zero_threshold > 1) stop("'zero_threshold' must be numeric between 0 and 1.")
if (!is.numeric(skew_threshold) || skew_threshold < 0) stop("'skew_threshold' must be non-negative numeric.")
if (!is.logical(verbose)) stop("'verbose' must be a logical value (TRUE or FALSE).")
# --- 2. Prepare Data and Identify Variables ---
if (!is.factor(data[[group]])) {
if (verbose) message("Converting grouping variable '", group, "' to factor.")
data[[group]] <- as.factor(data[[group]])
}
if (nlevels(data[[group]]) < 2) stop("Grouping variable '", group, "' must have at least two levels.")
if (!is.null(group_ref)) {
if (!(group_ref %in% levels(data[[group]]))) {
stop(sprintf("'group_ref' '%s' is not a level of the group variable '%s'.", group_ref, group))
}
}
if (is.null(vars_to_test)) {
var_names <- setdiff(names(data), c(group, covariates))
} else {
var_names <- vars_to_test
}
if (length(var_names) == 0) {
warning("No variables identified for testing after excluding 'group' and 'covariates'.")
return(data.frame(Variable = character(), Type = character(), Covariates_Used = character(),
n_obs = integer(), Status = character(), p_value = numeric(),
p_value_FDR = numeric(), Significant = character(),
Gamma_Shift_Warning = character(), Convergence_Warning = character(),
stringsAsFactors = FALSE))
}
if (verbose) {
message(sprintf("Starting analysis of %d variables with '%s' as the grouping variable.",
length(var_names), group))
if (length(covariates) > 0) {
message(sprintf("Using %d covariates: %s", length(covariates), paste(covariates, collapse = ", ")))
}
}
# --- 3. Internal Helper Function: Model Fitting and LRT ---
fit_models_and_lrt <- function(temp_data, formula_full, formula_null, type, variable, verbose,
zero_threshold, skew_threshold) {
gamma_shift_warning <- FALSE
convergence_warning <- FALSE
p_val <- NA_real_
model_status <- "OK"
type_used <- type
data_for_model <- temp_data
outcome <- data_for_model$y
res <- tryCatch({
if (type == "Continuous") {
zero_prop <- mean(outcome == 0, na.rm = TRUE)
skew_val <- tryCatch(moments::skewness(outcome, na.rm = TRUE), error = function(e) NA_real_)
use_gamma <- !is.na(skew_val) && (zero_prop > zero_threshold) && (abs(skew_val) > skew_threshold)
if (is.na(skew_val) && verbose) {
message(sprintf("Skewness calculation failed for '%s'. Using Gaussian GLM.", variable))
}
if (use_gamma) {
min_val <- min(outcome, na.rm = TRUE)
if (min_val <= 0) {
shift_amount <- max(1e-6, abs(min_val) * 1.01 + 1e-6)
outcome <- outcome + shift_amount
data_for_model$y <- outcome
gamma_shift_warning <- TRUE
if (verbose) warning(sprintf("Variable '%s': Contains non-positive values. Added shift of ~%s for Gamma GLM fitting.",
variable, format(shift_amount, digits = 2)), call. = FALSE, immediate. = TRUE)
}
model_full <- stats::glm(formula = formula_full, data = data_for_model, family = stats::Gamma(link = "log"))
model_null <- stats::glm(formula = formula_null, data = data_for_model, family = stats::Gamma(link = "log"))
type_used <- "Continuous (Gamma GLM used)"
if (!model_full$converged || !model_null$converged) {
convergence_warning <- TRUE
if (verbose) warning(sprintf("Gamma GLM did not converge for '%s'.", variable), call. = FALSE)
}
} else {
model_full <- stats::glm(formula = formula_full, data = data_for_model, family = stats::gaussian(link = "identity"))
model_null <- stats::glm(formula = formula_null, data = data_for_model, family = stats::gaussian(link = "identity"))
type_used <- "Continuous (Gaussian GLM used)"
if (!model_full$converged || !model_null$converged) {
convergence_warning <- TRUE
if (verbose) warning(sprintf("Gaussian GLM did not converge for '%s'.", variable), call. = FALSE)
}
}
} else if (type == "Binary") {
model_full <- stats::glm(formula = formula_full, data = data_for_model, family = stats::binomial(link = "logit"))
model_null <- stats::glm(formula = formula_null, data = data_for_model, family = stats::binomial(link = "logit"))
type_used <- "Binary"
if (!model_full$converged || !model_null$converged) {
convergence_warning <- TRUE
if (verbose) warning(sprintf("Binomial GLM did not converge for '%s'.", variable), call. = FALSE)
}
} else if (type == "Ordinal") {
model_full <- MASS::polr(formula = formula_full, data = data_for_model, method = "logistic", Hess = TRUE)
model_null <- MASS::polr(formula = formula_null, data = data_for_model, method = "logistic", Hess = TRUE)
type_used <- "Ordinal"
} else if (type == "Categorical") {
model_full <- nnet::multinom(formula = formula_full, data = data_for_model, trace = FALSE)
model_null <- nnet::multinom(formula = formula_null, data = data_for_model, trace = FALSE)
type_used <- "Categorical"
if ((!is.null(model_full$convergence) && model_full$convergence != 0) ||
(!is.null(model_null$convergence) && model_null$convergence != 0)) {
convergence_warning <- TRUE
if (verbose) warning(sprintf("Multinomial model did not fully converge for '%s'.", variable), call. = FALSE)
}
} else {
stop("Internal error: Unknown model type.")
}
# --- Likelihood Ratio Test ---
if (inherits(model_full, "glm")) {
lrt_result <- suppressWarnings(stats::anova(model_null, model_full, test = "LRT"))
if (!is.null(lrt_result) && "Pr(>Chi)" %in% names(lrt_result) && nrow(lrt_result) > 1) {
candidate <- lrt_result$"Pr(>Chi)"[2]
if (!is.na(candidate) && is.finite(candidate)) {
p_val <- candidate
} else {
if (!is.na(lrt_result$Df[2]) && lrt_result$Df[2] <= 0) {
model_status <- "No DF improvement/equivalent models"
p_val <- 1.0
} else {
model_status <- "LRT p-value calculation failed (anova)"
p_val <- NA_real_
}
}
} else {
model_status <- "LRT failed (anova output invalid/models identical)"
p_val <- NA_real_
}
} else {
ll_full <- tryCatch(stats::logLik(model_full), error = function(e) NA)
ll_null <- tryCatch(stats::logLik(model_null), error = function(e) NA)
if (is.na(ll_full) || is.na(ll_null)) {
model_status <- "Log-Likelihood calculation failed (NA/NaN)"
if (inherits(model_full, "polr")) {
model_status <- paste(model_status, "- potential polr convergence?")
convergence_warning <- TRUE
}
p_val <- NA_real_
} else {
lr_stat <- 2 * (as.numeric(ll_full) - as.numeric(ll_null))
df_full <- attr(ll_full, "df")
df_null <- attr(ll_null, "df")
if (is.null(df_full) || is.null(df_null)) {
model_status <- "Could not retrieve degrees of freedom for LRT"
p_val <- NA_real_
} else {
df_diff <- df_full - df_null
if (df_diff > .Machine$double.eps^0.5) {
if (lr_stat < -1e-8) {
model_status <- sprintf("Negative LRT statistic (%.2g)", lr_stat)
p_val <- NA_real_
} else {
p_val <- stats::pchisq(max(0, lr_stat), df = df_diff, lower.tail = FALSE)
}
} else {
p_val <- 1.0
model_status <- "No DF improvement"
}
}
}
}
list(p_value = p_val, model_status = model_status, type_used = type_used,
gamma_shift_warning = gamma_shift_warning, convergence_warning = convergence_warning)
}, error = function(e) {
msg <- paste("Error:", gsub("[\\r\\n\\t]+", " ", conditionMessage(e)))
if (verbose) message(sprintf("Model fitting/LRT failed for '%s'. Status: %s", variable, msg))
list(p_value = NA_real_, model_status = msg, type_used = type,
gamma_shift_warning = gamma_shift_warning, convergence_warning = convergence_warning)
})
return(res)
}
# --- 4. Loop Through Variables ---
res_list <- vector("list", length(var_names))
names(res_list) <- var_names
for (i in seq_along(var_names)) {
variable <- var_names[i]
if (verbose && length(var_names) > 5 && i %% ceiling(length(var_names) / 10) == 1) {
message(sprintf("Processing variable %d of %d: %s", i, length(var_names), variable))
}
# Initialize default result for the variable
res <- data.frame(
Variable = variable,
Type = "Unknown",
Covariates_Used = paste(covariates, collapse = ", "),
n_obs = 0L,
Status = "Not processed",
p_value = NA_real_,
p_value_FDR = NA_real_,
Significant = "No",
Gamma_Shift_Warning = "No",
Convergence_Warning = "No",
stringsAsFactors = FALSE
)
# --- Data Extraction and Missing Data Handling ---
current_cols <- c(variable, group, covariates)
full_data <- tryCatch(data[, current_cols, drop = FALSE], error = function(e) NULL)
if (is.null(full_data)) {
res$Status <- paste("Error accessing column:", variable)
res_list[[variable]] <- res
next
}
complete_idx <- stats::complete.cases(full_data)
n_complete <- sum(complete_idx)
res$n_obs <- n_complete
min_obs_needed <- (nlevels(data[[group]]) - 1) + length(covariates) + nlevels(data[[group]]) + 5
if (n_complete < min_obs_needed || n_complete < (2 * nlevels(data[[group]]))) {
res$Status <- "Low N or too many NAs for model complexity"
res_list[[variable]] <- res
next
}
temp_data <- full_data[complete_idx, , drop = FALSE]
names(temp_data)[names(temp_data) == variable] <- "y"
temp_data[[group]] <- factor(temp_data[[group]])
# Check if group has at least two levels after removing NAs
if (nlevels(temp_data[[group]]) < 2) {
res$Status <- "Grouping variable has less than 2 levels after removing NAs"
res_list[[variable]] <- res
next
}
# Set reference level for group if specified and present
if (!is.null(group_ref) && group_ref %in% levels(temp_data[[group]])) {
temp_data[[group]] <- relevel(temp_data[[group]], ref = group_ref)
} else if (!is.null(group_ref)) {
warning(sprintf("For variable '%s', the specified 'group_ref' '%s' is not present in the data after removing NAs. Using default reference level.", variable, group_ref))
}
x <- temp_data$y
# --- Check for Constant Outcome Variable ---
is_constant <- FALSE
if (is.numeric(x)) {
if (is.na(stats::sd(x, na.rm = TRUE)) || stats::sd(x, na.rm = TRUE) < .Machine$double.eps^0.5) {
is_constant <- TRUE
}
} else {
if (length(unique(stats::na.omit(x))) <= 1) is_constant <- TRUE
}
if (is_constant) {
res$Status <- "Constant Outcome Variable"
res$Type <- if (is.numeric(x)) "Numeric Constant" else "Factor/Char Constant"
res_list[[variable]] <- res
next
}
# --- Determine Outcome Variable Type ---
type <- "Unknown"
if (is.numeric(x)) {
unique_vals <- unique(stats::na.omit(x))
n_unique <- length(unique_vals)
is_int_like_binary <- n_unique == 2 && all(abs(unique_vals - round(unique_vals)) < .Machine$double.eps^0.5)
if (is_int_like_binary) {
type <- "Binary"
temp_data$y <- factor(temp_data$y, levels = sort(unique_vals))
} else {
type <- "Continuous"
}
} else if (is.factor(x)) {
current_levels <- levels(droplevels(factor(x)))
n_levels <- length(current_levels)
original_levels <- levels(data[[variable]])
if (is.ordered(x)) {
if (n_levels < 2) {
res$Status <- "Ordinal outcome with < 2 levels after NA removal"
res_list[[variable]] <- res
next
}
type <- "Ordinal"
temp_data$y <- factor(temp_data$y, levels = intersect(original_levels, current_levels), ordered = TRUE)
} else {
if (n_levels == 2) {
type <- "Binary"
temp_data$y <- factor(temp_data$y, levels = current_levels)
} else if (n_levels > 2) {
type <- "Categorical"
temp_data$y <- stats::relevel(factor(temp_data$y, levels = current_levels), ref = current_levels[1])
} else {
res$Status <- "Factor outcome with < 2 levels after NA removal"
res_list[[variable]] <- res
next
}
}
} else if (is.character(x)) {
temp_data$y <- factor(temp_data$y)
current_levels <- levels(temp_data$y)
n_levels <- length(current_levels)
if (n_levels == 2) {
type <- "Binary"
} else if (n_levels > 2) {
type <- "Categorical"
temp_data$y <- stats::relevel(temp_data$y, ref = current_levels[1])
} else {
res$Status <- "Character outcome with < 2 levels after NA removal"
res_list[[variable]] <- res
next
}
} else {
res$Status <- "Unsupported outcome variable type"
res$Type <- class(x)[1]
res_list[[variable]] <- res
next
}
res$Type <- type
# --- Ensure Covariates are Factor or Numeric ---
for (covar in covariates) {
if (!is.numeric(temp_data[[covar]]) && !is.factor(temp_data[[covar]])) {
if (verbose) message("Converting covariate '", covar, "' to factor for variable '", variable, "'.")
temp_data[[covar]] <- factor(temp_data[[covar]])
}
}
# --- Construct Model Formulas ---
terms_full <- c(group, covariates)
terms_null <- if (length(covariates) > 0) covariates else "1"
formula_full <- tryCatch(stats::reformulate(termlabels = terms_full, response = "y"),
error = function(e) NULL)
formula_null <- tryCatch(stats::reformulate(termlabels = terms_null, response = "y"),
error = function(e) NULL)
if (is.null(formula_full) || is.null(formula_null)) {
res$Status <- "Error: Failed to construct model formulas"
res_list[[variable]] <- res
next
}
# --- Fit Models and Perform LRT via Helper Function ---
model_out <- fit_models_and_lrt(temp_data, formula_full, formula_null, type, variable, verbose,
zero_threshold, skew_threshold)
res$p_value <- model_out$p_value
res$Status <- model_out$model_status
res$Type <- model_out$type_used
if (model_out$gamma_shift_warning) res$Gamma_Shift_Warning <- "Yes"
if (model_out$convergence_warning) res$Convergence_Warning <- "Yes"
res_list[[variable]] <- res
# --- Periodic Garbage Collection ---
if (i %% 50 == 0 && (nrow(data) * ncol(data) > 1e6 || length(var_names) > 100)) {
if (verbose) message("Running garbage collection...")
gc(verbose = FALSE)
}
}
# --- 5. Combine and Adjust Results ---
res_df <- do.call(rbind, res_list)
rownames(res_df) <- NULL
valid_idx <- which(!is.na(res_df$p_value) & is.finite(res_df$p_value))
if (length(valid_idx) > 0) {
res_df$p_value_FDR[valid_idx] <- stats::p.adjust(res_df$p_value[valid_idx], method = "fdr")
res_df$Significant <- ifelse(!is.na(res_df$p_value_FDR) &
res_df$p_value_FDR < alpha, "Yes", "No")
res_df$Significant[is.na(res_df$p_value)] <- "No"
} else {
res_df$p_value_FDR <- NA_real_
res_df$Significant <- "No"
}
# --- 6. Round and Format P-values ---
res_df$p_value <- ifelse(!is.na(res_df$p_value) & res_df$p_value < 0.001,
"<0.001", sprintf("%.3f", res_df$p_value))
res_df$p_value_FDR <- ifelse(!is.na(res_df$p_value_FDR) & res_df$p_value_FDR < 0.001,
"<0.001", sprintf("%.3f", res_df$p_value_FDR))
final_order <- c("Variable", "Type", "Covariates_Used", "n_obs", "Status",
"p_value", "p_value_FDR", "Significant",
"Gamma_Shift_Warning", "Convergence_Warning")
if (!all(final_order %in% names(res_df))) {
warning("Internal issue: Some expected result columns are missing.")
} else {
res_df <- res_df[, final_order, drop = FALSE]
}
if (verbose) message("Analysis finished. Returning results table.")
return(res_df)
}
# Compare brain variables across groups (cognitive)
compare_groups_auto_v4(imputed_data[c(1, 5, 60:90)], group = "cognitive", covariates = "age_pcr")
## Variable
## 1 right_accumbens_area
## 2 left_accumbens_area
## 3 right_amygdala
## 4 left_amygdala
## 5 right_cerebellum_exterior
## 6 left_cerebellum_exterior
## 7 right_hippocampus
## 8 left_hippocampus
## 9 right_putamen
## 10 left_putamen
## 11 right_thalamus_proper
## 12 left_thalamus_proper
## 13 fornix_right
## 14 fornix_left
## 15 anterior_limb_of_internal_capsule_right
## 16 anterior_limb_of_internal_capsule_left
## 17 posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right
## 18 posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left
## 19 corpus_callosum
## 20 right_a_cg_g_anterior_cingulate_gyrus
## 21 left_a_cg_g_anterior_cingulate_gyrus
## 22 right_a_ins_anterior_insula
## 23 left_a_ins_anterior_insula
## 24 right_an_g_angular_gyrus
## 25 left_an_g_angular_gyrus
## 26 right_cun_cuneus
## 27 left_cun_cuneus
## 28 right_ent_entorhinal_area
## 29 left_ent_entorhinal_area
## 30 right_g_re_gyrus_rectus
## 31 left_g_re_gyrus_rectus
## Type Covariates_Used n_obs Status p_value
## 1 Continuous (Gaussian GLM used) age_pcr 463 OK 0.884
## 2 Continuous (Gaussian GLM used) age_pcr 463 OK 0.497
## 3 Continuous (Gaussian GLM used) age_pcr 463 OK 0.668
## 4 Continuous (Gaussian GLM used) age_pcr 463 OK 0.318
## 5 Continuous (Gaussian GLM used) age_pcr 463 OK 0.277
## 6 Continuous (Gaussian GLM used) age_pcr 463 OK 0.251
## 7 Continuous (Gaussian GLM used) age_pcr 463 OK 0.134
## 8 Continuous (Gaussian GLM used) age_pcr 463 OK 0.134
## 9 Continuous (Gaussian GLM used) age_pcr 463 OK 0.948
## 10 Continuous (Gaussian GLM used) age_pcr 463 OK 0.439
## 11 Continuous (Gaussian GLM used) age_pcr 463 OK 0.032
## 12 Continuous (Gaussian GLM used) age_pcr 463 OK 0.058
## 13 Continuous (Gaussian GLM used) age_pcr 463 OK 0.410
## 14 Continuous (Gaussian GLM used) age_pcr 463 OK 0.584
## 15 Continuous (Gaussian GLM used) age_pcr 463 OK 0.770
## 16 Continuous (Gaussian GLM used) age_pcr 463 OK 0.528
## 17 Continuous (Gaussian GLM used) age_pcr 463 OK 0.789
## 18 Continuous (Gaussian GLM used) age_pcr 463 OK 0.561
## 19 Continuous (Gaussian GLM used) age_pcr 463 OK 0.492
## 20 Continuous (Gaussian GLM used) age_pcr 463 OK 0.511
## 21 Continuous (Gaussian GLM used) age_pcr 463 OK 0.281
## 22 Continuous (Gaussian GLM used) age_pcr 463 OK 0.283
## 23 Continuous (Gaussian GLM used) age_pcr 463 OK 0.395
## 24 Continuous (Gaussian GLM used) age_pcr 463 OK 0.901
## 25 Continuous (Gaussian GLM used) age_pcr 463 OK 0.565
## 26 Continuous (Gaussian GLM used) age_pcr 463 OK 0.606
## 27 Continuous (Gaussian GLM used) age_pcr 463 OK 0.721
## 28 Continuous (Gaussian GLM used) age_pcr 463 OK 0.671
## 29 Continuous (Gaussian GLM used) age_pcr 463 OK 0.820
## 30 Continuous (Gaussian GLM used) age_pcr 463 OK 0.252
## 31 Continuous (Gaussian GLM used) age_pcr 463 OK 0.210
## p_value_FDR Significant Gamma_Shift_Warning Convergence_Warning
## 1 0.931 No No No
## 2 0.854 No No No
## 3 0.867 No No No
## 4 0.854 No No No
## 5 0.854 No No No
## 6 0.854 No No No
## 7 0.854 No No No
## 8 0.854 No No No
## 9 0.948 No No No
## 10 0.854 No No No
## 11 0.854 No No No
## 12 0.854 No No No
## 13 0.854 No No No
## 14 0.854 No No No
## 15 0.905 No No No
## 16 0.854 No No No
## 17 0.905 No No No
## 18 0.854 No No No
## 19 0.854 No No No
## 20 0.854 No No No
## 21 0.854 No No No
## 22 0.854 No No No
## 23 0.854 No No No
## 24 0.931 No No No
## 25 0.854 No No No
## 26 0.854 No No No
## 27 0.894 No No No
## 28 0.867 No No No
## 29 0.908 No No No
## 30 0.854 No No No
## 31 0.854 No No No
# Compare brain variables across groups (age_interval)
compare_groups_auto_v4(imputed_data[c(6, 60:90)], group = "age_interval")
## Variable
## 1 right_accumbens_area
## 2 left_accumbens_area
## 3 right_amygdala
## 4 left_amygdala
## 5 right_cerebellum_exterior
## 6 left_cerebellum_exterior
## 7 right_hippocampus
## 8 left_hippocampus
## 9 right_putamen
## 10 left_putamen
## 11 right_thalamus_proper
## 12 left_thalamus_proper
## 13 fornix_right
## 14 fornix_left
## 15 anterior_limb_of_internal_capsule_right
## 16 anterior_limb_of_internal_capsule_left
## 17 posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right
## 18 posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left
## 19 corpus_callosum
## 20 right_a_cg_g_anterior_cingulate_gyrus
## 21 left_a_cg_g_anterior_cingulate_gyrus
## 22 right_a_ins_anterior_insula
## 23 left_a_ins_anterior_insula
## 24 right_an_g_angular_gyrus
## 25 left_an_g_angular_gyrus
## 26 right_cun_cuneus
## 27 left_cun_cuneus
## 28 right_ent_entorhinal_area
## 29 left_ent_entorhinal_area
## 30 right_g_re_gyrus_rectus
## 31 left_g_re_gyrus_rectus
## Type Covariates_Used n_obs Status p_value
## 1 Continuous (Gaussian GLM used) 463 OK 0.008
## 2 Continuous (Gaussian GLM used) 463 OK 0.081
## 3 Continuous (Gaussian GLM used) 463 OK 0.609
## 4 Continuous (Gaussian GLM used) 463 OK 0.358
## 5 Continuous (Gaussian GLM used) 463 OK 0.095
## 6 Continuous (Gaussian GLM used) 463 OK 0.100
## 7 Continuous (Gaussian GLM used) 463 OK 0.002
## 8 Continuous (Gaussian GLM used) 463 OK <0.001
## 9 Continuous (Gaussian GLM used) 463 OK 0.187
## 10 Continuous (Gaussian GLM used) 463 OK 0.030
## 11 Continuous (Gaussian GLM used) 463 OK <0.001
## 12 Continuous (Gaussian GLM used) 463 OK <0.001
## 13 Continuous (Gaussian GLM used) 463 OK <0.001
## 14 Continuous (Gaussian GLM used) 463 OK <0.001
## 15 Continuous (Gaussian GLM used) 463 OK <0.001
## 16 Continuous (Gaussian GLM used) 463 OK <0.001
## 17 Continuous (Gaussian GLM used) 463 OK 0.009
## 18 Continuous (Gaussian GLM used) 463 OK 0.068
## 19 Continuous (Gaussian GLM used) 463 OK 0.007
## 20 Continuous (Gaussian GLM used) 463 OK <0.001
## 21 Continuous (Gaussian GLM used) 463 OK <0.001
## 22 Continuous (Gaussian GLM used) 463 OK <0.001
## 23 Continuous (Gaussian GLM used) 463 OK <0.001
## 24 Continuous (Gaussian GLM used) 463 OK <0.001
## 25 Continuous (Gaussian GLM used) 463 OK 0.028
## 26 Continuous (Gaussian GLM used) 463 OK 0.004
## 27 Continuous (Gaussian GLM used) 463 OK 0.458
## 28 Continuous (Gaussian GLM used) 463 OK 0.131
## 29 Continuous (Gaussian GLM used) 463 OK 0.115
## 30 Continuous (Gaussian GLM used) 463 OK 0.040
## 31 Continuous (Gaussian GLM used) 463 OK 0.577
## p_value_FDR Significant Gamma_Shift_Warning Convergence_Warning
## 1 0.016 Yes No No
## 2 0.114 No No No
## 3 0.609 No No No
## 4 0.397 No No No
## 5 0.128 No No No
## 6 0.130 No No No
## 7 0.004 Yes No No
## 8 <0.001 Yes No No
## 9 0.215 No No No
## 10 0.049 Yes No No
## 11 <0.001 Yes No No
## 12 <0.001 Yes No No
## 13 <0.001 Yes No No
## 14 <0.001 Yes No No
## 15 <0.001 Yes No No
## 16 <0.001 Yes No No
## 17 0.016 Yes No No
## 18 0.100 No No No
## 19 0.014 Yes No No
## 20 <0.001 Yes No No
## 21 <0.001 Yes No No
## 22 <0.001 Yes No No
## 23 <0.001 Yes No No
## 24 0.001 Yes No No
## 25 0.049 Yes No No
## 26 0.009 Yes No No
## 27 0.489 No No No
## 28 0.156 No No No
## 29 0.143 No No No
## 30 0.061 No No No
## 31 0.596 No No No
# Compare cognitive variables across groups (cognitive)
compare_groups_auto_v4(imputed_data[c(1, 5, 23:59)], group = "cognitive", covariates = "age_pcr")
## Variable Type Covariates_Used n_obs
## 1 listaprimerrec Continuous (Gaussian GLM used) age_pcr 463
## 2 listaaprendizaje Continuous (Gaussian GLM used) age_pcr 463
## 3 listacp Continuous (Gaussian GLM used) age_pcr 463
## 4 listalp Continuous (Gaussian GLM used) age_pcr 463
## 5 listarecon Continuous (Gaussian GLM used) age_pcr 463
## 6 corsidirecto Continuous (Gaussian GLM used) age_pcr 463
## 7 corsiinverso Continuous (Gaussian GLM used) age_pcr 463
## 8 cactusvivos Continuous (Gaussian GLM used) age_pcr 463
## 9 cactusinanim Continuous (Gaussian GLM used) age_pcr 463
## 10 otverbaltpo Continuous (Gaussian GLM used) age_pcr 463
## 11 otverbalerr Continuous (Gaussian GLM used) age_pcr 463
## 12 otvisualtpo Continuous (Gaussian GLM used) age_pcr 463
## 13 otvisualerr Continuous (Gaussian GLM used) age_pcr 463
## 14 otmentaltpo Continuous (Gaussian GLM used) age_pcr 463
## 15 otmentalerr Continuous (Gaussian GLM used) age_pcr 463
## 16 otvismenttpo Continuous (Gaussian GLM used) age_pcr 463
## 17 otvismenterr Continuous (Gaussian GLM used) age_pcr 463
## 18 otswitchtpo Continuous (Gaussian GLM used) age_pcr 463
## 19 otswitcherr Continuous (Gaussian GLM used) age_pcr 463
## 20 x5dreadtpo Continuous (Gaussian GLM used) age_pcr 463
## 21 x5dreaderr Continuous (Gaussian GLM used) age_pcr 463
## 22 x5dcounttpo Continuous (Gaussian GLM used) age_pcr 463
## 23 x5dcounterr Continuous (Gaussian GLM used) age_pcr 463
## 24 x5dfoctpo Continuous (Gaussian GLM used) age_pcr 463
## 25 x5dfocerr Continuous (Gaussian GLM used) age_pcr 463
## 26 x5dswitchtpo Continuous (Gaussian GLM used) age_pcr 463
## 27 x5dswitcherr Continuous (Gaussian GLM used) age_pcr 463
## 28 dscorr Continuous (Gaussian GLM used) age_pcr 463
## 29 dsomis Continuous age_pcr 463
## 30 dscomis Continuous (Gaussian GLM used) age_pcr 463
## 31 torremov Continuous (Gaussian GLM used) age_pcr 463
## 32 torretpo Continuous (Gaussian GLM used) age_pcr 463
## 33 bostonsc Continuous (Gaussian GLM used) age_pcr 463
## 34 bostonlat Continuous (Gaussian GLM used) age_pcr 463
## 35 bostonsemerr Continuous (Gaussian GLM used) age_pcr 463
## 36 bostonfonerr Continuous age_pcr 463
## 37 fluencia Continuous (Gaussian GLM used) age_pcr 463
## Status p_value p_value_FDR Significant Gamma_Shift_Warning
## 1 OK <0.001 <0.001 Yes No
## 2 OK <0.001 <0.001 Yes No
## 3 OK <0.001 <0.001 Yes No
## 4 OK <0.001 <0.001 Yes No
## 5 OK <0.001 <0.001 Yes No
## 6 OK 0.191 0.202 No No
## 7 OK <0.001 <0.001 Yes No
## 8 OK <0.001 <0.001 Yes No
## 9 OK <0.001 <0.001 Yes No
## 10 OK <0.001 <0.001 Yes No
## 11 OK <0.001 <0.001 Yes No
## 12 OK <0.001 <0.001 Yes No
## 13 OK 0.041 0.046 Yes No
## 14 OK <0.001 <0.001 Yes No
## 15 OK <0.001 <0.001 Yes No
## 16 OK <0.001 <0.001 Yes No
## 17 OK <0.001 <0.001 Yes No
## 18 OK <0.001 <0.001 Yes No
## 19 OK <0.001 <0.001 Yes No
## 20 OK <0.001 <0.001 Yes No
## 21 OK 0.349 0.360 No No
## 22 OK <0.001 <0.001 Yes No
## 23 OK 0.826 0.826 No No
## 24 OK <0.001 <0.001 Yes No
## 25 OK 0.001 0.001 Yes No
## 26 OK <0.001 <0.001 Yes No
## 27 OK <0.001 <0.001 Yes No
## 28 OK <0.001 <0.001 Yes No
## 29 Error: NA/NaN/I f i 'x' NA NA No Yes
## 30 OK 0.057 0.062 No No
## 31 OK <0.001 <0.001 Yes No
## 32 OK <0.001 <0.001 Yes No
## 33 OK <0.001 <0.001 Yes No
## 34 OK <0.001 <0.001 Yes No
## 35 OK <0.001 <0.001 Yes No
## 36 Error: NA/NaN/I f i 'x' NA NA No Yes
## 37 OK <0.001 <0.001 Yes No
## Convergence_Warning
## 1 No
## 2 No
## 3 No
## 4 No
## 5 No
## 6 No
## 7 No
## 8 No
## 9 No
## 10 No
## 11 No
## 12 No
## 13 No
## 14 No
## 15 No
## 16 No
## 17 No
## 18 No
## 19 No
## 20 No
## 21 No
## 22 No
## 23 No
## 24 No
## 25 No
## 26 No
## 27 No
## 28 No
## 29 No
## 30 No
## 31 No
## 32 No
## 33 No
## 34 No
## 35 No
## 36 No
## 37 No
# Compare symptoms variables across groups (cognitive)
compare_groups_auto_v4(imputed_data[c(1:22)], group = "cognitive", covariates = "age_pcr")
## Variable Type Covariates_Used
## 1 de800cog Continuous (Gaussian GLM used) age_pcr
## 2 images Continuous (Gaussian GLM used) age_pcr
## 3 age_2024 Continuous (Gaussian GLM used) age_pcr
## 4 age_interval Continuous (Gaussian GLM used) age_pcr
## 5 anosmia Continuous (Gaussian GLM used) age_pcr
## 6 risk_hospital_icu Continuous age_pcr
## 7 vaccine_before_study Continuous (Gaussian GLM used) age_pcr
## 8 covid_before_vaccination Binary age_pcr
## 9 fever Binary age_pcr
## 10 cough Binary age_pcr
## 11 muscle_pain Binary age_pcr
## 12 breath_dif Binary age_pcr
## 13 smell_lost Binary age_pcr
## 14 taste_lost Binary age_pcr
## 15 pcr Binary age_pcr
## 16 pcr_num Binary age_pcr
## 17 covid_variant Continuous (Gaussian GLM used) age_pcr
## 18 vaccine_1 Continuous (Gaussian GLM used) age_pcr
## 19 vaccine_2 Continuous (Gaussian GLM used) age_pcr
## 20 vaccine_3 Continuous (Gaussian GLM used) age_pcr
## n_obs Status p_value p_value_FDR Significant
## 1 463 OK <0.001 <0.001 Yes
## 2 463 OK 0.235 0.446 No
## 3 463 OK 0.483 0.765 No
## 4 463 OK 0.132 0.314 No
## 5 463 OK 0.089 0.240 No
## 6 463 Error: NA/NaN/I f i 'x' NA NA No
## 7 463 OK 0.015 0.147 No
## 8 463 OK 0.081 0.240 No
## 9 463 OK 0.227 0.446 No
## 10 463 OK 0.811 0.873 No
## 11 463 OK 0.606 0.873 No
## 12 463 OK 0.058 0.240 No
## 13 463 OK 0.725 0.873 No
## 14 463 OK 0.357 0.617 No
## 15 462 OK 0.064 0.240 No
## 16 463 OK 0.087 0.240 No
## 17 463 OK 0.719 0.873 No
## 18 463 OK 0.998 0.998 No
## 19 463 OK 0.827 0.873 No
## 20 463 OK 0.782 0.873 No
## Gamma_Shift_Warning Convergence_Warning
## 1 No No
## 2 No No
## 3 No No
## 4 No No
## 5 No No
## 6 Yes No
## 7 No No
## 8 No No
## 9 No No
## 10 No No
## 11 No No
## 12 No No
## 13 No No
## 14 No No
## 15 No No
## 16 No No
## 17 No No
## 18 No No
## 19 No No
## 20 No No
# Compare symptoms variables across groups (age_interval)
compare_groups_auto_v4(imputed_data[c(1:22)], group = "age_interval")
## Variable Type Covariates_Used
## 1 cognitive Binary
## 2 de800cog Continuous (Gaussian GLM used)
## 3 images Continuous (Gaussian GLM used)
## 4 age_2024 Continuous (Gaussian GLM used)
## 5 age_pcr Continuous (Gaussian GLM used)
## 6 anosmia Continuous (Gaussian GLM used)
## 7 risk_hospital_icu Continuous
## 8 vaccine_before_study Continuous (Gaussian GLM used)
## 9 covid_before_vaccination Binary
## 10 fever Binary
## 11 cough Binary
## 12 muscle_pain Binary
## 13 breath_dif Binary
## 14 smell_lost Binary
## 15 taste_lost Binary
## 16 pcr Binary
## 17 pcr_num Binary
## 18 covid_variant Continuous (Gaussian GLM used)
## 19 vaccine_1 Continuous (Gaussian GLM used)
## 20 vaccine_2 Continuous (Gaussian GLM used)
## 21 vaccine_3 Continuous (Gaussian GLM used)
## n_obs Status p_value p_value_FDR Significant
## 1 463 OK <0.001 <0.001 Yes
## 2 463 OK <0.001 0.001 Yes
## 3 463 OK <0.001 <0.001 Yes
## 4 463 OK <0.001 <0.001 Yes
## 5 463 OK <0.001 <0.001 Yes
## 6 463 OK 0.358 0.447 No
## 7 463 Error: NA/NaN/I f i 'x' NA NA No
## 8 463 OK 0.103 0.159 No
## 9 463 OK 0.537 0.597 No
## 10 463 OK 0.125 0.178 No
## 11 463 OK 0.022 0.055 No
## 12 463 OK 0.034 0.076 No
## 13 463 OK 0.016 0.055 No
## 14 463 OK 0.078 0.142 No
## 15 463 OK 0.020 0.055 No
## 16 462 OK 0.071 0.142 No
## 17 463 OK 0.087 0.144 No
## 18 463 OK 0.844 0.844 No
## 19 463 OK 0.416 0.489 No
## 20 463 OK 0.614 0.646 No
## 21 463 OK 0.315 0.420 No
## Gamma_Shift_Warning Convergence_Warning
## 1 No No
## 2 No No
## 3 No No
## 4 No No
## 5 No No
## 6 No No
## 7 Yes No
## 8 No No
## 9 No No
## 10 No No
## 11 No No
## 12 No No
## 13 No No
## 14 No No
## 15 No No
## 16 No No
## 17 No No
## 18 No No
## 19 No No
## 20 No No
## 21 No No
The dataset (N=463) exhibited minimal missingness for most variables, although some items (e.g., certain “cough” or “PCR” measures) had higher rates of missing data. They were imputed using CART method (see: https://stefvanbuuren.name/fimd/sec-cart.html)
Outlier diagnostics indicated a small subset of extreme values in several cognitive-performance and neuroimaging measures. These did not prevent model convergence overall, but a few variables required a shift when attempting Gamma-based modeling, suggesting skewed distributions and the presence of zeros or negative values. I used false discovery rate (FDR-controlling procedures) in the R syntax (https://link.springer.com/referenceworkentry/10.1007/978-1-4419-9863-7_223)
When grouping by the “cognitive” variable (binary/ordinal classification), we tested multiple subcortical and cortical volumes as outcomes in generalized linear models (Gaussian family) with “age_pcr” as a covariate.
Most brain volumetric measures did not show a statistically significant difference between cognitive groups after false discovery rate (FDR) correction. A single region (e.g., right_thalamus_proper, p=0.032) approached nominal significance but was not significant after multiple-comparison adjustment.
By contrast, grouping participants according to their “age_interval” revealed widespread associations with numerous brain volumes (e.g., hippocampus, thalamus, cerebellum, and other subcortical/cortical areas). Several regions displayed highly significant p-values that remained robust after FDR correction. This finding underscores the substantial impact of age on volumetric measures.
Additionally, certain task-based cognitive measures (e.g., reaction times and error rates in the 5D tasks or OT tasks) also varied significantly across age intervals, indicating age-related differences in cognitive performance.
Exploratory analyses of symptoms (e.g., anosmia, cough, muscle pain) across “age_interval” did not yield consistent significance after FDR correction, although some nominal p-values (e.g., for cough and muscle_pain) were <0.05 uncorrected.
Interestingly, “cognitive” status and certain related variables (e.g., de800cog) differed significantly across age intervals (p<0.001), suggesting that older groups may show different cognitive test profiles.
These preliminary results highlight age as a crucial factor influencing both neuroimaging measures and certain cognitive markers.
The “cognitive” grouping alone did not robustly differentiate subcortical volumes once age was accounted for, suggesting that chronological age exerts a stronger effect on structural brain metrics than the cognitive classification in this sample.
A machine learning approach such as a Random Forest can be used to classify participants according to cognitive status using demographic and neuroimaging predictors. This approach can also help identify the most important predictors of cognitive differences.
# Load the randomForest package
library(randomForest)
# Ensure cognitive status is treated as a factor
imputed_data$cognitive <- as.factor(imputed_data$cognitive)
# Select predictors (here we include age_pcr and neuroimaging measures; adjust as needed)
predictors <- imputed_data %>%
select(age_pcr, right_accumbens_area:left_g_re_gyrus_rectus)
# Combine response and predictors into one dataframe
rf_data <- data.frame(cognitive = imputed_data$cognitive, predictors)
# Set seed for reproducibility and train the random forest model
set.seed(123)
rf_model <- randomForest(cognitive ~ ., data = rf_data, importance = TRUE)
# Print the model summary
print(rf_model)
##
## Call:
## randomForest(formula = cognitive ~ ., data = rf_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 40.6%
## Confusion matrix:
## 1 2 class.error
## 1 134 97 0.4199134
## 2 91 141 0.3922414
# Plot variable importance to identify key predictors
varImpPlot(rf_model)
In these two plots (yes, I know they are too small to be seen), each point represents the contribution of a particular predictor variable to the random forest model used to classify participants according to their cognitive status. The left panel (Mean Decrease in Accuracy) indicates how much model accuracy would drop if a given variable were excluded, while the right panel (Mean Decrease in Gini) shows how each variable contributes to node purity (i.e., how well it splits the data within the trees). 1. Top Predictors • Age (age_pcr) stands out as the most influential predictor for classifying cognitive status, indicating that chronological age has the strongest impact on the model’s decision-making process. • Subcortical volumes (e.g., hippocampus and thalamus) also rank highly, suggesting that these brain regions are important for differentiating between cognitive groups. 2. Model Performance • The out-of-bag (OOB) error rate of about 40.6% implies a moderate level of predictive accuracy (roughly 59.4% correct classification). While certain variables (like age and hippocampal volumes) are clearly influential, the model still struggles to classify all individuals correctly. 3. Practical Implications • The strong role of age underscores the need to control for or further investigate age-related effects when studying cognitive outcomes. • The importance of hippocampal and thalamic volumes aligns with existing evidence that these structures are linked to cognitive performance, particularly in aging populations. • The moderate overall accuracy suggests that either (a) additional variables or (b) a different modeling approach (e.g., feature engineering, dimension reduction, or other machine learning methods) may be needed to improve classification performance.
Overall, these plots highlight that age and certain subcortical regions are key drivers of the model’s classification decisions, but the relatively high misclassification rate points to a complex interplay of factors influencing cognitive status.
GAMs allow flexible modeling of non-linear effects. For instance, you can explore the non-linear association between age (or age interval) and brain volumes, which may not be captured adequately by linear models.
# Load the mgcv package
library(mgcv)
# Fit a GAM for one brain region (e.g., right_hippocampus) as a function of age_pcr
gam_model <- gam(right_hippocampus ~ s(age_pcr), data = imputed_data)
# Print the summary of the model
summary(gam_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## right_hippocampus ~ s(age_pcr)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3823.52 19.28 198.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age_pcr) 3.419 4.326 9.536 3.68e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0801 Deviance explained = 8.69%
## GCV = 1.7373e+05 Scale est. = 1.7207e+05 n = 463
# Plot the fitted smooth along with the residuals
plot(gam_model, residuals = TRUE, pch = 20, cex = 0.5, shade = TRUE)
# Diagnostic plots for the GAM model
par(mfrow = c(2,2))
plot(gam_model, residuals = TRUE, pch = 20, cex = 0.5, shade = TRUE)
# Check model diagnostics
gam.check(gam_model)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 0.01721632 .
## The Hessian was positive definite.
## Model rank = 10 / 10
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(age_pcr) 9.00 3.42 0.95 0.12
library(lavaan)
# Convert 'cognitive' to numeric (ensure that the factor levels correspond to numeric values)
imputed_data$cognitive_num <- as.numeric(as.character(imputed_data$cognitive))
# Specify the mediation model using the numeric version of cognitive
mediation_model <- '
# Direct effect
cognitive_num ~ c*age_pcr
# Mediator path
right_hippocampus ~ a*age_pcr
cognitive_num ~ b*right_hippocampus
# Indirect effect (a*b) and total effect
ab := a*b
total := c + (a*b)
'
# Fit the mediation model with robust maximum likelihood estimation
fit_mediation <- sem(mediation_model, data = imputed_data, estimator = "MLR")
summary(fit_mediation, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-19 ended normally after 1 iteration
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Number of observations 463
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 0.000 NA
## Degrees of freedom 0 0
##
## Model Test Baseline Model:
##
## Test statistic 67.355 67.354
## Degrees of freedom 3 3
## P-value 0.000 0.000
## Scaling correction factor 1.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 1.000 NA
## Tucker-Lewis Index (TLI) 1.000 NA
##
## Robust Comparative Fit Index (CFI) NA
## Robust Tucker-Lewis Index (TLI) NA
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -3769.059 -3769.059
## Loglikelihood unrestricted model (H1) -3769.059 -3769.059
##
## Akaike (AIC) 7548.117 7548.117
## Bayesian (BIC) 7568.806 7568.806
## Sample-size adjusted Bayesian (SABIC) 7552.937 7552.937
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.000 NA
## 90 Percent confidence interval - lower 0.000 NA
## 90 Percent confidence interval - upper 0.000 NA
## P-value H_0: RMSEA <= 0.050 NA NA
## P-value H_0: RMSEA >= 0.080 NA NA
##
## Robust RMSEA 0.000
## 90 Percent confidence interval - lower NA
## 90 Percent confidence interval - upper NA
## P-value H_0: Robust RMSEA <= 0.050 NA
## P-value H_0: Robust RMSEA >= 0.080 NA
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.000 0.000
##
## Parameter Estimates:
##
## Standard errors Sandwich
## Information bread Observed
## Observed information based on Hessian
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## cognitive_num ~
## age_pcr (c) -0.024 NA -0.024 -0.252
## right_hippocampus ~
## age_pcr (a) -20.512 NA -20.512 -0.252
## cognitive_num ~
## rght_hppcm (b) 0.000 NA 0.000 0.069
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .cognitive_num 0.231 NA 0.231 0.923
## .right_hippcmps 174832.240 NA 174832.240 0.937
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## ab -0.002 -0.002 -0.017
## total -0.025 -0.025 -0.269
The mediation model is just‐identified, so all global fit indices (CFI, TLI, RMSEA, SRMR) are perfect by definition and do not provide additional insight. The estimated parameters indicate the following: • Direct Effect (c): The effect of age (age_pcr) on cognitive performance (cognitive_num) is estimated at –0.024 (standardized –0.252), suggesting that, holding other factors constant, an increase in age is directly associated with lower cognitive scores. • Path a: The effect of age on right hippocampal volume is –20.512 (standardized –0.252), indicating that older age is associated with a reduction in hippocampal volume. • Path b: The effect of right hippocampal volume on cognitive performance is estimated at 0.000 (standardized 0.069). This near‐zero coefficient suggests that, in this model, hippocampal volume does not significantly predict cognitive performance once age is accounted for. • Indirect Effect (ab): The product of paths a and b (i.e., the mediated effect) is –0.002 (standardized –0.017), which is very small relative to the total effect. • Total Effect: The sum of the direct and indirect effects is –0.025 (standardized –0.269), which is essentially driven by the direct effect of age on cognition.
While age significantly predicts both lower cognitive performance and reduced hippocampal volume, there is no evidence from this model that hippocampal volume mediates the relationship between age and cognitive performance. The mediation pathway (ab) is negligible, indicating that the association between age and cognition is primarily a direct effect.