Argentina Data Analysis

1. Demographics

Variable Description ID: Unique identifier for the participant. ID_genetics: Genetic identifier related to the participant’s sample or genetic profile. cognitive: Assessment or overall score of cognitive function. de800cog: Result obtained from a specific cognitive test (e.g., DE800 version). images: Indicates the availability or analysis of images (e.g., neuroimaging). DATE_STUDY: Date when the study was conducted. BIRTH: Participant’s birth date. AGE-2024: Participant’s age in the year 2024. AGE-PCR: Age of the participant at the time of the PCR test. AGE_INTERVAL: Interval between age assessments (e.g., between the study date and the PCR test).

2. Symptoms

Variable Description ID Identifier: associated with the symptom evaluation. ANOSMIA: Indicator of loss of smell (anosmia), presented in binary format or on a scale. Anosmia: A second measure or confirmation of the presence of anosmia. RISK-HOSPITAL-ICU: Indicator of the risk of hospitalization or the need for ICU care due to COVID-19. VACCINE_BEFORE_STUDY: Participant’s vaccination status prior to the study. COVID_BEFORE_VACCINATION: History of COVID-19 infection before vaccination. FEVER: Presence of fever. COUGH: Presence of cough. MUSCLE_PAIN: Presence of muscle pain (myalgia). BREATH_DIF: Indication of breathing difficulties (dyspnea). SMELL_LOST: Report of loss of smell. TASTE_LOST: Report of loss of taste. DATE_PCR: Date when the PCR test for COVID-19 was performed. PCR: Qualitative result of the PCR test (e.g., positive or negative). PCR_NUM: Numerical value associated with the PCR test (e.g., Ct value or viral load). COVID-VARIANT: SARS-CoV-2 variant identified (e.g., Alpha, Delta, Omicron). VACCINE_1: Information regarding the first dose of the vaccine (type or brand). VACCINE_2: Information regarding the second dose of the vaccine (type or brand). VACCINE_3: Information regarding the third dose or booster dose of the vaccine.

3. Environmental COVID

Variable Description ID: Identifier related to environmental data or COVID-19 exposure. LISTAPRIMERREC: List referring to the first recognition (possibly of symptoms or initial contact). LISTAAPRENDIZAJE: List associated with the acquisition of knowledge or learning in the context of COVID-19. LISTACP: Indicator or list related to close contacts (CP) or similar parameters. LISTALP: List of parameters or locations (LP), whose definition depends on the study protocol. LISTARECON: List for the recognition of signs or symptoms related to COVID-19. CORSIDIRECTO: Measure or score of direct correlation, possibly related to environmental factors. CORSIINVERSO: Measure or score of inverse correlation related to the same parameters. CACTUSCORRECTAS: Number of correct responses obtained in the “cactus” test. CACTUSVIVOS: Number of responses classified as “living” in the “cactus” test. CACTUSINANIM: Number of responses classified as “inanimate” in the “cactus” test.

4. Cognitive Variables

Variable Description OTVERBALTPO: Response time in the verbal task (OT). OTVERBALERR: Number of errors in the verbal task (OT). OTVISUALTPO: Response time in the visual task (OT). OTVISUALERR: Number of errors in the visual task (OT). OTMENTALTPO: Response time in the mental task (OT). OTMENTALERR: Number of errors in the mental task (OT). OTVISMENTTPO: Response time in the task combining visual and mental stimuli (OT). OTVISMENTERR: Number of errors in the task combining visual and mental stimuli (OT). OTSWITCHTPO: Response time in the switching task (OT evaluation). OTSWITCHERR: Number of errors in the switching task (OT evaluation). 5DREADTPO: Reading time in the designated 5D task. 5DREADERR: Number of errors in the 5D reading task. 5DCOUNTTPO: Time taken in the counting task (5D). 5DCOUNTERR: Number of errors in the counting task (5D). 5DFOCTPO: Execution time in the focus task (5D). 5DFOCERR: Number of errors in the focus task (5D). 5DSWITCHTPO: Response time in the switching task (5D evaluation). 5DSWITCHERR: Number of errors in the switching task (5D evaluation). DSCORR: Number of correct responses in the DS task (e.g., digit span test). DSOMIS: Number of omissions in the DS task. DSCOMIS: Number of commission errors (incorrect responses) in the DS task. TORREMOV: Indicator or number of removals in a tower task (possibly related to response inhibition). TORRETPO: Execution time in the tower task (e.g., Tower of Hanoi or similar). BOSTONSC: Score obtained on the Boston test subscale, possibly related to naming. BOSTONLAT: Latency in the performance of the Boston test. BOSTONSEM: Performance in the semantic component of the Boston test. BOSTONSEMERR: Number of errors in the semantic component of the Boston test. BOSTONFON: Performance in the phonemic component of the Boston test. BOSTONFONERR: Number of errors in the phonemic component of the Boston test. FLUENCIA: Measure of verbal fluency, evaluating the ability to generate words within a specified time.

library(readxl)
library(tidyverse)
library(caret)

# Read the Excel file
dataset <- read_excel("~/Downloads/en_uso8_last_V10.xlsx", 
                      sheet = "dataset", col_types = c("numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "date", "date", "numeric", 
                                                       "numeric", "numeric", "date", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "date", "text", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric", "numeric", "numeric", 
                                                       "numeric"))

library(janitor)
# Rename variables to clean names
dataset <- dataset %>% 
  clean_names()

# Subset the data (selecting specific columns)
data <- dataset[c(2:5, 8:10, 12, 14:22, 24:29, 31:37, 39:65, 67, 69:101)]

# Get the names of the columns
names(data)
##  [1] "id_genetics"                                                   
##  [2] "cognitive"                                                     
##  [3] "de800cog"                                                      
##  [4] "images"                                                        
##  [5] "age_2024"                                                      
##  [6] "age_pcr"                                                       
##  [7] "age_interval"                                                  
##  [8] "anosmia"                                                       
##  [9] "risk_hospital_icu"                                             
## [10] "vaccine_before_study"                                          
## [11] "covid_before_vaccination"                                      
## [12] "fever"                                                         
## [13] "cough"                                                         
## [14] "muscle_pain"                                                   
## [15] "breath_dif"                                                    
## [16] "smell_lost"                                                    
## [17] "taste_lost"                                                    
## [18] "pcr"                                                           
## [19] "pcr_num"                                                       
## [20] "covid_variant"                                                 
## [21] "vaccine_1"                                                     
## [22] "vaccine_2"                                                     
## [23] "vaccine_3"                                                     
## [24] "listaprimerrec"                                                
## [25] "listaaprendizaje"                                              
## [26] "listacp"                                                       
## [27] "listalp"                                                       
## [28] "listarecon"                                                    
## [29] "corsidirecto"                                                  
## [30] "corsiinverso"                                                  
## [31] "cactusvivos"                                                   
## [32] "cactusinanim"                                                  
## [33] "otverbaltpo"                                                   
## [34] "otverbalerr"                                                   
## [35] "otvisualtpo"                                                   
## [36] "otvisualerr"                                                   
## [37] "otmentaltpo"                                                   
## [38] "otmentalerr"                                                   
## [39] "otvismenttpo"                                                  
## [40] "otvismenterr"                                                  
## [41] "otswitchtpo"                                                   
## [42] "otswitcherr"                                                   
## [43] "x5dreadtpo"                                                    
## [44] "x5dreaderr"                                                    
## [45] "x5dcounttpo"                                                   
## [46] "x5dcounterr"                                                   
## [47] "x5dfoctpo"                                                     
## [48] "x5dfocerr"                                                     
## [49] "x5dswitchtpo"                                                  
## [50] "x5dswitcherr"                                                  
## [51] "dscorr"                                                        
## [52] "dsomis"                                                        
## [53] "dscomis"                                                       
## [54] "torremov"                                                      
## [55] "torretpo"                                                      
## [56] "bostonsc"                                                      
## [57] "bostonlat"                                                     
## [58] "bostonsemerr"                                                  
## [59] "bostonfonerr"                                                  
## [60] "fluencia"                                                      
## [61] "right_accumbens_area"                                          
## [62] "left_accumbens_area"                                           
## [63] "right_amygdala"                                                
## [64] "left_amygdala"                                                 
## [65] "right_cerebellum_exterior"                                     
## [66] "left_cerebellum_exterior"                                      
## [67] "right_hippocampus"                                             
## [68] "left_hippocampus"                                              
## [69] "right_putamen"                                                 
## [70] "left_putamen"                                                  
## [71] "right_thalamus_proper"                                         
## [72] "left_thalamus_proper"                                          
## [73] "fornix_right"                                                  
## [74] "fornix_left"                                                   
## [75] "anterior_limb_of_internal_capsule_right"                       
## [76] "anterior_limb_of_internal_capsule_left"                        
## [77] "posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right"
## [78] "posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left" 
## [79] "corpus_callosum"                                               
## [80] "right_a_cg_g_anterior_cingulate_gyrus"                         
## [81] "left_a_cg_g_anterior_cingulate_gyrus"                          
## [82] "right_a_ins_anterior_insula"                                   
## [83] "left_a_ins_anterior_insula"                                    
## [84] "right_an_g_angular_gyrus"                                      
## [85] "left_an_g_angular_gyrus"                                       
## [86] "right_cun_cuneus"                                              
## [87] "left_cun_cuneus"                                               
## [88] "right_ent_entorhinal_area"                                     
## [89] "left_ent_entorhinal_area"                                      
## [90] "right_g_re_gyrus_rectus"                                       
## [91] "left_g_re_gyrus_rectus"
# Load skimr
library(skimr)
# Get a detailed summary of cognitive data:
skim(data[24:60])
Data summary
Name data[24:60]
Number of rows 463
Number of columns 37
_______________________
Column type frequency:
numeric 37
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
listaprimerrec 0 1.00 3.71 1.63 0.00 3.00 4.00 5.00 16.00 ▇▇▁▁▁
listaaprendizaje 0 1.00 23.99 7.35 0.00 19.00 24.00 29.00 41.00 ▁▂▇▇▂
listacp 0 1.00 6.20 2.21 0.00 5.00 6.00 8.00 12.00 ▁▃▇▃▁
listalp 0 1.00 5.48 2.38 0.00 4.00 5.00 7.00 12.00 ▂▅▇▃▁
listarecon 0 1.00 21.21 3.27 0.00 20.00 22.00 23.00 24.00 ▁▁▁▂▇
corsidirecto 0 1.00 5.13 1.34 0.00 4.00 5.00 6.00 11.00 ▁▃▇▁▁
corsiinverso 1 1.00 4.17 1.21 0.00 4.00 4.00 5.00 9.00 ▁▂▇▂▁
cactusvivos 0 1.00 12.67 2.56 0.00 12.00 13.00 14.00 16.00 ▁▁▁▃▇
cactusinanim 0 1.00 15.11 2.46 0.00 14.00 16.00 17.00 17.00 ▁▁▁▁▇
otverbaltpo 0 1.00 185.29 11.42 99.00 180.00 189.00 192.00 215.00 ▁▁▁▇▃
otverbalerr 0 1.00 19.84 0.55 15.00 20.00 20.00 20.00 20.00 ▁▁▁▁▇
otvisualtpo 0 1.00 63.95 31.16 0.00 46.00 55.00 74.00 300.00 ▇▅▁▁▁
otvisualerr 0 1.00 19.70 1.42 0.00 20.00 20.00 20.00 20.00 ▁▁▁▁▇
otmentaltpo 0 1.00 245.58 52.95 19.00 234.50 262.00 278.00 319.00 ▁▁▁▅▇
otmentalerr 0 1.00 18.04 3.03 0.00 17.00 19.00 20.00 20.00 ▁▁▁▂▇
otvismenttpo 0 1.00 240.89 49.22 32.00 226.50 256.00 272.50 332.00 ▁▁▂▇▃
otvismenterr 0 1.00 18.63 2.70 0.00 18.00 20.00 20.00 20.00 ▁▁▁▁▇
otswitchtpo 0 1.00 350.00 53.89 28.00 338.00 366.00 383.00 428.00 ▁▁▁▃▇
otswitcherr 0 1.00 17.65 2.95 0.00 17.00 19.00 20.00 20.00 ▁▁▁▂▇
x5dreadtpo 0 1.00 188.32 20.01 22.00 185.00 193.00 198.00 222.00 ▁▁▁▂▇
x5dreaderr 3 0.99 49.83 1.06 30.00 50.00 50.00 50.00 50.00 ▁▁▁▁▇
x5dcounttpo 2 1.00 165.29 14.23 68.00 162.00 168.00 173.00 200.00 ▁▁▁▇▃
x5dcounterr 2 1.00 49.79 1.51 19.00 50.00 50.00 50.00 50.00 ▁▁▁▁▇
x5dfoctpo 2 1.00 246.04 23.31 89.00 239.00 252.00 259.00 300.00 ▁▁▁▇▃
x5dfocerr 2 1.00 48.28 3.01 3.00 48.00 49.00 50.00 50.00 ▁▁▁▁▇
x5dswitchtpo 2 1.00 271.96 35.53 0.00 257.00 280.00 295.00 350.00 ▁▁▁▇▇
x5dswitcherr 2 1.00 46.58 4.59 0.00 46.00 48.00 49.00 50.00 ▁▁▁▁▇
dscorr 0 1.00 69.16 18.11 0.00 57.00 70.00 82.00 109.00 ▁▂▆▇▂
dsomis 0 1.00 0.35 1.95 0.00 0.00 0.00 0.00 24.00 ▇▁▁▁▁
dscomis 0 1.00 25.93 2.25 0.00 26.00 27.00 27.00 27.00 ▁▁▁▁▇
torremov 1 1.00 341.44 3.93 323.00 340.00 343.00 344.00 351.00 ▁▁▂▇▁
torretpo 0 1.00 264.54 74.67 2.58 240.58 287.58 316.08 352.58 ▁▁▁▅▇
bostonsc 0 1.00 3.20 2.93 0.00 1.00 2.00 5.00 12.00 ▇▃▂▂▁
bostonlat 0 1.00 2.71 1.89 0.00 1.83 2.62 3.52 16.00 ▇▃▁▁▁
bostonsemerr 0 1.00 1.99 2.37 0.00 0.00 1.00 3.00 10.00 ▇▂▁▁▁
bostonfonerr 0 1.00 1.12 1.76 0.00 0.00 0.00 2.00 11.00 ▇▁▁▁▁
fluencia 1 1.00 16.52 4.82 1.00 14.00 16.50 20.00 33.00 ▁▃▇▂▁
# Load DataExplorer
library(DataExplorer)
# Visualize the missing data pattern:
plot_missing(data[24:60])

# Plot density distributions of numeric variables:
plot_density(data[24:60])

# Load dlookr
library(dlookr)
# Diagnose potential data issues
print(diagnose_outlier(data[24:60]), n = 40)
## # A tibble: 37 × 6
##    variables    outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##    <chr>               <int>          <dbl>         <dbl>     <dbl>        <dbl>
##  1 listaprimer…            4          0.864         10.8      3.71         3.64 
##  2 listaaprend…            4          0.864          0.75    24.0         24.2  
##  3 listacp                 8          1.73           0        6.20         6.31 
##  4 listalp                 4          0.864         12        5.48         5.42 
##  5 listarecon             16          3.46           8.75    21.2         21.7  
##  6 corsidirecto            7          1.51           6        5.13         5.12 
##  7 corsiinverso           52         11.2            3.17     4.17         4.3  
##  8 cactusvivos            31          6.70           5.81    12.7         13.2  
##  9 cactusinanim           20          4.32           6.6     15.1         15.5  
## 10 otverbaltpo            20          4.32         157.     185.         187.   
## 11 otverbalerr            49         10.6           18.5     19.8         20    
## 12 otvisualtpo            29          6.26         142.      64.0         58.7  
## 13 otvisualerr            75         16.2           18.2     19.7         20    
## 14 otmentaltpo            44          9.50         110.     246.         260.   
## 15 otmentalerr            18          3.89           7       18.0         18.5  
## 16 otvismenttpo           29          6.26          97.7    241.         250.   
## 17 otvismenterr           24          5.18          10.0     18.6         19.1  
## 18 otswitchtpo            30          6.48         197.     350.         361.   
## 19 otswitcherr            21          4.54           8.43    17.7         18.1  
## 20 x5dreadtpo             29          6.26         138.     188.         192.   
## 21 x5dreaderr             38          8.21          47.9     49.8         50    
## 22 x5dcounttpo            32          6.91         135.     165.         168.   
## 23 x5dcounterr            46          9.94          47.9     49.8         50    
## 24 x5dfoctpo              28          6.05         194.     246.         249.   
## 25 x5dfocerr              21          4.54          39       48.3         48.7  
## 26 x5dswitchtpo           19          4.10         160.     272.         277.   
## 27 x5dswitcherr           31          6.70          34.4     46.6         47.5  
## 28 dscorr                  3          0.648         11.3     69.2         69.5  
## 29 dsomis                 46          9.94           3.54     0.352        0    
## 30 dscomis                68         14.7           21.8     25.9         26.6  
## 31 torremov               18          3.89         330.     341.         342.   
## 32 torretpo               37          7.99          66.8    265.         282.   
## 33 bostonsc                4          0.864         12        3.20         3.12 
## 34 bostonlat              19          4.10           8.04     2.71         2.48 
## 35 bostonsemerr           17          3.67           8.76     1.99         1.73 
## 36 bostonfonerr           22          4.75           6.64     1.12         0.846
## 37 fluencia                8          1.73          17.6     16.5         16.5
# Explore missing data patterns
print(diagnose(data[24:60]), n = 40)
## # A tibble: 37 × 6
##    variables        types missing_count missing_percent unique_count unique_rate
##    <chr>            <chr>         <int>           <dbl>        <int>       <dbl>
##  1 listaprimerrec   nume…             0           0               11      0.0238
##  2 listaaprendizaje nume…             0           0               39      0.0842
##  3 listacp          nume…             0           0               13      0.0281
##  4 listalp          nume…             0           0               13      0.0281
##  5 listarecon       nume…             0           0               17      0.0367
##  6 corsidirecto     nume…             0           0               11      0.0238
##  7 corsiinverso     nume…             1           0.216           11      0.0238
##  8 cactusvivos      nume…             0           0               14      0.0302
##  9 cactusinanim     nume…             0           0               13      0.0281
## 10 otverbaltpo      nume…             0           0               49      0.106 
## 11 otverbalerr      nume…             0           0                5      0.0108
## 12 otvisualtpo      nume…             0           0               99      0.214 
## 13 otvisualerr      nume…             0           0                7      0.0151
## 14 otmentaltpo      nume…             0           0              136      0.294 
## 15 otmentalerr      nume…             0           0               14      0.0302
## 16 otvismenttpo     nume…             0           0              128      0.276 
## 17 otvismenterr     nume…             0           0               13      0.0281
## 18 otswitchtpo      nume…             0           0              134      0.289 
## 19 otswitcherr      nume…             0           0               14      0.0302
## 20 x5dreadtpo       nume…             0           0               65      0.140 
## 21 x5dreaderr       nume…             3           0.648            7      0.0151
## 22 x5dcounttpo      nume…             2           0.432           59      0.127 
## 23 x5dcounterr      nume…             2           0.432            7      0.0151
## 24 x5dfoctpo        nume…             2           0.432           84      0.181 
## 25 x5dfocerr        nume…             2           0.432           15      0.0324
## 26 x5dswitchtpo     nume…             2           0.432          118      0.255 
## 27 x5dswitcherr     nume…             2           0.432           21      0.0454
## 28 dscorr           nume…             0           0               82      0.177 
## 29 dsomis           nume…             0           0               12      0.0259
## 30 dscomis          nume…             0           0               13      0.0281
## 31 torremov         nume…             1           0.216           20      0.0432
## 32 torretpo         nume…             0           0              151      0.326 
## 33 bostonsc         nume…             0           0               13      0.0281
## 34 bostonlat        nume…             0           0              147      0.317 
## 35 bostonsemerr     nume…             0           0               11      0.0238
## 36 bostonfonerr     nume…             0           0               10      0.0216
## 37 fluencia         nume…             1           0.216           30      0.0648
# Summary of the COVID data
summary(data[c(2:23)])
##    cognitive        de800cog         images         age_2024       age_pcr     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :12.0   Min.   :53.00  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:65.0   1st Qu.:62.00  
##  Median :2.000   Median :2.000   Median :2.000   Median :69.0   Median :65.00  
##  Mean   :1.501   Mean   :2.352   Mean   :1.877   Mean   :69.7   Mean   :66.48  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:73.0   3rd Qu.:70.00  
##  Max.   :2.000   Max.   :4.000   Max.   :3.000   Max.   :90.0   Max.   :86.00  
##                                                  NA's   :13                    
##   age_interval      anosmia      risk_hospital_icu vaccine_before_study
##  Min.   :1.000   Min.   :0.000   Min.   :0.0000    Min.   :0.000       
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000    1st Qu.:1.000       
##  Median :2.000   Median :2.000   Median :0.0000    Median :2.000       
##  Mean   :1.922   Mean   :1.734   Mean   :0.3024    Mean   :1.482       
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:0.0000    3rd Qu.:2.000       
##  Max.   :3.000   Max.   :3.000   Max.   :3.0000    Max.   :3.000       
##                  NA's   :1                                             
##  covid_before_vaccination     fever            cough         muscle_pain    
##  Min.   :0.0000           Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000           1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000           Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.5487           Mean   :0.4323   Mean   :0.5178   Mean   :0.6081  
##  3rd Qu.:1.0000           3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000           Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  NA's   :42               NA's   :42       NA's   :42       NA's   :42      
##    breath_dif       smell_lost       taste_lost        pcr           
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Length:463        
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   Class :character  
##  Median :0.0000   Median :0.0000   Median :0.000   Mode  :character  
##  Mean   :0.3967   Mean   :0.4561   Mean   :0.399                     
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000                     
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.000                     
##  NA's   :42       NA's   :42       NA's   :42                        
##     pcr_num       covid_variant     vaccine_1       vaccine_2    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :1.0000   Median :1.000   Median :2.000   Median :2.000  
##  Mean   :0.8377   Mean   :1.348   Mean   :1.996   Mean   :1.819  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :1.0000   Max.   :7.000   Max.   :5.000   Max.   :6.000  
##  NA's   :1                                                       
##    vaccine_3    
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :1.000  
##  Mean   :1.726  
##  3rd Qu.:2.000  
##  Max.   :6.000  
## 
# Summary of the cognitive data
summary(data[24:60])
##  listaprimerrec   listaaprendizaje    listacp          listalp      
##  Min.   : 0.000   Min.   : 0.00    Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.000   1st Qu.:19.00    1st Qu.: 5.000   1st Qu.: 4.000  
##  Median : 4.000   Median :24.00    Median : 6.000   Median : 5.000  
##  Mean   : 3.706   Mean   :23.99    Mean   : 6.203   Mean   : 5.477  
##  3rd Qu.: 5.000   3rd Qu.:29.00    3rd Qu.: 8.000   3rd Qu.: 7.000  
##  Max.   :16.000   Max.   :41.00    Max.   :12.000   Max.   :12.000  
##                                                                     
##    listarecon     corsidirecto     corsiinverso    cactusvivos   
##  Min.   : 0.00   Min.   : 0.000   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:20.00   1st Qu.: 4.000   1st Qu.:4.000   1st Qu.:12.00  
##  Median :22.00   Median : 5.000   Median :4.000   Median :13.00  
##  Mean   :21.21   Mean   : 5.134   Mean   :4.173   Mean   :12.67  
##  3rd Qu.:23.00   3rd Qu.: 6.000   3rd Qu.:5.000   3rd Qu.:14.00  
##  Max.   :24.00   Max.   :11.000   Max.   :9.000   Max.   :16.00  
##                                   NA's   :1                      
##   cactusinanim    otverbaltpo     otverbalerr     otvisualtpo    
##  Min.   : 0.00   Min.   : 99.0   Min.   :15.00   Min.   :  0.00  
##  1st Qu.:14.00   1st Qu.:180.0   1st Qu.:20.00   1st Qu.: 46.00  
##  Median :16.00   Median :189.0   Median :20.00   Median : 55.00  
##  Mean   :15.11   Mean   :185.3   Mean   :19.84   Mean   : 63.95  
##  3rd Qu.:17.00   3rd Qu.:192.0   3rd Qu.:20.00   3rd Qu.: 74.00  
##  Max.   :17.00   Max.   :215.0   Max.   :20.00   Max.   :300.00  
##                                                                  
##   otvisualerr    otmentaltpo     otmentalerr     otvismenttpo    otvismenterr  
##  Min.   : 0.0   Min.   : 19.0   Min.   : 0.00   Min.   : 32.0   Min.   : 0.00  
##  1st Qu.:20.0   1st Qu.:234.5   1st Qu.:17.00   1st Qu.:226.5   1st Qu.:18.00  
##  Median :20.0   Median :262.0   Median :19.00   Median :256.0   Median :20.00  
##  Mean   :19.7   Mean   :245.6   Mean   :18.04   Mean   :240.9   Mean   :18.63  
##  3rd Qu.:20.0   3rd Qu.:278.0   3rd Qu.:20.00   3rd Qu.:272.5   3rd Qu.:20.00  
##  Max.   :20.0   Max.   :319.0   Max.   :20.00   Max.   :332.0   Max.   :20.00  
##                                                                                
##   otswitchtpo   otswitcherr      x5dreadtpo      x5dreaderr     x5dcounttpo   
##  Min.   : 28   Min.   : 0.00   Min.   : 22.0   Min.   :30.00   Min.   : 68.0  
##  1st Qu.:338   1st Qu.:17.00   1st Qu.:185.0   1st Qu.:50.00   1st Qu.:162.0  
##  Median :366   Median :19.00   Median :193.0   Median :50.00   Median :168.0  
##  Mean   :350   Mean   :17.65   Mean   :188.3   Mean   :49.83   Mean   :165.3  
##  3rd Qu.:383   3rd Qu.:20.00   3rd Qu.:198.0   3rd Qu.:50.00   3rd Qu.:173.0  
##  Max.   :428   Max.   :20.00   Max.   :222.0   Max.   :50.00   Max.   :200.0  
##                                                NA's   :3       NA's   :2      
##   x5dcounterr      x5dfoctpo     x5dfocerr      x5dswitchtpo  x5dswitcherr  
##  Min.   :19.00   Min.   : 89   Min.   : 3.00   Min.   :  0   Min.   : 0.00  
##  1st Qu.:50.00   1st Qu.:239   1st Qu.:48.00   1st Qu.:257   1st Qu.:46.00  
##  Median :50.00   Median :252   Median :49.00   Median :280   Median :48.00  
##  Mean   :49.79   Mean   :246   Mean   :48.28   Mean   :272   Mean   :46.58  
##  3rd Qu.:50.00   3rd Qu.:259   3rd Qu.:50.00   3rd Qu.:295   3rd Qu.:49.00  
##  Max.   :50.00   Max.   :300   Max.   :50.00   Max.   :350   Max.   :50.00  
##  NA's   :2       NA's   :2     NA's   :2       NA's   :2     NA's   :2      
##      dscorr           dsomis           dscomis         torremov    
##  Min.   :  0.00   Min.   : 0.0000   Min.   : 0.00   Min.   :323.0  
##  1st Qu.: 57.00   1st Qu.: 0.0000   1st Qu.:26.00   1st Qu.:340.0  
##  Median : 70.00   Median : 0.0000   Median :27.00   Median :343.0  
##  Mean   : 69.16   Mean   : 0.3521   Mean   :25.93   Mean   :341.4  
##  3rd Qu.: 82.00   3rd Qu.: 0.0000   3rd Qu.:27.00   3rd Qu.:344.0  
##  Max.   :109.00   Max.   :24.0000   Max.   :27.00   Max.   :351.0  
##                                                     NA's   :1      
##     torretpo         bostonsc        bostonlat       bostonsemerr   
##  Min.   :  2.58   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:240.58   1st Qu.: 1.000   1st Qu.: 1.830   1st Qu.: 0.000  
##  Median :287.58   Median : 2.000   Median : 2.620   Median : 1.000  
##  Mean   :264.54   Mean   : 3.201   Mean   : 2.706   Mean   : 1.991  
##  3rd Qu.:316.08   3rd Qu.: 5.000   3rd Qu.: 3.520   3rd Qu.: 3.000  
##  Max.   :352.58   Max.   :12.000   Max.   :16.000   Max.   :10.000  
##                                                                     
##   bostonfonerr       fluencia    
##  Min.   : 0.000   Min.   : 1.00  
##  1st Qu.: 0.000   1st Qu.:14.00  
##  Median : 0.000   Median :16.50  
##  Mean   : 1.121   Mean   :16.52  
##  3rd Qu.: 2.000   3rd Qu.:20.00  
##  Max.   :11.000   Max.   :33.00  
##                   NA's   :1
# Summary of the volume brain data
summary(data[71:91])
##  right_thalamus_proper left_thalamus_proper  fornix_right    fornix_left   
##  Min.   :4349          Min.   :4478         Min.   : 67.0   Min.   :105.0  
##  1st Qu.:6449          1st Qu.:6800         1st Qu.:324.0   1st Qu.:399.5  
##  Median :6859          Median :7203         Median :361.0   Median :453.0  
##  Mean   :6871          Mean   :7231         Mean   :358.6   Mean   :446.2  
##  3rd Qu.:7265          3rd Qu.:7658         3rd Qu.:394.0   3rd Qu.:496.0  
##  Max.   :9039          Max.   :9560         Max.   :541.0   Max.   :694.0  
##  anterior_limb_of_internal_capsule_right anterior_limb_of_internal_capsule_left
##  Min.   :1849                            Min.   :1438                          
##  1st Qu.:2842                            1st Qu.:2440                          
##  Median :3071                            Median :2673                          
##  Mean   :3099                            Mean   :2687                          
##  3rd Qu.:3336                            3rd Qu.:2919                          
##  Max.   :4742                            Max.   :4213                          
##  posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right
##  Min.   :1571                                                  
##  1st Qu.:2194                                                  
##  Median :2349                                                  
##  Mean   :2384                                                  
##  3rd Qu.:2557                                                  
##  Max.   :4129                                                  
##  posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left corpus_callosum
##  Min.   :1455                                                  Min.   : 6119  
##  1st Qu.:2155                                                  1st Qu.: 9718  
##  Median :2333                                                  Median :10789  
##  Mean   :2365                                                  Mean   :10885  
##  3rd Qu.:2550                                                  3rd Qu.:11905  
##  Max.   :3814                                                  Max.   :17580  
##  right_a_cg_g_anterior_cingulate_gyrus left_a_cg_g_anterior_cingulate_gyrus
##  Min.   :1901                          Min.   :2210                        
##  1st Qu.:2962                          1st Qu.:3522                        
##  Median :3316                          Median :3896                        
##  Mean   :3392                          Mean   :3953                        
##  3rd Qu.:3772                          3rd Qu.:4374                        
##  Max.   :5867                          Max.   :6185                        
##  right_a_ins_anterior_insula left_a_ins_anterior_insula
##  Min.   :1537                Min.   :1562              
##  1st Qu.:2925                1st Qu.:2907              
##  Median :3186                Median :3165              
##  Mean   :3206                Mean   :3177              
##  3rd Qu.:3493                3rd Qu.:3450              
##  Max.   :4521                Max.   :5098              
##  right_an_g_angular_gyrus left_an_g_angular_gyrus right_cun_cuneus
##  Min.   : 4960            Min.   : 4715           Min.   :2653    
##  1st Qu.: 7492            1st Qu.: 6732           1st Qu.:4556    
##  Median : 8189            Median : 7401           Median :5139    
##  Mean   : 8312            Mean   : 7447           Mean   :5188    
##  3rd Qu.: 9116            3rd Qu.: 8115           3rd Qu.:5791    
##  Max.   :13560            Max.   :11434           Max.   :8236    
##  left_cun_cuneus right_ent_entorhinal_area left_ent_entorhinal_area
##  Min.   :2627    Min.   : 749              Min.   : 796            
##  1st Qu.:4551    1st Qu.:2131              1st Qu.:1856            
##  Median :5085    Median :2372              Median :2070            
##  Mean   :5144    Mean   :2424              Mean   :2109            
##  3rd Qu.:5652    3rd Qu.:2704              3rd Qu.:2336            
##  Max.   :8628    Max.   :3989              Max.   :4156            
##  right_g_re_gyrus_rectus left_g_re_gyrus_rectus
##  Min.   :1253            Min.   :1189          
##  1st Qu.:1752            1st Qu.:1774          
##  Median :1955            Median :1956          
##  Mean   :1972            Mean   :1978          
##  3rd Qu.:2166            3rd Qu.:2152          
##  Max.   :2912            Max.   :3072
# Impute the data using the MICE package
library(mice)
# Use the CART imputation method
imp <- mice( data[c(2:91)], m = 5, maxit = 5, method = "cart", seed = 123)

# Complete the imputation process
imputed_data <- complete(imp)
names(imputed_data)
library(EGAnet)
# Run EGA on the imputed data

# EGA on the Symptoms and Demographics
fit_demo <- EGA(imputed_data[c(1:3, 5, 8:16, 18:22)])

# Scale cognitive variables before running the EGA
fit_cog <- EGA(scale(imputed_data[23:59]))

# Confirmatory Factor Analysis from EGA model
cfa <- CFA(fit_cog, plot = TRUE, data = (scale(imputed_data[23:59])), estimator = "MLR")
## [1] "listaprimerrec"   "listaaprendizaje" "listacp"          "listalp"         
## [5] "listarecon"       "fluencia"        
##  [1] "corsidirecto" "corsiinverso" "cactusvivos"  "cactusinanim" "otverbalerr" 
##  [6] "dscorr"       "dscomis"      "bostonsc"     "bostonlat"    "bostonsemerr"
## [11] "bostonfonerr"
## [1] "otverbaltpo"  "otvisualtpo"  "x5dreadtpo"   "x5dcounttpo"  "x5dfoctpo"   
## [6] "x5dswitchtpo"
## [1] "otvisualerr"  "otmentaltpo"  "otmentalerr"  "otvismenttpo" "otvismenterr"
## [6] "otswitchtpo"  "otswitcherr" 
## [1] "x5dreaderr"   "x5dcounterr"  "x5dfocerr"    "x5dswitcherr"
## [1] "torremov" "torretpo"

# Scale brain variables before running the EGA
fit_all <- EGA(scale(imputed_data[60:90]))

#' Compare Groups Across Multiple Variables Automatically, Adjusting for Covariates
#'
#' This function iterates through specified variables in a dataset, determines their type,
#' fits appropriate generalized linear models (GLM), ordinal, or multinomial models,
#' and compares a model with the group predictor (and optional covariates) to a
#' null model (with optional covariates only) using a Likelihood Ratio Test (LRT).
#' It reports p-values and FDR-adjusted p-values for the group effect, adjusted for covariates.
#'
#' @param data A data.frame containing the variables to compare, the grouping variable,
#'   and any covariates.
#' @param group A character string specifying the name of the column in `data` that
#'   contains the grouping factor. Must have at least 2 levels.
#' @param vars_to_test A character vector specifying the names of the outcome variables
#'   in `data` to be tested against the `group`. If NULL (default), all columns
#'   except `group` and `covariates` are tested.
#' @param covariates A character vector specifying the names of the columns in `data` to
#'   be used as covariates in the regression models. Default is `NULL` (no covariates).
#'   These variables will be included in both the full and null models.
#' @param alpha The significance level (default 0.05) used to determine significance
#'   based on the FDR-adjusted p-value of the group effect.
#' @param zero_threshold For continuous outcome variables, the minimum proportion of zero values
#'   (default 0.3) to consider potentially using a Gamma GLM (along with skewness).
#' @param skew_threshold For continuous outcome variables, the minimum absolute skewness
#'   (default 2) to consider potentially using a Gamma GLM (along with zero proportion).
#'   Note: Uses `moments::skewness`.
#' @param verbose Logical indicating whether to print progress messages (default TRUE).
#' @param group_ref Optional. A string specifying the reference level for the group variable.
#'   If NULL (default), the first level is used.
#'
#' @return A data.frame summarizing the results for each variable tested, including:
#'   \item{Variable}{Name of the outcome variable tested.}
#'   \item{Type}{Detected type of the outcome variable (e.g., "Continuous (Gaussian GLM used)", "Binary", "Ordinal", "Categorical", "Constant Outcome Variable").}
#'   \item{Covariates_Used}{Comma-separated string of covariates included in the models for this variable.}
#'   \item{n_obs}{Number of observations used for the test after handling missing data for the outcome, group, and all covariates.}
#'   \item{Status}{Outcome of the modeling ("OK", "Error: [message]", "Constant Outcome Variable", "No DF improvement", "Low N or too many NAs", "Unsupported outcome variable type", "Log-Likelihood calculation failed...").}
#'   \item{p_value}{The raw p-value from the Likelihood Ratio Test for the `group` effect, formatted as a string.}
#'   \item{p_value_FDR}{The group p-value adjusted for multiple comparisons using the Benjamini-Hochberg FDR method, formatted as a string.}
#'   \item{Significant}{"Yes" if p_value_FDR < alpha, "No" otherwise (based on non-NA adjusted p-values).}
#'   \item{Gamma_Shift_Warning}{A flag ("Yes"/"No") indicating if non-positive outcome values were shifted for the Gamma GLM.}
#'   \item{Convergence_Warning}{A flag ("Yes"/"No") indicating if GLM/Multinom models reported non-convergence or if logLik calculation failed for Ordinal models (proxy for potential issues).}
#'
#' @details
#' - **Covariate Adjustment:** Compares `model(y ~ group + covariates)` vs `model(y ~ covariates)` (or `y ~ group` vs `y ~ 1` if no covariates).
#' - **Variable Selection:** Uses `vars_to_test` or defaults to all other variables.
#' - **Variable Types & Modeling:** Automatically detects outcome type. Uses Gaussian GLM (default continuous), Gamma GLM (skewed/zeros continuous), Binomial GLM (binary), `MASS::polr` (ordinal), `nnet::multinom` (categorical). Covariates used as-is (ensure factor/numeric).
#' - **Convergence:** Checks convergence flags for GLM/Multinom. For Ordinal (`polr`), failure to calculate log-likelihood is used as a proxy for potential convergence/Hessian issues and flagged.
#' - **Missing Data:** Uses complete cases for outcome, group, and all covariates per test.
#' - **Group Reference Level:** Users can specify the reference level for the group variable using `group_ref`. If the specified level is not present in the data for a particular variable after removing NAs, a warning is issued, and the default reference level is used.
#' - **Dependencies:** Requires 'MASS', 'nnet', and 'moments' packages.
#'
#' @examples
#' \dontrun{
#' set.seed(123)
#' sample_data <- data.frame(
#'   Site = factor(rep(c("Site1", "Site2"), each = 60)),
#'   Age = rnorm(120, mean = 50, sd = 10),
#'   Education = sample(10:20, 120, replace = TRUE),
#'   Sex = factor(sample(c("M", "F"), 120, replace = TRUE)),
#'   CognitiveScore = rnorm(120, mean = rep(c(100, 105), each = 60) + 0.1 * Age - 5 * (Sex == "F")),
#'   BiomarkerA = rgamma(120, shape = rep(c(2, 3), each = 60) + 0.05 * Age, scale = 1.5),
#'   Diagnosis = factor(sample(c("Normal", "MCI", "AD"), 120, replace = TRUE, prob = c(0.6, 0.3, 0.1))),
#'   OrdinalResp = factor(sample(1:5, 120, replace = TRUE), ordered = TRUE),
#'   ConstantVar = rep(10, 120)
#' )
#' sample_data$Age[sample(1:120, 10)] <- NA
#' sample_data$CognitiveScore[sample(1:120, 5)] <- NA
#' sample_data$BiomarkerA[1:5] <- NA
#'
#' results_with_cov <- compare_groups_auto_v4(
#'     data = sample_data,
#'     group = "Site",
#'     vars_to_test = c("CognitiveScore", "BiomarkerA", "Diagnosis", "OrdinalResp", "ConstantVar"),
#'     covariates = c("Age", "Education", "Sex"),
#'     verbose = TRUE
#' )
#' print(results_with_cov)
#'
#' # Example with specified group reference level
#' results_with_ref <- compare_groups_auto_v4(
#'     data = sample_data,
#'     group = "Site",
#'     vars_to_test = c("CognitiveScore", "BiomarkerA", "Diagnosis", "OrdinalResp", "ConstantVar"),
#'     covariates = c("Age", "Education", "Sex"),
#'     group_ref = "Site2",
#'     verbose = TRUE
#' )
#' print(results_with_ref)
#' }
#' @importFrom stats glm gaussian binomial Gamma logLik pchisq p.adjust complete.cases sd anova reformulate relevel na.omit
#' @importFrom MASS polr
#' @importFrom nnet multinom
#' @importFrom moments skewness
#' @export
compare_groups_auto_v4 <- function(data, group, vars_to_test = NULL, covariates = NULL, alpha = 0.05,
                                   zero_threshold = 0.3, skew_threshold = 2, verbose = TRUE, group_ref = NULL) {

  # --- 1. Input Validation and Package Checks ---
  if (!requireNamespace("MASS", quietly = TRUE)) stop("Package 'MASS' is required. Please install it.")
  if (!requireNamespace("nnet", quietly = TRUE)) stop("Package 'nnet' is required. Please install it.")
  if (!requireNamespace("moments", quietly = TRUE)) stop("Package 'moments' is required. Please install it.")

  if (!is.data.frame(data)) stop("'data' must be a data.frame.")
  if (!is.character(group) || length(group) != 1 || !(group %in% names(data))) {
    stop("'group' must be a single string naming an existing column in 'data'.")
  }

  if (!is.null(covariates)) {
    if (!is.character(covariates) || any(!covariates %in% names(data))) {
      stop("'covariates' must be NULL or a character vector of existing column names in 'data'.")
    }
    if (any(covariates %in% group)) stop("The 'group' variable cannot also be listed in 'covariates'.")
    covariates <- unique(covariates)
  } else {
    covariates <- character(0)
  }

  if (!is.null(vars_to_test)) {
    if (!is.character(vars_to_test) || any(!vars_to_test %in% names(data))) {
      stop("'vars_to_test' must be NULL or a character vector of existing column names in 'data'.")
    }
    if (any(vars_to_test %in% c(group, covariates))) stop("Variables in 'vars_to_test' cannot include the 'group' or 'covariates'.")
    vars_to_test <- unique(vars_to_test)
  }

  if (!is.null(group_ref) && (!is.character(group_ref) || length(group_ref) != 1)) {
    stop("'group_ref' must be NULL or a single string.")
  }

  if (!is.numeric(alpha) || alpha <= 0 || alpha >= 1) stop("'alpha' must be a numeric value strictly between 0 and 1.")
  if (!is.numeric(zero_threshold) || zero_threshold < 0 || zero_threshold > 1) stop("'zero_threshold' must be numeric between 0 and 1.")
  if (!is.numeric(skew_threshold) || skew_threshold < 0) stop("'skew_threshold' must be non-negative numeric.")
  if (!is.logical(verbose)) stop("'verbose' must be a logical value (TRUE or FALSE).")

  # --- 2. Prepare Data and Identify Variables ---
  if (!is.factor(data[[group]])) {
    if (verbose) message("Converting grouping variable '", group, "' to factor.")
    data[[group]] <- as.factor(data[[group]])
  }
  if (nlevels(data[[group]]) < 2) stop("Grouping variable '", group, "' must have at least two levels.")

  if (!is.null(group_ref)) {
    if (!(group_ref %in% levels(data[[group]]))) {
      stop(sprintf("'group_ref' '%s' is not a level of the group variable '%s'.", group_ref, group))
    }
  }

  if (is.null(vars_to_test)) {
    var_names <- setdiff(names(data), c(group, covariates))
  } else {
    var_names <- vars_to_test
  }

  if (length(var_names) == 0) {
    warning("No variables identified for testing after excluding 'group' and 'covariates'.")
    return(data.frame(Variable = character(), Type = character(), Covariates_Used = character(),
                      n_obs = integer(), Status = character(), p_value = numeric(),
                      p_value_FDR = numeric(), Significant = character(),
                      Gamma_Shift_Warning = character(), Convergence_Warning = character(),
                      stringsAsFactors = FALSE))
  }

  if (verbose) {
    message(sprintf("Starting analysis of %d variables with '%s' as the grouping variable.",
                    length(var_names), group))
    if (length(covariates) > 0) {
      message(sprintf("Using %d covariates: %s", length(covariates), paste(covariates, collapse = ", ")))
    }
  }

  # --- 3. Internal Helper Function: Model Fitting and LRT ---
  fit_models_and_lrt <- function(temp_data, formula_full, formula_null, type, variable, verbose,
                                 zero_threshold, skew_threshold) {
    gamma_shift_warning <- FALSE
    convergence_warning <- FALSE
    p_val <- NA_real_
    model_status <- "OK"
    type_used <- type
    data_for_model <- temp_data
    outcome <- data_for_model$y

    res <- tryCatch({
      if (type == "Continuous") {
        zero_prop <- mean(outcome == 0, na.rm = TRUE)
        skew_val <- tryCatch(moments::skewness(outcome, na.rm = TRUE), error = function(e) NA_real_)
        use_gamma <- !is.na(skew_val) && (zero_prop > zero_threshold) && (abs(skew_val) > skew_threshold)
        if (is.na(skew_val) && verbose) {
          message(sprintf("Skewness calculation failed for '%s'. Using Gaussian GLM.", variable))
        }
        if (use_gamma) {
          min_val <- min(outcome, na.rm = TRUE)
          if (min_val <= 0) {
            shift_amount <- max(1e-6, abs(min_val) * 1.01 + 1e-6)
            outcome <- outcome + shift_amount
            data_for_model$y <- outcome
            gamma_shift_warning <- TRUE
            if (verbose) warning(sprintf("Variable '%s': Contains non-positive values. Added shift of ~%s for Gamma GLM fitting.",
                                         variable, format(shift_amount, digits = 2)), call. = FALSE, immediate. = TRUE)
          }
          model_full <- stats::glm(formula = formula_full, data = data_for_model, family = stats::Gamma(link = "log"))
          model_null <- stats::glm(formula = formula_null, data = data_for_model, family = stats::Gamma(link = "log"))
          type_used <- "Continuous (Gamma GLM used)"
          if (!model_full$converged || !model_null$converged) {
            convergence_warning <- TRUE
            if (verbose) warning(sprintf("Gamma GLM did not converge for '%s'.", variable), call. = FALSE)
          }
        } else {
          model_full <- stats::glm(formula = formula_full, data = data_for_model, family = stats::gaussian(link = "identity"))
          model_null <- stats::glm(formula = formula_null, data = data_for_model, family = stats::gaussian(link = "identity"))
          type_used <- "Continuous (Gaussian GLM used)"
          if (!model_full$converged || !model_null$converged) {
            convergence_warning <- TRUE
            if (verbose) warning(sprintf("Gaussian GLM did not converge for '%s'.", variable), call. = FALSE)
          }
        }
      } else if (type == "Binary") {
        model_full <- stats::glm(formula = formula_full, data = data_for_model, family = stats::binomial(link = "logit"))
        model_null <- stats::glm(formula = formula_null, data = data_for_model, family = stats::binomial(link = "logit"))
        type_used <- "Binary"
        if (!model_full$converged || !model_null$converged) {
          convergence_warning <- TRUE
          if (verbose) warning(sprintf("Binomial GLM did not converge for '%s'.", variable), call. = FALSE)
        }
      } else if (type == "Ordinal") {
        model_full <- MASS::polr(formula = formula_full, data = data_for_model, method = "logistic", Hess = TRUE)
        model_null <- MASS::polr(formula = formula_null, data = data_for_model, method = "logistic", Hess = TRUE)
        type_used <- "Ordinal"
      } else if (type == "Categorical") {
        model_full <- nnet::multinom(formula = formula_full, data = data_for_model, trace = FALSE)
        model_null <- nnet::multinom(formula = formula_null, data = data_for_model, trace = FALSE)
        type_used <- "Categorical"
        if ((!is.null(model_full$convergence) && model_full$convergence != 0) ||
            (!is.null(model_null$convergence) && model_null$convergence != 0)) {
          convergence_warning <- TRUE
          if (verbose) warning(sprintf("Multinomial model did not fully converge for '%s'.", variable), call. = FALSE)
        }
      } else {
        stop("Internal error: Unknown model type.")
      }

      # --- Likelihood Ratio Test ---
      if (inherits(model_full, "glm")) {
        lrt_result <- suppressWarnings(stats::anova(model_null, model_full, test = "LRT"))
        if (!is.null(lrt_result) && "Pr(>Chi)" %in% names(lrt_result) && nrow(lrt_result) > 1) {
          candidate <- lrt_result$"Pr(>Chi)"[2]
          if (!is.na(candidate) && is.finite(candidate)) {
            p_val <- candidate
          } else {
            if (!is.na(lrt_result$Df[2]) && lrt_result$Df[2] <= 0) {
              model_status <- "No DF improvement/equivalent models"
              p_val <- 1.0
            } else {
              model_status <- "LRT p-value calculation failed (anova)"
              p_val <- NA_real_
            }
          }
        } else {
          model_status <- "LRT failed (anova output invalid/models identical)"
          p_val <- NA_real_
        }
      } else {
        ll_full <- tryCatch(stats::logLik(model_full), error = function(e) NA)
        ll_null <- tryCatch(stats::logLik(model_null), error = function(e) NA)
        if (is.na(ll_full) || is.na(ll_null)) {
          model_status <- "Log-Likelihood calculation failed (NA/NaN)"
          if (inherits(model_full, "polr")) {
            model_status <- paste(model_status, "- potential polr convergence?")
            convergence_warning <- TRUE
          }
          p_val <- NA_real_
        } else {
          lr_stat <- 2 * (as.numeric(ll_full) - as.numeric(ll_null))
          df_full <- attr(ll_full, "df")
          df_null <- attr(ll_null, "df")
          if (is.null(df_full) || is.null(df_null)) {
            model_status <- "Could not retrieve degrees of freedom for LRT"
            p_val <- NA_real_
          } else {
            df_diff <- df_full - df_null
            if (df_diff > .Machine$double.eps^0.5) {
              if (lr_stat < -1e-8) {
                model_status <- sprintf("Negative LRT statistic (%.2g)", lr_stat)
                p_val <- NA_real_
              } else {
                p_val <- stats::pchisq(max(0, lr_stat), df = df_diff, lower.tail = FALSE)
              }
            } else {
              p_val <- 1.0
              model_status <- "No DF improvement"
            }
          }
        }
      }

      list(p_value = p_val, model_status = model_status, type_used = type_used,
           gamma_shift_warning = gamma_shift_warning, convergence_warning = convergence_warning)
    }, error = function(e) {
      msg <- paste("Error:", gsub("[\\r\\n\\t]+", " ", conditionMessage(e)))
      if (verbose) message(sprintf("Model fitting/LRT failed for '%s'. Status: %s", variable, msg))
      list(p_value = NA_real_, model_status = msg, type_used = type,
           gamma_shift_warning = gamma_shift_warning, convergence_warning = convergence_warning)
    })

    return(res)
  }

  # --- 4. Loop Through Variables ---
  res_list <- vector("list", length(var_names))
  names(res_list) <- var_names

  for (i in seq_along(var_names)) {
    variable <- var_names[i]
    if (verbose && length(var_names) > 5 && i %% ceiling(length(var_names) / 10) == 1) {
      message(sprintf("Processing variable %d of %d: %s", i, length(var_names), variable))
    }

    # Initialize default result for the variable
    res <- data.frame(
      Variable = variable,
      Type = "Unknown",
      Covariates_Used = paste(covariates, collapse = ", "),
      n_obs = 0L,
      Status = "Not processed",
      p_value = NA_real_,
      p_value_FDR = NA_real_,
      Significant = "No",
      Gamma_Shift_Warning = "No",
      Convergence_Warning = "No",
      stringsAsFactors = FALSE
    )

    # --- Data Extraction and Missing Data Handling ---
    current_cols <- c(variable, group, covariates)
    full_data <- tryCatch(data[, current_cols, drop = FALSE], error = function(e) NULL)
    if (is.null(full_data)) {
      res$Status <- paste("Error accessing column:", variable)
      res_list[[variable]] <- res
      next
    }

    complete_idx <- stats::complete.cases(full_data)
    n_complete <- sum(complete_idx)
    res$n_obs <- n_complete
    min_obs_needed <- (nlevels(data[[group]]) - 1) + length(covariates) + nlevels(data[[group]]) + 5

    if (n_complete < min_obs_needed || n_complete < (2 * nlevels(data[[group]]))) {
      res$Status <- "Low N or too many NAs for model complexity"
      res_list[[variable]] <- res
      next
    }

    temp_data <- full_data[complete_idx, , drop = FALSE]
    names(temp_data)[names(temp_data) == variable] <- "y"
    temp_data[[group]] <- factor(temp_data[[group]])

    # Check if group has at least two levels after removing NAs
    if (nlevels(temp_data[[group]]) < 2) {
      res$Status <- "Grouping variable has less than 2 levels after removing NAs"
      res_list[[variable]] <- res
      next
    }

    # Set reference level for group if specified and present
    if (!is.null(group_ref) && group_ref %in% levels(temp_data[[group]])) {
      temp_data[[group]] <- relevel(temp_data[[group]], ref = group_ref)
    } else if (!is.null(group_ref)) {
      warning(sprintf("For variable '%s', the specified 'group_ref' '%s' is not present in the data after removing NAs. Using default reference level.", variable, group_ref))
    }

    x <- temp_data$y

    # --- Check for Constant Outcome Variable ---
    is_constant <- FALSE
    if (is.numeric(x)) {
      if (is.na(stats::sd(x, na.rm = TRUE)) || stats::sd(x, na.rm = TRUE) < .Machine$double.eps^0.5) {
        is_constant <- TRUE
      }
    } else {
      if (length(unique(stats::na.omit(x))) <= 1) is_constant <- TRUE
    }
    if (is_constant) {
      res$Status <- "Constant Outcome Variable"
      res$Type <- if (is.numeric(x)) "Numeric Constant" else "Factor/Char Constant"
      res_list[[variable]] <- res
      next
    }

    # --- Determine Outcome Variable Type ---
    type <- "Unknown"
    if (is.numeric(x)) {
      unique_vals <- unique(stats::na.omit(x))
      n_unique <- length(unique_vals)
      is_int_like_binary <- n_unique == 2 && all(abs(unique_vals - round(unique_vals)) < .Machine$double.eps^0.5)
      if (is_int_like_binary) {
        type <- "Binary"
        temp_data$y <- factor(temp_data$y, levels = sort(unique_vals))
      } else {
        type <- "Continuous"
      }
    } else if (is.factor(x)) {
      current_levels <- levels(droplevels(factor(x)))
      n_levels <- length(current_levels)
      original_levels <- levels(data[[variable]])
      if (is.ordered(x)) {
        if (n_levels < 2) {
          res$Status <- "Ordinal outcome with < 2 levels after NA removal"
          res_list[[variable]] <- res
          next
        }
        type <- "Ordinal"
        temp_data$y <- factor(temp_data$y, levels = intersect(original_levels, current_levels), ordered = TRUE)
      } else {
        if (n_levels == 2) {
          type <- "Binary"
          temp_data$y <- factor(temp_data$y, levels = current_levels)
        } else if (n_levels > 2) {
          type <- "Categorical"
          temp_data$y <- stats::relevel(factor(temp_data$y, levels = current_levels), ref = current_levels[1])
        } else {
          res$Status <- "Factor outcome with < 2 levels after NA removal"
          res_list[[variable]] <- res
          next
        }
      }
    } else if (is.character(x)) {
      temp_data$y <- factor(temp_data$y)
      current_levels <- levels(temp_data$y)
      n_levels <- length(current_levels)
      if (n_levels == 2) {
        type <- "Binary"
      } else if (n_levels > 2) {
        type <- "Categorical"
        temp_data$y <- stats::relevel(temp_data$y, ref = current_levels[1])
      } else {
        res$Status <- "Character outcome with < 2 levels after NA removal"
        res_list[[variable]] <- res
        next
      }
    } else {
      res$Status <- "Unsupported outcome variable type"
      res$Type <- class(x)[1]
      res_list[[variable]] <- res
      next
    }

    res$Type <- type

    # --- Ensure Covariates are Factor or Numeric ---
    for (covar in covariates) {
      if (!is.numeric(temp_data[[covar]]) && !is.factor(temp_data[[covar]])) {
        if (verbose) message("Converting covariate '", covar, "' to factor for variable '", variable, "'.")
        temp_data[[covar]] <- factor(temp_data[[covar]])
      }
    }

    # --- Construct Model Formulas ---
    terms_full <- c(group, covariates)
    terms_null <- if (length(covariates) > 0) covariates else "1"
    formula_full <- tryCatch(stats::reformulate(termlabels = terms_full, response = "y"),
                             error = function(e) NULL)
    formula_null <- tryCatch(stats::reformulate(termlabels = terms_null, response = "y"),
                             error = function(e) NULL)
    if (is.null(formula_full) || is.null(formula_null)) {
      res$Status <- "Error: Failed to construct model formulas"
      res_list[[variable]] <- res
      next
    }

    # --- Fit Models and Perform LRT via Helper Function ---
    model_out <- fit_models_and_lrt(temp_data, formula_full, formula_null, type, variable, verbose,
                                    zero_threshold, skew_threshold)
    res$p_value <- model_out$p_value
    res$Status <- model_out$model_status
    res$Type <- model_out$type_used
    if (model_out$gamma_shift_warning) res$Gamma_Shift_Warning <- "Yes"
    if (model_out$convergence_warning) res$Convergence_Warning <- "Yes"
    res_list[[variable]] <- res

    # --- Periodic Garbage Collection ---
    if (i %% 50 == 0 && (nrow(data) * ncol(data) > 1e6 || length(var_names) > 100)) {
      if (verbose) message("Running garbage collection...")
      gc(verbose = FALSE)
    }
  }

  # --- 5. Combine and Adjust Results ---
  res_df <- do.call(rbind, res_list)
  rownames(res_df) <- NULL
  valid_idx <- which(!is.na(res_df$p_value) & is.finite(res_df$p_value))
  if (length(valid_idx) > 0) {
    res_df$p_value_FDR[valid_idx] <- stats::p.adjust(res_df$p_value[valid_idx], method = "fdr")
    res_df$Significant <- ifelse(!is.na(res_df$p_value_FDR) &
                                   res_df$p_value_FDR < alpha, "Yes", "No")
    res_df$Significant[is.na(res_df$p_value)] <- "No"
  } else {
    res_df$p_value_FDR <- NA_real_
    res_df$Significant <- "No"
  }

  # --- 6. Round and Format P-values ---
  res_df$p_value <- ifelse(!is.na(res_df$p_value) & res_df$p_value < 0.001,
                           "<0.001", sprintf("%.3f", res_df$p_value))
  res_df$p_value_FDR <- ifelse(!is.na(res_df$p_value_FDR) & res_df$p_value_FDR < 0.001,
                               "<0.001", sprintf("%.3f", res_df$p_value_FDR))

  final_order <- c("Variable", "Type", "Covariates_Used", "n_obs", "Status",
                   "p_value", "p_value_FDR", "Significant",
                   "Gamma_Shift_Warning", "Convergence_Warning")
  if (!all(final_order %in% names(res_df))) {
    warning("Internal issue: Some expected result columns are missing.")
  } else {
    res_df <- res_df[, final_order, drop = FALSE]
  }

  if (verbose) message("Analysis finished. Returning results table.")
  return(res_df)
}
# Compare brain variables across groups (cognitive)
compare_groups_auto_v4(imputed_data[c(1, 5, 60:90)], group = "cognitive", covariates = "age_pcr")
##                                                          Variable
## 1                                            right_accumbens_area
## 2                                             left_accumbens_area
## 3                                                  right_amygdala
## 4                                                   left_amygdala
## 5                                       right_cerebellum_exterior
## 6                                        left_cerebellum_exterior
## 7                                               right_hippocampus
## 8                                                left_hippocampus
## 9                                                   right_putamen
## 10                                                   left_putamen
## 11                                          right_thalamus_proper
## 12                                           left_thalamus_proper
## 13                                                   fornix_right
## 14                                                    fornix_left
## 15                        anterior_limb_of_internal_capsule_right
## 16                         anterior_limb_of_internal_capsule_left
## 17 posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right
## 18  posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left
## 19                                                corpus_callosum
## 20                          right_a_cg_g_anterior_cingulate_gyrus
## 21                           left_a_cg_g_anterior_cingulate_gyrus
## 22                                    right_a_ins_anterior_insula
## 23                                     left_a_ins_anterior_insula
## 24                                       right_an_g_angular_gyrus
## 25                                        left_an_g_angular_gyrus
## 26                                               right_cun_cuneus
## 27                                                left_cun_cuneus
## 28                                      right_ent_entorhinal_area
## 29                                       left_ent_entorhinal_area
## 30                                        right_g_re_gyrus_rectus
## 31                                         left_g_re_gyrus_rectus
##                              Type Covariates_Used n_obs Status p_value
## 1  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.884
## 2  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.497
## 3  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.668
## 4  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.318
## 5  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.277
## 6  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.251
## 7  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.134
## 8  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.134
## 9  Continuous (Gaussian GLM used)         age_pcr   463     OK   0.948
## 10 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.439
## 11 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.032
## 12 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.058
## 13 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.410
## 14 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.584
## 15 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.770
## 16 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.528
## 17 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.789
## 18 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.561
## 19 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.492
## 20 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.511
## 21 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.281
## 22 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.283
## 23 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.395
## 24 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.901
## 25 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.565
## 26 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.606
## 27 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.721
## 28 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.671
## 29 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.820
## 30 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.252
## 31 Continuous (Gaussian GLM used)         age_pcr   463     OK   0.210
##    p_value_FDR Significant Gamma_Shift_Warning Convergence_Warning
## 1        0.931          No                  No                  No
## 2        0.854          No                  No                  No
## 3        0.867          No                  No                  No
## 4        0.854          No                  No                  No
## 5        0.854          No                  No                  No
## 6        0.854          No                  No                  No
## 7        0.854          No                  No                  No
## 8        0.854          No                  No                  No
## 9        0.948          No                  No                  No
## 10       0.854          No                  No                  No
## 11       0.854          No                  No                  No
## 12       0.854          No                  No                  No
## 13       0.854          No                  No                  No
## 14       0.854          No                  No                  No
## 15       0.905          No                  No                  No
## 16       0.854          No                  No                  No
## 17       0.905          No                  No                  No
## 18       0.854          No                  No                  No
## 19       0.854          No                  No                  No
## 20       0.854          No                  No                  No
## 21       0.854          No                  No                  No
## 22       0.854          No                  No                  No
## 23       0.854          No                  No                  No
## 24       0.931          No                  No                  No
## 25       0.854          No                  No                  No
## 26       0.854          No                  No                  No
## 27       0.894          No                  No                  No
## 28       0.867          No                  No                  No
## 29       0.908          No                  No                  No
## 30       0.854          No                  No                  No
## 31       0.854          No                  No                  No
# Compare brain variables across groups (age_interval)
compare_groups_auto_v4(imputed_data[c(6, 60:90)], group = "age_interval")
##                                                          Variable
## 1                                            right_accumbens_area
## 2                                             left_accumbens_area
## 3                                                  right_amygdala
## 4                                                   left_amygdala
## 5                                       right_cerebellum_exterior
## 6                                        left_cerebellum_exterior
## 7                                               right_hippocampus
## 8                                                left_hippocampus
## 9                                                   right_putamen
## 10                                                   left_putamen
## 11                                          right_thalamus_proper
## 12                                           left_thalamus_proper
## 13                                                   fornix_right
## 14                                                    fornix_left
## 15                        anterior_limb_of_internal_capsule_right
## 16                         anterior_limb_of_internal_capsule_left
## 17 posterior_limb_of_internal_capsule_inc_cerebral_peduncle_right
## 18  posterior_limb_of_internal_capsule_inc_cerebral_peduncle_left
## 19                                                corpus_callosum
## 20                          right_a_cg_g_anterior_cingulate_gyrus
## 21                           left_a_cg_g_anterior_cingulate_gyrus
## 22                                    right_a_ins_anterior_insula
## 23                                     left_a_ins_anterior_insula
## 24                                       right_an_g_angular_gyrus
## 25                                        left_an_g_angular_gyrus
## 26                                               right_cun_cuneus
## 27                                                left_cun_cuneus
## 28                                      right_ent_entorhinal_area
## 29                                       left_ent_entorhinal_area
## 30                                        right_g_re_gyrus_rectus
## 31                                         left_g_re_gyrus_rectus
##                              Type Covariates_Used n_obs Status p_value
## 1  Continuous (Gaussian GLM used)                   463     OK   0.008
## 2  Continuous (Gaussian GLM used)                   463     OK   0.081
## 3  Continuous (Gaussian GLM used)                   463     OK   0.609
## 4  Continuous (Gaussian GLM used)                   463     OK   0.358
## 5  Continuous (Gaussian GLM used)                   463     OK   0.095
## 6  Continuous (Gaussian GLM used)                   463     OK   0.100
## 7  Continuous (Gaussian GLM used)                   463     OK   0.002
## 8  Continuous (Gaussian GLM used)                   463     OK  <0.001
## 9  Continuous (Gaussian GLM used)                   463     OK   0.187
## 10 Continuous (Gaussian GLM used)                   463     OK   0.030
## 11 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 12 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 13 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 14 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 15 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 16 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 17 Continuous (Gaussian GLM used)                   463     OK   0.009
## 18 Continuous (Gaussian GLM used)                   463     OK   0.068
## 19 Continuous (Gaussian GLM used)                   463     OK   0.007
## 20 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 21 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 22 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 23 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 24 Continuous (Gaussian GLM used)                   463     OK  <0.001
## 25 Continuous (Gaussian GLM used)                   463     OK   0.028
## 26 Continuous (Gaussian GLM used)                   463     OK   0.004
## 27 Continuous (Gaussian GLM used)                   463     OK   0.458
## 28 Continuous (Gaussian GLM used)                   463     OK   0.131
## 29 Continuous (Gaussian GLM used)                   463     OK   0.115
## 30 Continuous (Gaussian GLM used)                   463     OK   0.040
## 31 Continuous (Gaussian GLM used)                   463     OK   0.577
##    p_value_FDR Significant Gamma_Shift_Warning Convergence_Warning
## 1        0.016         Yes                  No                  No
## 2        0.114          No                  No                  No
## 3        0.609          No                  No                  No
## 4        0.397          No                  No                  No
## 5        0.128          No                  No                  No
## 6        0.130          No                  No                  No
## 7        0.004         Yes                  No                  No
## 8       <0.001         Yes                  No                  No
## 9        0.215          No                  No                  No
## 10       0.049         Yes                  No                  No
## 11      <0.001         Yes                  No                  No
## 12      <0.001         Yes                  No                  No
## 13      <0.001         Yes                  No                  No
## 14      <0.001         Yes                  No                  No
## 15      <0.001         Yes                  No                  No
## 16      <0.001         Yes                  No                  No
## 17       0.016         Yes                  No                  No
## 18       0.100          No                  No                  No
## 19       0.014         Yes                  No                  No
## 20      <0.001         Yes                  No                  No
## 21      <0.001         Yes                  No                  No
## 22      <0.001         Yes                  No                  No
## 23      <0.001         Yes                  No                  No
## 24       0.001         Yes                  No                  No
## 25       0.049         Yes                  No                  No
## 26       0.009         Yes                  No                  No
## 27       0.489          No                  No                  No
## 28       0.156          No                  No                  No
## 29       0.143          No                  No                  No
## 30       0.061          No                  No                  No
## 31       0.596          No                  No                  No
# Compare cognitive variables across groups (cognitive)
compare_groups_auto_v4(imputed_data[c(1, 5, 23:59)], group = "cognitive", covariates = "age_pcr")
##            Variable                           Type Covariates_Used n_obs
## 1    listaprimerrec Continuous (Gaussian GLM used)         age_pcr   463
## 2  listaaprendizaje Continuous (Gaussian GLM used)         age_pcr   463
## 3           listacp Continuous (Gaussian GLM used)         age_pcr   463
## 4           listalp Continuous (Gaussian GLM used)         age_pcr   463
## 5        listarecon Continuous (Gaussian GLM used)         age_pcr   463
## 6      corsidirecto Continuous (Gaussian GLM used)         age_pcr   463
## 7      corsiinverso Continuous (Gaussian GLM used)         age_pcr   463
## 8       cactusvivos Continuous (Gaussian GLM used)         age_pcr   463
## 9      cactusinanim Continuous (Gaussian GLM used)         age_pcr   463
## 10      otverbaltpo Continuous (Gaussian GLM used)         age_pcr   463
## 11      otverbalerr Continuous (Gaussian GLM used)         age_pcr   463
## 12      otvisualtpo Continuous (Gaussian GLM used)         age_pcr   463
## 13      otvisualerr Continuous (Gaussian GLM used)         age_pcr   463
## 14      otmentaltpo Continuous (Gaussian GLM used)         age_pcr   463
## 15      otmentalerr Continuous (Gaussian GLM used)         age_pcr   463
## 16     otvismenttpo Continuous (Gaussian GLM used)         age_pcr   463
## 17     otvismenterr Continuous (Gaussian GLM used)         age_pcr   463
## 18      otswitchtpo Continuous (Gaussian GLM used)         age_pcr   463
## 19      otswitcherr Continuous (Gaussian GLM used)         age_pcr   463
## 20       x5dreadtpo Continuous (Gaussian GLM used)         age_pcr   463
## 21       x5dreaderr Continuous (Gaussian GLM used)         age_pcr   463
## 22      x5dcounttpo Continuous (Gaussian GLM used)         age_pcr   463
## 23      x5dcounterr Continuous (Gaussian GLM used)         age_pcr   463
## 24        x5dfoctpo Continuous (Gaussian GLM used)         age_pcr   463
## 25        x5dfocerr Continuous (Gaussian GLM used)         age_pcr   463
## 26     x5dswitchtpo Continuous (Gaussian GLM used)         age_pcr   463
## 27     x5dswitcherr Continuous (Gaussian GLM used)         age_pcr   463
## 28           dscorr Continuous (Gaussian GLM used)         age_pcr   463
## 29           dsomis                     Continuous         age_pcr   463
## 30          dscomis Continuous (Gaussian GLM used)         age_pcr   463
## 31         torremov Continuous (Gaussian GLM used)         age_pcr   463
## 32         torretpo Continuous (Gaussian GLM used)         age_pcr   463
## 33         bostonsc Continuous (Gaussian GLM used)         age_pcr   463
## 34        bostonlat Continuous (Gaussian GLM used)         age_pcr   463
## 35     bostonsemerr Continuous (Gaussian GLM used)         age_pcr   463
## 36     bostonfonerr                     Continuous         age_pcr   463
## 37         fluencia Continuous (Gaussian GLM used)         age_pcr   463
##                      Status p_value p_value_FDR Significant Gamma_Shift_Warning
## 1                        OK  <0.001      <0.001         Yes                  No
## 2                        OK  <0.001      <0.001         Yes                  No
## 3                        OK  <0.001      <0.001         Yes                  No
## 4                        OK  <0.001      <0.001         Yes                  No
## 5                        OK  <0.001      <0.001         Yes                  No
## 6                        OK   0.191       0.202          No                  No
## 7                        OK  <0.001      <0.001         Yes                  No
## 8                        OK  <0.001      <0.001         Yes                  No
## 9                        OK  <0.001      <0.001         Yes                  No
## 10                       OK  <0.001      <0.001         Yes                  No
## 11                       OK  <0.001      <0.001         Yes                  No
## 12                       OK  <0.001      <0.001         Yes                  No
## 13                       OK   0.041       0.046         Yes                  No
## 14                       OK  <0.001      <0.001         Yes                  No
## 15                       OK  <0.001      <0.001         Yes                  No
## 16                       OK  <0.001      <0.001         Yes                  No
## 17                       OK  <0.001      <0.001         Yes                  No
## 18                       OK  <0.001      <0.001         Yes                  No
## 19                       OK  <0.001      <0.001         Yes                  No
## 20                       OK  <0.001      <0.001         Yes                  No
## 21                       OK   0.349       0.360          No                  No
## 22                       OK  <0.001      <0.001         Yes                  No
## 23                       OK   0.826       0.826          No                  No
## 24                       OK  <0.001      <0.001         Yes                  No
## 25                       OK   0.001       0.001         Yes                  No
## 26                       OK  <0.001      <0.001         Yes                  No
## 27                       OK  <0.001      <0.001         Yes                  No
## 28                       OK  <0.001      <0.001         Yes                  No
## 29 Error: NA/NaN/I f i  'x'      NA          NA          No                 Yes
## 30                       OK   0.057       0.062          No                  No
## 31                       OK  <0.001      <0.001         Yes                  No
## 32                       OK  <0.001      <0.001         Yes                  No
## 33                       OK  <0.001      <0.001         Yes                  No
## 34                       OK  <0.001      <0.001         Yes                  No
## 35                       OK  <0.001      <0.001         Yes                  No
## 36 Error: NA/NaN/I f i  'x'      NA          NA          No                 Yes
## 37                       OK  <0.001      <0.001         Yes                  No
##    Convergence_Warning
## 1                   No
## 2                   No
## 3                   No
## 4                   No
## 5                   No
## 6                   No
## 7                   No
## 8                   No
## 9                   No
## 10                  No
## 11                  No
## 12                  No
## 13                  No
## 14                  No
## 15                  No
## 16                  No
## 17                  No
## 18                  No
## 19                  No
## 20                  No
## 21                  No
## 22                  No
## 23                  No
## 24                  No
## 25                  No
## 26                  No
## 27                  No
## 28                  No
## 29                  No
## 30                  No
## 31                  No
## 32                  No
## 33                  No
## 34                  No
## 35                  No
## 36                  No
## 37                  No
# Compare symptoms variables across groups (cognitive)
compare_groups_auto_v4(imputed_data[c(1:22)], group = "cognitive", covariates = "age_pcr")
##                    Variable                           Type Covariates_Used
## 1                  de800cog Continuous (Gaussian GLM used)         age_pcr
## 2                    images Continuous (Gaussian GLM used)         age_pcr
## 3                  age_2024 Continuous (Gaussian GLM used)         age_pcr
## 4              age_interval Continuous (Gaussian GLM used)         age_pcr
## 5                   anosmia Continuous (Gaussian GLM used)         age_pcr
## 6         risk_hospital_icu                     Continuous         age_pcr
## 7      vaccine_before_study Continuous (Gaussian GLM used)         age_pcr
## 8  covid_before_vaccination                         Binary         age_pcr
## 9                     fever                         Binary         age_pcr
## 10                    cough                         Binary         age_pcr
## 11              muscle_pain                         Binary         age_pcr
## 12               breath_dif                         Binary         age_pcr
## 13               smell_lost                         Binary         age_pcr
## 14               taste_lost                         Binary         age_pcr
## 15                      pcr                         Binary         age_pcr
## 16                  pcr_num                         Binary         age_pcr
## 17            covid_variant Continuous (Gaussian GLM used)         age_pcr
## 18                vaccine_1 Continuous (Gaussian GLM used)         age_pcr
## 19                vaccine_2 Continuous (Gaussian GLM used)         age_pcr
## 20                vaccine_3 Continuous (Gaussian GLM used)         age_pcr
##    n_obs                   Status p_value p_value_FDR Significant
## 1    463                       OK  <0.001      <0.001         Yes
## 2    463                       OK   0.235       0.446          No
## 3    463                       OK   0.483       0.765          No
## 4    463                       OK   0.132       0.314          No
## 5    463                       OK   0.089       0.240          No
## 6    463 Error: NA/NaN/I f i  'x'      NA          NA          No
## 7    463                       OK   0.015       0.147          No
## 8    463                       OK   0.081       0.240          No
## 9    463                       OK   0.227       0.446          No
## 10   463                       OK   0.811       0.873          No
## 11   463                       OK   0.606       0.873          No
## 12   463                       OK   0.058       0.240          No
## 13   463                       OK   0.725       0.873          No
## 14   463                       OK   0.357       0.617          No
## 15   462                       OK   0.064       0.240          No
## 16   463                       OK   0.087       0.240          No
## 17   463                       OK   0.719       0.873          No
## 18   463                       OK   0.998       0.998          No
## 19   463                       OK   0.827       0.873          No
## 20   463                       OK   0.782       0.873          No
##    Gamma_Shift_Warning Convergence_Warning
## 1                   No                  No
## 2                   No                  No
## 3                   No                  No
## 4                   No                  No
## 5                   No                  No
## 6                  Yes                  No
## 7                   No                  No
## 8                   No                  No
## 9                   No                  No
## 10                  No                  No
## 11                  No                  No
## 12                  No                  No
## 13                  No                  No
## 14                  No                  No
## 15                  No                  No
## 16                  No                  No
## 17                  No                  No
## 18                  No                  No
## 19                  No                  No
## 20                  No                  No
# Compare symptoms variables across groups (age_interval)
compare_groups_auto_v4(imputed_data[c(1:22)], group = "age_interval")
##                    Variable                           Type Covariates_Used
## 1                 cognitive                         Binary                
## 2                  de800cog Continuous (Gaussian GLM used)                
## 3                    images Continuous (Gaussian GLM used)                
## 4                  age_2024 Continuous (Gaussian GLM used)                
## 5                   age_pcr Continuous (Gaussian GLM used)                
## 6                   anosmia Continuous (Gaussian GLM used)                
## 7         risk_hospital_icu                     Continuous                
## 8      vaccine_before_study Continuous (Gaussian GLM used)                
## 9  covid_before_vaccination                         Binary                
## 10                    fever                         Binary                
## 11                    cough                         Binary                
## 12              muscle_pain                         Binary                
## 13               breath_dif                         Binary                
## 14               smell_lost                         Binary                
## 15               taste_lost                         Binary                
## 16                      pcr                         Binary                
## 17                  pcr_num                         Binary                
## 18            covid_variant Continuous (Gaussian GLM used)                
## 19                vaccine_1 Continuous (Gaussian GLM used)                
## 20                vaccine_2 Continuous (Gaussian GLM used)                
## 21                vaccine_3 Continuous (Gaussian GLM used)                
##    n_obs                   Status p_value p_value_FDR Significant
## 1    463                       OK  <0.001      <0.001         Yes
## 2    463                       OK  <0.001       0.001         Yes
## 3    463                       OK  <0.001      <0.001         Yes
## 4    463                       OK  <0.001      <0.001         Yes
## 5    463                       OK  <0.001      <0.001         Yes
## 6    463                       OK   0.358       0.447          No
## 7    463 Error: NA/NaN/I f i  'x'      NA          NA          No
## 8    463                       OK   0.103       0.159          No
## 9    463                       OK   0.537       0.597          No
## 10   463                       OK   0.125       0.178          No
## 11   463                       OK   0.022       0.055          No
## 12   463                       OK   0.034       0.076          No
## 13   463                       OK   0.016       0.055          No
## 14   463                       OK   0.078       0.142          No
## 15   463                       OK   0.020       0.055          No
## 16   462                       OK   0.071       0.142          No
## 17   463                       OK   0.087       0.144          No
## 18   463                       OK   0.844       0.844          No
## 19   463                       OK   0.416       0.489          No
## 20   463                       OK   0.614       0.646          No
## 21   463                       OK   0.315       0.420          No
##    Gamma_Shift_Warning Convergence_Warning
## 1                   No                  No
## 2                   No                  No
## 3                   No                  No
## 4                   No                  No
## 5                   No                  No
## 6                   No                  No
## 7                  Yes                  No
## 8                   No                  No
## 9                   No                  No
## 10                  No                  No
## 11                  No                  No
## 12                  No                  No
## 13                  No                  No
## 14                  No                  No
## 15                  No                  No
## 16                  No                  No
## 17                  No                  No
## 18                  No                  No
## 19                  No                  No
## 20                  No                  No
## 21                  No                  No

Descriptive Statistics and Data Quality

The dataset (N=463) exhibited minimal missingness for most variables, although some items (e.g., certain “cough” or “PCR” measures) had higher rates of missing data. They were imputed using CART method (see: https://stefvanbuuren.name/fimd/sec-cart.html)

Outlier diagnostics indicated a small subset of extreme values in several cognitive-performance and neuroimaging measures. These did not prevent model convergence overall, but a few variables required a shift when attempting Gamma-based modeling, suggesting skewed distributions and the presence of zeros or negative values. I used false discovery rate (FDR-controlling procedures) in the R syntax (https://link.springer.com/referenceworkentry/10.1007/978-1-4419-9863-7_223)

Cognitive Group Comparisons (Controlling for Age)

When grouping by the “cognitive” variable (binary/ordinal classification), we tested multiple subcortical and cortical volumes as outcomes in generalized linear models (Gaussian family) with “age_pcr” as a covariate.

Most brain volumetric measures did not show a statistically significant difference between cognitive groups after false discovery rate (FDR) correction. A single region (e.g., right_thalamus_proper, p=0.032) approached nominal significance but was not significant after multiple-comparison adjustment.

Age Interval Group Comparisons

By contrast, grouping participants according to their “age_interval” revealed widespread associations with numerous brain volumes (e.g., hippocampus, thalamus, cerebellum, and other subcortical/cortical areas). Several regions displayed highly significant p-values that remained robust after FDR correction. This finding underscores the substantial impact of age on volumetric measures.

Additionally, certain task-based cognitive measures (e.g., reaction times and error rates in the 5D tasks or OT tasks) also varied significantly across age intervals, indicating age-related differences in cognitive performance.

Symptom and Demographic Variables

Exploratory analyses of symptoms (e.g., anosmia, cough, muscle pain) across “age_interval” did not yield consistent significance after FDR correction, although some nominal p-values (e.g., for cough and muscle_pain) were <0.05 uncorrected.

Interestingly, “cognitive” status and certain related variables (e.g., de800cog) differed significantly across age intervals (p<0.001), suggesting that older groups may show different cognitive test profiles.

Overall Implications

These preliminary results highlight age as a crucial factor influencing both neuroimaging measures and certain cognitive markers.

The “cognitive” grouping alone did not robustly differentiate subcortical volumes once age was accounted for, suggesting that chronological age exerts a stronger effect on structural brain metrics than the cognitive classification in this sample.

Random Forest Classification for Cognitive Status

A machine learning approach such as a Random Forest can be used to classify participants according to cognitive status using demographic and neuroimaging predictors. This approach can also help identify the most important predictors of cognitive differences.

# Load the randomForest package
library(randomForest)

# Ensure cognitive status is treated as a factor
imputed_data$cognitive <- as.factor(imputed_data$cognitive)

# Select predictors (here we include age_pcr and neuroimaging measures; adjust as needed)
predictors <- imputed_data %>% 
  select(age_pcr, right_accumbens_area:left_g_re_gyrus_rectus)

# Combine response and predictors into one dataframe
rf_data <- data.frame(cognitive = imputed_data$cognitive, predictors)

# Set seed for reproducibility and train the random forest model
set.seed(123)
rf_model <- randomForest(cognitive ~ ., data = rf_data, importance = TRUE)

# Print the model summary
print(rf_model)
## 
## Call:
##  randomForest(formula = cognitive ~ ., data = rf_data, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 40.6%
## Confusion matrix:
##     1   2 class.error
## 1 134  97   0.4199134
## 2  91 141   0.3922414
# Plot variable importance to identify key predictors
varImpPlot(rf_model)

Interpretation of the Random Forest Variable Importance Plot

In these two plots (yes, I know they are too small to be seen), each point represents the contribution of a particular predictor variable to the random forest model used to classify participants according to their cognitive status. The left panel (Mean Decrease in Accuracy) indicates how much model accuracy would drop if a given variable were excluded, while the right panel (Mean Decrease in Gini) shows how each variable contributes to node purity (i.e., how well it splits the data within the trees). 1. Top Predictors • Age (age_pcr) stands out as the most influential predictor for classifying cognitive status, indicating that chronological age has the strongest impact on the model’s decision-making process. • Subcortical volumes (e.g., hippocampus and thalamus) also rank highly, suggesting that these brain regions are important for differentiating between cognitive groups. 2. Model Performance • The out-of-bag (OOB) error rate of about 40.6% implies a moderate level of predictive accuracy (roughly 59.4% correct classification). While certain variables (like age and hippocampal volumes) are clearly influential, the model still struggles to classify all individuals correctly. 3. Practical Implications • The strong role of age underscores the need to control for or further investigate age-related effects when studying cognitive outcomes. • The importance of hippocampal and thalamic volumes aligns with existing evidence that these structures are linked to cognitive performance, particularly in aging populations. • The moderate overall accuracy suggests that either (a) additional variables or (b) a different modeling approach (e.g., feature engineering, dimension reduction, or other machine learning methods) may be needed to improve classification performance.

Overall, these plots highlight that age and certain subcortical regions are key drivers of the model’s classification decisions, but the relatively high misclassification rate points to a complex interplay of factors influencing cognitive status.

Generalized Additive Models (GAM) for Non-linear Relationships

GAMs allow flexible modeling of non-linear effects. For instance, you can explore the non-linear association between age (or age interval) and brain volumes, which may not be captured adequately by linear models.

# Load the mgcv package
library(mgcv)

# Fit a GAM for one brain region (e.g., right_hippocampus) as a function of age_pcr
gam_model <- gam(right_hippocampus ~ s(age_pcr), data = imputed_data)

# Print the summary of the model
summary(gam_model)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## right_hippocampus ~ s(age_pcr)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3823.52      19.28   198.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df     F  p-value    
## s(age_pcr) 3.419  4.326 9.536 3.68e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0801   Deviance explained = 8.69%
## GCV = 1.7373e+05  Scale est. = 1.7207e+05  n = 463
# Plot the fitted smooth along with the residuals
plot(gam_model, residuals = TRUE, pch = 20, cex = 0.5, shade = TRUE)

# Diagnostic plots for the GAM model
par(mfrow = c(2,2))
plot(gam_model, residuals = TRUE, pch = 20, cex = 0.5, shade = TRUE)
# Check model diagnostics
gam.check(gam_model)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 0.01721632 .
## The Hessian was positive definite.
## Model rank =  10 / 10 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##              k'  edf k-index p-value
## s(age_pcr) 9.00 3.42    0.95    0.12
library(lavaan)

# Convert 'cognitive' to numeric (ensure that the factor levels correspond to numeric values)
imputed_data$cognitive_num <- as.numeric(as.character(imputed_data$cognitive))

# Specify the mediation model using the numeric version of cognitive
mediation_model <- '
  # Direct effect
  cognitive_num ~ c*age_pcr
  # Mediator path
  right_hippocampus ~ a*age_pcr
  cognitive_num ~ b*right_hippocampus
  # Indirect effect (a*b) and total effect
  ab := a*b
  total := c + (a*b)
'

# Fit the mediation model with robust maximum likelihood estimation
fit_mediation <- sem(mediation_model, data = imputed_data, estimator = "MLR")
summary(fit_mediation, standardized = TRUE, fit.measures = TRUE)
## lavaan 0.6-19 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                           463
## 
## Model Test User Model:
##                                               Standard      Scaled
##   Test Statistic                                 0.000          NA
##   Degrees of freedom                                 0           0
## 
## Model Test Baseline Model:
## 
##   Test statistic                                67.355      67.354
##   Degrees of freedom                                 3           3
##   P-value                                        0.000       0.000
##   Scaling correction factor                                  1.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    1.000          NA
##   Tucker-Lewis Index (TLI)                       1.000          NA
##                                                                   
##   Robust Comparative Fit Index (CFI)                            NA
##   Robust Tucker-Lewis Index (TLI)                               NA
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3769.059   -3769.059
##   Loglikelihood unrestricted model (H1)      -3769.059   -3769.059
##                                                                   
##   Akaike (AIC)                                7548.117    7548.117
##   Bayesian (BIC)                              7568.806    7568.806
##   Sample-size adjusted Bayesian (SABIC)       7552.937    7552.937
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.000          NA
##   90 Percent confidence interval - lower         0.000          NA
##   90 Percent confidence interval - upper         0.000          NA
##   P-value H_0: RMSEA <= 0.050                       NA          NA
##   P-value H_0: RMSEA >= 0.080                       NA          NA
##                                                                   
##   Robust RMSEA                                               0.000
##   90 Percent confidence interval - lower                        NA
##   90 Percent confidence interval - upper                        NA
##   P-value H_0: Robust RMSEA <= 0.050                            NA
##   P-value H_0: Robust RMSEA >= 0.080                            NA
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.000       0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Sandwich
##   Information bread                           Observed
##   Observed information based on                Hessian
## 
## Regressions:
##                       Estimate    Std.Err  z-value  P(>|z|)   Std.lv    Std.all
##   cognitive_num ~                                                              
##     age_pcr    (c)        -0.024       NA                       -0.024   -0.252
##   right_hippocampus ~                                                          
##     age_pcr    (a)       -20.512       NA                      -20.512   -0.252
##   cognitive_num ~                                                              
##     rght_hppcm (b)         0.000       NA                        0.000    0.069
## 
## Variances:
##                    Estimate    Std.Err  z-value  P(>|z|)   Std.lv    Std.all
##    .cognitive_num       0.231       NA                        0.231    0.923
##    .right_hippcmps 174832.240       NA                   174832.240    0.937
## 
## Defined Parameters:
##                    Estimate    Std.Err  z-value  P(>|z|)   Std.lv    Std.all
##     ab                 -0.002                                -0.002   -0.017
##     total              -0.025                                -0.025   -0.269

The mediation model is just‐identified, so all global fit indices (CFI, TLI, RMSEA, SRMR) are perfect by definition and do not provide additional insight. The estimated parameters indicate the following: • Direct Effect (c): The effect of age (age_pcr) on cognitive performance (cognitive_num) is estimated at –0.024 (standardized –0.252), suggesting that, holding other factors constant, an increase in age is directly associated with lower cognitive scores. • Path a: The effect of age on right hippocampal volume is –20.512 (standardized –0.252), indicating that older age is associated with a reduction in hippocampal volume. • Path b: The effect of right hippocampal volume on cognitive performance is estimated at 0.000 (standardized 0.069). This near‐zero coefficient suggests that, in this model, hippocampal volume does not significantly predict cognitive performance once age is accounted for. • Indirect Effect (ab): The product of paths a and b (i.e., the mediated effect) is –0.002 (standardized –0.017), which is very small relative to the total effect. • Total Effect: The sum of the direct and indirect effects is –0.025 (standardized –0.269), which is essentially driven by the direct effect of age on cognition.

While age significantly predicts both lower cognitive performance and reduced hippocampal volume, there is no evidence from this model that hippocampal volume mediates the relationship between age and cognitive performance. The mediation pathway (ab) is negligible, indicating that the association between age and cognition is primarily a direct effect.