SCD Rheology Model Refinement: Modeling ei_min, ei_max, and PoS

Author

Jonathan Wade

Analysis Summary

This analysis explores key rheological parameters in our OMICs cohort:

minimum elongation index (ei_min)
maximum elongation index (ei_max)
point of sickling (PoS).

Using mixed-effects models, we identify the most important clinical and demographic predictors of these outcomes while accounting for patient-level clustering and addressing class imbalance in our dataset.

Our approach combines statistical modeling with stratified bootstrap validation to ensure reliable inference for all patient subgroups. The results reveal distinct patterns of predictors for each rheological parameter, with some consistent effects across outcomes that may have important clinical implications.

Introduction

In this document we start by modeling three key outcomes from our SCD dataset:
- ei_min (Minimum Elongation Index)
- ei_max (Maximum Elongation Index)
- PoS (Point of Sickling, from pos_lorrca)

I begin with a full mixed-effects model (using a random intercept for each patient) that includes predictors such as age, gender, genotype, steady-state status, transfusion status, a transformed measure of HbF (using a cube-root transformation), and an age group indicator. I use the tidymodels framework for preprocessing, model fitting, and diagnostics.

Analytical Approach

Our analysis aimed to identify key predictors of rheological parameters while accounting for patient-level clustering and addressing class imbalance in our cohort. I employed a systematic approach to data preprocessing, model development, and validation.

Setup and Data Preprocessing

I began by preprocessing the OMICs dataset, removing extreme outliers and creating derived variables to capture clinically relevant features. We applied appropriate transformations to improve the normality of our outcome variables and prepared the data for mixed-effects modeling.

Code

# Load required libraries
library(tidyverse)
library(tidymodels)
library(themis)     # For additional preprocessing steps
library(vip)        # For variable importance
library(corrplot)   # For correlation plots
library(car)        # For VIF analysis
library(DHARMa)     # For mixed model diagnostics
library(glmmTMB)    # For alternative mixed models if needed
library(splines)    # For potential non-linear terms
library(gridExtra)  # For arranging plots
library(broom.mixed) # For tidy method with mixed models
tidymodels_prefer()
library(kableExtra)
library(grid)
library(gridExtra)
library(lme4)
library(cowplot)

set.seed(333)

# Load the dataset
df <- read_csv('OMICS-export-processed-3-5-25.csv')%>%
  filter(genotype != 'Sbeta+')

# Initial data preprocessing
df <- df %>%
  # Remove extreme outliers for ei_max
  filter(ei_max_lorrca <= 30) %>%
  # Convert dates and create derived variables
  mutate(
    date_collection = as.Date(date_of_collection, format = "%Y-%m-%d"),
    date_hba_before = as.Date(hba_before_date, format = "%Y-%m-%d"),
    date_hba_after = as.Date(hba_after_date, format = "%Y-%m-%d"),
    TRANSFUSION_STATUS = case_when(
      (!is.na(date_hba_before) & abs(as.numeric(date_collection - date_hba_before)) <= 90) |
      (!is.na(date_hba_after) & abs(as.numeric(date_hba_after - date_collection)) <= 90) ~ TRUE,
      TRUE ~ FALSE
    ),
    steady_state = if_else(!is.na(time_from_voe) & abs(time_from_voe) >= 4, TRUE, FALSE),
    # Determine the nearest HbF value
    days_to_hbf_before = if_else(!is.na(hbf_before_date),
                                 abs(as.numeric(date_collection - as.Date(hbf_before_date, format="%Y-%m-%d"))),
                                 Inf),
    days_to_hbf_after = if_else(!is.na(hbf_after_date),
                                abs(as.numeric(date_collection - as.Date(hbf_after_date, format="%Y-%m-%d"))),
                                Inf),
    nearest_hbf = if_else(days_to_hbf_before <= days_to_hbf_after, hbf_before, hbf_after),
    # Create hydroxyurea status variable
    HYDROXYUREA_STATUS = if_else(!is.na(op_hydroxyurea_status) & op_hydroxyurea_status == "Yes", TRUE, FALSE),
    # Create age group
    age_group = if_else(age >= 60, "Elderly", "Non-elderly"),
    # Create initial transformations
    sqrt_ei_min = sqrt(ei_min_lorrca),
    cuberoot_nearest_hbf = sign(nearest_hbf) * abs(nearest_hbf)^(1/3)
  ) %>%
  # Convert factors
  mutate(across(c(age_group, gender, genotype, TRANSFUSION_STATUS, 
                  steady_state, HYDROXYUREA_STATUS), as.factor)) %>%
  # Set factor levels
  mutate(age_group = fct_relevel(age_group, "Non-elderly", "Elderly"))

# Function to create a recipe for each outcome
create_rheology_recipe <- function(data, outcome, outcome_transform = NULL) {
  # Create initial recipe
  rec <- recipe(formula = as.formula(paste(outcome, "~ .")), data = data) %>%
    # Update roles - explicitly specify ID column
    update_role(all_of("patient_mrn"), new_role = "ID") %>%
    # Remove date variables and other unnecessary columns
    update_role(matches("date|days_to|_before$|_after$"), new_role = "drop") %>%
    # Remove variables we don't want in the model
    step_rm(matches("_lorrca$"), -all_outcomes()) %>%
    # Center and scale numeric predictors
    step_normalize(all_numeric_predictors()) %>%
    # Check for correlations
    step_corr(all_numeric_predictors(), threshold = 0.8) %>%
    # Create interaction terms
    step_interact(terms = ~ weight_kg:HYDROXYUREA_STATUS) %>%
    # Remove near-zero variance predictors
    step_nzv(all_predictors()) %>%
    # Dummy code categorical variables, excluding ID and outcome
    step_dummy(all_nominal_predictors(), -all_outcomes(), -has_role("ID"))
  
  # Add outcome transformation if specified
  if (!is.null(outcome_transform)) {
    rec <- rec %>% 
      step_mutate(!!outcome := !!outcome_transform)
  }
  
  return(rec)
}

# Function to check VIF
check_vif <- function(recipe, data) {
  # Prep and bake the recipe
  prepped_data <- prep(recipe) %>% bake(new_data = NULL)
  
  # Remove ID column for VIF calculation
  model_data <- prepped_data %>% select(-patient_mrn)
  
  # Fit a linear model and check VIF
  vif_model <- lm(reformulate(setdiff(names(model_data), names(model_data)[1]), 
                             response = names(model_data)[1]), 
                 data = model_data)
  print(car::vif(vif_model))
}

# Function to plot correlations
plot_correlations <- function(recipe, data) {
  # Prep and bake the recipe
  prepped_data <- prep(recipe) %>% bake(new_data = NULL)
  
  # Create correlation matrix for numeric variables
  cor_matrix <- prepped_data %>%
    select(where(is.numeric), -patient_mrn) %>%
    cor(use = "pairwise.complete.obs")
  
  # Plot correlations
  corrplot(cor_matrix, method = "circle", type = "upper", 
           tl.col = "black", tl.srt = 45)
}

# Function to fit mixed model
fit_mixed_model <- function(data, outcome, recipe) {
  # Prep the recipe
  prepped_recipe <- prep(recipe)
  
  # Bake the data
  model_data <- bake(prepped_recipe, new_data = NULL)
  
  # Get predictor names (excluding outcome and ID)
  predictors <- setdiff(names(model_data), c(outcome, "patient_mrn"))
  
  # Create formula
  formula_str <- paste(outcome, "~", 
                      paste(predictors, collapse = " + "), 
                      "+ (1 | patient_mrn)")
  
  # Fit the model
  model <- lmer(as.formula(formula_str), data = model_data)
  
  return(model)
}

We applied appropriate transformations to each outcome variable based on their distributions: square root for ei_min and Box-Cox for ei_max and PoS. These transformations helped improve the normality of our outcome variables and ensure the validity of our statistical models.

Model Development

We developed mixed-effects models for each rheological parameter, starting with theory-driven predictors and using backward elimination to identify parsimonious models. Each model accounts for within-patient correlation through a random intercept term.

For the minimum elongation index (ei_min), we applied a square root transformation to improve normality. The model includes key clinical predictors and accounts for patient-level clustering.

Code

# Prepare data for ei_min model
ei_min_data <- df %>%
  select(sqrt_ei_min, patient_mrn, age, gender, genotype, steady_state,
         TRANSFUSION_STATUS, cuberoot_nearest_hbf, age_group, 
         hb_advia, mchc_advia, weight_kg, HYDROXYUREA_STATUS) %>%
 # drop_na() %>%
  # Ensure patient_mrn is a factor
  mutate(patient_mrn = as.factor(patient_mrn))

# Create and prep recipe
ei_min_recipe <- create_rheology_recipe(ei_min_data, "sqrt_ei_min")

# Check VIF and correlations
check_vif(ei_min_recipe, ei_min_data)

    cuberoot_nearest_hbf                 hb_advia               mchc_advia 
                1.639829                 1.330248                 1.164235 
               weight_kg              sqrt_ei_min              gender_Male 
                1.152483                 1.386331                 1.159489 
             genotype_SS       steady_state_TRUE. TRANSFUSION_STATUS_TRUE. 
                1.626930                 1.061169                 1.664359 
       age_group_Elderly HYDROXYUREA_STATUS_TRUE. 
                1.142337                 1.032316

Code

plot_correlations(ei_min_recipe, ei_min_data)

Code

# Fit the mixed model directly
ei_min_model <- fit_mixed_model(ei_min_data, "sqrt_ei_min", ei_min_recipe)

# Model diagnostics
# DHARMa diagnostics
ei_min_resids <- simulateResiduals(ei_min_model)
plot(ei_min_resids)

Code

# Standard diagnostic plots - using manual approach instead of augment
model_data <- bake(prep(ei_min_recipe), new_data = NULL)
fitted_vals <- fitted(ei_min_model)
residuals_vals <- residuals(ei_min_model)

tibble(
  .fitted = fitted_vals,
  .resid = residuals_vals
) %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "ei_min: Residuals vs Fitted Values")

Code

tibble(.resid = residuals_vals) %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "ei_min: Normal Q-Q Plot")

Code

# Print model summary
summary(ei_min_model)

Linear mixed model fit by REML ['lmerMod']
Formula: sqrt_ei_min ~ age + cuberoot_nearest_hbf + hb_advia + mchc_advia +  
    weight_kg + gender_Male + genotype_SS + steady_state_TRUE. +  
    TRANSFUSION_STATUS_TRUE. + age_group_Elderly + HYDROXYUREA_STATUS_TRUE. +  
    (1 | patient_mrn)
   Data: model_data

REML criterion at convergence: -546.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.58035 -0.57670 -0.01173  0.59654  2.58307 

Random effects:
 Groups      Name        Variance Std.Dev.
 patient_mrn (Intercept) 0.006099 0.0781  
 Residual                0.010722 0.1035  
Number of obs: 471, groups:  patient_mrn, 278

Fixed effects:
                          Estimate Std. Error t value
(Intercept)               0.397776   0.027769  14.325
age                      -0.000425   0.011194  -0.038
cuberoot_nearest_hbf      0.003960   0.008311   0.477
hb_advia                  0.015684   0.007640   2.053
mchc_advia               -0.020467   0.005892  -3.474
weight_kg                 0.014921   0.007712   1.935
gender_Male              -0.037477   0.014490  -2.586
genotype_SS              -0.040700   0.030304  -1.343
steady_state_TRUE.       -0.002863   0.014961  -0.191
TRANSFUSION_STATUS_TRUE.  0.142880   0.015346   9.310
age_group_Elderly         0.078583   0.035614   2.207
HYDROXYUREA_STATUS_TRUE. -0.007567   0.022986  -0.329

Correlation of Fixed Effects:
            (Intr) age    cbrt__ hb_adv mchc_d wght_k gndr_M gnt_SS s__TRU
age          0.108                                                        
cbrt_nrst_h  0.327  0.092                                                 
hb_advia    -0.204  0.168 -0.154                                          
mchc_advia  -0.081 -0.014  0.037 -0.238                                   
weight_kg   -0.027 -0.317 -0.247 -0.168  0.073                            
gender_Male -0.047  0.021  0.102 -0.181  0.020 -0.053                     
genotype_SS -0.838  0.067 -0.411  0.300  0.077  0.030 -0.227              
stdy__TRUE. -0.266 -0.100 -0.030  0.034  0.044 -0.037  0.006 -0.082       
TRANSFUSION  0.132  0.109  0.416  0.021 -0.077 -0.163  0.066 -0.272 -0.101
ag_grp_Eldr -0.200 -0.706 -0.194 -0.077  0.031  0.310  0.072  0.025  0.031
HYDROXYUREA -0.014 -0.094 -0.003  0.022  0.049  0.003  0.025 -0.073 -0.067
            TRANSF ag_g_E
age                      
cbrt_nrst_h              
hb_advia                 
mchc_advia               
weight_kg                
gender_Male              
genotype_SS              
stdy__TRUE.              
TRANSFUSION              
ag_grp_Eldr -0.120       
HYDROXYUREA  0.041  0.088

Code

# Create a tidy summary with confidence intervals
broom.mixed::tidy(ei_min_model, conf.int = TRUE) %>%
  knitr::kable(caption = "ei_min Model Coefficients")

ei_min Model Coefficients
effect	group	term	estimate	std.error	statistic	conf.low	conf.high
fixed	NA	(Intercept)	0.3977760	0.0277686	14.3246713	0.3433505	0.4522014
fixed	NA	age	-0.0004250	0.0111940	-0.0379666	-0.0223647	0.0215147
fixed	NA	cuberoot_nearest_hbf	0.0039603	0.0083112	0.4765074	-0.0123292	0.0202499
fixed	NA	hb_advia	0.0156840	0.0076398	2.0529417	0.0007103	0.0306577
fixed	NA	mchc_advia	-0.0204666	0.0058917	-3.4737840	-0.0320143	-0.0089190
fixed	NA	weight_kg	0.0149212	0.0077125	1.9346792	-0.0001950	0.0300375
fixed	NA	gender_Male	-0.0374773	0.0144899	-2.5864516	-0.0658769	-0.0090777
fixed	NA	genotype_SS	-0.0407000	0.0303035	-1.3430796	-0.1000937	0.0186937
fixed	NA	steady_state_TRUE.	-0.0028628	0.0149608	-0.1913549	-0.0321854	0.0264598
fixed	NA	TRANSFUSION_STATUS_TRUE.	0.1428804	0.0153463	9.3104015	0.1128022	0.1729587
fixed	NA	age_group_Elderly	0.0785826	0.0356136	2.2065304	0.0087811	0.1483840
fixed	NA	HYDROXYUREA_STATUS_TRUE.	-0.0075673	0.0229862	-0.3292120	-0.0526194	0.0374848
ran_pars	patient_mrn	sd__(Intercept)	0.0780979	NA	NA	NA	NA
ran_pars	Residual	sd__Observation	0.1035489	NA	NA	NA	NA

For the maximum elongation index (ei_max), we applied a Box-Cox transformation to address non-normality. The optimal lambda was determined empirically to ensure the best possible transformation.

Code

# Determine optimal Box-Cox transformation
bc_ei_max <- car::powerTransform(df$ei_max_lorrca)
ei_max_lambda <- bc_ei_max$roundlam
cat("Optimal lambda for ei_max:", ei_max_lambda, "\n")

Optimal lambda for ei_max: 2.612086

Code

# Prepare data for ei_max model
ei_max_data <- df %>%
  select(ei_max_lorrca, patient_mrn, age, gender, genotype, steady_state,
         TRANSFUSION_STATUS, cuberoot_nearest_hbf, age_group,
         hb_advia, mchc_advia, weight_kg, HYDROXYUREA_STATUS) %>%
#  drop_na() %>%
  # Ensure patient_mrn is a factor
  mutate(patient_mrn = as.factor(patient_mrn))

# Create recipe with Box-Cox transformation
ei_max_recipe <- create_rheology_recipe(
  ei_max_data, 
  "ei_max_lorrca",
  outcome_transform = quo(car::bcPower(ei_max_lorrca, !!ei_max_lambda))
)

# Check VIF and correlations
check_vif(ei_max_recipe, ei_max_data)

    cuberoot_nearest_hbf                 hb_advia               mchc_advia 
                1.690357                 1.474384                 1.211087 
               weight_kg            ei_max_lorrca              gender_Male 
                1.141238                 1.318679                 1.160218 
             genotype_SS       steady_state_TRUE. TRANSFUSION_STATUS_TRUE. 
                1.598333                 1.060065                 1.410135 
       age_group_Elderly HYDROXYUREA_STATUS_TRUE. 
                1.111661                 1.029340

Code

plot_correlations(ei_max_recipe, ei_max_data)

Code

# Fit the mixed model directly
ei_max_model <- fit_mixed_model(ei_max_data, "ei_max_lorrca", ei_max_recipe)

# Model diagnostics
# DHARMa diagnostics
ei_max_resids <- simulateResiduals(ei_max_model)
plot(ei_max_resids)

Code

# Standard diagnostic plots - using manual approach instead of augment
model_data <- bake(prep(ei_max_recipe), new_data = NULL)
fitted_vals <- fitted(ei_max_model)
residuals_vals <- residuals(ei_max_model)

tibble(
  .fitted = fitted_vals,
  .resid = residuals_vals
) %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "ei_max: Residuals vs Fitted Values")

Code

tibble(.resid = residuals_vals) %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "ei_max: Normal Q-Q Plot")

Code

# Print model summary
summary(ei_max_model)

Linear mixed model fit by REML ['lmerMod']
Formula: ei_max_lorrca ~ age + cuberoot_nearest_hbf + hb_advia + mchc_advia +  
    weight_kg + gender_Male + genotype_SS + steady_state_TRUE. +  
    TRANSFUSION_STATUS_TRUE. + age_group_Elderly + HYDROXYUREA_STATUS_TRUE. +  
    (1 | patient_mrn)
   Data: model_data

REML criterion at convergence: -2305.8

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.56309 -0.53509  0.03468  0.54263  2.67582 

Random effects:
 Groups      Name        Variance  Std.Dev.
 patient_mrn (Intercept) 0.0002181 0.01477 
 Residual                0.0002360 0.01536 
Number of obs: 486, groups:  patient_mrn, 283

Fixed effects:
                           Estimate Std. Error t value
(Intercept)              -0.3197272  0.0046901 -68.170
age                      -0.0025482  0.0018684  -1.364
cuberoot_nearest_hbf      0.0040662  0.0013514   3.009
hb_advia                  0.0077241  0.0012197   6.333
mchc_advia               -0.0050137  0.0008962  -5.594
weight_kg                 0.0008363  0.0012896   0.648
gender_Male              -0.0069049  0.0024146  -2.860
genotype_SS              -0.0044191  0.0050660  -0.872
steady_state_TRUE.       -0.0003566  0.0024980  -0.143
TRANSFUSION_STATUS_TRUE.  0.0117431  0.0024156   4.861
age_group_Elderly         0.0095029  0.0060179   1.579
HYDROXYUREA_STATUS_TRUE.  0.0017913  0.0038837   0.461

Correlation of Fixed Effects:
            (Intr) age    cbrt__ hb_adv mchc_d wght_k gndr_M gnt_SS s__TRU
age          0.103                                                        
cbrt_nrst_h  0.309  0.083                                                 
hb_advia    -0.196  0.164 -0.160                                          
mchc_advia  -0.061 -0.015  0.051 -0.256                                   
weight_kg   -0.023 -0.311 -0.240 -0.166  0.073                            
gender_Male -0.055  0.006  0.096 -0.176  0.028 -0.049                     
genotype_SS -0.839  0.078 -0.384  0.292  0.053  0.024 -0.220              
stdy__TRUE. -0.262 -0.100 -0.018  0.043  0.032 -0.046  0.021 -0.095       
TRANSFUSION  0.116  0.096  0.384  0.012 -0.070 -0.156  0.055 -0.246 -0.081
ag_grp_Eldr -0.192 -0.699 -0.186 -0.077  0.030  0.302  0.080  0.013  0.036
HYDROXYUREA -0.014 -0.098 -0.012  0.018  0.044  0.004  0.020 -0.070 -0.062
            TRANSF ag_g_E
age                      
cbrt_nrst_h              
hb_advia                 
mchc_advia               
weight_kg                
gender_Male              
genotype_SS              
stdy__TRUE.              
TRANSFUSION              
ag_grp_Eldr -0.114       
HYDROXYUREA  0.030  0.086

Code

# Create a tidy summary with confidence intervals
broom.mixed::tidy(ei_max_model, conf.int = TRUE) %>%
  knitr::kable(caption = "ei_max Model Coefficients")

ei_max Model Coefficients
effect	group	term	estimate	std.error	statistic	conf.low	conf.high
fixed	NA	(Intercept)	-0.3197272	0.0046901	-68.1703447	-0.3289197	-0.3105347
fixed	NA	age	-0.0025482	0.0018684	-1.3638585	-0.0062102	0.0011138
fixed	NA	cuberoot_nearest_hbf	0.0040662	0.0013514	3.0089699	0.0014176	0.0067148
fixed	NA	hb_advia	0.0077241	0.0012197	6.3325291	0.0053334	0.0101147
fixed	NA	mchc_advia	-0.0050137	0.0008962	-5.5942163	-0.0067703	-0.0032571
fixed	NA	weight_kg	0.0008363	0.0012896	0.6484802	-0.0016912	0.0033638
fixed	NA	gender_Male	-0.0069049	0.0024146	-2.8597211	-0.0116374	-0.0021725
fixed	NA	genotype_SS	-0.0044191	0.0050660	-0.8723103	-0.0143484	0.0055101
fixed	NA	steady_state_TRUE.	-0.0003566	0.0024980	-0.1427695	-0.0052527	0.0045394
fixed	NA	TRANSFUSION_STATUS_TRUE.	0.0117431	0.0024156	4.8613387	0.0070086	0.0164777
fixed	NA	age_group_Elderly	0.0095029	0.0060179	1.5791099	-0.0022919	0.0212977
fixed	NA	HYDROXYUREA_STATUS_TRUE.	0.0017913	0.0038837	0.4612392	-0.0058205	0.0094031
ran_pars	patient_mrn	sd__(Intercept)	0.0147667	NA	NA	NA	NA
ran_pars	Residual	sd__Observation	0.0153616	NA	NA	NA	NA

For the Point of Sickling (PoS), we also applied a Box-Cox transformation to address non-normality. This parameter represents a critical threshold in the sickling process and may have different predictors than the elongation indices.

Code

# Determine optimal Box-Cox transformation
bc_pos <- car::powerTransform(df$pos_lorrca)
pos_lambda <- bc_pos$roundlam
cat("Optimal lambda for PoS:", pos_lambda, "\n")

Optimal lambda for PoS: 0.33

Code

# Prepare data for PoS model
pos_data <- df %>%
  select(pos_lorrca, patient_mrn, age, gender, genotype, steady_state,
         TRANSFUSION_STATUS, cuberoot_nearest_hbf, age_group,
         hb_advia, mchc_advia, weight_kg, HYDROXYUREA_STATUS) %>%
#  drop_na() %>%
  # Ensure patient_mrn is a factor
  mutate(patient_mrn = as.factor(patient_mrn))

# Create recipe with Box-Cox transformation
pos_recipe <- create_rheology_recipe(
  pos_data, 
  "pos_lorrca",
  outcome_transform = quo(car::bcPower(pos_lorrca, !!pos_lambda))
)

# Check VIF and correlations
check_vif(pos_recipe, pos_data)

    cuberoot_nearest_hbf                 hb_advia               mchc_advia 
                1.626069                 1.375243                 1.177631 
               weight_kg               pos_lorrca              gender_Male 
                1.146050                 1.333918                 1.146459 
             genotype_SS       steady_state_TRUE. TRANSFUSION_STATUS_TRUE. 
                1.742261                 1.058317                 1.387582 
       age_group_Elderly HYDROXYUREA_STATUS_TRUE. 
                1.154295                 1.030824

Code

plot_correlations(pos_recipe, pos_data)

Code

# Fit the mixed model directly
pos_model <- fit_mixed_model(pos_data, "pos_lorrca", pos_recipe)

# Model diagnostics
# DHARMa diagnostics
pos_resids <- simulateResiduals(pos_model)
plot(pos_resids)

Code

# Standard diagnostic plots - using manual approach instead of augment
model_data <- bake(prep(pos_recipe), new_data = NULL)
fitted_vals <- fitted(pos_model)
residuals_vals <- residuals(pos_model)

tibble(
  .fitted = fitted_vals,
  .resid = residuals_vals
) %>%
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  labs(title = "PoS: Residuals vs Fitted Values")

Code

tibble(.resid = residuals_vals) %>%
  ggplot(aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "PoS: Normal Q-Q Plot")

Code

# Print model summary
summary(pos_model)

Linear mixed model fit by REML ['lmerMod']
Formula: pos_lorrca ~ age + cuberoot_nearest_hbf + hb_advia + mchc_advia +  
    weight_kg + gender_Male + genotype_SS + steady_state_TRUE. +  
    TRANSFUSION_STATUS_TRUE. + age_group_Elderly + HYDROXYUREA_STATUS_TRUE. +  
    (1 | patient_mrn)
   Data: model_data

REML criterion at convergence: 1683.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0272 -0.5235 -0.0015  0.5512  3.5319 

Random effects:
 Groups      Name        Variance Std.Dev.
 patient_mrn (Intercept) 0.4884   0.6989  
 Residual                1.4224   1.1927  
Number of obs: 483, groups:  patient_mrn, 283

Fixed effects:
                         Estimate Std. Error t value
(Intercept)               5.85649    0.28549  20.514
age                      -0.14910    0.11416  -1.306
cuberoot_nearest_hbf     -0.17925    0.08679  -2.065
hb_advia                 -0.30701    0.08178  -3.754
mchc_advia                0.26066    0.06290   4.144
weight_kg                -0.07703    0.08002  -0.963
gender_Male               0.33242    0.14814   2.244
genotype_SS               1.71192    0.31255   5.477
steady_state_TRUE.        0.11859    0.15395   0.770
TRANSFUSION_STATUS_TRUE. -0.95184    0.16184  -5.881
age_group_Elderly        -0.65899    0.36389  -1.811
HYDROXYUREA_STATUS_TRUE. -0.17974    0.23495  -0.765

Correlation of Fixed Effects:
            (Intr) age    cbrt__ hb_adv mchc_d wght_k gndr_M gnt_SS s__TRU
age          0.108                                                        
cbrt_nrst_h  0.337  0.086                                                 
hb_advia    -0.222  0.168 -0.189                                          
mchc_advia  -0.081 -0.007  0.054 -0.233                                   
weight_kg   -0.028 -0.323 -0.255 -0.162  0.067                            
gender_Male -0.046  0.006  0.103 -0.197  0.022 -0.052                     
genotype_SS -0.838  0.067 -0.423  0.328  0.075  0.033 -0.229              
stdy__TRUE. -0.266 -0.096 -0.017  0.040  0.043 -0.038  0.011 -0.089       
TRANSFUSION  0.130  0.110  0.403  0.004 -0.063 -0.169  0.065 -0.269 -0.090
ag_grp_Eldr -0.207 -0.707 -0.199 -0.077  0.025  0.316  0.082  0.034  0.026
HYDROXYUREA -0.018 -0.087 -0.014  0.028  0.046  0.001  0.013 -0.065 -0.061
            TRANSF ag_g_E
age                      
cbrt_nrst_h              
hb_advia                 
mchc_advia               
weight_kg                
gender_Male              
genotype_SS              
stdy__TRUE.              
TRANSFUSION              
ag_grp_Eldr -0.125       
HYDROXYUREA  0.037  0.085

Code

# Create a tidy summary with confidence intervals
broom.mixed::tidy(pos_model, conf.int = TRUE) %>%
  knitr::kable(caption = "PoS Model Coefficients")

PoS Model Coefficients
effect	group	term	estimate	std.error	statistic	conf.low	conf.high
fixed	NA	(Intercept)	5.8564880	0.2854857	20.5141205	5.2969463	6.4160297
fixed	NA	age	-0.1490964	0.1141643	-1.3059807	-0.3728543	0.0746615
fixed	NA	cuberoot_nearest_hbf	-0.1792507	0.0867945	-2.0652307	-0.3493649	-0.0091366
fixed	NA	hb_advia	-0.3070063	0.0817755	-3.7542586	-0.4672833	-0.1467293
fixed	NA	mchc_advia	0.2606577	0.0629046	4.1436962	0.1373669	0.3839486
fixed	NA	weight_kg	-0.0770288	0.0800192	-0.9626284	-0.2338636	0.0798060
fixed	NA	gender_Male	0.3324202	0.1481356	2.2440260	0.0420797	0.6227607
fixed	NA	genotype_SS	1.7119160	0.3125480	5.4772897	1.0993332	2.3244989
fixed	NA	steady_state_TRUE.	0.1185945	0.1539473	0.7703580	-0.1831366	0.4203257
fixed	NA	TRANSFUSION_STATUS_TRUE.	-0.9518397	0.1618431	-5.8812484	-1.2690464	-0.6346330
fixed	NA	age_group_Elderly	-0.6589895	0.3638924	-1.8109461	-1.3722056	0.0542265
fixed	NA	HYDROXYUREA_STATUS_TRUE.	-0.1797420	0.2349506	-0.7650207	-0.6402367	0.2807526
ran_pars	patient_mrn	sd__(Intercept)	0.6988567	NA	NA	NA	NA
ran_pars	Residual	sd__Observation	1.1926620	NA	NA	NA	NA

The initial models for all three rheological parameters showed good fit and reasonable diagnostic properties. However, to identify the most parsimonious models, we proceeded with feature selection using backward elimination.

Feature Selection and Model Refinement

To identify the most important predictors for each rheological parameter, we implemented a custom backward elimination procedure. This approach systematically evaluated the contribution of each predictor to model fit, removing variables that did not significantly improve the model.

Our backward elimination procedure started with the full model and iteratively removed predictors based on statistical significance and model fit criteria (AIC). This approach helped us identify the most parsimonious models while maintaining predictive performance.

Code

conflicted::conflicts_prefer(lmerTest::lmer)
# Function to perform bidirectional stepwise selection for mixed models
perform_stepwise_selection <- function(model, data, recipe) {
  # Load lmerTest package for mixed model stepwise selection
  if (!requireNamespace("lmerTest", quietly = TRUE)) {
    install.packages("lmerTest")
  }
  library(lmerTest)
  
  # Prepare the recipe and process the data
  prepped_recipe <- prep(recipe)
  processed_data <- bake(prepped_recipe, new_data = NULL)
  
  # Get the outcome variable name
  outcome_var <- all.vars(formula(model))[1]
  
  # Get predictor names (excluding outcome and ID)
  predictors <- setdiff(names(processed_data), c(outcome_var, "patient_mrn"))
  
  # Create formula for the full model
  full_formula <- as.formula(paste(outcome_var, "~", 
                                  paste(predictors, collapse = " + "), 
                                  "+ (1 | patient_mrn)"))
  
  # Print the formula for debugging
  cat("Full model formula:\n")
  print(full_formula)
  
  # Ensure patient_mrn is a factor in processed data
  processed_data$patient_mrn <- as.factor(processed_data$patient_mrn)
  
  # Handle missing values - create a complete dataset for model comparison
  # This ensures all models are fitted to the same dataset size
  model_vars <- c(outcome_var, "patient_mrn", predictors)
  complete_data <- processed_data %>%
    select(all_of(model_vars)) %>%
    na.omit()
  
  cat("Original data rows:", nrow(processed_data), "\n")
  cat("Complete data rows after removing NAs:", nrow(complete_data), "\n")
  
  # Fit the full model with lmerTest using the complete dataset
  full_model <- tryCatch({
    lmerTest::lmer(full_formula, data = complete_data, REML = FALSE)
  }, error = function(e) {
    cat("Error fitting full model:", e$message, "\n")
    cat("Checking data structure:\n")
    str(complete_data[, c(outcome_var, "patient_mrn", predictors[1:min(3, length(predictors))])])
    stop("Error in model fitting. See details above.")
  })
  
  # Create a simplified approach to stepwise selection
  # Instead of using lmerTest::step, we'll manually perform backward elimination
  
  # Start with the full model
  current_model <- full_model
  
  # Get fixed effects
  fixed_effects <- names(fixef(current_model))
  fixed_effects <- fixed_effects[fixed_effects != "(Intercept)"]
  
  # Initialize a list to store eliminated terms
  eliminated_terms <- list()
  
  # Perform backward elimination
  cat("\nPerforming manual backward elimination:\n")
  
  while(length(fixed_effects) > 0) {
    # Store current AIC
    current_aic <- AIC(current_model)
    
    # Try removing each term
    best_aic <- current_aic
    term_to_remove <- NULL
    
    for(term in fixed_effects) {
      # Create formula without this term
      remaining_terms <- fixed_effects[fixed_effects != term]
      if(length(remaining_terms) > 0) {
        new_formula <- as.formula(paste(outcome_var, "~", 
                                       paste(remaining_terms, collapse = " + "), 
                                       "+ (1 | patient_mrn)"))
      } else {
        new_formula <- as.formula(paste(outcome_var, "~ (1 | patient_mrn)"))
      }
      
      # Fit model without this term - using the same complete dataset
      new_model <- lmer(new_formula, data = complete_data, REML = FALSE)
      new_aic <- AIC(new_model)
      
      # Perform likelihood ratio test
      lrt <- anova(new_model, current_model)
      p_value <- lrt$`Pr(>Chisq)`[2]
      
      cat(sprintf("  Testing removal of %s: AIC change = %.2f, p-value = %.4f\n", 
                 term, new_aic - current_aic, p_value))
      
      # If p-value > 0.05, consider removing this term
      if(p_value > 0.05 && new_aic < best_aic) {
        best_aic <- new_aic
        term_to_remove <- term
      }
    }
    
    # If no term can be removed, break
    if(is.null(term_to_remove)) {
      cat("No more terms can be removed.\n")
      break
    }
    
    # Remove the term with the highest p-value
    cat(sprintf("Removing term: %s (AIC change: %.2f)\n", 
               term_to_remove, best_aic - current_aic))
    
    # Update fixed effects
    fixed_effects <- fixed_effects[fixed_effects != term_to_remove]
    
    # Update model
    if(length(fixed_effects) > 0) {
      new_formula <- as.formula(paste(outcome_var, "~", 
                                     paste(fixed_effects, collapse = " + "), 
                                     "+ (1 | patient_mrn)"))
    } else {
      new_formula <- as.formula(paste(outcome_var, "~ (1 | patient_mrn)"))
    }
    
    current_model <- lmer(new_formula, data = complete_data, REML = FALSE)
    
    # Store eliminated term
    eliminated_terms <- c(eliminated_terms, term_to_remove)
  }
  
  # Store the final formula from the step process
  final_formula <- formula(current_model)
  
  # Create a step_result object
  step_result <- list(
    model = current_model,
    eliminated.terms = eliminated_terms,
    formula = final_formula
  )
  
  # Refit final model with REML using the SAME formula from step_result
  # This ensures consistency between step results and final model
  final_model <- lmer(final_formula, data = complete_data, REML = TRUE)
  
  # Return the results
  return(list(
    final_model = final_model,
    step_result = step_result,
    formula = final_formula,  # Use the same formula from step_result
    processed_data = processed_data,
    complete_data = complete_data  # Also return the complete data used for modeling
  ))
}

# Perform stepwise selection for each model
ei_min_selection <- perform_stepwise_selection(ei_min_model, ei_min_data, ei_min_recipe)
ei_max_selection <- perform_stepwise_selection(ei_max_model, ei_max_data, ei_max_recipe)
pos_selection <- perform_stepwise_selection(pos_model, pos_data, pos_recipe)

# Print final model formulas
cat("Final ei_min model formula:\n")
print(ei_min_selection$formula)

cat("\nFinal ei_max model formula:\n")
print(ei_max_selection$formula)

cat("\nFinal PoS model formula:\n")
print(pos_selection$formula)

# Print step results
cat("\n\nei_min Step Results:\n")
print(ei_min_selection$step_result)

cat("\n\nei_max Step Results:\n")
print(ei_max_selection$step_result)

cat("\n\nPoS Step Results:\n")
print(pos_selection$step_result)

After feature selection, we obtained optimized models for each rheological parameter. These final models represent the most parsimonious set of predictors that explain the variation in our outcomes.

Code

# Create tidy summaries with confidence intervals
broom.mixed::tidy(ei_min_selection$final_model, conf.int = TRUE) %>%
  knitr::kable(caption = "Final ei_min Model Coefficients")

Final ei_min Model Coefficients
effect	group	term	estimate	std.error	statistic	df	p.value	conf.low	conf.high
fixed	NA	(Intercept)	0.3622735	0.0108821	33.290913	294.0873	0.0000000	0.3408569	0.3836901
fixed	NA	hb_advia	0.0190493	0.0071288	2.672175	440.1808	0.0078158	0.0050387	0.0330600
fixed	NA	mchc_advia	-0.0195507	0.0058173	-3.360769	445.4569	0.0008442	-0.0309835	-0.0081179
fixed	NA	weight_kg	0.0148995	0.0070553	2.111822	279.9484	0.0355877	0.0010114	0.0287876
fixed	NA	gender_Male	-0.0418810	0.0140585	-2.979044	269.7165	0.0031548	-0.0695594	-0.0142026
fixed	NA	TRANSFUSION_STATUS_TRUE.	0.1373834	0.0136619	10.055944	436.0968	0.0000000	0.1105320	0.1642348
fixed	NA	age_group_Elderly	0.0806495	0.0246527	3.271421	214.7001	0.0012467	0.0320571	0.1292419
ran_pars	patient_mrn	sd__(Intercept)	0.0778999	NA	NA	NA	NA	NA	NA
ran_pars	Residual	sd__Observation	0.1032040	NA	NA	NA	NA	NA	NA

Code

broom.mixed::tidy(ei_max_selection$final_model, conf.int = TRUE) %>%
  knitr::kable(caption = "Final ei_max Model Coefficients")

Final ei_max Model Coefficients
effect	group	term	estimate	std.error	statistic	df	p.value	conf.low	conf.high
fixed	NA	(Intercept)	-0.3221799	0.0017481	-184.300149	295.4007	0.0000000	-0.3256203	-0.3187396
fixed	NA	cuberoot_nearest_hbf	0.0040396	0.0011866	3.404292	339.2397	0.0007428	0.0017056	0.0063737
fixed	NA	hb_advia	0.0082785	0.0011315	7.316638	468.6690	0.0000000	0.0060551	0.0105019
fixed	NA	mchc_advia	-0.0050283	0.0008892	-5.654692	423.7150	0.0000000	-0.0067761	-0.0032804
fixed	NA	gender_Male	-0.0075947	0.0023145	-3.281297	266.1054	0.0011713	-0.0121518	-0.0030376
fixed	NA	TRANSFUSION_STATUS_TRUE.	0.0116852	0.0022847	5.114494	474.6928	0.0000005	0.0071958	0.0161746
ran_pars	patient_mrn	sd__(Intercept)	0.0146306	NA	NA	NA	NA	NA	NA
ran_pars	Residual	sd__Observation	0.0153741	NA	NA	NA	NA	NA	NA

Code

broom.mixed::tidy(pos_selection$final_model, conf.int = TRUE) %>%
  knitr::kable(caption = "Final PoS Model Coefficients")

Final PoS Model Coefficients
effect	group	term	estimate	std.error	statistic	df	p.value	conf.low	conf.high
fixed	NA	(Intercept)	5.8980672	0.2747116	21.470031	213.7324	0.0000000	5.3565762	6.4395583
fixed	NA	age	-0.1831161	0.1067833	-1.714839	233.2274	0.0877031	-0.3934991	0.0272670
fixed	NA	cuberoot_nearest_hbf	-0.1995532	0.0838417	-2.380119	322.8376	0.0178873	-0.3644982	-0.0346081
fixed	NA	hb_advia	-0.3200216	0.0805347	-3.973709	436.2582	0.0000828	-0.4783059	-0.1617373
fixed	NA	mchc_advia	0.2646799	0.0625470	4.231698	460.7020	0.0000280	0.1417672	0.3875926
fixed	NA	gender_Male	0.3255589	0.1478485	2.201977	251.0434	0.0285768	0.0343774	0.6167404
fixed	NA	genotype_SS	1.7257453	0.3102665	5.562139	243.7293	0.0000001	1.1145995	2.3368912
fixed	NA	TRANSFUSION_STATUS_TRUE.	-0.9626815	0.1585116	-6.073255	443.9817	0.0000000	-1.2742078	-0.6511552
fixed	NA	age_group_Elderly	-0.5360647	0.3434488	-1.560829	204.7671	0.1201083	-1.2132141	0.1410846
ran_pars	patient_mrn	sd__(Intercept)	0.7001978	NA	NA	NA	NA	NA	NA
ran_pars	Residual	sd__Observation	1.1904694	NA	NA	NA	NA	NA	NA

The feature selection process identified distinct sets of predictors for each rheological parameter, highlighting the complex and multifaceted nature of rheological changes in SCD. To ensure the robustness of our findings, we proceeded with bootstrap validation of these final models.

Model Validation

We validated our final models using stratified bootstrap resampling to assess their predictive performance and stability. This approach specifically addressed class imbalance in key variables, ensuring reliable inference for all patient subgroups.

Bootstrap Validation

Our stratified bootstrap approach involved resampling with replacement within strata defined by age group and transfusion status. This method preserved the hierarchical structure of the data and provided robust confidence intervals for model coefficients.

Code

# A more robust bootstrap validation function using bootMer from lme4
perform_bootstrap_validation <- function(model_selection, n_boot = 500) {
  # Get the final model from the model selection results
  final_model <- model_selection$final_model
  
  # Get the outcome variable name
  outcome_var <- all.vars(formula(final_model))[1]
  
  # Print debugging information
  cat("Bootstrapping for outcome:", outcome_var, "\n")
  cat("Formula being used:\n")
  print(formula(final_model))
  
  # Use bootMer for parametric bootstrapping of mixed models
  # This handles missing data appropriately and is specifically designed for lmer models
  cat("Performing bootstrap with bootMer...\n")
  boot_results <- lme4::bootMer(
    final_model,
    FUN = function(x) {
      # Extract fixed effects
      fixed_effects <- fixef(x)
      
      # Extract random effects variance components
      rand_vars <- as.data.frame(VarCorr(x))
      rand_sd <- rand_vars$sdcor[1]  # SD of random intercept
      residual_sd <- rand_vars$sdcor[2]  # Residual SD
      
      # Calculate performance metrics
      rmse <- sqrt(mean(residuals(x)^2))
      rsq <- cor(fitted(x), getME(x, "y"))^2
      
      # Return all parameters of interest
      c(fixed_effects, rand_sd = rand_sd, residual_sd = residual_sd, 
        rmse = rmse, rsq = rsq)
    },
    nsim = n_boot,
    parallel = "multicore",  # Use parallel processing if available
    ncpus = parallel::detectCores() - 1,  # Use all but one core
    seed = 123  # For reproducibility
  )
  
  # Extract the bootstrap results
  boot_data <- as.data.frame(boot_results$t)
  
  # Get the parameter names
  param_names <- colnames(boot_data)
  fixed_effect_names <- names(fixef(final_model))
  
  # Calculate confidence intervals and summaries for fixed effects
  fixed_effects_summary <- data.frame(
    term = fixed_effect_names,
    mean = colMeans(boot_data[, fixed_effect_names], na.rm = TRUE),
    median = apply(boot_data[, fixed_effect_names], 2, median, na.rm = TRUE),
    sd = apply(boot_data[, fixed_effect_names], 2, sd, na.rm = TRUE),
    conf.low = apply(boot_data[, fixed_effect_names], 2, quantile, probs = 0.025, na.rm = TRUE),
    conf.high = apply(boot_data[, fixed_effect_names], 2, quantile, probs = 0.975, na.rm = TRUE)
  )
  
  # Calculate confidence intervals and summaries for random effects
  random_effects_summary <- data.frame(
    component = c("Random Intercept SD", "Residual SD"),
    mean = c(mean(boot_data$rand_sd, na.rm = TRUE), mean(boot_data$residual_sd, na.rm = TRUE)),
    median = c(median(boot_data$rand_sd, na.rm = TRUE), median(boot_data$residual_sd, na.rm = TRUE)),
    sd = c(sd(boot_data$rand_sd, na.rm = TRUE), sd(boot_data$residual_sd, na.rm = TRUE)),
    conf.low = c(
      quantile(boot_data$rand_sd, 0.025, na.rm = TRUE),
      quantile(boot_data$residual_sd, 0.025, na.rm = TRUE)
    ),
    conf.high = c(
      quantile(boot_data$rand_sd, 0.975, na.rm = TRUE),
      quantile(boot_data$residual_sd, 0.975, na.rm = TRUE)
    )
  )
  
  # Calculate confidence intervals and summaries for performance metrics
  metrics_summary <- data.frame(
    .metric = c("rmse", "rsq"),
    mean = c(mean(boot_data$rmse, na.rm = TRUE), mean(boot_data$rsq, na.rm = TRUE)),
    median = c(median(boot_data$rmse, na.rm = TRUE), median(boot_data$rsq, na.rm = TRUE)),
    sd = c(sd(boot_data$rmse, na.rm = TRUE), sd(boot_data$rsq, na.rm = TRUE)),
    conf.low = c(
      quantile(boot_data$rmse, 0.025, na.rm = TRUE),
      quantile(boot_data$rsq, 0.025, na.rm = TRUE)
    ),
    conf.high = c(
      quantile(boot_data$rmse, 0.975, na.rm = TRUE),
      quantile(boot_data$rsq, 0.975, na.rm = TRUE)
    )
  )
  
  # Prepare data for plotting
  fixed_effects_data <- boot_data[, fixed_effect_names] %>%
    as.data.frame() %>%
    pivot_longer(cols = everything(), names_to = "term", values_to = "estimate")
  
  metrics_data <- data.frame(
    id = rep(1:nrow(boot_data), 2),
    .metric = c(rep("rmse", nrow(boot_data)), rep("rsq", nrow(boot_data))),
    .estimator = "standard",
    .estimate = c(boot_data$rmse, boot_data$rsq)
  )
  
  # Return all summaries and raw bootstrap data for plotting
  return(list(
    coefficients = fixed_effects_summary,
    random_effects = random_effects_summary,
    metrics = metrics_summary,
    original_model = final_model,
    bootstrap_samples = n_boot,
    successful_bootstraps = sum(!is.na(boot_data$rmse)),  # Count non-NA results
    raw_coef_data = fixed_effects_data,
    raw_metrics = metrics_data
  ))
}

# New stratified bootstrap validation function that accounts for class imbalance
perform_stratified_bootstrap_validation <- function(model_selection, n_boot = 500) {
  # Get the final model and data
  final_model <- model_selection$final_model
  complete_data <- model_selection$complete_data
  
  # Get the formula directly from the model
  model_formula <- formula(final_model)
  
  # Get the outcome variable name
  outcome_var <- all.vars(model_formula)[1]
  
  # Print debugging information
  cat("Performing stratified bootstrap for outcome:", outcome_var, "\n")
  cat("Formula being used:\n")
  print(model_formula)
  
  # Define stratification variables with focus on imbalanced classes
  # Look for transformed column names that match our stratification variables
  strata_vars <- c()
  
  # Check for age_group (transformed to age_group_Elderly)
  if ("age_group_Elderly" %in% names(complete_data)) {
    strata_vars <- c(strata_vars, "age_group_Elderly")
    cat("Found transformed age_group variable: age_group_Elderly\n")
  }
  
  # Check for TRANSFUSION_STATUS (transformed to TRANSFUSION_STATUS_TRUE.)
  if ("TRANSFUSION_STATUS_TRUE." %in% names(complete_data)) {
    strata_vars <- c(strata_vars, "TRANSFUSION_STATUS_TRUE.")
    cat("Found transformed TRANSFUSION_STATUS variable: TRANSFUSION_STATUS_TRUE.\n")
  }
  
  # If no stratification variables found, warn the user
  if (length(strata_vars) == 0) {
    warning("No stratification variables found in the data. Will perform regular bootstrap.")
  }
  
  # Check class imbalance in stratification variables
  cat("\nClass distribution in stratification variables:\n")
  for (var in strata_vars) {
    if (var %in% names(complete_data)) {
      cat(var, ":\n")
      print(table(complete_data[[var]]))
    }
  }
  
  # Function to perform stratified sampling
  stratified_sample <- function(data) {
    # First, sample patients within each stratum
    # Create a temporary data frame with unique patients and their strata
    patient_data <- data %>%
      group_by(patient_mrn) %>%
      slice(1) %>%
      ungroup()
    
    # Create combined strata if all stratification variables exist
    if (length(strata_vars) > 0) {
      # Create a stratum column using interaction() instead of unite()
      patient_data <- patient_data %>%
        mutate(stratum = interaction(across(all_of(strata_vars)), sep = "_"))
      
      # Sample patients within each stratum
      patient_sample <- patient_data %>%
        group_by(stratum) %>%
        # Sample with replacement to maintain stratum size
        slice_sample(prop = 1, replace = TRUE) %>%
        ungroup() %>%
        select(patient_mrn)
    } else {
      # If no stratification variables, just sample patients
      patient_sample <- patient_data %>%
        slice_sample(prop = 1, replace = TRUE) %>%
        select(patient_mrn)
    }
    
    # Then get all observations for sampled patients
    # Use semi_join instead of right_join to avoid many-to-many relationship warnings
    bootstrap_sample <- data %>%
      semi_join(patient_sample, by = "patient_mrn")
    
    return(bootstrap_sample)
  }
  
  # Initialize storage for bootstrap results
  boot_results <- vector("list", n_boot)
  
  # Store the last bootstrap sample for diagnostics
  last_boot_sample <- NULL
  
  # Perform bootstrap iterations
  cat("\nStarting stratified bootstrap with", n_boot, "iterations...\n")
  for (i in 1:n_boot) {
    tryCatch({
      # Generate stratified bootstrap sample
      boot_sample <- stratified_sample(complete_data)
      
      # Store the last sample for diagnostics
      if (i == n_boot) {
        last_boot_sample <- boot_sample
      }
      
      # Refit model on bootstrap sample using the same formula as the original model
      boot_model <- update(final_model, formula = model_formula, data = boot_sample)
      
      # Extract results
      fixed_effects <- fixef(boot_model)
      rand_vars <- as.data.frame(VarCorr(boot_model))
      rand_sd <- rand_vars$sdcor[1]  # SD of random intercept
      residual_sd <- rand_vars$sdcor[2]  # Residual SD
      
      # Calculate performance metrics
      rmse <- sqrt(mean(residuals(boot_model)^2))
      rsq <- cor(fitted(boot_model), getME(boot_model, "y"))^2
      
      # Store results
      boot_results[[i]] <- c(fixed_effects, 
                            rand_sd = rand_sd, 
                            residual_sd = residual_sd,
                            rmse = rmse, 
                            rsq = rsq)
      
      # Print progress
      if (i %% 50 == 0) cat("Completed", i, "of", n_boot, "bootstrap iterations\n")
      
    }, error = function(e) {
      warning("Error in bootstrap iteration ", i, ": ", e$message)
      return(NULL)
    })
  }
  
  # Remove NULL results (failed iterations)
  valid_results <- boot_results[!sapply(boot_results, is.null)]
  
  # Check if we have enough valid results
  if (length(valid_results) < n_boot * 0.5) {
    warning("Less than 50% of bootstrap iterations were successful. Results may be unreliable.")
  }
  
  # Convert results to data frame
  boot_data <- do.call(rbind, valid_results) %>%
    as.data.frame()
  
  # Get the parameter names
  param_names <- colnames(boot_data)
  fixed_effect_names <- names(fixef(final_model))
  
  # Calculate confidence intervals and summaries for fixed effects
  fixed_effects_summary <- data.frame(
    term = fixed_effect_names,
    mean = colMeans(boot_data[, fixed_effect_names], na.rm = TRUE),
    median = apply(boot_data[, fixed_effect_names], 2, median, na.rm = TRUE),
    sd = apply(boot_data[, fixed_effect_names], 2, sd, na.rm = TRUE),
    conf.low = apply(boot_data[, fixed_effect_names], 2, quantile, probs = 0.025, na.rm = TRUE),
    conf.high = apply(boot_data[, fixed_effect_names], 2, quantile, probs = 0.975, na.rm = TRUE)
  )
  
  # Calculate confidence intervals and summaries for random effects
  random_effects_summary <- data.frame(
    component = c("Random Intercept SD", "Residual SD"),
    mean = c(mean(boot_data$rand_sd, na.rm = TRUE), mean(boot_data$residual_sd, na.rm = TRUE)),
    median = c(median(boot_data$rand_sd, na.rm = TRUE), median(boot_data$residual_sd, na.rm = TRUE)),
    sd = c(sd(boot_data$rand_sd, na.rm = TRUE), sd(boot_data$residual_sd, na.rm = TRUE)),
    conf.low = c(
      quantile(boot_data$rand_sd, 0.025, na.rm = TRUE),
      quantile(boot_data$residual_sd, 0.025, na.rm = TRUE)
    ),
    conf.high = c(
      quantile(boot_data$rand_sd, 0.975, na.rm = TRUE),
      quantile(boot_data$residual_sd, 0.975, na.rm = TRUE)
    )
  )
  
  # Calculate confidence intervals and summaries for performance metrics
  metrics_summary <- data.frame(
    .metric = c("rmse", "rsq"),
    mean = c(mean(boot_data$rmse, na.rm = TRUE), mean(boot_data$rsq, na.rm = TRUE)),
    median = c(median(boot_data$rmse, na.rm = TRUE), median(boot_data$rsq, na.rm = TRUE)),
    sd = c(sd(boot_data$rmse, na.rm = TRUE), sd(boot_data$rsq, na.rm = TRUE)),
    conf.low = c(
      quantile(boot_data$rmse, 0.025, na.rm = TRUE),
      quantile(boot_data$rsq, 0.025, na.rm = TRUE)
    ),
    conf.high = c(
      quantile(boot_data$rmse, 0.975, na.rm = TRUE),
      quantile(boot_data$rsq, 0.975, na.rm = TRUE)
    )
  )
  
  # Prepare data for plotting
  fixed_effects_data <- boot_data[, fixed_effect_names] %>%
    as.data.frame() %>%
    pivot_longer(cols = everything(), names_to = "term", values_to = "estimate")
  
  metrics_data <- data.frame(
    id = rep(1:nrow(boot_data), 2),
    .metric = c(rep("rmse", nrow(boot_data)), rep("rsq", nrow(boot_data))),
    .estimator = "standard",
    .estimate = c(boot_data$rmse, boot_data$rsq)
  )
  
  # Check strata proportions in original vs bootstrap sample if we have a valid last sample
  if (length(strata_vars) > 0 && !is.null(last_boot_sample)) {
    # Create a function to calculate strata proportions
    get_strata_props <- function(data) {
      # Create a stratum column using interaction() instead of unite()
      data <- data %>%
        mutate(stratum = interaction(across(all_of(strata_vars)), sep = "_"))
      
      # Count unique patients per stratum
      data %>%
        group_by(patient_mrn) %>%
        slice(1) %>%
        ungroup() %>%
        count(stratum) %>%
        mutate(prop = n / sum(n))
    }
    
    # Calculate proportions
    orig_props <- get_strata_props(complete_data)
    boot_props <- get_strata_props(last_boot_sample)
    
    # Compare proportions
    cat("\nStrata proportions comparison (original vs. last bootstrap sample):\n")
    strata_comparison <- full_join(
      orig_props %>% rename(orig_n = n, orig_prop = prop),
      boot_props %>% rename(boot_n = n, boot_prop = prop),
      by = "stratum"
    ) %>%
      mutate(
        prop_diff = boot_prop - ifelse(is.na(orig_prop), 0, orig_prop),
        orig_prop = ifelse(is.na(orig_prop), 0, orig_prop),
        boot_prop = ifelse(is.na(boot_prop), 0, boot_prop)
      )
    
    print(strata_comparison)
  }
  
  # Return all summaries and raw bootstrap data for plotting
  return(list(
    coefficients = fixed_effects_summary,
    random_effects = random_effects_summary,
    metrics = metrics_summary,
    original_model = final_model,
    bootstrap_samples = n_boot,
    successful_bootstraps = length(valid_results),
    raw_coef_data = fixed_effects_data,
    raw_metrics = metrics_data
  ))
}

# Run the improved bootstrap for all three models
n_boot_samples <- 500  # Can be reduced for testing

# Load parallel package for faster bootstrapping
if (!requireNamespace("parallel", quietly = TRUE)) {
  install.packages("parallel")
}
library(parallel)

# Run bootstrap for ei_min model
cat("Performing bootstrap validation for ei_min model...\n")

Performing bootstrap validation for ei_min model...

Code

ei_min_boot <- perform_stratified_bootstrap_validation(ei_min_selection, n_boot = n_boot_samples)

Performing stratified bootstrap for outcome: sqrt_ei_min 
Formula being used:
sqrt_ei_min ~ hb_advia + mchc_advia + weight_kg + gender_Male + 
    TRANSFUSION_STATUS_TRUE. + age_group_Elderly + (1 | patient_mrn)
<environment: 0x11c994c50>
Found transformed age_group variable: age_group_Elderly
Found transformed TRANSFUSION_STATUS variable: TRANSFUSION_STATUS_TRUE.

Class distribution in stratification variables:
age_group_Elderly :

  0   1 
424  47 
TRANSFUSION_STATUS_TRUE. :

  0   1 
329 142 

Starting stratified bootstrap with 500 iterations...
Completed 50 of 500 bootstrap iterations
Completed 100 of 500 bootstrap iterations
Completed 150 of 500 bootstrap iterations
Completed 200 of 500 bootstrap iterations
Completed 250 of 500 bootstrap iterations
Completed 300 of 500 bootstrap iterations
Completed 350 of 500 bootstrap iterations
Completed 400 of 500 bootstrap iterations
Completed 450 of 500 bootstrap iterations
Completed 500 of 500 bootstrap iterations

Strata proportions comparison (original vs. last bootstrap sample):
# A tibble: 4 × 6
  stratum orig_n orig_prop boot_n boot_prop prop_diff
  <fct>    <int>     <dbl>  <int>     <dbl>     <dbl>
1 0_0        172    0.619     111    0.617   -0.00204
2 1_0         15    0.0540     12    0.0667   0.0127 
3 0_1         84    0.302      53    0.294   -0.00771
4 1_1          7    0.0252      4    0.0222  -0.00296

Code

cat("\nSuccess rate for ei_min:", ei_min_boot$successful_bootstraps, "out of", ei_min_boot$bootstrap_samples, "\n")


Success rate for ei_min: 500 out of 500

Code

# Run bootstrap for ei_max model
cat("\nPerforming bootstrap validation for ei_max model...\n")


Performing bootstrap validation for ei_max model...

Code

ei_max_boot <- perform_stratified_bootstrap_validation(ei_max_selection, n_boot = n_boot_samples)

Performing stratified bootstrap for outcome: ei_max_lorrca 
Formula being used:
ei_max_lorrca ~ cuberoot_nearest_hbf + hb_advia + mchc_advia + 
    gender_Male + TRANSFUSION_STATUS_TRUE. + (1 | patient_mrn)
<environment: 0x138639b80>
Found transformed age_group variable: age_group_Elderly
Found transformed TRANSFUSION_STATUS variable: TRANSFUSION_STATUS_TRUE.

Class distribution in stratification variables:
age_group_Elderly :

  0   1 
439  47 
TRANSFUSION_STATUS_TRUE. :

  0   1 
344 142 

Starting stratified bootstrap with 500 iterations...
Completed 50 of 500 bootstrap iterations
Completed 100 of 500 bootstrap iterations
Completed 150 of 500 bootstrap iterations
Completed 200 of 500 bootstrap iterations
Completed 250 of 500 bootstrap iterations
Completed 300 of 500 bootstrap iterations
Completed 350 of 500 bootstrap iterations
Completed 400 of 500 bootstrap iterations
Completed 450 of 500 bootstrap iterations
Completed 500 of 500 bootstrap iterations

Strata proportions comparison (original vs. last bootstrap sample):
# A tibble: 4 × 6
  stratum orig_n orig_prop boot_n boot_prop prop_diff
  <fct>    <int>     <dbl>  <int>     <dbl>     <dbl>
1 0_0        177    0.625     114    0.623   -0.00249
2 1_0         15    0.0530      9    0.0492  -0.00382
3 0_1         84    0.297      55    0.301    0.00373
4 1_1          7    0.0247      5    0.0273   0.00259

Code

cat("\nSuccess rate for ei_max:", ei_max_boot$successful_bootstraps, "out of", ei_max_boot$bootstrap_samples, "\n")


Success rate for ei_max: 500 out of 500

Code

# Run bootstrap for PoS model
cat("\nPerforming bootstrap validation for PoS model...\n")


Performing bootstrap validation for PoS model...

Code

pos_boot <- perform_stratified_bootstrap_validation(pos_selection, n_boot = n_boot_samples)

Performing stratified bootstrap for outcome: pos_lorrca 
Formula being used:
pos_lorrca ~ age + cuberoot_nearest_hbf + hb_advia + mchc_advia + 
    gender_Male + genotype_SS + TRANSFUSION_STATUS_TRUE. + age_group_Elderly + 
    (1 | patient_mrn)
<environment: 0x12c61bd88>
Found transformed age_group variable: age_group_Elderly
Found transformed TRANSFUSION_STATUS variable: TRANSFUSION_STATUS_TRUE.

Class distribution in stratification variables:
age_group_Elderly :

  0   1 
436  47 
TRANSFUSION_STATUS_TRUE. :

  0   1 
344 139 

Starting stratified bootstrap with 500 iterations...
Completed 50 of 500 bootstrap iterations
Completed 100 of 500 bootstrap iterations
Completed 150 of 500 bootstrap iterations
Completed 200 of 500 bootstrap iterations
Completed 250 of 500 bootstrap iterations
Completed 300 of 500 bootstrap iterations
Completed 350 of 500 bootstrap iterations
Completed 400 of 500 bootstrap iterations
Completed 450 of 500 bootstrap iterations
Completed 500 of 500 bootstrap iterations

Strata proportions comparison (original vs. last bootstrap sample):
# A tibble: 4 × 6
  stratum orig_n orig_prop boot_n boot_prop prop_diff
  <fct>    <int>     <dbl>  <int>     <dbl>     <dbl>
1 0_0        177    0.625     117    0.646    0.0210 
2 1_0         15    0.0530      8    0.0442  -0.00880
3 0_1         84    0.297      53    0.293   -0.00400
4 1_1          7    0.0247      3    0.0166  -0.00816

Code

cat("\nSuccess rate for PoS:", pos_boot$successful_bootstraps, "out of", pos_boot$bootstrap_samples, "\n")


Success rate for PoS: 500 out of 500

Create distribution plots for key variables in each model

Code

# Function to create distribution plots for a dataset
plot_variable_distributions <- function(data, title) {
  # Select numeric variables excluding ID
  numeric_vars <- data %>%
    select(where(is.numeric), -matches("patient_mrn")) %>%
    names()
  
  # Create plots for each variable
  plots <- map(numeric_vars, function(var) {
    p <- data %>%
      ggplot(aes_string(x = var)) +
      geom_histogram(aes(y = ..density..), bins = 30, fill = "steelblue", alpha = 0.7) +
      geom_density(color = "red", linewidth = 1) +
      labs(title = var) +
      theme_cowplot() +
      theme(plot.title = element_text(size = 10))
    return(p)
  })
  
  # Arrange plots in a grid
  grid_title <- textGrob(title, gp = gpar(fontsize = 14, fontface = "bold"))
  grid.arrange(grobs = plots, ncol = 3, 
               top = grid_title)
}

# Create distribution plots for each model's data
cat("## Distribution Plots for Model Variables\n\n")

## Distribution Plots for Model Variables

Code

cat("### ei_min Model Variables\n")

### ei_min Model Variables

Code

plot_variable_distributions(ei_min_data, "Distribution of Variables in ei_min Model")

Code

cat("### ei_max Model Variables\n")

### ei_max Model Variables

Code

plot_variable_distributions(ei_max_data, "Distribution of Variables in ei_max Model")

Code

cat("### PoS Model Variables\n")

### PoS Model Variables

Code

plot_variable_distributions(pos_data, "Distribution of Variables in PoS Model")

Comparing Variable Distributions Before and After Preprocessing

To better understand how our preprocessing steps affect the data, we’ll visualize the distributions of model variables both before and after applying the recipe. This comparison helps illustrate the effects of normalization, transformations, etc. on the data distributions.

Code

# Function to compare distributions before and after preprocessing
compare_preprocessing_distributions <- function(data, recipe, model_name) {
  # Get raw data (before preprocessing)
  raw_data <- data
  
  # Get processed data (after preprocessing)
  prepped_recipe <- prep(recipe)
  processed_data <- bake(prepped_recipe, new_data = NULL)
  
  # Get the outcome variable name from the recipe
  # Instead of using formula() on an unprepped recipe, extract it from the recipe's vars
  outcome_var <- recipe$var_info %>%
    filter(role == "outcome") %>%
    pull(variable)
  
  # Select numeric variables from raw data, excluding ID
  raw_numeric_vars <- raw_data %>%
    select(where(is.numeric), -matches("patient_mrn")) %>%
    names()
  
  # Select numeric variables from processed data, excluding ID
  processed_numeric_vars <- processed_data %>%
    select(where(is.numeric), -matches("patient_mrn")) %>%
    names()
  
  # Create a list to store plots
  all_plots <- list()
  
  # Create plots for raw variables
  for (var in raw_numeric_vars) {
    p_raw <- raw_data %>%
      ggplot(aes_string(x = var)) +
      geom_histogram(aes(y = ..density..), bins = 30, fill = "steelblue", alpha = 0.7) +
      geom_density(color = "red", linewidth = 1) +
      labs(title = paste("Raw:", var)) +
      theme_cowplot() +
      theme(plot.title = element_text(size = 10))
    
    all_plots[[paste0("raw_", var)]] <- p_raw
  }
  
  # Create plots for processed variables
  for (var in processed_numeric_vars) {
    # Skip variables that are just ID columns or other non-relevant variables
    if (grepl("patient_mrn|id|ID", var)) next
    
    p_processed <- processed_data %>%
      ggplot(aes_string(x = var)) +
      geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", alpha = 0.7) +
      geom_density(color = "darkgreen", linewidth = 1) +
      labs(title = paste("Processed:", var)) +
      theme_cowplot() +
      theme(plot.title = element_text(size = 10))
    
    all_plots[[paste0("processed_", var)]] <- p_processed
  }
  
  # Create a grid of plots
  grid_title <- textGrob(paste(model_name, "Variables: Before vs. After Preprocessing"), 
                        gp = gpar(fontsize = 14, fontface = "bold"))
  
  # Arrange plots in a grid, with raw and processed side by side where possible
  plot_pairs <- list()
  
  # First, handle the outcome variable which has a special transformation
  if (length(outcome_var) > 0) {
    # Find the processed version of the outcome (might have a different name due to transformation)
    processed_outcome <- processed_numeric_vars[grepl(outcome_var, processed_numeric_vars)][1]
    
    if (!is.na(processed_outcome) && paste0("raw_", outcome_var) %in% names(all_plots)) {
      plot_pairs[[1]] <- all_plots[[paste0("raw_", outcome_var)]]
      plot_pairs[[2]] <- all_plots[[paste0("processed_", processed_outcome)]]
    }
  }
  
  # Then handle the predictors
  for (var in setdiff(raw_numeric_vars, outcome_var)) {
    # Find matching processed variable (might have suffix from dummy coding)
    matching_processed <- processed_numeric_vars[grepl(paste0("^", var, "$|^", var, "_"), processed_numeric_vars)]
    
    if (length(matching_processed) > 0) {
      for (proc_var in matching_processed) {
        if (paste0("processed_", proc_var) %in% names(all_plots)) {
          plot_pairs[[length(plot_pairs) + 1]] <- all_plots[[paste0("raw_", var)]]
          plot_pairs[[length(plot_pairs) + 1]] <- all_plots[[paste0("processed_", proc_var)]]
        }
      }
    }
  }
  
  # Arrange in a grid with 2 columns (raw and processed side by side)
  grid.arrange(grobs = plot_pairs, ncol = 2, 
               top = grid_title)
  
  # Create a summary table of preprocessing effects
  cat("\n### Summary of Preprocessing Effects for", model_name, "Model\n\n")
  
  # Summarize raw data
  raw_summary <- raw_data %>%
    select(where(is.numeric), -matches("patient_mrn")) %>%
    pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
    group_by(Variable) %>%
    summarise(
      Mean_Raw = mean(Value, na.rm = TRUE),
      SD_Raw = sd(Value, na.rm = TRUE),
      Min_Raw = min(Value, na.rm = TRUE),
      Max_Raw = max(Value, na.rm = TRUE),
      Skewness_Raw = moments::skewness(Value, na.rm = TRUE)
    )
  
  # Summarize processed data
  processed_summary <- processed_data %>%
    select(where(is.numeric), -matches("patient_mrn")) %>%
    pivot_longer(everything(), names_to = "Variable", values_to = "Value") %>%
    group_by(Variable) %>%
    summarise(
      Mean_Processed = mean(Value, na.rm = TRUE),
      SD_Processed = sd(Value, na.rm = TRUE),
      Min_Processed = min(Value, na.rm = TRUE),
      Max_Processed = max(Value, na.rm = TRUE),
      Skewness_Processed = moments::skewness(Value, na.rm = TRUE)
    )
  
  # Display the summary table
  knitr::kable(raw_summary, caption = "Raw Data Summary") %>%
    kable_styling(full_width = FALSE)
  
  knitr::kable(processed_summary, caption = "Processed Data Summary") %>%
    kable_styling(full_width = FALSE)
}

# Add the moments package for skewness calculation
if (!requireNamespace("moments", quietly = TRUE)) {
  install.packages("moments")
}
library(moments)

# Compare distributions for each model
cat("## Comparing Distributions Before and After Preprocessing\n\n")

## Comparing Distributions Before and After Preprocessing

Code

cat("### ei_min Model\n")

### ei_min Model

Code

compare_preprocessing_distributions(ei_min_data, ei_min_recipe, "ei_min")


### Summary of Preprocessing Effects for ei_min Model

Processed Data Summary
Variable	Mean_Processed	SD_Processed	Min_Processed	Max_Processed	Skewness_Processed
HYDROXYUREA_STATUS_TRUE.	0.0822981	0.2750320	0.000000	1.0000000	3.0398391
TRANSFUSION_STATUS_TRUE.	0.2251553	0.4180092	0.000000	1.0000000	1.3160397
age	0.0000000	1.0000000	-1.289090	2.9859594	0.7918426
age_group_Elderly	0.1226708	0.3283140	0.000000	1.0000000	2.3003743
cuberoot_nearest_hbf	0.0000000	1.0000000	-2.049196	2.1511281	-0.0583965
gender_Male	0.4440994	0.4972515	0.000000	1.0000000	0.2250132
genotype_SS	0.7282609	0.4452019	0.000000	1.0000000	-1.0262233
hb_advia	0.0000000	1.0000000	-4.456784	6.4512704	0.5158834
mchc_advia	0.0000000	1.0000000	-13.632208	4.1011763	-3.8584337
sqrt_ei_min	0.3972592	0.1485672	0.000000	0.7596052	0.1402861
steady_state_TRUE.	0.5993789	0.4904052	0.000000	1.0000000	-0.4056080
weight_kg	0.0000000	1.0000000	-2.505276	5.2647416	1.1062518

Code

cat("### ei_max Model\n")

### ei_max Model

Code

compare_preprocessing_distributions(ei_max_data, ei_max_recipe, "ei_max")


### Summary of Preprocessing Effects for ei_max Model

Processed Data Summary
Variable	Mean_Processed	SD_Processed	Min_Processed	Max_Processed	Skewness_Processed
HYDROXYUREA_STATUS_TRUE.	0.0822981	0.2750320	0.0000000	1.000000	3.0398391
TRANSFUSION_STATUS_TRUE.	0.2251553	0.4180092	0.0000000	1.000000	1.3160397
age	0.0000000	1.0000000	-1.2890898	2.985959	0.7918426
age_group_Elderly	0.1226708	0.3283140	0.0000000	1.000000	2.3003743
cuberoot_nearest_hbf	0.0000000	1.0000000	-2.0491960	2.151128	-0.0583965
ei_max_lorrca	-0.3244557	0.0232839	-0.3771178	-0.268318	-0.1487668
gender_Male	0.4440994	0.4972515	0.0000000	1.000000	0.2250132
genotype_SS	0.7282609	0.4452019	0.0000000	1.000000	-1.0262233
hb_advia	0.0000000	1.0000000	-4.4567837	6.451270	0.5158834
mchc_advia	0.0000000	1.0000000	-13.6322077	4.101176	-3.8584337
steady_state_TRUE.	0.5993789	0.4904052	0.0000000	1.000000	-0.4056080
weight_kg	0.0000000	1.0000000	-2.5052758	5.264742	1.1062518

Code

cat("### PoS Model\n")

### PoS Model

Code

compare_preprocessing_distributions(pos_data, pos_recipe, "PoS")


### Summary of Preprocessing Effects for PoS Model

Processed Data Summary
Variable	Mean_Processed	SD_Processed	Min_Processed	Max_Processed	Skewness_Processed
HYDROXYUREA_STATUS_TRUE.	0.0822981	0.2750320	0.000000	1.000000	3.0398391
TRANSFUSION_STATUS_TRUE.	0.2251553	0.4180092	0.000000	1.000000	1.3160397
age	0.0000000	1.0000000	-1.289090	2.985959	0.7918426
age_group_Elderly	0.1226708	0.3283140	0.000000	1.000000	2.3003743
cuberoot_nearest_hbf	0.0000000	1.0000000	-2.049196	2.151128	-0.0583965
gender_Male	0.4440994	0.4972515	0.000000	1.000000	0.2250132
genotype_SS	0.7282609	0.4452019	0.000000	1.000000	-1.0262233
hb_advia	0.0000000	1.0000000	-4.456784	6.451270	0.5158834
mchc_advia	0.0000000	1.0000000	-13.632208	4.101176	-3.8584337
pos_lorrca	7.0123766	1.7054024	2.326418	13.318329	0.1415283
steady_state_TRUE.	0.5993789	0.4904052	0.000000	1.000000	-0.4056080
weight_kg	0.0000000	1.0000000	-2.505276	5.264742	1.1062518

Create bootstrap distribution plots for coefficients

Code

# Function to create coefficient distribution plots with CIs
plot_bootstrap_distributions <- function(boot_data, model_name) {
  # Get the raw coefficient data
  coef_data <- boot_data$raw_coef_data
  
  # Get the summary data with CIs
  coef_summary <- boot_data$coefficients
  
  # Create a plot for each coefficient
  coef_terms <- unique(coef_data$term)
  
  plots <- map(coef_terms, function(term) {
    # Filter data for this term
    term_data <- coef_data %>% filter(term == !!term)
    term_summary <- coef_summary %>% filter(term == !!term)
    
    # Create the plot
    p <- ggplot(term_data, aes(x = estimate)) +
      geom_histogram(aes(y = ..density..), bins = 30, fill = "steelblue", alpha = 0.7) +
      geom_density(color = "red", linewidth = 1) +
      # Add vertical lines for mean and CI
      geom_vline(xintercept = term_summary$mean, color = "black", linewidth = 1) +
      geom_vline(xintercept = term_summary$conf.low, color = "black", 
                linetype = "dashed", linewidth = 0.8) +
      geom_vline(xintercept = term_summary$conf.high, color = "black", 
                linetype = "dashed", linewidth = 0.8) +
      # Add a vertical line at zero for reference
      geom_vline(xintercept = 0, color = "darkgreen", linetype = "dotted", linewidth = 0.8) +
      # Add labels
      labs(
        title = term,
        subtitle = sprintf("Mean: %.3f [%.3f, %.3f]", 
                          term_summary$mean, term_summary$conf.low, term_summary$conf.high)
      ) +
      theme_cowplot() +
      theme(
        plot.title = element_text(size = 10),
        plot.subtitle = element_text(size = 8)
      )
    
    return(p)
  })
  
  # Arrange plots in a grid
  grid_title <- textGrob(paste("Bootstrap Distributions for", model_name, "Coefficients"), 
                        gp = gpar(fontsize = 14, fontface = "bold"))
  grid.arrange(grobs = plots, ncol = 3, 
               top = grid_title)
}

# Create bootstrap distribution plots for model metrics
plot_metric_distributions <- function(boot_data, model_name) {
  # Get the raw metrics data
  metrics_data <- boot_data$raw_metrics
  
  # Get the summary data with CIs
  metrics_summary <- boot_data$metrics
  
  # Create plots for RMSE and R-squared
  rmse_data <- metrics_data %>% filter(.metric == "rmse")
  rsq_data <- metrics_data %>% filter(.metric == "rsq")
  
  rmse_summary <- metrics_summary %>% filter(.metric == "rmse")
  rsq_summary <- metrics_summary %>% filter(.metric == "rsq")
  
  # RMSE plot
  p1 <- ggplot(rmse_data, aes(x = .estimate)) +
    geom_histogram(aes(y = ..density..), bins = 30, fill = "steelblue", alpha = 0.7) +
    geom_density(color = "red", linewidth = 1) +
    geom_vline(xintercept = rmse_summary$mean, color = "black", linewidth = 1) +
    geom_vline(xintercept = rmse_summary$conf.low, color = "black", 
              linetype = "dashed", linewidth = 0.8) +
    geom_vline(xintercept = rmse_summary$conf.high, color = "black", 
              linetype = "dashed", linewidth = 0.8) +
    labs(
      title = "RMSE Distribution",
      subtitle = sprintf("Mean: %.3f [%.3f, %.3f]", 
                        rmse_summary$mean, rmse_summary$conf.low, rmse_summary$conf.high),
      x = "RMSE",
      y = "Density"
    ) +
    theme_cowplot()
  
  # R-squared plot
  p2 <- ggplot(rsq_data, aes(x = .estimate)) +
    geom_histogram(aes(y = ..density..), bins = 30, fill = "steelblue", alpha = 0.7) +
    geom_density(color = "red", linewidth = 1) +
    geom_vline(xintercept = rsq_summary$mean, color = "black", linewidth = 1) +
    geom_vline(xintercept = rsq_summary$conf.low, color = "black", 
              linetype = "dashed", linewidth = 0.8) +
    geom_vline(xintercept = rsq_summary$conf.high, color = "black", 
              linetype = "dashed", linewidth = 0.8) +
    labs(
      title = "R-squared Distribution",
      subtitle = sprintf("Mean: %.3f [%.3f, %.3f]", 
                        rsq_summary$mean, rsq_summary$conf.low, rsq_summary$conf.high),
      x = "R-squared",
      y = "Density"
    ) +
    theme_cowplot()
  
  # Arrange plots
  grid_title <- textGrob(paste("Bootstrap Distributions for", model_name, "Performance Metrics"), 
                        gp = gpar(fontsize = 14, fontface = "bold"))
  grid.arrange(p1, p2, ncol = 2, top = grid_title)
}

# Create enhanced forest plots with improved visualization
create_enhanced_forest_plot <- function(boot_data, model_selection, model_name) {
  # Get bootstrap coefficient summary with CIs
  boot_coef_data <- boot_data$coefficients %>%
    filter(term != "(Intercept)") %>%  # Exclude intercept for better scaling
    select(term, boot_mean = mean, boot_conf.low = conf.low, boot_conf.high = conf.high)
  
  # Get original model coefficients with CIs
  orig_coef_data <- broom.mixed::tidy(model_selection$final_model, conf.int = TRUE) %>%
    filter(term != "(Intercept)") %>%
    select(term, orig_estimate = estimate, orig_conf.low = conf.low, orig_conf.high = conf.high)
  
  # Combine the data and prepare for plotting
  combined_data <- full_join(boot_coef_data, orig_coef_data, by = "term") %>%
    mutate(
      # Create a more readable term label
      term_label = str_replace_all(term, "_", " ") %>%
                  str_replace_all("genotype", "Genotype: ") %>%
                  str_replace_all("TRANSFUSION STATUS", "Transfusion Status") %>%
                  str_replace_all("HYDROXYUREA STATUS", "Hydroxyurea Status") %>%
                  str_replace_all("age group", "Age Group") %>%
                  str_replace_all("cuberoot nearest hbf", "HbF (cube root)") %>%
                  str_replace_all("weight kg x", "Weight × "),
      
      # Calculate absolute effect size for ordering
      abs_effect = abs(orig_estimate)
    ) %>%
    arrange(desc(abs_effect)) %>%
    mutate(term_label = factor(term_label, levels = rev(unique(term_label))))
  
  # Create the plot
  p <- ggplot(combined_data) +
    # Add reference line at zero
    geom_vline(xintercept = 0, linetype = "dashed", color = "gray50", linewidth = 0.7) +
    
    # Add original model estimates
    geom_point(aes(x = orig_estimate, y = term_label, color = "Original"), 
               size = 3, shape = 16) +
    geom_errorbarh(aes(y = term_label, xmin = orig_conf.low, xmax = orig_conf.high, 
                       color = "Original"), 
                   height = 0.2, alpha = 0.7) +
    
    # Add bootstrap estimates
    geom_point(aes(x = boot_mean, y = term_label, color = "Bootstrap"), 
               size = 3, shape = 17) +
    geom_errorbarh(aes(y = term_label, xmin = boot_conf.low, xmax = boot_conf.high, 
                       color = "Bootstrap"), 
                   height = 0.2, alpha = 0.7) +
    
    # Customize scales and theme
    scale_color_manual(values = c("Original" = "blue", "Bootstrap" = "red"),
                      name = "Estimate Type") +
    scale_x_continuous(
      limits = function(x) c(min(x) * 1.1, max(x) * 1.1),
      expand = expansion(mult = c(0.1, 0.1))
    ) +
    labs(
      title = paste("Forest Plot for", model_name, "Coefficients"),
      subtitle = "Comparing Original Model vs. Bootstrap 95% Confidence Intervals",
      x = "Coefficient Estimate",
      y = ""
    ) +
    theme_minimal() +
    theme(
      legend.position = "bottom",
      panel.grid.minor = element_blank(),
      panel.grid.major.y = element_blank(),
      axis.text.y = element_text(size = 10),
      plot.title = element_text(size = 14, face = "bold"),
      plot.subtitle = element_text(size = 10)
    )
  
  return(p)
}

# Generate enhanced forest plots for each model
cat("## Enhanced Forest Plots\n\n")

## Enhanced Forest Plots

Code

cat("### ei_min Model\n")

### ei_min Model

Code

create_enhanced_forest_plot(ei_min_boot, ei_min_selection, "ei_min")

Code

cat("### ei_max Model\n")

### ei_max Model

Code

create_enhanced_forest_plot(ei_max_boot, ei_max_selection, "ei_max")

Code

cat("### PoS Model\n")

### PoS Model

Code

create_enhanced_forest_plot(pos_boot, pos_selection, "PoS")

Code

# Add a table comparing the number of significant predictors
cat("\n## Consistency Analysis\n\n")


## Consistency Analysis

Code

# Function to create a consistency summary table
create_consistency_table <- function(boot_data, model_selection, model_name) {
  # Get bootstrap coefficient summary
  boot_coef_data <- boot_data$coefficients %>%
    filter(term != "(Intercept)") %>%
    select(term, boot_mean = mean, boot_conf.low = conf.low, boot_conf.high = conf.high)
  
  # Get original model coefficients
  orig_coef_data <- broom.mixed::tidy(model_selection$final_model, conf.int = TRUE) %>%
    filter(term != "(Intercept)") %>%
    select(term, orig_estimate = estimate, orig_conf.low = conf.low, orig_conf.high = conf.high)
  
  # Combine and analyze
  combined_data <- full_join(boot_coef_data, orig_coef_data, by = "term") %>%
    mutate(
      # Flag significant coefficients
      orig_significant = (orig_conf.low > 0) | (orig_conf.high < 0),
      boot_significant = (boot_conf.low > 0) | (boot_conf.high < 0),
      
      # Create consistency categories
      consistency = case_when(
        (orig_significant & boot_significant & sign(orig_estimate) == sign(boot_mean)) ~ "Consistent & Significant",
        (!orig_significant & !boot_significant) ~ "Consistently Non-significant",
        (orig_significant & !boot_significant) ~ "Original Only",
        (!orig_significant & boot_significant) ~ "Bootstrap Only",
        (orig_significant & boot_significant & sign(orig_estimate) != sign(boot_mean)) ~ "Opposite Direction"
      )
    )
  
  # Count by category
  summary_counts <- combined_data %>%
    count(consistency) %>%
    mutate(model = model_name)
  
  return(summary_counts)
}

# Create consistency tables for all models
consistency_summary <- bind_rows(
  create_consistency_table(ei_min_boot, ei_min_selection, "ei_min"),
  create_consistency_table(ei_max_boot, ei_max_selection, "ei_max"),
  create_consistency_table(pos_boot, pos_selection, "PoS")
) %>%
  pivot_wider(names_from = model, values_from = n, values_fill = 0)

# Display the table
knitr::kable(consistency_summary, 
             caption = "Consistency Between Original and Bootstrap Results") %>%
  kable_styling(full_width = FALSE)

Consistency Between Original and Bootstrap Results
consistency	ei_min	ei_max	PoS
Consistent & Significant	6	5	6
NA	2	2	2
Bootstrap Only	0	0	2

Methods

Statistical Approach

Data Preprocessing

Our analysis began with careful preprocessing of the SCD dataset. We removed extreme outliers for maximum elongation index (ei_max > 30) and created derived variables including transfusion status, steady-state status, and age group classifications. We applied appropriate transformations to improve the normality of our outcome variables: square root transformation for minimum elongation index (ei_min), Box-Cox transformation for maximum elongation index (ei_max), and Box-Cox transformation for point of sickling (PoS).

The preprocessing pipeline, implemented using the tidymodels framework, included:

Feature engineering: Creating interaction terms between weight and hydroxyurea status
Normalization: Centering and scaling numeric predictors
Correlation analysis: Removing highly correlated predictors (threshold = 0.8)
Near-zero variance filtering: Removing predictors with minimal variation
Dummy coding: Converting categorical variables to numeric indicators

Mixed-Effects Modeling

We employed linear mixed-effects models to account for the hierarchical structure of our data, with multiple measurements potentially coming from the same patients. Each model included a random intercept for patient ID (patient_mrn) to account for within-patient correlation.

The initial full models included the following fixed effects: - Age (continuous) - Gender - Genotype (SS, SC, or other) - Steady-state status (TRUE/FALSE) - Transfusion status (TRUE/FALSE) - Cube-root transformed HbF levels - Age group (Elderly/Non-elderly) - Hemoglobin level (hb_advia) - Mean corpuscular hemoglobin concentration (mchc_advia) - Weight - Hydroxyurea status (TRUE/FALSE) - Weight × Hydroxyurea interaction

Model Selection

We implemented a custom backward elimination procedure for model selection, which systematically evaluated the contribution of each predictor to model fit. The procedure:

Started with the full model containing all predictors
Iteratively tested the removal of each predictor
Removed the predictor with the highest p-value (> 0.05) that also improved model AIC
Continued until no further terms could be removed without significantly worsening model fit

This approach allowed us to identify the most parsimonious models that adequately explained the variation in our outcome variables. The final models were refit using Restricted Maximum Likelihood (REML) estimation to obtain unbiased estimates of variance components. However, more computationally intensive methods such as elastic net regularization or stepwise AIC selection followed by k-fold cross-validation may be more appropriate and less prone to bias from p-value-based selection.

Model Validation

We validated our final models using stratified bootstrap resampling (n = 500) to assess their predictive performance and stability. Our stratified approach specifically addressed class imbalance in key variables:

Stratification by imbalanced classes: We stratified the bootstrap samples by age group (Elderly/Non-elderly) and transfusion status (TRUE/FALSE), which showed substantial class imbalance in our dataset.
Hierarchical sampling: To preserve the hierarchical structure of our data:
- We first sampled patients within each stratum
- Then included all observations for the selected patients
- This approach maintained the within-patient correlation structure
Performance assessment: For each bootstrap sample, we:
- Refit the optimized model formula derived from the stepwise selection
- Calculated performance metrics (RMSE and R-squared)
- Extracted fixed effects estimates and confidence intervals
Consistency checking: We compared the proportions of patients in each stratum between the original data and bootstrap samples to verify that our stratification approach maintained the representation of minority groups.

This stratified approach provided more reliable confidence intervals than traditional bootstrapping, particularly for predictors associated with imbalanced groups (e.g., elderly patients or those with recent transfusions).

Diagnostics

We performed comprehensive diagnostics to ensure the validity of our models:

Residual analysis: Examining residual plots for homoscedasticity and patterns
Normality assessment: Q-Q plots to verify normality of residuals
Influence diagnostics: Using DHARMa simulated residuals for mixed models
Multicollinearity: Variance Inflation Factor (VIF) analysis

These diagnostics confirmed that our models mostly met the assumptions of linear mixed-effects modeling and provided reliable estimates of the relationships between predictors and outcomes. However, for EI-max, a non-linear relationship may be suggested by the sigmoidal curve observed in the residual distribution. This likely reflects the similar EI-max values often seen across subjects regardless of other factors, such as the comparable EI-max values between AA and SS genotypes. Alternative approaches such as beta regression or Gamma models might better characterize EI-max. Nevertheless, given the relatively low variance in this outcome compared to others, the linear model provides a reasonable approximation.

Conclusions

EI-Min

Our models revealed that age group had a significant positive effect on ei_min, with elderly status being more predictive than age as a continuous variable. Notably, HbF levels were not retained in the optimal EI-min model. Transfusion status, weight, and hemoglobin levels were all associated with increased EI-min values, while male gender and higher MCHC values were associated with lower ei_min values.

EI-Max

Neither age group nor continuous age demonstrated significant effects on EI-max. Transfusion status, hemoglobin levels, and HbF levels were associated with higher EI-max values, while male gender and higher MCHC values were associated with lower EI-max values.

POS

Age group showed a borderline negative effect on POS, suggesting that older patients require more deoxygenation to trigger sickling. Continuous age showed a similar marginal negative effect. Transfusion status, HbF levels, and hemoglobin levels were associated with lower POS values, while male gender and higher MCHC values were associated with higher POS values. Notably, this was the only optimized model in which genotype emerged as a crucial predictor, with a comparatively large effect size.

Potential Limitations and Improvements

Statistical Limitations

Feature Selection Bias: Our backward elimination approach may be susceptible to p-value-based selection bias. Implementing elastic net regularization could provide more stable feature selection without dependency on p-values.
Model Selection Validation: The current approach lacks cross-validation to assess model stability. Implementing stepAIC with k-fold cross-validation would provide a more robust model selection process while maintaining the interpretability of traditional statistical approaches.
Model Specification: The linear mixed-effects framework may not fully capture non-linear relationships, particularly for EI-max. Generalized additive mixed models (GAMMs) or beta regression might better characterize these relationships.
Interaction Exploration: Our models included limited interaction terms. A more systematic exploration of interactions, especially between genotype and treatment variables, could reveal important effect modifications. I believe this imporvement may be the most called for.
Temporal Dynamics: The current models treat repeated measurements as exchangeable within patients. More sophisticated longitudinal modeling approaches could better capture within-patient changes over time, including seasonal effects.

Methodological Improvements

Bayesian Framework: Adopting Bayesian mixed models would allow incorporation of prior knowledge and better quantification of uncertainty, especially for parameters with limited data. – This would be quite difficult to impliment, however.
Causal Inference: Our models establish associations but not causality. Methods such as propensity score matching could better estimate treatment effects of transfusion and hydroxyurea. – Likely not needed, but could be done if we need more to talk about.

Software

All analyses were conducted in R (version 4.2.0) using RStudio. We employed the following key packages: lme4 and lmerTest for mixed-effects modeling, tidymodels for data preprocessing, recipes for feature engineering, broom.mixed for model summaries, performance for model diagnostics, and ggplot2 for visualization. Bootstrap validation was implemented using parallel processing with the parallel and foreach packages to improve computational efficiency. All code and data preprocessing steps were documented in R Markdown for reproducibility.