SCD Rheology Model Refinement: Modeling ei_min, ei_max, and PoS
Author
Jonathan Wade
Analysis Summary
This analysis explores key rheological parameters in our OMICs cohort:
minimum elongation index (ei_min)
maximum elongation index (ei_max)
point of sickling (PoS).
Using mixed-effects models, we identify the most important clinical and demographic predictors of these outcomes while accounting for patient-level clustering and addressing class imbalance in our dataset.
Our approach combines statistical modeling with stratified bootstrap validation to ensure reliable inference for all patient subgroups. The results reveal distinct patterns of predictors for each rheological parameter, with some consistent effects across outcomes that may have important clinical implications.
Introduction
In this document we start by modeling three key outcomes from our SCD dataset:
- ei_min (Minimum Elongation Index)
- ei_max (Maximum Elongation Index)
- PoS (Point of Sickling, from pos_lorrca)
I begin with a full mixed-effects model (using a random intercept for each patient) that includes predictors such as age, gender, genotype, steady-state status, transfusion status, a transformed measure of HbF (using a cube-root transformation), and an age group indicator. I use the tidymodels framework for preprocessing, model fitting, and diagnostics.
Analytical Approach
Our analysis aimed to identify key predictors of rheological parameters while accounting for patient-level clustering and addressing class imbalance in our cohort. I employed a systematic approach to data preprocessing, model development, and validation.
I began by preprocessing the OMICs dataset, removing extreme outliers and creating derived variables to capture clinically relevant features. We applied appropriate transformations to improve the normality of our outcome variables and prepared the data for mixed-effects modeling.
Code
# Load required librarieslibrary(tidyverse)library(tidymodels)library(themis) # For additional preprocessing stepslibrary(vip) # For variable importancelibrary(corrplot) # For correlation plotslibrary(car) # For VIF analysislibrary(DHARMa) # For mixed model diagnosticslibrary(glmmTMB) # For alternative mixed models if neededlibrary(splines) # For potential non-linear termslibrary(gridExtra) # For arranging plotslibrary(broom.mixed) # For tidy method with mixed modelstidymodels_prefer()library(kableExtra)library(grid)library(gridExtra)library(lme4)library(cowplot)set.seed(333)# Load the datasetdf <-read_csv('OMICS-export-processed-3-5-25.csv')%>%filter(genotype !='Sbeta+')# Initial data preprocessingdf <- df %>%# Remove extreme outliers for ei_maxfilter(ei_max_lorrca <=30) %>%# Convert dates and create derived variablesmutate(date_collection =as.Date(date_of_collection, format ="%Y-%m-%d"),date_hba_before =as.Date(hba_before_date, format ="%Y-%m-%d"),date_hba_after =as.Date(hba_after_date, format ="%Y-%m-%d"),TRANSFUSION_STATUS =case_when( (!is.na(date_hba_before) &abs(as.numeric(date_collection - date_hba_before)) <=90) | (!is.na(date_hba_after) &abs(as.numeric(date_hba_after - date_collection)) <=90) ~TRUE,TRUE~FALSE ),steady_state =if_else(!is.na(time_from_voe) &abs(time_from_voe) >=4, TRUE, FALSE),# Determine the nearest HbF valuedays_to_hbf_before =if_else(!is.na(hbf_before_date),abs(as.numeric(date_collection -as.Date(hbf_before_date, format="%Y-%m-%d"))),Inf),days_to_hbf_after =if_else(!is.na(hbf_after_date),abs(as.numeric(date_collection -as.Date(hbf_after_date, format="%Y-%m-%d"))),Inf),nearest_hbf =if_else(days_to_hbf_before <= days_to_hbf_after, hbf_before, hbf_after),# Create hydroxyurea status variableHYDROXYUREA_STATUS =if_else(!is.na(op_hydroxyurea_status) & op_hydroxyurea_status =="Yes", TRUE, FALSE),# Create age groupage_group =if_else(age >=60, "Elderly", "Non-elderly"),# Create initial transformationssqrt_ei_min =sqrt(ei_min_lorrca),cuberoot_nearest_hbf =sign(nearest_hbf) *abs(nearest_hbf)^(1/3) ) %>%# Convert factorsmutate(across(c(age_group, gender, genotype, TRANSFUSION_STATUS, steady_state, HYDROXYUREA_STATUS), as.factor)) %>%# Set factor levelsmutate(age_group =fct_relevel(age_group, "Non-elderly", "Elderly"))# Function to create a recipe for each outcomecreate_rheology_recipe <-function(data, outcome, outcome_transform =NULL) {# Create initial recipe rec <-recipe(formula =as.formula(paste(outcome, "~ .")), data = data) %>%# Update roles - explicitly specify ID columnupdate_role(all_of("patient_mrn"), new_role ="ID") %>%# Remove date variables and other unnecessary columnsupdate_role(matches("date|days_to|_before$|_after$"), new_role ="drop") %>%# Remove variables we don't want in the modelstep_rm(matches("_lorrca$"), -all_outcomes()) %>%# Center and scale numeric predictorsstep_normalize(all_numeric_predictors()) %>%# Check for correlationsstep_corr(all_numeric_predictors(), threshold =0.8) %>%# Create interaction termsstep_interact(terms =~ weight_kg:HYDROXYUREA_STATUS) %>%# Remove near-zero variance predictorsstep_nzv(all_predictors()) %>%# Dummy code categorical variables, excluding ID and outcomestep_dummy(all_nominal_predictors(), -all_outcomes(), -has_role("ID"))# Add outcome transformation if specifiedif (!is.null(outcome_transform)) { rec <- rec %>%step_mutate(!!outcome :=!!outcome_transform) }return(rec)}# Function to check VIFcheck_vif <-function(recipe, data) {# Prep and bake the recipe prepped_data <-prep(recipe) %>%bake(new_data =NULL)# Remove ID column for VIF calculation model_data <- prepped_data %>%select(-patient_mrn)# Fit a linear model and check VIF vif_model <-lm(reformulate(setdiff(names(model_data), names(model_data)[1]), response =names(model_data)[1]), data = model_data)print(car::vif(vif_model))}# Function to plot correlationsplot_correlations <-function(recipe, data) {# Prep and bake the recipe prepped_data <-prep(recipe) %>%bake(new_data =NULL)# Create correlation matrix for numeric variables cor_matrix <- prepped_data %>%select(where(is.numeric), -patient_mrn) %>%cor(use ="pairwise.complete.obs")# Plot correlationscorrplot(cor_matrix, method ="circle", type ="upper", tl.col ="black", tl.srt =45)}# Function to fit mixed modelfit_mixed_model <-function(data, outcome, recipe) {# Prep the recipe prepped_recipe <-prep(recipe)# Bake the data model_data <-bake(prepped_recipe, new_data =NULL)# Get predictor names (excluding outcome and ID) predictors <-setdiff(names(model_data), c(outcome, "patient_mrn"))# Create formula formula_str <-paste(outcome, "~", paste(predictors, collapse =" + "), "+ (1 | patient_mrn)")# Fit the model model <-lmer(as.formula(formula_str), data = model_data)return(model)}
We applied appropriate transformations to each outcome variable based on their distributions: square root for ei_min and Box-Cox for ei_max and PoS. These transformations helped improve the normality of our outcome variables and ensure the validity of our statistical models.
Model Development
We developed mixed-effects models for each rheological parameter, starting with theory-driven predictors and using backward elimination to identify parsimonious models. Each model accounts for within-patient correlation through a random intercept term.
For the minimum elongation index (ei_min), we applied a square root transformation to improve normality. The model includes key clinical predictors and accounts for patient-level clustering.
Code
# Prepare data for ei_min modelei_min_data <- df %>%select(sqrt_ei_min, patient_mrn, age, gender, genotype, steady_state, TRANSFUSION_STATUS, cuberoot_nearest_hbf, age_group, hb_advia, mchc_advia, weight_kg, HYDROXYUREA_STATUS) %>%# drop_na() %>%# Ensure patient_mrn is a factormutate(patient_mrn =as.factor(patient_mrn))# Create and prep recipeei_min_recipe <-create_rheology_recipe(ei_min_data, "sqrt_ei_min")# Check VIF and correlationscheck_vif(ei_min_recipe, ei_min_data)
# Fit the mixed model directlyei_min_model <-fit_mixed_model(ei_min_data, "sqrt_ei_min", ei_min_recipe)# Model diagnostics# DHARMa diagnosticsei_min_resids <-simulateResiduals(ei_min_model)plot(ei_min_resids)
Code
# Standard diagnostic plots - using manual approach instead of augmentmodel_data <-bake(prep(ei_min_recipe), new_data =NULL)fitted_vals <-fitted(ei_min_model)residuals_vals <-residuals(ei_min_model)tibble(.fitted = fitted_vals,.resid = residuals_vals) %>%ggplot(aes(x = .fitted, y = .resid)) +geom_point(alpha =0.5) +geom_hline(yintercept =0, color ="red", linetype ="dashed") +geom_smooth(method ="loess", se =FALSE, color ="blue") +labs(title ="ei_min: Residuals vs Fitted Values")
# Create a tidy summary with confidence intervalsbroom.mixed::tidy(ei_min_model, conf.int =TRUE) %>% knitr::kable(caption ="ei_min Model Coefficients")
ei_min Model Coefficients
effect
group
term
estimate
std.error
statistic
conf.low
conf.high
fixed
NA
(Intercept)
0.3977760
0.0277686
14.3246713
0.3433505
0.4522014
fixed
NA
age
-0.0004250
0.0111940
-0.0379666
-0.0223647
0.0215147
fixed
NA
cuberoot_nearest_hbf
0.0039603
0.0083112
0.4765074
-0.0123292
0.0202499
fixed
NA
hb_advia
0.0156840
0.0076398
2.0529417
0.0007103
0.0306577
fixed
NA
mchc_advia
-0.0204666
0.0058917
-3.4737840
-0.0320143
-0.0089190
fixed
NA
weight_kg
0.0149212
0.0077125
1.9346792
-0.0001950
0.0300375
fixed
NA
gender_Male
-0.0374773
0.0144899
-2.5864516
-0.0658769
-0.0090777
fixed
NA
genotype_SS
-0.0407000
0.0303035
-1.3430796
-0.1000937
0.0186937
fixed
NA
steady_state_TRUE.
-0.0028628
0.0149608
-0.1913549
-0.0321854
0.0264598
fixed
NA
TRANSFUSION_STATUS_TRUE.
0.1428804
0.0153463
9.3104015
0.1128022
0.1729587
fixed
NA
age_group_Elderly
0.0785826
0.0356136
2.2065304
0.0087811
0.1483840
fixed
NA
HYDROXYUREA_STATUS_TRUE.
-0.0075673
0.0229862
-0.3292120
-0.0526194
0.0374848
ran_pars
patient_mrn
sd__(Intercept)
0.0780979
NA
NA
NA
NA
ran_pars
Residual
sd__Observation
0.1035489
NA
NA
NA
NA
For the maximum elongation index (ei_max), we applied a Box-Cox transformation to address non-normality. The optimal lambda was determined empirically to ensure the best possible transformation.
# Fit the mixed model directlyei_max_model <-fit_mixed_model(ei_max_data, "ei_max_lorrca", ei_max_recipe)# Model diagnostics# DHARMa diagnosticsei_max_resids <-simulateResiduals(ei_max_model)plot(ei_max_resids)
Code
# Standard diagnostic plots - using manual approach instead of augmentmodel_data <-bake(prep(ei_max_recipe), new_data =NULL)fitted_vals <-fitted(ei_max_model)residuals_vals <-residuals(ei_max_model)tibble(.fitted = fitted_vals,.resid = residuals_vals) %>%ggplot(aes(x = .fitted, y = .resid)) +geom_point(alpha =0.5) +geom_hline(yintercept =0, color ="red", linetype ="dashed") +geom_smooth(method ="loess", se =FALSE, color ="blue") +labs(title ="ei_max: Residuals vs Fitted Values")
# Create a tidy summary with confidence intervalsbroom.mixed::tidy(ei_max_model, conf.int =TRUE) %>% knitr::kable(caption ="ei_max Model Coefficients")
ei_max Model Coefficients
effect
group
term
estimate
std.error
statistic
conf.low
conf.high
fixed
NA
(Intercept)
-0.3197272
0.0046901
-68.1703447
-0.3289197
-0.3105347
fixed
NA
age
-0.0025482
0.0018684
-1.3638585
-0.0062102
0.0011138
fixed
NA
cuberoot_nearest_hbf
0.0040662
0.0013514
3.0089699
0.0014176
0.0067148
fixed
NA
hb_advia
0.0077241
0.0012197
6.3325291
0.0053334
0.0101147
fixed
NA
mchc_advia
-0.0050137
0.0008962
-5.5942163
-0.0067703
-0.0032571
fixed
NA
weight_kg
0.0008363
0.0012896
0.6484802
-0.0016912
0.0033638
fixed
NA
gender_Male
-0.0069049
0.0024146
-2.8597211
-0.0116374
-0.0021725
fixed
NA
genotype_SS
-0.0044191
0.0050660
-0.8723103
-0.0143484
0.0055101
fixed
NA
steady_state_TRUE.
-0.0003566
0.0024980
-0.1427695
-0.0052527
0.0045394
fixed
NA
TRANSFUSION_STATUS_TRUE.
0.0117431
0.0024156
4.8613387
0.0070086
0.0164777
fixed
NA
age_group_Elderly
0.0095029
0.0060179
1.5791099
-0.0022919
0.0212977
fixed
NA
HYDROXYUREA_STATUS_TRUE.
0.0017913
0.0038837
0.4612392
-0.0058205
0.0094031
ran_pars
patient_mrn
sd__(Intercept)
0.0147667
NA
NA
NA
NA
ran_pars
Residual
sd__Observation
0.0153616
NA
NA
NA
NA
For the Point of Sickling (PoS), we also applied a Box-Cox transformation to address non-normality. This parameter represents a critical threshold in the sickling process and may have different predictors than the elongation indices.
# Fit the mixed model directlypos_model <-fit_mixed_model(pos_data, "pos_lorrca", pos_recipe)# Model diagnostics# DHARMa diagnosticspos_resids <-simulateResiduals(pos_model)plot(pos_resids)
Code
# Standard diagnostic plots - using manual approach instead of augmentmodel_data <-bake(prep(pos_recipe), new_data =NULL)fitted_vals <-fitted(pos_model)residuals_vals <-residuals(pos_model)tibble(.fitted = fitted_vals,.resid = residuals_vals) %>%ggplot(aes(x = .fitted, y = .resid)) +geom_point(alpha =0.5) +geom_hline(yintercept =0, color ="red", linetype ="dashed") +geom_smooth(method ="loess", se =FALSE, color ="blue") +labs(title ="PoS: Residuals vs Fitted Values")
# Create a tidy summary with confidence intervalsbroom.mixed::tidy(pos_model, conf.int =TRUE) %>% knitr::kable(caption ="PoS Model Coefficients")
PoS Model Coefficients
effect
group
term
estimate
std.error
statistic
conf.low
conf.high
fixed
NA
(Intercept)
5.8564880
0.2854857
20.5141205
5.2969463
6.4160297
fixed
NA
age
-0.1490964
0.1141643
-1.3059807
-0.3728543
0.0746615
fixed
NA
cuberoot_nearest_hbf
-0.1792507
0.0867945
-2.0652307
-0.3493649
-0.0091366
fixed
NA
hb_advia
-0.3070063
0.0817755
-3.7542586
-0.4672833
-0.1467293
fixed
NA
mchc_advia
0.2606577
0.0629046
4.1436962
0.1373669
0.3839486
fixed
NA
weight_kg
-0.0770288
0.0800192
-0.9626284
-0.2338636
0.0798060
fixed
NA
gender_Male
0.3324202
0.1481356
2.2440260
0.0420797
0.6227607
fixed
NA
genotype_SS
1.7119160
0.3125480
5.4772897
1.0993332
2.3244989
fixed
NA
steady_state_TRUE.
0.1185945
0.1539473
0.7703580
-0.1831366
0.4203257
fixed
NA
TRANSFUSION_STATUS_TRUE.
-0.9518397
0.1618431
-5.8812484
-1.2690464
-0.6346330
fixed
NA
age_group_Elderly
-0.6589895
0.3638924
-1.8109461
-1.3722056
0.0542265
fixed
NA
HYDROXYUREA_STATUS_TRUE.
-0.1797420
0.2349506
-0.7650207
-0.6402367
0.2807526
ran_pars
patient_mrn
sd__(Intercept)
0.6988567
NA
NA
NA
NA
ran_pars
Residual
sd__Observation
1.1926620
NA
NA
NA
NA
The initial models for all three rheological parameters showed good fit and reasonable diagnostic properties. However, to identify the most parsimonious models, we proceeded with feature selection using backward elimination.
Feature Selection and Model Refinement
To identify the most important predictors for each rheological parameter, we implemented a custom backward elimination procedure. This approach systematically evaluated the contribution of each predictor to model fit, removing variables that did not significantly improve the model.
Our backward elimination procedure started with the full model and iteratively removed predictors based on statistical significance and model fit criteria (AIC). This approach helped us identify the most parsimonious models while maintaining predictive performance.
Code
conflicted::conflicts_prefer(lmerTest::lmer)# Function to perform bidirectional stepwise selection for mixed modelsperform_stepwise_selection <-function(model, data, recipe) {# Load lmerTest package for mixed model stepwise selectionif (!requireNamespace("lmerTest", quietly =TRUE)) {install.packages("lmerTest") }library(lmerTest)# Prepare the recipe and process the data prepped_recipe <-prep(recipe) processed_data <-bake(prepped_recipe, new_data =NULL)# Get the outcome variable name outcome_var <-all.vars(formula(model))[1]# Get predictor names (excluding outcome and ID) predictors <-setdiff(names(processed_data), c(outcome_var, "patient_mrn"))# Create formula for the full model full_formula <-as.formula(paste(outcome_var, "~", paste(predictors, collapse =" + "), "+ (1 | patient_mrn)"))# Print the formula for debuggingcat("Full model formula:\n")print(full_formula)# Ensure patient_mrn is a factor in processed data processed_data$patient_mrn <-as.factor(processed_data$patient_mrn)# Handle missing values - create a complete dataset for model comparison# This ensures all models are fitted to the same dataset size model_vars <-c(outcome_var, "patient_mrn", predictors) complete_data <- processed_data %>%select(all_of(model_vars)) %>%na.omit()cat("Original data rows:", nrow(processed_data), "\n")cat("Complete data rows after removing NAs:", nrow(complete_data), "\n")# Fit the full model with lmerTest using the complete dataset full_model <-tryCatch({ lmerTest::lmer(full_formula, data = complete_data, REML =FALSE) }, error =function(e) {cat("Error fitting full model:", e$message, "\n")cat("Checking data structure:\n")str(complete_data[, c(outcome_var, "patient_mrn", predictors[1:min(3, length(predictors))])])stop("Error in model fitting. See details above.") })# Create a simplified approach to stepwise selection# Instead of using lmerTest::step, we'll manually perform backward elimination# Start with the full model current_model <- full_model# Get fixed effects fixed_effects <-names(fixef(current_model)) fixed_effects <- fixed_effects[fixed_effects !="(Intercept)"]# Initialize a list to store eliminated terms eliminated_terms <-list()# Perform backward eliminationcat("\nPerforming manual backward elimination:\n")while(length(fixed_effects) >0) {# Store current AIC current_aic <-AIC(current_model)# Try removing each term best_aic <- current_aic term_to_remove <-NULLfor(term in fixed_effects) {# Create formula without this term remaining_terms <- fixed_effects[fixed_effects != term]if(length(remaining_terms) >0) { new_formula <-as.formula(paste(outcome_var, "~", paste(remaining_terms, collapse =" + "), "+ (1 | patient_mrn)")) } else { new_formula <-as.formula(paste(outcome_var, "~ (1 | patient_mrn)")) }# Fit model without this term - using the same complete dataset new_model <-lmer(new_formula, data = complete_data, REML =FALSE) new_aic <-AIC(new_model)# Perform likelihood ratio test lrt <-anova(new_model, current_model) p_value <- lrt$`Pr(>Chisq)`[2]cat(sprintf(" Testing removal of %s: AIC change = %.2f, p-value = %.4f\n", term, new_aic - current_aic, p_value))# If p-value > 0.05, consider removing this termif(p_value >0.05&& new_aic < best_aic) { best_aic <- new_aic term_to_remove <- term } }# If no term can be removed, breakif(is.null(term_to_remove)) {cat("No more terms can be removed.\n")break }# Remove the term with the highest p-valuecat(sprintf("Removing term: %s (AIC change: %.2f)\n", term_to_remove, best_aic - current_aic))# Update fixed effects fixed_effects <- fixed_effects[fixed_effects != term_to_remove]# Update modelif(length(fixed_effects) >0) { new_formula <-as.formula(paste(outcome_var, "~", paste(fixed_effects, collapse =" + "), "+ (1 | patient_mrn)")) } else { new_formula <-as.formula(paste(outcome_var, "~ (1 | patient_mrn)")) } current_model <-lmer(new_formula, data = complete_data, REML =FALSE)# Store eliminated term eliminated_terms <-c(eliminated_terms, term_to_remove) }# Store the final formula from the step process final_formula <-formula(current_model)# Create a step_result object step_result <-list(model = current_model,eliminated.terms = eliminated_terms,formula = final_formula )# Refit final model with REML using the SAME formula from step_result# This ensures consistency between step results and final model final_model <-lmer(final_formula, data = complete_data, REML =TRUE)# Return the resultsreturn(list(final_model = final_model,step_result = step_result,formula = final_formula, # Use the same formula from step_resultprocessed_data = processed_data,complete_data = complete_data # Also return the complete data used for modeling ))}# Perform stepwise selection for each modelei_min_selection <-perform_stepwise_selection(ei_min_model, ei_min_data, ei_min_recipe)ei_max_selection <-perform_stepwise_selection(ei_max_model, ei_max_data, ei_max_recipe)pos_selection <-perform_stepwise_selection(pos_model, pos_data, pos_recipe)# Print final model formulascat("Final ei_min model formula:\n")print(ei_min_selection$formula)cat("\nFinal ei_max model formula:\n")print(ei_max_selection$formula)cat("\nFinal PoS model formula:\n")print(pos_selection$formula)# Print step resultscat("\n\nei_min Step Results:\n")print(ei_min_selection$step_result)cat("\n\nei_max Step Results:\n")print(ei_max_selection$step_result)cat("\n\nPoS Step Results:\n")print(pos_selection$step_result)
After feature selection, we obtained optimized models for each rheological parameter. These final models represent the most parsimonious set of predictors that explain the variation in our outcomes.
Code
# Create tidy summaries with confidence intervalsbroom.mixed::tidy(ei_min_selection$final_model, conf.int =TRUE) %>% knitr::kable(caption ="Final ei_min Model Coefficients")
Final ei_min Model Coefficients
effect
group
term
estimate
std.error
statistic
df
p.value
conf.low
conf.high
fixed
NA
(Intercept)
0.3622735
0.0108821
33.290913
294.0873
0.0000000
0.3408569
0.3836901
fixed
NA
hb_advia
0.0190493
0.0071288
2.672175
440.1808
0.0078158
0.0050387
0.0330600
fixed
NA
mchc_advia
-0.0195507
0.0058173
-3.360769
445.4569
0.0008442
-0.0309835
-0.0081179
fixed
NA
weight_kg
0.0148995
0.0070553
2.111822
279.9484
0.0355877
0.0010114
0.0287876
fixed
NA
gender_Male
-0.0418810
0.0140585
-2.979044
269.7165
0.0031548
-0.0695594
-0.0142026
fixed
NA
TRANSFUSION_STATUS_TRUE.
0.1373834
0.0136619
10.055944
436.0968
0.0000000
0.1105320
0.1642348
fixed
NA
age_group_Elderly
0.0806495
0.0246527
3.271421
214.7001
0.0012467
0.0320571
0.1292419
ran_pars
patient_mrn
sd__(Intercept)
0.0778999
NA
NA
NA
NA
NA
NA
ran_pars
Residual
sd__Observation
0.1032040
NA
NA
NA
NA
NA
NA
Code
broom.mixed::tidy(ei_max_selection$final_model, conf.int =TRUE) %>% knitr::kable(caption ="Final ei_max Model Coefficients")
Final ei_max Model Coefficients
effect
group
term
estimate
std.error
statistic
df
p.value
conf.low
conf.high
fixed
NA
(Intercept)
-0.3221799
0.0017481
-184.300149
295.4007
0.0000000
-0.3256203
-0.3187396
fixed
NA
cuberoot_nearest_hbf
0.0040396
0.0011866
3.404292
339.2397
0.0007428
0.0017056
0.0063737
fixed
NA
hb_advia
0.0082785
0.0011315
7.316638
468.6690
0.0000000
0.0060551
0.0105019
fixed
NA
mchc_advia
-0.0050283
0.0008892
-5.654692
423.7150
0.0000000
-0.0067761
-0.0032804
fixed
NA
gender_Male
-0.0075947
0.0023145
-3.281297
266.1054
0.0011713
-0.0121518
-0.0030376
fixed
NA
TRANSFUSION_STATUS_TRUE.
0.0116852
0.0022847
5.114494
474.6928
0.0000005
0.0071958
0.0161746
ran_pars
patient_mrn
sd__(Intercept)
0.0146306
NA
NA
NA
NA
NA
NA
ran_pars
Residual
sd__Observation
0.0153741
NA
NA
NA
NA
NA
NA
Code
broom.mixed::tidy(pos_selection$final_model, conf.int =TRUE) %>% knitr::kable(caption ="Final PoS Model Coefficients")
Final PoS Model Coefficients
effect
group
term
estimate
std.error
statistic
df
p.value
conf.low
conf.high
fixed
NA
(Intercept)
5.8980672
0.2747116
21.470031
213.7324
0.0000000
5.3565762
6.4395583
fixed
NA
age
-0.1831161
0.1067833
-1.714839
233.2274
0.0877031
-0.3934991
0.0272670
fixed
NA
cuberoot_nearest_hbf
-0.1995532
0.0838417
-2.380119
322.8376
0.0178873
-0.3644982
-0.0346081
fixed
NA
hb_advia
-0.3200216
0.0805347
-3.973709
436.2582
0.0000828
-0.4783059
-0.1617373
fixed
NA
mchc_advia
0.2646799
0.0625470
4.231698
460.7020
0.0000280
0.1417672
0.3875926
fixed
NA
gender_Male
0.3255589
0.1478485
2.201977
251.0434
0.0285768
0.0343774
0.6167404
fixed
NA
genotype_SS
1.7257453
0.3102665
5.562139
243.7293
0.0000001
1.1145995
2.3368912
fixed
NA
TRANSFUSION_STATUS_TRUE.
-0.9626815
0.1585116
-6.073255
443.9817
0.0000000
-1.2742078
-0.6511552
fixed
NA
age_group_Elderly
-0.5360647
0.3434488
-1.560829
204.7671
0.1201083
-1.2132141
0.1410846
ran_pars
patient_mrn
sd__(Intercept)
0.7001978
NA
NA
NA
NA
NA
NA
ran_pars
Residual
sd__Observation
1.1904694
NA
NA
NA
NA
NA
NA
The feature selection process identified distinct sets of predictors for each rheological parameter, highlighting the complex and multifaceted nature of rheological changes in SCD. To ensure the robustness of our findings, we proceeded with bootstrap validation of these final models.
Model Validation
We validated our final models using stratified bootstrap resampling to assess their predictive performance and stability. This approach specifically addressed class imbalance in key variables, ensuring reliable inference for all patient subgroups.
Our stratified bootstrap approach involved resampling with replacement within strata defined by age group and transfusion status. This method preserved the hierarchical structure of the data and provided robust confidence intervals for model coefficients.
Code
# A more robust bootstrap validation function using bootMer from lme4perform_bootstrap_validation <-function(model_selection, n_boot =500) {# Get the final model from the model selection results final_model <- model_selection$final_model# Get the outcome variable name outcome_var <-all.vars(formula(final_model))[1]# Print debugging informationcat("Bootstrapping for outcome:", outcome_var, "\n")cat("Formula being used:\n")print(formula(final_model))# Use bootMer for parametric bootstrapping of mixed models# This handles missing data appropriately and is specifically designed for lmer modelscat("Performing bootstrap with bootMer...\n") boot_results <- lme4::bootMer( final_model,FUN =function(x) {# Extract fixed effects fixed_effects <-fixef(x)# Extract random effects variance components rand_vars <-as.data.frame(VarCorr(x)) rand_sd <- rand_vars$sdcor[1] # SD of random intercept residual_sd <- rand_vars$sdcor[2] # Residual SD# Calculate performance metrics rmse <-sqrt(mean(residuals(x)^2)) rsq <-cor(fitted(x), getME(x, "y"))^2# Return all parameters of interestc(fixed_effects, rand_sd = rand_sd, residual_sd = residual_sd, rmse = rmse, rsq = rsq) },nsim = n_boot,parallel ="multicore", # Use parallel processing if availablencpus = parallel::detectCores() -1, # Use all but one coreseed =123# For reproducibility )# Extract the bootstrap results boot_data <-as.data.frame(boot_results$t)# Get the parameter names param_names <-colnames(boot_data) fixed_effect_names <-names(fixef(final_model))# Calculate confidence intervals and summaries for fixed effects fixed_effects_summary <-data.frame(term = fixed_effect_names,mean =colMeans(boot_data[, fixed_effect_names], na.rm =TRUE),median =apply(boot_data[, fixed_effect_names], 2, median, na.rm =TRUE),sd =apply(boot_data[, fixed_effect_names], 2, sd, na.rm =TRUE),conf.low =apply(boot_data[, fixed_effect_names], 2, quantile, probs =0.025, na.rm =TRUE),conf.high =apply(boot_data[, fixed_effect_names], 2, quantile, probs =0.975, na.rm =TRUE) )# Calculate confidence intervals and summaries for random effects random_effects_summary <-data.frame(component =c("Random Intercept SD", "Residual SD"),mean =c(mean(boot_data$rand_sd, na.rm =TRUE), mean(boot_data$residual_sd, na.rm =TRUE)),median =c(median(boot_data$rand_sd, na.rm =TRUE), median(boot_data$residual_sd, na.rm =TRUE)),sd =c(sd(boot_data$rand_sd, na.rm =TRUE), sd(boot_data$residual_sd, na.rm =TRUE)),conf.low =c(quantile(boot_data$rand_sd, 0.025, na.rm =TRUE),quantile(boot_data$residual_sd, 0.025, na.rm =TRUE) ),conf.high =c(quantile(boot_data$rand_sd, 0.975, na.rm =TRUE),quantile(boot_data$residual_sd, 0.975, na.rm =TRUE) ) )# Calculate confidence intervals and summaries for performance metrics metrics_summary <-data.frame(.metric =c("rmse", "rsq"),mean =c(mean(boot_data$rmse, na.rm =TRUE), mean(boot_data$rsq, na.rm =TRUE)),median =c(median(boot_data$rmse, na.rm =TRUE), median(boot_data$rsq, na.rm =TRUE)),sd =c(sd(boot_data$rmse, na.rm =TRUE), sd(boot_data$rsq, na.rm =TRUE)),conf.low =c(quantile(boot_data$rmse, 0.025, na.rm =TRUE),quantile(boot_data$rsq, 0.025, na.rm =TRUE) ),conf.high =c(quantile(boot_data$rmse, 0.975, na.rm =TRUE),quantile(boot_data$rsq, 0.975, na.rm =TRUE) ) )# Prepare data for plotting fixed_effects_data <- boot_data[, fixed_effect_names] %>%as.data.frame() %>%pivot_longer(cols =everything(), names_to ="term", values_to ="estimate") metrics_data <-data.frame(id =rep(1:nrow(boot_data), 2),.metric =c(rep("rmse", nrow(boot_data)), rep("rsq", nrow(boot_data))),.estimator ="standard",.estimate =c(boot_data$rmse, boot_data$rsq) )# Return all summaries and raw bootstrap data for plottingreturn(list(coefficients = fixed_effects_summary,random_effects = random_effects_summary,metrics = metrics_summary,original_model = final_model,bootstrap_samples = n_boot,successful_bootstraps =sum(!is.na(boot_data$rmse)), # Count non-NA resultsraw_coef_data = fixed_effects_data,raw_metrics = metrics_data ))}# New stratified bootstrap validation function that accounts for class imbalanceperform_stratified_bootstrap_validation <-function(model_selection, n_boot =500) {# Get the final model and data final_model <- model_selection$final_model complete_data <- model_selection$complete_data# Get the formula directly from the model model_formula <-formula(final_model)# Get the outcome variable name outcome_var <-all.vars(model_formula)[1]# Print debugging informationcat("Performing stratified bootstrap for outcome:", outcome_var, "\n")cat("Formula being used:\n")print(model_formula)# Define stratification variables with focus on imbalanced classes# Look for transformed column names that match our stratification variables strata_vars <-c()# Check for age_group (transformed to age_group_Elderly)if ("age_group_Elderly"%in%names(complete_data)) { strata_vars <-c(strata_vars, "age_group_Elderly")cat("Found transformed age_group variable: age_group_Elderly\n") }# Check for TRANSFUSION_STATUS (transformed to TRANSFUSION_STATUS_TRUE.)if ("TRANSFUSION_STATUS_TRUE."%in%names(complete_data)) { strata_vars <-c(strata_vars, "TRANSFUSION_STATUS_TRUE.")cat("Found transformed TRANSFUSION_STATUS variable: TRANSFUSION_STATUS_TRUE.\n") }# If no stratification variables found, warn the userif (length(strata_vars) ==0) {warning("No stratification variables found in the data. Will perform regular bootstrap.") }# Check class imbalance in stratification variablescat("\nClass distribution in stratification variables:\n")for (var in strata_vars) {if (var %in%names(complete_data)) {cat(var, ":\n")print(table(complete_data[[var]])) } }# Function to perform stratified sampling stratified_sample <-function(data) {# First, sample patients within each stratum# Create a temporary data frame with unique patients and their strata patient_data <- data %>%group_by(patient_mrn) %>%slice(1) %>%ungroup()# Create combined strata if all stratification variables existif (length(strata_vars) >0) {# Create a stratum column using interaction() instead of unite() patient_data <- patient_data %>%mutate(stratum =interaction(across(all_of(strata_vars)), sep ="_"))# Sample patients within each stratum patient_sample <- patient_data %>%group_by(stratum) %>%# Sample with replacement to maintain stratum sizeslice_sample(prop =1, replace =TRUE) %>%ungroup() %>%select(patient_mrn) } else {# If no stratification variables, just sample patients patient_sample <- patient_data %>%slice_sample(prop =1, replace =TRUE) %>%select(patient_mrn) }# Then get all observations for sampled patients# Use semi_join instead of right_join to avoid many-to-many relationship warnings bootstrap_sample <- data %>%semi_join(patient_sample, by ="patient_mrn")return(bootstrap_sample) }# Initialize storage for bootstrap results boot_results <-vector("list", n_boot)# Store the last bootstrap sample for diagnostics last_boot_sample <-NULL# Perform bootstrap iterationscat("\nStarting stratified bootstrap with", n_boot, "iterations...\n")for (i in1:n_boot) {tryCatch({# Generate stratified bootstrap sample boot_sample <-stratified_sample(complete_data)# Store the last sample for diagnosticsif (i == n_boot) { last_boot_sample <- boot_sample }# Refit model on bootstrap sample using the same formula as the original model boot_model <-update(final_model, formula = model_formula, data = boot_sample)# Extract results fixed_effects <-fixef(boot_model) rand_vars <-as.data.frame(VarCorr(boot_model)) rand_sd <- rand_vars$sdcor[1] # SD of random intercept residual_sd <- rand_vars$sdcor[2] # Residual SD# Calculate performance metrics rmse <-sqrt(mean(residuals(boot_model)^2)) rsq <-cor(fitted(boot_model), getME(boot_model, "y"))^2# Store results boot_results[[i]] <-c(fixed_effects, rand_sd = rand_sd, residual_sd = residual_sd,rmse = rmse, rsq = rsq)# Print progressif (i %%50==0) cat("Completed", i, "of", n_boot, "bootstrap iterations\n") }, error =function(e) {warning("Error in bootstrap iteration ", i, ": ", e$message)return(NULL) }) }# Remove NULL results (failed iterations) valid_results <- boot_results[!sapply(boot_results, is.null)]# Check if we have enough valid resultsif (length(valid_results) < n_boot *0.5) {warning("Less than 50% of bootstrap iterations were successful. Results may be unreliable.") }# Convert results to data frame boot_data <-do.call(rbind, valid_results) %>%as.data.frame()# Get the parameter names param_names <-colnames(boot_data) fixed_effect_names <-names(fixef(final_model))# Calculate confidence intervals and summaries for fixed effects fixed_effects_summary <-data.frame(term = fixed_effect_names,mean =colMeans(boot_data[, fixed_effect_names], na.rm =TRUE),median =apply(boot_data[, fixed_effect_names], 2, median, na.rm =TRUE),sd =apply(boot_data[, fixed_effect_names], 2, sd, na.rm =TRUE),conf.low =apply(boot_data[, fixed_effect_names], 2, quantile, probs =0.025, na.rm =TRUE),conf.high =apply(boot_data[, fixed_effect_names], 2, quantile, probs =0.975, na.rm =TRUE) )# Calculate confidence intervals and summaries for random effects random_effects_summary <-data.frame(component =c("Random Intercept SD", "Residual SD"),mean =c(mean(boot_data$rand_sd, na.rm =TRUE), mean(boot_data$residual_sd, na.rm =TRUE)),median =c(median(boot_data$rand_sd, na.rm =TRUE), median(boot_data$residual_sd, na.rm =TRUE)),sd =c(sd(boot_data$rand_sd, na.rm =TRUE), sd(boot_data$residual_sd, na.rm =TRUE)),conf.low =c(quantile(boot_data$rand_sd, 0.025, na.rm =TRUE),quantile(boot_data$residual_sd, 0.025, na.rm =TRUE) ),conf.high =c(quantile(boot_data$rand_sd, 0.975, na.rm =TRUE),quantile(boot_data$residual_sd, 0.975, na.rm =TRUE) ) )# Calculate confidence intervals and summaries for performance metrics metrics_summary <-data.frame(.metric =c("rmse", "rsq"),mean =c(mean(boot_data$rmse, na.rm =TRUE), mean(boot_data$rsq, na.rm =TRUE)),median =c(median(boot_data$rmse, na.rm =TRUE), median(boot_data$rsq, na.rm =TRUE)),sd =c(sd(boot_data$rmse, na.rm =TRUE), sd(boot_data$rsq, na.rm =TRUE)),conf.low =c(quantile(boot_data$rmse, 0.025, na.rm =TRUE),quantile(boot_data$rsq, 0.025, na.rm =TRUE) ),conf.high =c(quantile(boot_data$rmse, 0.975, na.rm =TRUE),quantile(boot_data$rsq, 0.975, na.rm =TRUE) ) )# Prepare data for plotting fixed_effects_data <- boot_data[, fixed_effect_names] %>%as.data.frame() %>%pivot_longer(cols =everything(), names_to ="term", values_to ="estimate") metrics_data <-data.frame(id =rep(1:nrow(boot_data), 2),.metric =c(rep("rmse", nrow(boot_data)), rep("rsq", nrow(boot_data))),.estimator ="standard",.estimate =c(boot_data$rmse, boot_data$rsq) )# Check strata proportions in original vs bootstrap sample if we have a valid last sampleif (length(strata_vars) >0&&!is.null(last_boot_sample)) {# Create a function to calculate strata proportions get_strata_props <-function(data) {# Create a stratum column using interaction() instead of unite() data <- data %>%mutate(stratum =interaction(across(all_of(strata_vars)), sep ="_"))# Count unique patients per stratum data %>%group_by(patient_mrn) %>%slice(1) %>%ungroup() %>%count(stratum) %>%mutate(prop = n /sum(n)) }# Calculate proportions orig_props <-get_strata_props(complete_data) boot_props <-get_strata_props(last_boot_sample)# Compare proportionscat("\nStrata proportions comparison (original vs. last bootstrap sample):\n") strata_comparison <-full_join( orig_props %>%rename(orig_n = n, orig_prop = prop), boot_props %>%rename(boot_n = n, boot_prop = prop),by ="stratum" ) %>%mutate(prop_diff = boot_prop -ifelse(is.na(orig_prop), 0, orig_prop),orig_prop =ifelse(is.na(orig_prop), 0, orig_prop),boot_prop =ifelse(is.na(boot_prop), 0, boot_prop) )print(strata_comparison) }# Return all summaries and raw bootstrap data for plottingreturn(list(coefficients = fixed_effects_summary,random_effects = random_effects_summary,metrics = metrics_summary,original_model = final_model,bootstrap_samples = n_boot,successful_bootstraps =length(valid_results),raw_coef_data = fixed_effects_data,raw_metrics = metrics_data ))}# Run the improved bootstrap for all three modelsn_boot_samples <-500# Can be reduced for testing# Load parallel package for faster bootstrappingif (!requireNamespace("parallel", quietly =TRUE)) {install.packages("parallel")}library(parallel)# Run bootstrap for ei_min modelcat("Performing bootstrap validation for ei_min model...\n")
Performing bootstrap validation for ei_min model...
Performing stratified bootstrap for outcome: pos_lorrca
Formula being used:
pos_lorrca ~ age + cuberoot_nearest_hbf + hb_advia + mchc_advia +
gender_Male + genotype_SS + TRANSFUSION_STATUS_TRUE. + age_group_Elderly +
(1 | patient_mrn)
<environment: 0x12c61bd88>
Found transformed age_group variable: age_group_Elderly
Found transformed TRANSFUSION_STATUS variable: TRANSFUSION_STATUS_TRUE.
Class distribution in stratification variables:
age_group_Elderly :
0 1
436 47
TRANSFUSION_STATUS_TRUE. :
0 1
344 139
Starting stratified bootstrap with 500 iterations...
Completed 50 of 500 bootstrap iterations
Completed 100 of 500 bootstrap iterations
Completed 150 of 500 bootstrap iterations
Completed 200 of 500 bootstrap iterations
Completed 250 of 500 bootstrap iterations
Completed 300 of 500 bootstrap iterations
Completed 350 of 500 bootstrap iterations
Completed 400 of 500 bootstrap iterations
Completed 450 of 500 bootstrap iterations
Completed 500 of 500 bootstrap iterations
Strata proportions comparison (original vs. last bootstrap sample):
# A tibble: 4 × 6
stratum orig_n orig_prop boot_n boot_prop prop_diff
<fct> <int> <dbl> <int> <dbl> <dbl>
1 0_0 177 0.625 117 0.646 0.0210
2 1_0 15 0.0530 8 0.0442 -0.00880
3 0_1 84 0.297 53 0.293 -0.00400
4 1_1 7 0.0247 3 0.0166 -0.00816
Code
cat("\nSuccess rate for PoS:", pos_boot$successful_bootstraps, "out of", pos_boot$bootstrap_samples, "\n")
Success rate for PoS: 500 out of 500
Create distribution plots for key variables in each model
Code
# Function to create distribution plots for a datasetplot_variable_distributions <-function(data, title) {# Select numeric variables excluding ID numeric_vars <- data %>%select(where(is.numeric), -matches("patient_mrn")) %>%names()# Create plots for each variable plots <-map(numeric_vars, function(var) { p <- data %>%ggplot(aes_string(x = var)) +geom_histogram(aes(y = ..density..), bins =30, fill ="steelblue", alpha =0.7) +geom_density(color ="red", linewidth =1) +labs(title = var) +theme_cowplot() +theme(plot.title =element_text(size =10))return(p) })# Arrange plots in a grid grid_title <-textGrob(title, gp =gpar(fontsize =14, fontface ="bold"))grid.arrange(grobs = plots, ncol =3, top = grid_title)}# Create distribution plots for each model's datacat("## Distribution Plots for Model Variables\n\n")
## Distribution Plots for Model Variables
Code
cat("### ei_min Model Variables\n")
### ei_min Model Variables
Code
plot_variable_distributions(ei_min_data, "Distribution of Variables in ei_min Model")
Code
cat("### ei_max Model Variables\n")
### ei_max Model Variables
Code
plot_variable_distributions(ei_max_data, "Distribution of Variables in ei_max Model")
Code
cat("### PoS Model Variables\n")
### PoS Model Variables
Code
plot_variable_distributions(pos_data, "Distribution of Variables in PoS Model")
Comparing Variable Distributions Before and After Preprocessing
To better understand how our preprocessing steps affect the data, we’ll visualize the distributions of model variables both before and after applying the recipe. This comparison helps illustrate the effects of normalization, transformations, etc. on the data distributions.
Code
# Function to compare distributions before and after preprocessingcompare_preprocessing_distributions <-function(data, recipe, model_name) {# Get raw data (before preprocessing) raw_data <- data# Get processed data (after preprocessing) prepped_recipe <-prep(recipe) processed_data <-bake(prepped_recipe, new_data =NULL)# Get the outcome variable name from the recipe# Instead of using formula() on an unprepped recipe, extract it from the recipe's vars outcome_var <- recipe$var_info %>%filter(role =="outcome") %>%pull(variable)# Select numeric variables from raw data, excluding ID raw_numeric_vars <- raw_data %>%select(where(is.numeric), -matches("patient_mrn")) %>%names()# Select numeric variables from processed data, excluding ID processed_numeric_vars <- processed_data %>%select(where(is.numeric), -matches("patient_mrn")) %>%names()# Create a list to store plots all_plots <-list()# Create plots for raw variablesfor (var in raw_numeric_vars) { p_raw <- raw_data %>%ggplot(aes_string(x = var)) +geom_histogram(aes(y = ..density..), bins =30, fill ="steelblue", alpha =0.7) +geom_density(color ="red", linewidth =1) +labs(title =paste("Raw:", var)) +theme_cowplot() +theme(plot.title =element_text(size =10)) all_plots[[paste0("raw_", var)]] <- p_raw }# Create plots for processed variablesfor (var in processed_numeric_vars) {# Skip variables that are just ID columns or other non-relevant variablesif (grepl("patient_mrn|id|ID", var)) next p_processed <- processed_data %>%ggplot(aes_string(x = var)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightgreen", alpha =0.7) +geom_density(color ="darkgreen", linewidth =1) +labs(title =paste("Processed:", var)) +theme_cowplot() +theme(plot.title =element_text(size =10)) all_plots[[paste0("processed_", var)]] <- p_processed }# Create a grid of plots grid_title <-textGrob(paste(model_name, "Variables: Before vs. After Preprocessing"), gp =gpar(fontsize =14, fontface ="bold"))# Arrange plots in a grid, with raw and processed side by side where possible plot_pairs <-list()# First, handle the outcome variable which has a special transformationif (length(outcome_var) >0) {# Find the processed version of the outcome (might have a different name due to transformation) processed_outcome <- processed_numeric_vars[grepl(outcome_var, processed_numeric_vars)][1]if (!is.na(processed_outcome) &&paste0("raw_", outcome_var) %in%names(all_plots)) { plot_pairs[[1]] <- all_plots[[paste0("raw_", outcome_var)]] plot_pairs[[2]] <- all_plots[[paste0("processed_", processed_outcome)]] } }# Then handle the predictorsfor (var insetdiff(raw_numeric_vars, outcome_var)) {# Find matching processed variable (might have suffix from dummy coding) matching_processed <- processed_numeric_vars[grepl(paste0("^", var, "$|^", var, "_"), processed_numeric_vars)]if (length(matching_processed) >0) {for (proc_var in matching_processed) {if (paste0("processed_", proc_var) %in%names(all_plots)) { plot_pairs[[length(plot_pairs) +1]] <- all_plots[[paste0("raw_", var)]] plot_pairs[[length(plot_pairs) +1]] <- all_plots[[paste0("processed_", proc_var)]] } } } }# Arrange in a grid with 2 columns (raw and processed side by side)grid.arrange(grobs = plot_pairs, ncol =2, top = grid_title)# Create a summary table of preprocessing effectscat("\n### Summary of Preprocessing Effects for", model_name, "Model\n\n")# Summarize raw data raw_summary <- raw_data %>%select(where(is.numeric), -matches("patient_mrn")) %>%pivot_longer(everything(), names_to ="Variable", values_to ="Value") %>%group_by(Variable) %>%summarise(Mean_Raw =mean(Value, na.rm =TRUE),SD_Raw =sd(Value, na.rm =TRUE),Min_Raw =min(Value, na.rm =TRUE),Max_Raw =max(Value, na.rm =TRUE),Skewness_Raw = moments::skewness(Value, na.rm =TRUE) )# Summarize processed data processed_summary <- processed_data %>%select(where(is.numeric), -matches("patient_mrn")) %>%pivot_longer(everything(), names_to ="Variable", values_to ="Value") %>%group_by(Variable) %>%summarise(Mean_Processed =mean(Value, na.rm =TRUE),SD_Processed =sd(Value, na.rm =TRUE),Min_Processed =min(Value, na.rm =TRUE),Max_Processed =max(Value, na.rm =TRUE),Skewness_Processed = moments::skewness(Value, na.rm =TRUE) )# Display the summary table knitr::kable(raw_summary, caption ="Raw Data Summary") %>%kable_styling(full_width =FALSE) knitr::kable(processed_summary, caption ="Processed Data Summary") %>%kable_styling(full_width =FALSE)}# Add the moments package for skewness calculationif (!requireNamespace("moments", quietly =TRUE)) {install.packages("moments")}library(moments)# Compare distributions for each modelcat("## Comparing Distributions Before and After Preprocessing\n\n")
## Comparing Distributions Before and After Preprocessing
# Add a table comparing the number of significant predictorscat("\n## Consistency Analysis\n\n")
## Consistency Analysis
Code
# Function to create a consistency summary tablecreate_consistency_table <-function(boot_data, model_selection, model_name) {# Get bootstrap coefficient summary boot_coef_data <- boot_data$coefficients %>%filter(term !="(Intercept)") %>%select(term, boot_mean = mean, boot_conf.low = conf.low, boot_conf.high = conf.high)# Get original model coefficients orig_coef_data <- broom.mixed::tidy(model_selection$final_model, conf.int =TRUE) %>%filter(term !="(Intercept)") %>%select(term, orig_estimate = estimate, orig_conf.low = conf.low, orig_conf.high = conf.high)# Combine and analyze combined_data <-full_join(boot_coef_data, orig_coef_data, by ="term") %>%mutate(# Flag significant coefficientsorig_significant = (orig_conf.low >0) | (orig_conf.high <0),boot_significant = (boot_conf.low >0) | (boot_conf.high <0),# Create consistency categoriesconsistency =case_when( (orig_significant & boot_significant &sign(orig_estimate) ==sign(boot_mean)) ~"Consistent & Significant", (!orig_significant &!boot_significant) ~"Consistently Non-significant", (orig_significant &!boot_significant) ~"Original Only", (!orig_significant & boot_significant) ~"Bootstrap Only", (orig_significant & boot_significant &sign(orig_estimate) !=sign(boot_mean)) ~"Opposite Direction" ) )# Count by category summary_counts <- combined_data %>%count(consistency) %>%mutate(model = model_name)return(summary_counts)}# Create consistency tables for all modelsconsistency_summary <-bind_rows(create_consistency_table(ei_min_boot, ei_min_selection, "ei_min"),create_consistency_table(ei_max_boot, ei_max_selection, "ei_max"),create_consistency_table(pos_boot, pos_selection, "PoS")) %>%pivot_wider(names_from = model, values_from = n, values_fill =0)# Display the tableknitr::kable(consistency_summary, caption ="Consistency Between Original and Bootstrap Results") %>%kable_styling(full_width =FALSE)
Consistency Between Original and Bootstrap Results
consistency
ei_min
ei_max
PoS
Consistent & Significant
6
5
6
NA
2
2
2
Bootstrap Only
0
0
2
Methods
Statistical Approach
Data Preprocessing
Our analysis began with careful preprocessing of the SCD dataset. We removed extreme outliers for maximum elongation index (ei_max > 30) and created derived variables including transfusion status, steady-state status, and age group classifications. We applied appropriate transformations to improve the normality of our outcome variables: square root transformation for minimum elongation index (ei_min), Box-Cox transformation for maximum elongation index (ei_max), and Box-Cox transformation for point of sickling (PoS).
The preprocessing pipeline, implemented using the tidymodels framework, included:
Feature engineering: Creating interaction terms between weight and hydroxyurea status
Normalization: Centering and scaling numeric predictors
Near-zero variance filtering: Removing predictors with minimal variation
Dummy coding: Converting categorical variables to numeric indicators
Mixed-Effects Modeling
We employed linear mixed-effects models to account for the hierarchical structure of our data, with multiple measurements potentially coming from the same patients. Each model included a random intercept for patient ID (patient_mrn) to account for within-patient correlation.
The initial full models included the following fixed effects: - Age (continuous) - Gender - Genotype (SS, SC, or other) - Steady-state status (TRUE/FALSE) - Transfusion status (TRUE/FALSE) - Cube-root transformed HbF levels - Age group (Elderly/Non-elderly) - Hemoglobin level (hb_advia) - Mean corpuscular hemoglobin concentration (mchc_advia) - Weight - Hydroxyurea status (TRUE/FALSE) - Weight × Hydroxyurea interaction
Model Selection
We implemented a custom backward elimination procedure for model selection, which systematically evaluated the contribution of each predictor to model fit. The procedure:
Started with the full model containing all predictors
Iteratively tested the removal of each predictor
Removed the predictor with the highest p-value (> 0.05) that also improved model AIC
Continued until no further terms could be removed without significantly worsening model fit
This approach allowed us to identify the most parsimonious models that adequately explained the variation in our outcome variables. The final models were refit using Restricted Maximum Likelihood (REML) estimation to obtain unbiased estimates of variance components. However, more computationally intensive methods such as elastic net regularization or stepwise AIC selection followed by k-fold cross-validation may be more appropriate and less prone to bias from p-value-based selection.
Model Validation
We validated our final models using stratified bootstrap resampling (n = 500) to assess their predictive performance and stability. Our stratified approach specifically addressed class imbalance in key variables:
Stratification by imbalanced classes: We stratified the bootstrap samples by age group (Elderly/Non-elderly) and transfusion status (TRUE/FALSE), which showed substantial class imbalance in our dataset.
Hierarchical sampling: To preserve the hierarchical structure of our data:
We first sampled patients within each stratum
Then included all observations for the selected patients
This approach maintained the within-patient correlation structure
Performance assessment: For each bootstrap sample, we:
Refit the optimized model formula derived from the stepwise selection
Calculated performance metrics (RMSE and R-squared)
Extracted fixed effects estimates and confidence intervals
Consistency checking: We compared the proportions of patients in each stratum between the original data and bootstrap samples to verify that our stratification approach maintained the representation of minority groups.
This stratified approach provided more reliable confidence intervals than traditional bootstrapping, particularly for predictors associated with imbalanced groups (e.g., elderly patients or those with recent transfusions).
Diagnostics
We performed comprehensive diagnostics to ensure the validity of our models:
Residual analysis: Examining residual plots for homoscedasticity and patterns
Normality assessment: Q-Q plots to verify normality of residuals
Influence diagnostics: Using DHARMa simulated residuals for mixed models
These diagnostics confirmed that our models mostly met the assumptions of linear mixed-effects modeling and provided reliable estimates of the relationships between predictors and outcomes. However, for EI-max, a non-linear relationship may be suggested by the sigmoidal curve observed in the residual distribution. This likely reflects the similar EI-max values often seen across subjects regardless of other factors, such as the comparable EI-max values between AA and SS genotypes. Alternative approaches such as beta regression or Gamma models might better characterize EI-max. Nevertheless, given the relatively low variance in this outcome compared to others, the linear model provides a reasonable approximation.
Conclusions
EI-Min
Our models revealed that age group had a significant positive effect on ei_min, with elderly status being more predictive than age as a continuous variable. Notably, HbF levels were not retained in the optimal EI-min model. Transfusion status, weight, and hemoglobin levels were all associated with increased EI-min values, while male gender and higher MCHC values were associated with lower ei_min values.
EI-Max
Neither age group nor continuous age demonstrated significant effects on EI-max. Transfusion status, hemoglobin levels, and HbF levels were associated with higher EI-max values, while male gender and higher MCHC values were associated with lower EI-max values.
POS
Age group showed a borderline negative effect on POS, suggesting that older patients require more deoxygenation to trigger sickling. Continuous age showed a similar marginal negative effect. Transfusion status, HbF levels, and hemoglobin levels were associated with lower POS values, while male gender and higher MCHC values were associated with higher POS values. Notably, this was the only optimized model in which genotype emerged as a crucial predictor, with a comparatively large effect size.
Potential Limitations and Improvements
Statistical Limitations
Feature Selection Bias: Our backward elimination approach may be susceptible to p-value-based selection bias. Implementing elastic net regularization could provide more stable feature selection without dependency on p-values.
Model Selection Validation: The current approach lacks cross-validation to assess model stability. Implementing stepAIC with k-fold cross-validation would provide a more robust model selection process while maintaining the interpretability of traditional statistical approaches.
Model Specification: The linear mixed-effects framework may not fully capture non-linear relationships, particularly for EI-max. Generalized additive mixed models (GAMMs) or beta regression might better characterize these relationships.
Interaction Exploration: Our models included limited interaction terms. A more systematic exploration of interactions, especially between genotype and treatment variables, could reveal important effect modifications. I believe this imporvement may be the most called for.
Temporal Dynamics: The current models treat repeated measurements as exchangeable within patients. More sophisticated longitudinal modeling approaches could better capture within-patient changes over time, including seasonal effects.
Methodological Improvements
Bayesian Framework: Adopting Bayesian mixed models would allow incorporation of prior knowledge and better quantification of uncertainty, especially for parameters with limited data. – This would be quite difficult to impliment, however.
Causal Inference: Our models establish associations but not causality. Methods such as propensity score matching could better estimate treatment effects of transfusion and hydroxyurea. – Likely not needed, but could be done if we need more to talk about.
Software
All analyses were conducted in R (version 4.2.0) using RStudio. We employed the following key packages: lme4 and lmerTest for mixed-effects modeling, tidymodels for data preprocessing, recipes for feature engineering, broom.mixed for model summaries, performance for model diagnostics, and ggplot2 for visualization. Bootstrap validation was implemented using parallel processing with the parallel and foreach packages to improve computational efficiency. All code and data preprocessing steps were documented in R Markdown for reproducibility.