Document Overview

This analysis develops and validates a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection (UTI). The document follows a structured workflow aligned with the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines.

Analysis Roadmap

Part Section Description
1 Setup & Configuration Package loading, seeds, parameters
2 Data Preparation Loading, cleaning, TRIPOD flow
3 Exploratory Analysis Descriptive statistics, Table 1
4 Feature Selection Least Absolute Shrinkage and Selection Operator (LASSO) variable selection
5 Model Development Logistic regression (linear terms; RCS disabled for EPV adequacy)
6 Internal Validation Bootstrap, full-process validation
7 Calibration & Shrinkage Optimism correction, coefficient adjustment
8 Clinical Utility Nomogram, decision curve analysis
9 Results Tables, figures, interpretation
10 Technical Appendix Methods detail, reproducibility

Key Findings Summary

Quick Reference: All statistics below are computed dynamically from data; see the Results section for current values.

  • Design: Single-center retrospective cohort (TRIPOD Type 1b)
  • Outcome: Urodynamic procedure cancellation due to UTI
  • Variable selection: LASSO with 10-fold cross-validation
  • Validation: Full-process bootstrap (accounts for selection-induced optimism)
  • Overfitting correction: Uniform shrinkage via calibration slope
  • Clinical tools: Nomogram + web-based Shiny calculator

Clinical Context

Understanding these patterns could help clinics implement targeted interventions such as additional reminders for high-risk patients, better education about the importance of urodynamic testing, or addressing specific concerns that commonly lead to cancellations, ultimately improving resource utilization and patient care in urogynecology practices.

PART 1: Setup & Configuration

This section loads all required packages, sets random seeds for reproducibility, and defines key parameters used throughout the analysis.

PART 2: Data Preparation

This section handles data loading, cleaning, and transformation. All patient counts are tracked for the TRIPOD flow diagram to ensure transparent reporting of cohort selection.

Data Loading

Summary of Data Preparation Code

The data prep pipeline performs these key steps (all tracked for TRIPOD reporting):

  1. Load Raw Data from REDCap Export
    • Reads the labeled CSV file directly: 230119Urodynamics_DATA_LABELS_2025-09-15_1126 w labels.csv
    • Loads PFDI-20 totals separately from pfdi_20_totals.csv
    • Initial record count captured for TRIPOD: 912
  2. Exclude Records with Missing Outcome
    • Records with NA in “Was the procedure cancelled?” are excluded
    • Excluded: 12 records
  3. Exclude Non-UTI Cancellations
    • Patient no-shows excluded: 49 records
    • Other cancellation reasons excluded: 10 records
    • These are not relevant to the UTI prediction model
  4. Convert Categorical Variables to Factors
    • All categorical predictors converted to factors with appropriate reference levels
    • Race: “Other” as reference
    • Tobacco: “Non-tobacco user” as reference
    • Menopause: “Pre-menopausal” as reference
  5. Remove Low-Information Columns
    • Zero-variance columns removed via caret::nearZeroVar
    • Columns with >50% missing values removed
  6. Handle Missing Values
    • Missing values in demographic/clinical variables are retained (complete case analysis)
    • NO cross-patient imputation (downfill would fabricate data)
    • Median imputation for PFDI-20 subscale scores only
      • LIMITATION: Single imputation understates uncertainty and assumes MCAR
      • If PFDI selected by LASSO, consider sensitivity analysis with multiple imputation
  7. Process Date Variable
    • Extract Year from procedure date
    • Fix Year 2034 → 2024 data entry error
  8. Join PFDI-20 Totals
    • POPDI-6, CRADI-8, UDI-6, and PFDI-20 total scores joined by Record ID
  9. Feature Selection (Later in Pipeline)
    • LASSO (Least Absolute Shrinkage and Selection Operator) regularization automatically selects the most predictive variables by shrinking less important coefficients to exactly zero, avoiding overfitting and the multiple testing problems of stepwise selection.
    • Complete case analysis performed before LASSO (drop_na())

Final output: Clean dataset with 29 variables and 841 observations for predicting urodynamics procedure cancellation due to UTI.

Data Cleaning

Are there Enough Events Per Predictor?

Interpretation of Events Per Predictor Analysis

The Events Per Predictor Variable (EPV) rule is a guideline in logistic regression that recommends having at least 10 outcome events for each predictor variable in the model to ensure stable coefficient estimates and reliable inference.

In our dataset:

  • Total observations: 841 patients scheduled for urodynamic testing
  • Events (procedure cancellations): 57 patients had procedures cancelled due to urinary tract infection (UTI)
  • Non-events (completed procedures): 784 patients completed their scheduled procedures
  • Number of candidate predictors: 21 variables
  • Minimum events required: 210 (based on 10 × 21 predictors)
  • Current EPV ratio: 2.7 events per predictor variable

Assessment: With only 57 events and 21 candidate predictors, the EPV ratio (2.7) is below the recommended minimum of 10. This limitation will be addressed through LASSO regularization, which automatically selects a parsimonious subset of predictors, thereby improving the effective EPV ratio in the final model.

The LASSO variable selection process (described in the Feature Selection section below) will identify the most predictive variables, reducing the number of predictors in the final model and substantially improving the EPV ratio. Note that the final model EPV is calculated using model degrees of freedom (which accounts for categorical variable coding), not the raw number of LASSO-selected features. The final model metrics are reported in the Results section.

TRIPOD Flow Diagram

The following flow diagram follows the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) guidelines for reporting patient flow in prediction model studies.

# Create TRIPOD-compliant flow diagram using ggplot2
library(ggplot2)

# Define box positions and sizes
box_width <- 3.5
box_height <- 0.85  # Slightly taller for better text fit

# Define a premium color palette
colors <- list(
  source = "#D6EAF8",      # Light Blue
  initial = "#EBDEF0",     # Light Purple
  excluded = "#F9E79F",    # Soft Yellow/Orange (Warning/Exclusion)
  imputed = "#D5F5E3",     # Soft Green (Imputation/Preservation)
  final = "#ABEBC6",       # Green (Success)
  outcome_yes = "#F5B7B1", # Light Red (Event)
  outcome_no = "#AED6F1"   # Light Blue (No Event)
)

# Create data for boxes using PROGRAMMATIC values from tripod_labels
boxes <- data.frame(
  id = c("source", "initial", "excl_missing", "excl2", "excl3", "after_excl", "imputed_pred", "final", "outcome_yes", "outcome_no"),
  x = c(5, 5, 9, 9, 9, 5, 9, 5, 3, 7.2), # Adjusted final classification x-positions
  y = c(10, 8.5, 8.5, 7, 5.5, 6.5, 6.5, 5, 3, 3),
  label = c(
    tripod_labels$source,
    tripod_labels$initial,
    tripod_labels$excl_missing,     # Missing outcomes
    tripod_labels$excl_noshow,
    tripod_labels$excl_other,
    tripod_labels$after_exclusions,
    tripod_labels$imputed,          # Imputation
    tripod_labels$final,
    tripod_labels$cancelled,        # Outcome Yes
    tripod_labels$completed         # Outcome No
  ),
  fill = c(colors$source, colors$initial, colors$excluded, colors$excluded, colors$excluded,
           colors$initial, colors$imputed, colors$final, colors$outcome_yes, colors$outcome_no)
)

# Define connections (arrows) programmatically
# Types: "main" (vertical flow), "exclusion" (horizontal out), "imputation" (horizontal side)
connections <- data.frame(
  x = c(5, 5, 5, 5, 5, 5, 5, 6.76, 6.76, 6.76, 6.76),
  y = c(9.57, 8.07, 6.07, 4.57, 4.57, 8.5, 7, 8.5, 7, 5.5, 6.5),
  xend = c(5, 5, 5, 3, 7.2, 5, 5, 7.24, 7.24, 7.24, 7.24),
  yend = c(8.93, 6.93, 5.43, 3.43, 3.43, 8.1, 6.1, 8.5, 7, 5.5, 6.5),
  type = c("main", "main", "main", "branch", "branch", "segment", "segment", "exclusion", "exclusion", "exclusion", "imputation")
)
# Note: segments (main vertical parts) and exclusions are fine-tuned based on box geometry

# Refined connection segments for precision
segments <- data.frame(
  x = c(5, 5, 5, 5, 5, 6.76, 6.76, 6.76, 6.76),
  y = c(9.57, 8.07, 6.07, 4.57, 4.57, 8.5, 7, 5.5, 6.5),
  xend = c(5, 5, 5, 3, 7.2, 7.24, 7.24, 7.24, 7.24),
  yend = c(8.93, 6.93, 5.43, 3.43, 3.43, 8.5, 7, 5.5, 6.5),
  linetype = c("solid", "solid", "solid", "solid", "solid", "dashed", "dashed", "dashed", "dotted"),
  size = c(1, 1, 1, 1, 1, 0.8, 0.8, 0.8, 0.8)
)

# Create the plot
tripod_plot <- ggplot() +
  # Draw boxes
  geom_rect(data = boxes,
            aes(xmin = x - box_width/2, xmax = x + box_width/2,
                ymin = y - box_height/2, ymax = y + box_height/2,
                fill = fill),
            color = "#2C3E50", linewidth = 0.8) +
  # Add labels
  geom_text(data = boxes,
            aes(x = x, y = y, label = label),
            size = 3.2, fontface = "bold", lineheight = 1.0, color = "#2C3E50") +
  # Draw arrows
  geom_segment(data = segments,
               aes(x = x, y = y, xend = xend, yend = yend, linetype = linetype, linewidth = size),
               arrow = arrow(length = unit(0.25, "cm"), type = "closed"),
               color = "#2C3E50") +
  # Use identity for fill/linetype/size
  scale_fill_identity() +
  scale_linetype_identity() +
  scale_linewidth_identity() +
  # Theme and labels
  labs(title = "TRIPOD Flow Diagram",
       subtitle = "Patient Selection and Model Development Cohort") +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5, color = "#1A5276"),
    plot.subtitle = element_text(size = 13, hjust = 0.5, color = "#5D6D7E", margin = margin(b = 20)),
    plot.background = element_rect(fill = "white", color = NA),
    plot.margin = margin(30, 30, 30, 30)
  ) +
  coord_cartesian(xlim = c(0.5, 12), ylim = c(2, 10.5))

print(tripod_plot)
Figure 1. TRIPOD Flow Diagram

Figure 1. TRIPOD Flow Diagram

Flow Diagram Summary:

  • Data Source: REDCap urodynamics database (study ID: 230119)
  • Initial records: 912 patients in database
  • Exclusions:
    • Missing outcome status: 12 patients (excluded, not imputed)
    • No-show (did not attend): 49 patients
    • Cancellation for other reasons (not UTI): 10 patients
    • Total excluded: 71 patients
  • Final analysis cohort: 841 patients
    • Completed procedures: 784 (93.2%)
    • Cancelled due to UTI: 57 (6.8%)
  • Missing predictor handling: Multivariate Imputation by Chained Equations (MICE) using predictive mean matching (PMM) for continuous variables, logistic regression for binary factors, and polytomous regression for multi-level factors; ensures maximum preservation of cohort size (N=841) as recommended by TRIPOD.

PART 3: Exploratory Analysis

This section provides descriptive statistics and visualizations to understand the study population. Table 1 summarizes patient demographics stratified by procedure cancellation status.

Comprehensive Data Description

The Hmisc::describe() function provides a comprehensive summary of the dataset including distributions, missing values, unique values, and quantiles for each variable.

Interpretation of Hmisc::describe() Output:

  • n: Number of non-missing observations
  • missing: Count and percentage of missing values
  • distinct: Number of unique values
  • Info: Information content (1 = maximum, lower = more ties)
  • Mean/Gmd: Mean and Gini mean difference (robust dispersion measure)
  • Quantiles: Distribution percentiles (.05, .10, .25, .50, .75, .90, .95)
  • lowest/highest: Extreme values to check for outliers/errors

Correlation Matrix of Continuous Predictors

Table 1: Patient Characteristics

Looking at the demographics table comparing patients whose procedures were cancelled versus completed, there are several notable findings:

Statistically Significant Differences

  1. Age (p = < 0.001):
    • Patients with cancelled procedures had higher age (median 76 vs 64 years)
    • IQR for cancelled: 66-79 vs completed: 52-73
    • This suggests older age is associated with procedure cancellation
  2. History of recurrent UTIs (p = 0.002):
    • Higher proportion in the cancelled group (21.1% vs 8%)
    • This indicates recurrent UTIs may increase risk of procedure cancellation
  3. POP-Q stage (p = 0.019):
    • Different distribution of prolapse stages between groups
    • Stage 0 (no prolapse): 58.5% (cancelled) vs 34.7% (completed)
    • Stage 2 (moderate): 9.8% (cancelled) vs 29.7% (completed)
    • Stage 3-4 (advanced): 22% (cancelled) vs 24.1% (completed)
    • This suggests pelvic organ prolapse stage distribution differs between groups
  4. Menopause status (p = 0.015):
    • Higher proportion in the cancelled group (91.2% vs 74%)
    • This indicates post-menopausal status is associated with procedure cancellation

Variables Not Reaching Statistical Significance (P = .05 to .10)

  1. BMI (p = 0.090):
    • Patients with cancelled procedures had higher bmi (median 31 vs 28.5 kg/m²)
    • IQR for cancelled: 25.6-35.2 vs completed: 24.8-33.5
    • This suggests higher body mass index may be associated with procedure cancellation
  2. Diabetes (p = 0.053):
    • Higher proportion in the cancelled group (24.6% vs 14%)
    • This indicates diabetes may be a risk factor for procedure cancellation

Why are variables with P = .05 to .10 retained in the analysis?

These “trending” variables are kept for several important reasons:

  1. LASSO handles variable selection: Unlike traditional stepwise regression that uses p-values for variable selection, LASSO uses cross-validation to select variables based on their contribution to predictive accuracy. A variable that does not reach statistical significance in univariate analysis may still improve model prediction when combined with other variables.

  2. P-values are not selection criteria for LASSO: The p-values in Table 1 describe univariate associations (each variable tested alone). LASSO considers multivariate relationships and can identify variables that are predictive in combination, even if individually they show weaker associations.

  3. Clinical plausibility: Body mass index and recurrent urinary tract infections have biological rationale for affecting urinary tract infection risk and procedure cancellation. Excluding clinically relevant variables prematurely could result in an underspecified model.

  4. Small sample size considerations: With only 57 events (cancellations), statistical power is limited. P-values near but above 0.05 may suggest meaningful effects that did not reach significance due to sample size constraints rather than true absence of association.

  5. Let the data decide: By including these variables as candidates, we allow the LASSO algorithm to make an objective, data-driven decision about their inclusion based on cross-validated prediction error—not arbitrary p-value cutoffs.

No Significant Differences

The following factors showed no significant association with procedure cancellation (P ≥ 0.10): - Hispanic/Latino - Race - Immunocompromised status - Tobacco use - Vaginal estrogen use - OAB status - Nighttime voiding frequency - Year - PFDI-20 total - POPDI-6 - CRADI-8

Clinical Interpretation

The data suggests that Age, Recurrent UTIs, POP-Q stage, Menopause status are statistically significant predictors of procedure cancellation (P < 0.05). Additionally, BMI, Diabetes, UDI-6 showed differences between groups but did not reach the prespecified significance threshold (0.05 ≤ P < 0.10); however, these variables may warrant clinical consideration as potential risk factors.

The sample size for cancellations is relatively small (N=57 vs N=784), which may limit statistical power to detect some associations. The significant age difference suggests pre-procedure assessment might need to be more comprehensive for elderly patients to reduce cancellation rates.

Table 1. Demographic and Clinical Characteristics Stratified by Procedure Cancellation Status

Demographics Stratified by Procedure Cancellation Status
Completed (N=784) Cancelled (N=57) Total (N=841) p value
Age (years) < 0.01 (1)
- Median 64.0 76.0 65.0
- Q1, Q3 52.0, 73.0 66.0, 79.0 53.0, 73.0
Body Mass Index (kg/m²) 0.09 (1)
- Median 28.5 31.0 28.6
- Q1, Q3 24.8, 33.5 25.6, 35.2 24.9, 33.6
- N Missing 11 0 11
Race: 0.66 (2)
- American Indian or Alaska Native 1 (0.1%) 0 (0.0%) 1 (0.1%)
- Asian 12 (1.5%) 0 (0.0%) 12 (1.4%)
- Black or African American 36 (4.6%) 1 (1.8%) 37 (4.4%)
- White 625 (80.3%) 46 (80.7%) 671 (80.4%)
- Other 99 (12.7%) 9 (15.8%) 108 (12.9%)
- Native Hawaiian or Other Pacific Islander 5 (0.6%) 1 (1.8%) 6 (0.7%)
- N Missing 6 0 6
Is the patient hispanic, latino or of Spanish origin? 0.32 (2)
- No 686 (88.6%) 48 (84.2%) 734 (88.3%)
- Yes 88 (11.4%) 9 (15.8%) 97 (11.7%)
- N Missing 10 0 10
Does the patient have diabetes? 0.03 (2)
- Yes 110 (14.2%) 14 (24.6%) 124 (14.9%)
- No 666 (85.8%) 43 (75.4%) 709 (85.1%)
- N Missing 8 0 8
History of Recurrent Urinary Tract Infections < 0.01 (2)
- No 711 (91.9%) 45 (78.9%) 756 (91.0%)
- Yes 63 (8.1%) 12 (21.1%) 75 (9.0%)
- N Missing 10 0 10
Immunocompromised: 0.70 (2)
- Yes 35 (4.6%) 2 (3.5%) 37 (4.5%)
- No 725 (95.4%) 55 (96.5%) 780 (95.5%)
- N Missing 24 0 24
Tobacco use: 0.17 (2)
- Non-tobacco user 494 (63.8%) 33 (57.9%) 527 (63.4%)
- Current tobacco user 45 (5.8%) 1 (1.8%) 46 (5.5%)
- Former tobacco user 235 (30.4%) 23 (40.4%) 258 (31.0%)
- N Missing 10 0 10
Menopause status: 0.02 (2)
- Pre-menopausal 167 (21.3%) 4 (7.0%) 171 (20.4%)
- Post-menopausal 580 (74.1%) 52 (91.2%) 632 (75.2%)
- Unclear 36 (4.6%) 1 (1.8%) 37 (4.4%)
- N Missing 1 0 1
Is the patient on vaginal estrogen? 0.23 (2)
- Yes 117 (15.1%) 12 (21.1%) 129 (15.5%)
- No 660 (84.9%) 45 (78.9%) 705 (84.5%)
- N Missing 7 0 7
Does the patient have Overactive Bladder? 0.08 (2)
- Yes 389 (50.3%) 35 (62.5%) 424 (51.1%)
- No 385 (49.7%) 21 (37.5%) 406 (48.9%)
- N Missing 10 1 11
Average number of voids at night: 0.33 (1)
- Median 2.0 2.0 2.0
- Q1, Q3 1.0, 3.0 1.0, 3.0 1.0, 3.0
- N Missing 68 5 73
Pelvic Organ Prolapse Quantification Stage 0.02 (1)
- Median 2.0 0.0 2.0
- Q1, Q3 0.0, 2.0 0.0, 2.0 0.0, 2.0
- N Missing 103 16 119
Pelvic Floor Distress Inventory-20 Total Score 0.39 (1)
- Median 100.0 100.0 100.0
- Q1, Q3 95.8, 100.2 100.0, 109.0 98.0, 101.0
Pelvic Organ Prolapse Distress Inventory-6 0.61 (1)
- Median 29.2 29.2 29.2
- Q1, Q3 25.0, 29.2 29.2, 33.3 25.0, 29.2
Colorectal-Anal Distress Inventory-8 0.74 (1)
- Median 18.8 18.8 18.8
- Q1, Q3 18.8, 21.9 18.8, 25.0 18.8, 21.9
Urinary Distress Inventory-6 0.17 (1)
- Median 50.0 50.0 50.0
- Q1, Q3 45.8, 50.0 50.0, 54.2 50.0, 50.0
Year 0.95 (2)
- 2022 288 (36.8%) 22 (38.6%) 310 (36.9%)
- 2023 332 (42.4%) 23 (40.4%) 355 (42.3%)
- 2024 163 (20.8%) 12 (21.1%) 175 (20.8%)
- N Missing 1 0 1
What was the reason for the urodynamics? 0.38 (2)
- Checked 279 (35.6%) 17 (29.8%) 296 (35.2%)
- Unchecked 505 (64.4%) 40 (70.2%) 545 (64.8%)
  1. Linear Model ANOVA
  2. Pearson’s Chi-squared test

Model Guide

This analysis builds and validates a clinical prediction model through a series of distinct modeling stages. Each stage produces a named model object. Understanding which model is active in each section prevents confusion when interpreting results.

Model Pipeline Overview

The pipeline flows through three main models:

  1. LASSO Selection (lasso_model, cv_lasso) — Uses L1-regularized logistic regression (glmnet) with 10-fold cross-validation to select the most predictive variables from all candidates. This is a variable selection tool, not the final clinical model. Predictors with non-zero coefficients at lambda.min are carried forward.

  2. Primary Model (model) — The main logistic regression fitted with rms::lrm() using the LASSO-selected predictors. Continuous predictors may receive restricted cubic splines (RCS) when events-per-variable permits; otherwise linear terms are used. This model undergoes bootstrap validation and is the basis for all performance metrics. It is intentionally not used for clinical deployment because apparent performance is optimistic.

  3. Deployed Model (deployed_model) — A copy of the Primary Model with coefficients multiplied by the bootstrap-derived shrinkage factor (calibration slope). This correction reduces overfitting-driven prediction extremes. The Deployed Model is what powers the Shiny calculator, nomograms, decision curves, and all clinical utility outputs. It is exported as model_object.rds.

Comparison Models (Sensitivity Analyses)

Several alternative models are fitted to test robustness but do not feed the main pipeline:

Model Variable Type Purpose
Linear comparison linear_comparison_model rms::lrm() Same predictors without RCS — tests whether nonlinear terms improve fit
Firth penalized firth_model logistf() Bias-reduced estimation for sparse/imbalanced data
Weighted logistic weighted_model glm() Inverse class-frequency weights to handle imbalance
Fractional polynomials fp_model mfp() Alternative nonlinear approach compared to RCS

datadist Management

The rms package requires a datadist object (variable ranges and distributions) to be set before model fitting. Because different analysis stages operate on different subsets of data, datadist is created and restored multiple times:

  • Initial creation (RMS Setup) — dd built from the full analysis dataset before Primary Model fitting.
  • After variable renaming (Nomogram Setup) — dd recreated from nomogram_df after columns are renamed to human-readable labels. Without this refresh, RCS knot lookups would fail on the new names.
  • After MICE imputationdd recreated because imputed values may shift predictor ranges and RCS knot positions.
  • Inside bootstrap loops — Each of the 1,000 bootstrap iterations creates a local dd_boot from the resampled data (RCS knots must match the resample), then restores dd afterward. Error handlers guarantee restoration even if a model fit fails.
  • Inside temporal validation — Each leave-one-year-out fold creates dd_dev from the training split, with on.exit() guaranteeing restoration of dd.

Key principle: datadist is always scoped to the data being modeled. Stale datadist from a prior stage would silently apply wrong knot positions and predictor ranges, producing incorrect predictions.

Stale Object Cleanup

Model objects from earlier pipeline stages (e.g., an intermediate model before shrinkage) are overwritten rather than explicitly deleted with rm(). This keeps the workspace lean without requiring manual garbage collection. The critical transitions are:

  • lasso_model / cv_lasso → used only to extract selected predictors, then not referenced again
  • model → copied into deployed_model with shrunk coefficients; model remains available for validation comparisons but deployed_model is used for all clinical outputs
  • dd_boot, dd_dev, dd_cv → temporary datadist objects scoped to loop iterations; main dd always restored at loop exit

PART 4: Feature Selection (LASSO Selection)

This section uses LASSO (Least Absolute Shrinkage and Selection Operator) for objective, data-driven variable selection. Unlike traditional p-value based selection, LASSO uses regularization to automatically identify the most predictive variables while guarding against overfitting.

Why LASSO instead of p-value selection?

LASSO may select variables that do not have p < 0.10 in univariate analysis, and may exclude variables that do. This occurs because:

  1. Multicollinearity handling: LASSO selects one variable from correlated groups (e.g., Menopause Status was excluded because Age is preferred for clinical interpretability; they correlate at r=0.77)
  2. Predictive vs. associative: A variable can be statistically significant but add little predictive information once other variables are in the model
  3. Regularization: LASSO penalizes model complexity, favoring parsimonious models that generalize better

The variables shown in the “LASSO Model Selected Features” table below are the final predictors used in the nomogram, regardless of their univariate p-values.

Variable Screening

Before LASSO feature selection, variables were screened for data quality issues. Variables were excluded if they had: (1) near-zero variance, (2) >50% missing data, or (3) correlation >0.90 with another predictor.

Feature Selection with LASSO

LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization method that automatically selects the most important predictors by shrinking less important coefficients to exactly zero. We used 10-fold cross-validation to identify the optimal regularization strength (λ) that minimizes prediction error while producing a parsimonious model.

Cross-Validation Results

Cross-validation results from our urodynamics data showing optimal lambda selection

Cross-validation results from our urodynamics data showing optimal lambda selection

Interpretation of Our Cross-Validation Results:

  • The blue line shows how prediction error changes as we increase regularization (λ)
  • The shaded blue region represents uncertainty (±1 standard error)
  • The red dashed line marks λ.min (0.01073) - the value that minimizes error
  • The green dashed line marks λ.1se (0.047539) - the one-standard-error rule (too aggressive here, selects 0 predictors)
  • The red dashed line marks our selected λ.min (0.01073) - the value that minimizes error
if (nrow(selected_features_df) > 0) {
  kable(selected_features_df,
        caption = "LASSO Model Selected Features and Coefficients",
        col.names = c("Feature", "Coefficient"),
        align = c("l", "r"),
        booktabs = TRUE,
        linesep = "",
        format = "html") %>%
    kable_styling(
      bootstrap_options = c("striped", "hover", "condensed", "responsive"),
      full_width = TRUE,
      position = "center"
    ) %>%
    row_spec(0, bold = TRUE, color = "black", background = "#f2f2f2") %>%
    column_spec(1, width = "400px") %>%
    column_spec(2, width = "150px")
} else {
  cat("No features selected at lambda.1se. Consider using lambda.min or relaxing regularization.\n")
}
LASSO Model Selected Features and Coefficients
Feature Coefficient
Age. 0.0354350
Does.the.patient.have.a.h.o.recurrent.UTIs.Yes 0.5386471
POP.Q.stage. -0.0678101
urodynamics_reasonevaluation of oab 0.0126851
Age.:Does.the.patient.have.diabetes.Yes 0.0013505
Age.:BMI 0.0003960

LASSO Regularization Path

# Create the regularization path plot with clean feature labels
ggplot(lasso_filtered, aes(x = log10(Lambda), y = Coefficient, color = Feature_Clean)) +
  geom_line(linewidth = 1.2) +
  geom_vline(xintercept = log10(optimal_lambda), linetype = "dashed", color = "red", linewidth = 1.2) +
  labs(
    title = "LASSO Regularization Path",
    subtitle = "Red line indicates optimal lambda from cross-validation",
    x = expression(log[10](lambda)),
    y = "Coefficient Values",
    color = "Predictor"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "right",
    legend.text = element_text(size = 11),
    legend.title = element_text(face = "bold", size = 12),
    axis.title = element_text(face = "bold"),
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray40")
  ) +
  scale_color_viridis_d(option = "turbo")
Figure: LASSO Regularization Path showing how coefficient values change with lambda

Figure: LASSO Regularization Path showing how coefficient values change with lambda

Interactive LASSO Regularization Path Visualization

This interactive visualization allows you to hover over any line to see the feature name and coefficient value, click legend entries to show/hide specific features, and zoom/pan to explore specific regions.


Figure Interpretation: LASSO Regularization Path

This graph shows the LASSO regularization path for a logistic regression model, illustrating how coefficient values change as the regularization strength (lambda) varies.

Key Observations:

  1. Regularization Effect: As lambda increases (moving left to right on the x-axis), more coefficients are pushed toward zero. This demonstrates LASSO’s feature selection capability.

  2. Optimal Lambda: The vertical red line indicates the optimal lambda value (λ = 0.0107) determined through cross-validation. This represents the best balance between model complexity and predictive performance.

  3. Selected Predictors at Optimal Lambda: LASSO selected 6 feature(s) with non-zero coefficients:

LASSO-Selected Predictors at Optimal Lambda
Predictor Coefficient Effect Direction
Age 0.0354 ↑ Increases risk
Does the patient have a h o recurrent UTIs Yes 0.5386 ↑ Increases risk
POP Q stage -0.0678 ↓ Decreases risk
urodynamics reasonevaluation of oab 0.0127 ↑ Increases risk
Age :Does the patient have diabetes Yes 0.0014 ↑ Increases risk
Age :BMI 0.0004 ↑ Increases risk
  1. Variables Eliminated: All other features have coefficients shrunk to exactly zero at the optimal lambda, indicating they contribute little predictive value after accounting for the selected variable(s).

  2. Direction of Effects:

    • Features with positive coefficients (above zero) increase the probability of procedure cancellation
    • Features with negative coefficients (below zero) decrease the probability of procedure cancellation

Cross-Validation Results

PART 5: Model Development (Primary Model)

This section fits the logistic regression model using the LASSO-selected predictors. The choice of linear vs. restricted cubic spline (RCS) terms for continuous predictors is determined programmatically based on the events-per-variable (EPV) ratio — RCS is used when EPV permits (≥ 10 with spline df), otherwise linear terms are used to preserve model stability.

Continuous Predictor Modeling

When sample size permits (EPV ≥ 10 with spline degrees of freedom), restricted cubic splines allow non-linear relationships. When EPV is insufficient, linear terms are used to prevent overfitting — the model adapts automatically based on event count.

RMS Setup and Predictor Specification

Final Model Fitting

Following Harrell’s Regression Modeling Strategies (Chapter 2), continuous predictors are modeled using linear terms (insufficient degrees of freedom for restricted cubic splines). This was determined based on 41 events in 695 complete cases.

Harrell Regression Modeling Strategies (rms) Visualizations

These visualizations are from the initial (unshrunk) model fit. They show the raw predictor associations and relative importance before shrinkage adjustment. Because uniform shrinkage preserves rank order and relative effect sizes, the patterns shown here are identical to the deployed model. For final validated performance metrics, see the Bootstrap Validation and Results sections.

The following visualizations follow Frank Harrell’s Regression Modeling Strategies approach for presenting logistic regression model results. These provide complementary views of predictor importance and effects.

Analysis of Variance (ANOVA) Chi-Square Plot (Variable Importance)

This plot shows the chi-square contribution of each predictor to the model, providing an objective measure of variable importance. Larger chi-square values indicate stronger predictors.

# =============================================================================
# ANOVA VISUALIZATION - Following Harrell's RMS approach
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Chapter 10.
# "plot(anova(f))" is the standard way to visualize predictor importance
# =============================================================================

log_info("Generating ANOVA chi-square plot for variable importance...")

# Compute ANOVA for the model
model_anova <- anova(model)

# Clean up variable names for display
# Create a mapping to convert underscores and clean up any remaining technical names
anova_label_map <- c(
  # Clean names from renaming (underscores to spaces)
  "Recurrent_UTIs" = "Recurrent UTIs",
  "Overactive_Bladder" = "Overactive Bladder",
  "Detrusor_Overactivity" = "Detrusor Overactivity",
  "Tobacco_Use" = "Tobacco Use",
  "Stress_Incontinence" = "Stress Incontinence",
  "Pelvic_Prolapse" = "Pelvic Prolapse",
  "Voiding_Dysfunction" = "Voiding Dysfunction",
  "Prior_Pelvic_Surgery" = "Prior Pelvic Surgery",
  "POPQ_Stage" = "POP-Q Stage",
  # Keep simple names as-is
  "Age" = "Age",
  "BMI" = "BMI",
  "Nocturia" = "Nocturia",
  "Hispanic" = "Hispanic/Latino",
  "Diabetes" = "Diabetes",
  "Neurological" = "Neurological",
  "TOTAL" = "TOTAL"
)

# Rename rows in the ANOVA object for cleaner display
if (!is.null(rownames(model_anova))) {
  old_names <- rownames(model_anova)
  new_names <- sapply(old_names, function(x) {
    if (x %in% names(anova_label_map)) {
      return(anova_label_map[x])
    } else {
      # Default: clean up dots and underscores
      cleaned <- gsub("\\.", " ", x)
      cleaned <- gsub("_", " ", cleaned)
      cleaned <- trimws(cleaned)
      return(cleaned)
    }
  })
  rownames(model_anova) <- new_names
}

# Create the chi-square plot with cleaner labels
# (ANOVA table output suppressed - values shown in plot)
plot(model_anova,
     what = "chisqminusdf",
     main = "Predictor Importance: Chi-Square Minus d.f.",
     cex.main = 1.2,
     margin = c("chisq", "P"))
Figure: ANOVA Chi-Square Plot showing relative importance of each predictor

Figure: ANOVA Chi-Square Plot showing relative importance of each predictor

log_info("ANOVA plot generated successfully")

Interpretation: The plot ranks predictors by their chi-square statistic minus degrees of freedom. This “partial chi-square” approach accounts for the complexity of each term, providing a fair comparison between simple (1 d.f.) and complex (multi-d.f.) predictors.

Summary Odds Ratio Plot (Effect Sizes)

This forest plot shows the odds ratios with 95% confidence intervals for each predictor. For continuous variables, the odds ratio represents the effect of moving from the 25th to 75th percentile (interquartile range).

Interpretation: - Odds ratios > 1: Predictor is associated with increased cancellation risk - Odds ratios < 1: Predictor is associated with decreased cancellation risk - CI crossing 1: Effect not statistically significant at α = 0.05 - For continuous predictors (Age, body mass index): odds ratio represents effect of interquartile range increase

Partial Effects Plot (Predictor-Outcome Relationships)

This visualization shows the predicted log-odds of cancellation across the range of each predictor, holding other predictors at their reference values. It reveals both the shape of relationships (linear vs. nonlinear via restricted cubic splines) and effect magnitudes.

Interpretation: - Solid line: Predicted effect (log-odds scale) - Shaded region: 95% confidence interval - Rug/histogram: Distribution of predictor values in the data - Horizontal reference: Effect = 0 (no change from reference) - Nonlinear curves: Reflect restricted cubic spline (RCS) modeling

Clinical Interpretation of Partial Effects: These plots allow clinicians to understand: 1. Direction of effect: Whether increasing a predictor increases or decreases risk 2. Magnitude of effect: How much the probability changes across the predictor range 3. Shape of relationship: Linear trends vs. thresholds or U-shaped patterns 4. Uncertainty: Wider confidence bands indicate less precise estimates

Logistic Regression Diagnostics

Since we are using logistic regression with a binary outcome, traditional OLS diagnostics (normality of residuals, homoscedasticity) do not apply. Instead, we perform diagnostics specific to logistic regression including: linearity of the logit, multicollinearity, goodness-of-fit, and influential observation analysis.

Multicollinearity Assessment: Variance Inflation Factors (VIF)

Multicollinearity inflates standard errors and makes coefficient estimates unstable. We assess this using Variance Inflation Factors (VIF) and Generalized VIF (GVIF) for categorical predictors. VIF > 5 indicates moderate collinearity; VIF > 10 indicates severe problems.

Interpretation: VIF values below 5 indicate acceptable multicollinearity. For categorical variables with multiple levels, we report GVIF^(1/(2*Df)) which should be interpreted similarly (values < 2.24 correspond to VIF < 5).

Linearity of the Logit Assessment

Logistic regression assumes continuous predictors have a linear relationship with the log-odds of the outcome. We test this assumption using: 1. Component-plus-residual (partial residual) plots: Visual inspection of linearity 2. Comparison of linear vs. restricted cubic spline (RCS) models: Formal nonlinearity test

Interpretation: The red loess curve should closely follow the blue linear reference line if the linearity assumption holds. Substantial deviation indicates the relationship is nonlinear and justifies using restricted cubic splines (RCS).

Influential Observation Diagnostics

We identify observations with disproportionate influence on model coefficients using Cook’s Distance, DFBETAS (Difference in Betas, measuring how much each coefficient changes when an observation is removed), and residual analysis.

Cook’s Distance Analysis

Cook’s Distance measures the influence of each observation on the overall model fit. Observations with Cook’s D > 4/n are potentially influential.

Comprehensive Confusion Matrix with All Metrics

McFadden’s Pseudo-R² and Model Fit Statistics

Brier Score Plot

This calibration plot visualizes how well the predicted probabilities from your logistic regression model align with the actual observed outcomes. The x-axis (“Predicted Probability”) represents the probabilities that the model estimated, while the y-axis (“Observed Probability of Cancellation”) indicates how frequently cancellations occurred within each predicted probability bin. Each point (blue circle) corresponds to a group of observations with similar predicted probabilities, and the size of each circle reflects the number of observations in that group, as detailed in the legend. Vertical lines around each point represent the 95% confidence intervals, indicating the uncertainty of each observed proportion.

A dashed grey diagonal line shows perfect calibration (where predictions exactly match observations). Your model shows good overall calibration, as evidenced by points closely following this ideal diagonal line, particularly for predicted probabilities below approximately 0.3. However, at higher probabilities (around 0.3 to 0.4), there is more deviation from perfect calibration, suggesting predictions in this range might be less reliable due to fewer observations or greater variability. The Brier score (0.0602) supports this interpretation, indicating good predictive accuracy overall, with only minor discrepancies observed at higher predicted probabilities.

The Brier Score and Area Under the Receiver Operating Characteristic Curve (AUC) are both valuable metrics used to assess model performance, though they measure different aspects of predictive accuracy. The Brier Score evaluates a model’s overall accuracy by simultaneously assessing calibration—how closely the predicted probabilities match observed outcomes—and its discriminatory power, or the ability to distinguish between different outcome classes. It ranges from 0 to 1, where lower values indicate better performance. A score of 0 represents perfect prediction accuracy, whereas higher scores indicate poorer performance, with scores closer to 0.25 typically suggesting no better performance than random chance in a balanced binary classification. Conversely, the AUC specifically quantifies a model’s ability to discriminate between two outcome classes. It ranges from 0.5 to 1.0, with 0.5 indicating random guessing (no discrimination ability) and 1.0 indicating perfect discrimination. For instance, an AUC of 0.85 demonstrates strong ability to correctly distinguish between classes based on predicted probabilities, but it does not guarantee that these predicted probabilities accurately reflect the actual observed frequencies. Thus, while the Brier Score combines both calibration (the accuracy of predicted probabilities) and discrimination (the ability to distinguish between outcomes), the AUC exclusively measures discrimination.

AUC Plot

# =============================================================================
# ROC CURVE USING SHRUNK MODEL (consistency with deployed calculator)
# =============================================================================
# Bug Fix: Previously used unshrunk model$linear.predictors, producing an
# optimistic AUC that didn't match the shrinkage-adjusted coefficients used
# in the nomogram and calculator. Now uses deployed_model if available, which
# has the same coefficients as the deployed prediction tool.

roc_model <- if (exists("deployed_model") && !is.null(deployed_model)) deployed_model else model
roc_predictions <- plogis(roc_model$linear.predictors)
roc_outcomes <- as.numeric(roc_model$y) - 1  # Convert factor to 0/1

# Verify alignment
log_info(sprintf("ROC calculation: %d predictions, %d outcomes",
                 length(roc_predictions), length(roc_outcomes)))
log_info(sprintf("Events: %d (%.1f%%)", sum(roc_outcomes == 1),
                 100 * mean(roc_outcomes == 1)))

# Calculate ROC
roc_obj_main <- pROC::roc(roc_outcomes, roc_predictions, quiet = TRUE)
ci_auc_main <- pROC::ci.auc(roc_obj_main, method = "delong")

# Create publication-quality ROC plot
par(mar = c(5, 5, 4, 2), pty = "s")
plot(roc_obj_main,
     main = "ROC Curve",
     col = "#2E86AB",
     lwd = 3,
     legacy.axes = TRUE,
     print.auc = FALSE,
     xlab = "1 - Specificity (False Positive Rate)",
     ylab = "Sensitivity (True Positive Rate)",
     cex.lab = 1.2,
     cex.axis = 1.1)

abline(a = 0, b = 1, lty = 2, col = "#CCCCCC", lwd = 2)

# Add AUC annotation
text(0.6, 0.2,
     paste0("AUC = ", round(as.numeric(roc_obj_main$auc), 3),
            "\n(95% CI: ", round(ci_auc_main[1], 3), "-", round(ci_auc_main[3], 3), ")"),
     cex = 1.3, font = 2)

legend("bottomright",
       legend = c("Prediction Model", "Reference Line (AUC = 0.5)"),
       col = c("#2E86AB", "#CCCCCC"),
       lty = c(1, 2),
       lwd = c(3, 2),
       bty = "n",
       cex = 1.0)
Figure: Model Accuracy (ROC Curve). This curve shows the trade-off between correctly identifying cancellations (sensitivity, y-axis) and avoiding false alarms (1-specificity, x-axis). The blue shaded area represents model accuracy—larger areas indicate better performance. A perfect model would reach the upper-left corner. The diagonal dashed line represents random guessing (50% accuracy).

Figure: Model Accuracy (ROC Curve). This curve shows the trade-off between correctly identifying cancellations (sensitivity, y-axis) and avoiding false alarms (1-specificity, x-axis). The blue shaded area represents model accuracy—larger areas indicate better performance. A perfect model would reach the upper-left corner. The diagonal dashed line represents random guessing (50% accuracy).

# Log for verification
log_info(sprintf("ROC AUC = %.3f (should match C-statistic = %.3f)",
                 as.numeric(roc_obj_main$auc), model$stats["C"]))

The values in your ROC curve represent important metrics related to the performance of your predictive model:

Interpreting ROC Results (Note: Values calculated below in the optimal threshold section)

The ROC curve analysis helps evaluate model performance at different classification thresholds. Key concepts:

Sensitivity (True Positive Rate) - The proportion of actual cancellations correctly identified by the model - Higher sensitivity means fewer missed cancellations

Specificity (True Negative Rate) - The proportion of completed procedures correctly identified by the model - Higher specificity means fewer false alarms

What this means for your model: - The optimism-corrected C-statistic (reported in the Bootstrap Validation section) provides the most reliable estimate of the model’s discriminative ability on new patients - At the optimal threshold (calculated below), the model balances sensitivity and specificity - The specific sensitivity and specificity values at the optimal threshold are reported in the subsequent sections

Depending on your clinical goals, you might want to adjust this threshold: - If missing a cancellation is very costly, use a lower threshold to increase sensitivity - If falsely predicting cancellations is problematic, use a higher threshold to increase specificity

Confusion Matrix

Interpreting the Histogram of Predictions

The histogram above shows the distribution of predicted cancellation probabilities across the 812 patients with complete data used in model fitting. Key observations:

  1. Right-skewed distribution: Most patients have low predicted probabilities (concentrated below 0.2, the 95th percentile), which reflects the low baseline cancellation rate (7%) in these patients.

  2. Peak near the threshold: The distribution peaks near the optimal threshold (0.063), indicating most patients are predicted to have relatively low risk of cancellation.

  3. Long right tail: A smaller number of patients extend to higher probabilities (up to 0.39), representing high-risk individuals (typically older patients with history of recurrent UTIs).

  4. Red dashed line: The optimal classification threshold (0.063) is shown. Patients with predicted probabilities above this line are classified as “likely to cancel.”

Why is the threshold so low? The threshold of 0.063 (6.3%) may seem low, but it reflects the base rate of cancellation in the data. Because cancellations are rare, the optimal threshold that balances sensitivity and specificity is near the population prevalence. Using the standard 0.5 threshold would result in classifying virtually no patients as “at risk.”

Optimal Threshold Interpretation (Programmatic): The optimal classification threshold of 0.0632344 (6.3%) was determined by maximizing Youden’s J statistic (sensitivity + specificity - 1), which identifies the probability cutoff that best balances the ability to detect true cancellations (sensitivity) while minimizing false alarms (1 - specificity). This threshold is close to the base cancellation rate of 7%, which is expected when the outcome is rare—the optimal threshold gravitates toward the prevalence to avoid systematic over- or under-prediction.

Interpreting the Confusion Matrix

The confusion matrix shows how the model’s predictions compare to actual outcomes. At the optimal threshold of 0.063:

Predicted: Completed Predicted: Cancelled
Actually Completed True Negatives (TN) False Positives (FP)
Actually Cancelled False Negatives (FN) True Positives (TP)

Reading the results:

  • True Positives (green): Patients correctly predicted to have cancellations
  • True Negatives (green): Patients correctly predicted to complete their procedures
  • False Positives (red): Patients predicted to cancel but who actually completed (false alarms)
  • False Negatives (red): Patients predicted to complete but who actually cancelled (missed detections)

Trade-off at the chosen threshold:

The optimal threshold (0.063) balances sensitivity and specificity. However, clinical considerations may suggest adjusting this threshold:

  • If missing a cancellation is costly (wasted OR time, patient inconvenience): Use a lower threshold to increase sensitivity, accepting more false positives
  • If false alarms are problematic (unnecessary interventions, patient anxiety): Use a higher threshold to increase specificity, accepting more missed cancellations

Your classification model predicting whether a procedure would be cancelled achieved good overall accuracy. Given that cancellations represent a minority class (as indicated by your data, with 57 cancellations out of 841 observations), accuracy alone might not fully capture your model’s performance.

The sensitivity (recall) describes the model’s ability to correctly detect patients whose procedures actually were cancelled. A higher sensitivity means fewer missed cancellations. Your calculated sensitivity reflects how well the model identified those who truly experienced cancellations. Conversely, specificity represents the proportion of correctly identified non-cancelled procedures. A high specificity means your model avoids incorrectly labeling procedures as cancelled when they actually occurred as planned.

However, sensitivity and specificity alone don’t provide a full picture—particularly if cancellations are uncommon (as your data suggest, with cancellations being a minority class, around 6.8%). In these scenarios, positive predictive value (PPV) and negative predictive value (NPV) become essential.

  • A high PPV indicates that when the model predicts cancellation, it is likely correct. In practice, this would help prioritize interventions or preoperative counseling for patients most likely to cancel.
  • A high NPV would indicate reliability in identifying patients whose procedures likely will not be cancelled, allowing more confident scheduling without unnecessary resource allocation.

Given your results:

  • Discrimination and calibration are assessed via the optimism-corrected metrics in the Bootstrap Validation section, which account for overfitting and provide the most honest assessment of model performance on new patients.
  • Sensitivity, specificity, PPV, and NPV are reported in the Results section using the validated model.

Nomogram

Technical Details: Setting Variable Limits for Nomogram Generation

The code below configures the data distribution settings required by the rms package for nomogram generation. The datadist object defines the range of values for Age and BMI that will be displayed on the nomogram axes.

# Load required packages
library(rms)
library(logger)

# Set up logging
log_info("Starting procedure cancellation nomogram creation")

# Ensure dataset meets requirements
if (!exists("selected_labels_df")) {
    stop("Error: selected_labels_df not found")
}

# ============================================================================
# CLEAN VARIABLE NAMES AND FACTOR LEVELS FOR NOMOGRAM DISPLAY
# This ensures the nomogram has readable labels instead of ugly R variable names
# ============================================================================

log_info("Cleaning variable names and factor levels for nomogram display")

# Create a copy for nomogram display
nomogram_df <- selected_labels_df

# Strip haven 'labelled' class - rms::Design() cannot handle it
for (col in names(nomogram_df)) {
  if (inherits(nomogram_df[[col]], "haven_labelled") || inherits(nomogram_df[[col]], "labelled")) {
    log_info(sprintf("Stripping labelled class from: %s", col))
    nomogram_df[[col]] <- unclass(nomogram_df[[col]])
    attr(nomogram_df[[col]], "label") <- NULL
    attr(nomogram_df[[col]], "format.stata") <- NULL
    attr(nomogram_df[[col]], "labels") <- NULL
  }
}

# CRITICAL FIX: Convert empty strings to NA across all character columns
# Empty strings masquerade as valid values but should be treated as missing
# This must happen BEFORE any imputation or modeling
log_info("Converting empty strings to NA across all character columns...")
for (col in names(nomogram_df)) {
  if (is.character(nomogram_df[[col]])) {
    n_empty <- sum(nomogram_df[[col]] == "", na.rm = TRUE)
    if (n_empty > 0) {
      log_info(sprintf("  %s: converting %d empty strings to NA", col, n_empty))
      nomogram_df[[col]][nomogram_df[[col]] == ""] <- NA
    }
  }
}

# COMPREHENSIVE VARIABLE NAME MAPPING (old name -> new clean name)
# Using underscores instead of spaces to avoid formula issues
var_name_map <- c(
  "Age." = "Age",
  "BMI" = "BMI",
  "Is.the.patient.hispanic..latino.or.of.Spanish.origin." = "Hispanic",
  "Is.the.patient.Hispanic.or.Latino." = "Hispanic",
  "Race." = "Race",
  "Does.the.patient.have.a.h.o.recurrent.UTIs." = "Recurrent_UTIs",
  "Does.the.patient.have.recurrent.UTIs." = "Recurrent_UTIs",
  # POP-Q stage removed from model per user request
  "What.was.the.reason.for.the.urodynamics." = "UDS_Indication",
  "What.is.the.reason.for.the.urodynamics." = "UDS_Indication",
  "Does.the.patient.have.diabetes." = "Diabetes",
  "Does.the.patient.have.OAB." = "Overactive_Bladder",
  # Keep Overactive_Bladder name consistent throughout pipeline (not shortened to OAB)
  # Tests, Shiny app, and coerce_inputs_to_model_schema all use Overactive_Bladder
  "Is.the.patient.on.vaginal.estrogen." = "Vaginal_Estrogen",
  "Immunocompromised." = "Immunocompromised",
  "Tobacco.use." = "Tobacco_Use",
  "Menopause.status." = "Menopause",
  "Average.number.of.voids.at.night." = "Nocturia",
  "Year" = "Year",
  "PFDI.20.Score" = "PFDI_20",
  "UDI_6" = "UDI_6",
  "POPDI_6" = "POPDI_6",
  "CRADI_8" = "CRADI_8",
  "Did.the.patient.have.detrusor.overactivity." = "Detrusor_Overactivity",
  "Did.the.patient.have.DO." = "Detrusor_Overactivity_DO"
)

# Rename columns that exist in the data
for (old_name in names(var_name_map)) {
  if (old_name %in% names(nomogram_df)) {
    new_name <- var_name_map[old_name]
    names(nomogram_df)[names(nomogram_df) == old_name] <- new_name
    log_info(paste("Renamed:", old_name, "->", new_name))
  }
}

# FORCE SAFE NAMES (Replace spaces/special chars with underscores, keep case)
# This prevents issues with rms::datadist and formula construction while preserving
# capitalization expected by downstream chunks (e.g. Factor Cleaning)
log_info("Sanitizing all column names (removing spaces/dots)...")
safe_names <- names(nomogram_df)
safe_names <- gsub("\\s+", "_", safe_names)       # Spaces to underscores
safe_names <- gsub("[^a-zA-Z0-9_]", "_", safe_names) # Special chars to underscores
safe_names <- gsub("_+", "_", safe_names)         # Deduplicate underscores
safe_names <- gsub("_$", "", safe_names)          # Remove trailing underscores
names(nomogram_df) <- safe_names

# Define exclude columns (outcomes) - matching potential safe names
exclude_cols <- c("Cancelled", "Was_the_procedure_cancelled", "Was_the_procedure_cancelled_")

# Remove constant columns (zero variance) as they cause issues for datadist/lrm
log_info("Removing constant columns...")
constant_cols <- sapply(nomogram_df, function(x) length(unique(x[!is.na(x)])) <= 1)
if (any(constant_cols)) {
  removed_const <- names(nomogram_df)[constant_cols]
  log_info(paste("Removed constant columns:", paste(removed_const, collapse = ", ")))
  nomogram_df <- nomogram_df[, !constant_cols]
}

# Update available_cols from the sanitized dataframe
available_cols <- setdiff(names(nomogram_df), exclude_cols)

log_info(paste("Sanitized predictors:", paste(available_cols, collapse = ", ")))

# =============================================================================
# CLEAN FACTOR LEVELS - Force all binary variables to clean "No"/"Yes" factors
# Bug Fix (Argument Collision): clean_binary_factor() is now defined ONLY in
# tyler_utils.R (sourced at startup). The previous inline definition here
# risked shadowing or being shadowed by a different-signature version in utils,
# which could re-introduce the Hall of Shame OAB=2 bug where "2" was not
# mapped to "No". Single canonical definition prevents this.
# =============================================================================
stopifnot(exists("clean_binary_factor"))  # Verify it was loaded from tyler_utils.R

# Apply to all binary clinical variables
binary_vars <- c("Hispanic", "Recurrent_UTIs", "Diabetes", "Overactive_Bladder",
                  "Vaginal_Estrogen", "Immunocompromised", "Tobacco_Use")
for (bvar in binary_vars) {
  if (bvar %in% names(nomogram_df)) {
    nomogram_df[[bvar]] <- clean_binary_factor(nomogram_df[[bvar]], bvar)
  }
}

# Tobacco Use has special values: "Current"/"Former" → Yes, "Never" → No
# clean_binary_factor already handles this via the loop above, but if Tobacco
# has "Non-tobacco user" or similar levels, recode explicitly here
if ("Tobacco_Use" %in% names(nomogram_df)) {
  raw_tobacco <- as.character(nomogram_df$Tobacco_Use)
  tobacco_clean <- dplyr::case_when(
    raw_tobacco %in% c("Yes", "yes", "Y", "1", "TRUE", "Current", "Former",
                        "Current tobacco user", "Former tobacco user") ~ "Yes",
    raw_tobacco %in% c("No", "no", "N", "0", "2", "FALSE", "Never",
                        "Non-tobacco user", "Non tobacco user") ~ "No",
    is.na(raw_tobacco) ~ NA_character_,
    TRUE ~ raw_tobacco
  )
  nomogram_df$Tobacco_Use <- factor(tobacco_clean, levels = c("No", "Yes"))
}

# Menopause Status
if ("Menopause" %in% names(nomogram_df)) {
  nomogram_df$Menopause <- factor(nomogram_df$Menopause)
  old_levels <- levels(nomogram_df$Menopause)
  new_levels <- gsub("Post-menopausal|Postmenopausal|post", "Post", old_levels)
  new_levels <- gsub("Pre-menopausal|Premenopausal|pre", "Pre", new_levels)
  new_levels <- gsub("Peri-menopausal|Perimenopausal|peri", "Peri", new_levels)
  levels(nomogram_df$Menopause) <- new_levels
}

# POP-Q Stage removed from model per user request - keeping code commented for reference
# if ("POP_Q_Stage" %in% names(nomogram_df)) {
#   nomogram_df$POP_Q_Stage <- factor(nomogram_df$POP_Q_Stage)
#   old_levels <- levels(nomogram_df$POP_Q_Stage)
#   # Map common POP-Q values
#   new_levels <- gsub("^0$|Stage 0|stage 0", "0", old_levels)
#   new_levels <- gsub("^1$|^I$|Stage 1|stage 1|Stage I", "I", new_levels)
#   new_levels <- gsub("^2$|^II$|Stage 2|stage 2|Stage II", "II", new_levels)
#   new_levels <- gsub("^3$|^III$|Stage 3|stage 3|Stage III", "III", new_levels)
#   new_levels <- gsub("^4$|^IV$|Stage 4|stage 4|Stage IV", "IV", new_levels)
#   levels(nomogram_df$POP_Q_Stage) <- new_levels
# }

# UDS Indication - make VERY SHORT labels to prevent overlap
if ("UDS_Indication" %in% names(nomogram_df)) {
  nomogram_df$UDS_Indication <- factor(nomogram_df$UDS_Indication)
  old_levels <- levels(nomogram_df$UDS_Indication)
  log_info(paste("Original UDS levels:", paste(old_levels, collapse = " | ")))

  # Use very short abbreviations - comprehensive pattern matching
  new_levels <- old_levels
  new_levels <- gsub(".*[Ss]ling.*", "SL", new_levels)
  new_levels <- gsub(".*[Pp]re-?op.*|.*[Pp]reop.*", "Pre", new_levels)
  new_levels <- gsub(".*[Ii]ncontinence.*|.*SUI.*|.*[Uu]rinary.*[Ii]ncont.*", "UI", new_levels)
  new_levels <- gsub(".*POP.*|.*[Pp]rolapse.*", "POP", new_levels)
  new_levels <- gsub(".*[Vv]oiding.*[Dd]ysf.*|.*[Vv]oiding.*|.*VD.*", "VD", new_levels)
  new_levels <- gsub(".*[Rr]ecurrent.*UTI.*|.*rUTI.*", "rUTI", new_levels)
  new_levels <- gsub(".*[Nn]eurogenic.*[Bb]ladder.*|.*[Nn]eurogenic.*|.*NB.*", "NB", new_levels)
  new_levels <- gsub(".*OAB.*|.*[Oo]veractive.*[Bb]ladder.*", "OAB", new_levels)
  new_levels <- gsub(".*[Oo]ther.*", "Oth", new_levels)
  # Clean up any remaining long labels
  new_levels <- gsub("evaluation of ", "", new_levels)
  new_levels <- gsub("Evaluation of ", "", new_levels)
  new_levels <- gsub(" ", "", new_levels)  # Remove any spaces

  levels(nomogram_df$UDS_Indication) <- new_levels
  log_info(paste("Cleaned UDS levels:", paste(levels(nomogram_df$UDS_Indication), collapse = ", ")))
}

# Race - clean up race categories
if ("Race" %in% names(nomogram_df)) {
  nomogram_df$Race <- factor(nomogram_df$Race)
  old_levels <- levels(nomogram_df$Race)
  new_levels <- gsub(".*[Ww]hite.*|.*[Cc]aucasian.*", "White", old_levels)
  new_levels <- gsub(".*[Bb]lack.*|.*African.*", "Black", new_levels)
  new_levels <- gsub(".*[Hh]ispanic.*|.*[Ll]atino.*", "Hispanic", new_levels)
  new_levels <- gsub(".*[Aa]sian.*", "Asian", new_levels)
  new_levels <- gsub(".*[Oo]ther.*|.*[Mm]ulti.*|.*[Mm]ixed.*", "Other", new_levels)
  levels(nomogram_df$Race) <- new_levels
}

# Also rename the outcome variable
if ("Was.the.procedure.cancelled." %in% names(nomogram_df)) {
  names(nomogram_df)[names(nomogram_df) == "Was.the.procedure.cancelled."] <- "Cancelled"
}

# SET RMS LABELS for nice display on nomogram axes
# These will appear as the variable names on the left side of the nomogram
label_display_map <- c(
  "Age" = "Patient Age (years)",
  "BMI" = "Body Mass Index",
  "Hispanic" = "Hispanic/Latino",
  "Race" = "Race",
  "Recurrent_UTIs" = "Recurrent UTIs",
  # "POP_Q_Stage" = "POP-Q Stage",  # Removed from model per user request
  "UDS_Indication" = "Indication for UDS",
  "Diabetes" = "Diabetes",
  "Overactive_Bladder" = "Overactive Bladder",
  "Vaginal_Estrogen" = "Vaginal Estrogen",
  "Immunocompromised" = "Immunocompromised",
  "Tobacco_Use" = "Tobacco Use",
  "Menopause" = "Menopause Status",
  "Nocturia" = "Nocturia (per night)",
  "Year" = "Year",
  "PFDI_20" = "PFDI-20 Score",
  "UDI_6" = "UDI-6 Score",
  "POPDI_6" = "POPDI-6 Score",
  "CRADI_8" = "CRADI-8 Score",
  "Detrusor_Overactivity" = "Detrusor Overactivity"
)

for (col in names(nomogram_df)) {
  if (col %in% names(label_display_map)) {
    Hmisc::label(nomogram_df[[col]]) <- label_display_map[[col]]
    log_info(paste("Set label for", col, "->", label_display_map[[col]]))
  }
}

log_info("Variable and factor level cleaning complete")

# ============================================================================
# DYNAMIC MODEL SETUP FOR NOMOGRAM
# Use the LASSO-selected predictors from nomogram_df (cleaned version)
# This ensures the nomogram is based on LASSO feature selection results
# ============================================================================

log_info("Setting up nomogram with LASSO-selected predictors")

# Get predictors from nomogram_df (which contains LASSO-selected variables with clean names)
exclude_cols <- c("Cancelled", "Was.the.procedure.cancelled.")
available_cols <- setdiff(names(nomogram_df), exclude_cols)

if (length(available_cols) == 0) {
  stop("Error: No predictor columns found in nomogram_df. Check LASSO feature selection.")
}

log_info(paste("Initial predictors from LASSO:", paste(available_cols, collapse = ", ")))

# =============================================================================


log_info(paste("Final predictors for model:", paste(available_cols, collapse = ", ")))

# Verify the outcome column exists
if (!"Cancelled" %in% names(nomogram_df)) {
  stop("Error: Outcome column 'Cancelled' not found in nomogram_df")
}

# Log ranges for numeric predictors
for (col in available_cols) {
  if (is.numeric(nomogram_df[[col]])) {
    log_info(paste("Current", col, "range:",
                   min(nomogram_df[[col]], na.rm=TRUE),
                   "to", max(nomogram_df[[col]], na.rm=TRUE)))
  } else {
    log_info(paste(col, "is categorical with levels:",
                   paste(levels(factor(nomogram_df[[col]])), collapse = ", ")))
  }
}

# ✅ Step 1: Explicitly Recreate and Apply `datadist`
log_info("Creating and applying datadist")
dd <- rms::datadist(nomogram_df)

# Bug Fix (Double Truncation): Use datadist defaults (full data range for Low/High).
# Previously hardcoded Age=20-90, BMI=20-60 which truncated the predictor space
# and flat-lined predictions for extreme patients. RCS already imposes linearity
# in the tails (beyond outermost knots), so additional clipping is harmful —
# it biases risk estimates for the most symptomatic/elderly patients.
# Reference: Harrell FE. RMS 2nd ed. Section 2.4.5: datadist should use actual ranges.
for (v in available_cols) {
  if (is.numeric(nomogram_df[[v]])) {
    actual_low <- dd$limits["Low", v]
    actual_high <- dd$limits["High", v]
    log_info(sprintf("  %s datadist limits: Low=%.1f, High=%.1f (data range)", v, actual_low, actual_high))
  }
}

options(datadist = "dd")  # Ensure it is applied globally
log_info("datadist created and applied globally")

# ✅ Step 2: Variable names are already clean - just store for later use
log_info("Variable names already cleaned - storing for display")
predictor_names_display <- available_cols

# =============================================================================
# ✅ Step 2.5: AUTOMATIC RESTRICTED CUBIC SPLINES FOR CONTINUOUS VARIABLES
# =============================================================================
# RCS allows modeling of non-linear relationships for continuous predictors
# This can improve model fit (R²) while maintaining interpretability
# Reference: Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Springer; 2015.
# =============================================================================

log_info("Detecting continuous variables for restricted cubic splines (RCS)")

# Configuration: Number of knots for RCS
# 3 knots = 2 degrees of freedom (minimum for non-linearity)
# 4 knots = 3 degrees of freedom (moderate flexibility)
# 5 knots = 4 degrees of freedom (more flexibility, higher risk of overfitting)
# Use the same rcs_knots determined by sample-size-adaptive logic in Part 5
# (rcs_knots was set to 3/4/5 based on n_events at the knot-selection chunk)
# Do NOT override here — the adaptive value is already set

# Minimum unique values required for RCS (need at least knots + 2)
min_unique_for_rcs <- rcs_knots + 2

# Identify continuous vs categorical variables
continuous_vars <- c()
categorical_vars <- c()

for (col in available_cols) {
  col_data <- nomogram_df[[col]]

  if (is.numeric(col_data)) {
    n_unique <- length(unique(na.omit(col_data)))

    # Only apply RCS if sufficient unique values
    if (n_unique >= min_unique_for_rcs) {
      continuous_vars <- c(continuous_vars, col)
      log_info(sprintf("  %s: continuous (n_unique=%d) -> %s",
                       col, n_unique,
                       if (rcs_knots >= 3) sprintf("will use rcs(%s, %d)", col, rcs_knots)
                       else sprintf("linear term (rcs_knots=%d, splines disabled)", rcs_knots)))
    } else {
      # Treat as linear if too few unique values
      categorical_vars <- c(categorical_vars, col)
      log_info(sprintf("  %s: continuous but low cardinality (n_unique=%d) -> linear term",
                       col, n_unique))
    }
  } else {
    categorical_vars <- c(categorical_vars, col)
    n_levels <- length(unique(na.omit(col_data)))
    log_info(sprintf("  %s: categorical (n_levels=%d) -> factor term", col, n_levels))
  }
}

# =============================================================================
# EPV-AWARE RCS DECISION
# Check if using RCS would push EPV below minimum. If so, use linear terms.
# This preserves all predictors while meeting the EPV >= 10 requirement.
# =============================================================================
# Count df properly: categoricals contribute (levels - 1) df each, not 1
cat_df <- sum(sapply(categorical_vars, function(v) {
  max(1, length(unique(na.omit(nomogram_df[[v]]))) - 1)
}))
rcs_df_total <- length(continuous_vars) * (rcs_knots - 1) + cat_df
linear_df_total <- length(continuous_vars) + cat_df
n_events_for_rcs <- sum(nomogram_df$Cancelled == levels(factor(nomogram_df$Cancelled))[2], na.rm = TRUE)
epv_with_rcs <- n_events_for_rcs / rcs_df_total
epv_with_linear <- n_events_for_rcs / linear_df_total

log_info(sprintf("EPV check: with RCS = %.1f (%d df), with linear = %.1f (%d df)",
                 epv_with_rcs, rcs_df_total, epv_with_linear, linear_df_total))

use_rcs <- rcs_knots >= 3 && rcs_df_total > 0 && epv_with_rcs >= EPV_MINIMUM
if (!use_rcs) {
  log_info(sprintf("RCS disabled: rcs_knots=%d, rcs_df_total=%d, EPV=%.1f (need knots>=3, df>0, EPV>=%d). Using linear terms.",
                   rcs_knots, rcs_df_total, epv_with_rcs, EPV_MINIMUM))
}

# Build formula terms
if (use_rcs) {
  rcs_terms <- sapply(continuous_vars, function(v) {
    v_safe <- if (grepl("[^a-zA-Z0-9_.]", v)) paste0("`", v, "`") else v
    sprintf("rcs(%s, %d)", v_safe, rcs_knots)
  })
} else {
  # Use linear terms for continuous variables to preserve EPV
  rcs_terms <- sapply(continuous_vars, function(v) {
    if (grepl("[^a-zA-Z0-9_.]", v)) paste0("`", v, "`") else v
  })
}
linear_terms <- categorical_vars

# Combine all terms
all_terms <- c(rcs_terms, linear_terms)

log_info(sprintf("Model terms: %d continuous vars (%s), %d categorical vars",
                 length(continuous_vars),
                 if (use_rcs) sprintf("rcs with %d knots", rcs_knots) else "linear",
                 length(categorical_vars)))

# Store for later reference
rcs_variables <- continuous_vars
linear_variables <- categorical_vars

# =============================================================================
# ✅ Step 2.6: INTERACTION TERMS FROM LASSO SELECTION
# =============================================================================
# Bug Fix (Interaction Bias): Interactions are now included in the initial LASSO
# candidate pool (Part 4) so the L1 penalty applies to their complexity. This
# replaces the previous post-hoc detect_interactions() approach which used
# stepwise LR tests AFTER LASSO — a biased procedure because the penalty did
# not apply to interaction terms, allowing spurious interactions to inflate
# model complexity.
# Reference: Hastie, Tibshirani, Wainwright (2015). Statistical Learning with
#            Sparsity. Chapter 4.
# =============================================================================

log_info("Checking for LASSO-selected interaction terms")

# Identify interaction terms that LASSO selected (they contain ":" in their names)
lasso_selected_interactions <- selected_predictors_lasso[grepl(":", selected_predictors_lasso)]

# Convert LASSO dummy interaction names back to formula interaction terms.
# Bug Fix (Dummy Name Mapping): LASSO coefficient names are dummy-encoded
# (e.g., "DiabetesYes:Age" or "Recurrent_UTIsYes:Vaginal_EstrogenYes").
# We must map these back to base column names in nomogram_df for the formula.
# Strategy: for each dummy part, find the longest matching column name in the data.
nomogram_col_names <- names(nomogram_df)

dummy_to_base <- function(dummy_name, col_names) {
  # Direct match first
  if (dummy_name %in% col_names) return(dummy_name)

  # Find longest column name that is a prefix of the dummy name
  # e.g., "DiabetesYes" -> matches "Diabetes" (length 8)
  matches <- col_names[sapply(col_names, function(cn) startsWith(dummy_name, cn))]
  if (length(matches) > 0) {
    return(matches[which.max(nchar(matches))])  # Longest prefix wins
  }

  # Fallback: strip trailing factor level indicators (digits, Yes, No)
  stripped <- gsub("(Yes|No|TRUE|FALSE|[0-9]+)$", "", dummy_name)
  if (stripped %in% col_names) return(stripped)

  # Fallback 2: The dummy_name uses the raw (pre-rename) column name (e.g.,
  # "Does.the.patient.have.diabetes.Yes" from model.matrix()). Map through
  # var_name_map to find the clean name used in nomogram_df.
  if (exists("var_name_map")) {
    # Try raw column name against var_name_map keys
    for (raw_name in names(var_name_map)) {
      if (startsWith(dummy_name, raw_name)) {
        clean_name <- var_name_map[raw_name]
        if (clean_name %in% col_names) {
          log_info(sprintf("  dummy_to_base: mapped '%s' -> '%s' via var_name_map", dummy_name, clean_name))
          return(clean_name)
        }
      }
    }
    # Also try the stripped version
    if (stripped != dummy_name) {
      for (raw_name in names(var_name_map)) {
        if (stripped == raw_name || startsWith(stripped, raw_name)) {
          clean_name <- var_name_map[raw_name]
          if (clean_name %in% col_names) {
            log_info(sprintf("  dummy_to_base: mapped stripped '%s' -> '%s' via var_name_map", stripped, clean_name))
            return(clean_name)
          }
        }
      }
    }
  }

  dummy_name  # Return as-is if no match (will fail later with informative error)
}

significant_interactions <- character(0)
if (length(lasso_selected_interactions) > 0) {
  for (int_name in lasso_selected_interactions) {
    parts <- strsplit(int_name, ":")[[1]]
    if (length(parts) == 2) {
      v1_base <- dummy_to_base(parts[1], nomogram_col_names)
      v2_base <- dummy_to_base(parts[2], nomogram_col_names)
      v1_safe <- if (grepl("[^a-zA-Z0-9_.]", v1_base)) paste0("`", v1_base, "`") else v1_base
      v2_safe <- if (grepl("[^a-zA-Z0-9_.]", v2_base)) paste0("`", v2_base, "`") else v2_base
      int_term <- paste(v1_safe, "*", v2_safe)
      if (!int_term %in% significant_interactions) {
        significant_interactions <- c(significant_interactions, int_term)
        log_info(sprintf("LASSO selected interaction: %s (from dummy: %s)", int_term, int_name))
      }
    }
  }
} else {
  log_info("No interaction terms survived LASSO selection (L1 penalty shrunk all to zero)")
}

# Check EPV budget before including LASSO-selected interactions
current_model_df <- if (use_rcs) length(continuous_vars) * (rcs_knots - 1) + length(categorical_vars) else length(continuous_vars) + length(categorical_vars)
current_n_events <- sum(nomogram_df$Cancelled == levels(factor(nomogram_df$Cancelled))[2], na.rm = TRUE)
epv_budget <- floor(current_n_events / EPV_MINIMUM) - current_model_df

if (length(significant_interactions) > 0 && epv_budget <= 0) {
  log_info(sprintf("Dropping LASSO-selected interactions: no EPV budget (df=%d, events=%d)",
                   current_model_df, current_n_events))
  significant_interactions <- character(0)
}

# Store for documentation
interaction_terms_added <- significant_interactions

# Store the number of potential interactions for reporting
available_interactions <- if (exists("lasso_interaction_terms") && length(lasso_interaction_terms) > 0) {
  gsub(":", " × ", lasso_interaction_terms)
} else {
  character(0)
}

# ✅ Step 3: Build canonical production formula using shared helper
log_info("Building production formula with dynamic predictors and interactions")

nomogram_formula <- build_model_formula(
  outcome    = "Cancelled",
  predictors = c(continuous_vars, categorical_vars),
  rcs_knots  = if (use_rcs) rcs_knots else 0L,
  data       = nomogram_df
)

# Append LASSO-selected interactions if any
if (length(significant_interactions) > 0) {
  nomogram_formula <- update(nomogram_formula,
    as.formula(paste(". ~ . +", paste(significant_interactions, collapse = " + "))))
}

log_info(paste("Nomogram formula:", deparse(nomogram_formula)))

# =============================================================================
# CANONICAL PRODUCTION FORMULA
# This single formula object defines the model architecture for ALL downstream
# uses: primary fit, bootstrap validation, temporal validation, cross-validation,
# nomogram, and Shiny calculator. Do NOT rebuild the formula from scratch in
# validation sections — reuse this object to guarantee architectural consistency.
# The bootstrap (full-process) is the ONE exception because it re-selects
# variables via LASSO per resample.
# =============================================================================
production_formula <- nomogram_formula
log_info(paste("Production formula:", deparse(production_formula)))

# =============================================================================
# MICE IMPUTATION: Handle missing predictor values
# This ensures all 841 patients are included in the model
# Missing: Detrusor Overactivity (26), Recurrent UTIs (10)
# =============================================================================
log_info("Checking for missing values in predictor variables...")

# Get predictor columns from the formula (exclude outcome)
predictor_cols <- all.vars(nomogram_formula)[-1]  # Remove outcome variable
predictor_cols <- gsub("rcs\\(([^,]+),.*\\)", "\\1", predictor_cols)  # Remove rcs() wrapper

log_info(sprintf("Formula predictor columns: %s", paste(predictor_cols, collapse = ", ")))
log_info(sprintf("Available nomogram_df columns: %s", paste(names(nomogram_df), collapse = ", ")))

# Check which predictor columns actually exist in nomogram_df
predictor_cols_found <- predictor_cols[predictor_cols %in% names(nomogram_df)]
predictor_cols_missing <- predictor_cols[!predictor_cols %in% names(nomogram_df)]

if (length(predictor_cols_missing) > 0) {
  log_warn(sprintf("Predictor columns NOT FOUND in data: %s", paste(predictor_cols_missing, collapse = ", ")))
}
log_info(sprintf("Predictor columns found: %s", paste(predictor_cols_found, collapse = ", ")))

# Check missing values before imputation
n_missing_before <- sum(!complete.cases(nomogram_df[, predictor_cols_found, drop = FALSE]))
log_info(sprintf("Rows with missing predictor values: %d of %d (%.1f%%)",
                 n_missing_before, nrow(nomogram_df),
                 100 * n_missing_before / nrow(nomogram_df)))

if (n_missing_before > 0) {
  log_info("Performing MICE imputation to include all patients...")

  # Load mice if not already loaded
  if (!requireNamespace("mice", quietly = TRUE)) {
    stop("Package 'mice' is required for imputation. Install with: install.packages('mice')")
  }

  # Select columns for imputation (predictors + outcome)
  # Use predictor_cols_found to ensure we only use columns that exist
  impute_cols <- c("Cancelled", predictor_cols_found)
  impute_cols <- impute_cols[impute_cols %in% names(nomogram_df)]
  impute_data <- nomogram_df[, impute_cols, drop = FALSE]

  log_info(sprintf("Columns selected for imputation: %s", paste(impute_cols, collapse = ", ")))

  # CRITICAL FIX: Convert empty strings to NA before imputation
  # Empty strings ("") are NOT treated as NA by MICE - they're treated as valid values
  # This was causing 64+ records to be "imputed" as "" instead of actual values
  for (col in names(impute_data)) {
    if (is.character(impute_data[[col]])) {
      n_empty <- sum(impute_data[[col]] == "", na.rm = TRUE)
      if (n_empty > 0) {
        log_info(sprintf("Converting %d empty strings to NA in %s", n_empty, col))
        impute_data[[col]][impute_data[[col]] == ""] <- NA
      }
    }
  }

  # CRITICAL: Convert character columns to factors for MICE imputation
  # MICE cannot impute character columns - they must be factors
  for (col in names(impute_data)) {
    if (is.character(impute_data[[col]])) {
      log_info(sprintf("Converting %s from character to factor for MICE", col))
      impute_data[[col]] <- factor(impute_data[[col]])
    }
  }

  # Show missing data pattern
  log_info("Missing data pattern:")
  for (col in impute_cols) {
    n_na <- sum(is.na(impute_data[[col]]))
    if (n_na > 0) {
      log_info(sprintf("  %s: %d missing (%.1f%%)", col, n_na, 100 * n_na / nrow(impute_data)))
    }
  }

  # Run MICE imputation
  # Use appropriate methods for each variable type:
  # - "pmm" for continuous variables
  # - "logreg" for binary factors
  # - "polyreg" for multi-level factors
  set.seed(SEED_MAIN)  # For reproducibility

  # Determine appropriate imputation method for each column
  impute_methods <- sapply(names(impute_data), function(col) {
    x <- impute_data[[col]]
    if (is.numeric(x) && !is.factor(x)) {
      return("pmm")  # Predictive mean matching for continuous
    } else if (is.factor(x) || is.character(x)) {
      n_levels <- length(unique(na.omit(x)))
      if (n_levels == 2) {
        return("logreg")  # Logistic regression for binary
      } else {
        return("polyreg")  # Polytomous regression for multi-level
      }
    } else {
      return("pmm")  # Default
    }
  })

  log_info(sprintf("MICE methods: %s", paste(names(impute_methods), impute_methods, sep="=", collapse=", ")))

  # Suppress verbose output during imputation
  mice_result <- suppressMessages(
    mice::mice(
      impute_data,
      m = 5,              # 5 imputed datasets
      method = impute_methods,
      maxit = 10,         # 10 iterations
      printFlag = FALSE   # Suppress iteration output
    )
  )

  # BUG FIX (2026-03-24): Use ALL m=5 imputed datasets via Rubin's rules,
  # not just the first. Extracting only dataset 1 discards between-imputation
  # variance, underestimates standard errors, and produces overconfident CIs.
  # Dataset 1 is still used to update nomogram_df for datadist/plots.
  imputed_data <- mice::complete(mice_result, action = 1)

  # Verify imputation worked - use predictor_cols_found (columns that actually exist)
  n_missing_after <- sum(!complete.cases(imputed_data[, predictor_cols_found, drop = FALSE]))
  log_info(sprintf("After MICE imputation: %d rows with missing values (was %d)",
                   n_missing_after, n_missing_before))

  if (n_missing_after > 0) {
    log_warn("MICE imputation did not resolve all missing values!")
    # Debug: show which columns still have missing
    for (col in predictor_cols_found) {
      n_still_na <- sum(is.na(imputed_data[[col]]))
      if (n_still_na > 0) {
        log_warn(sprintf("  %s still has %d missing values", col, n_still_na))
      }
    }
  } else {
    log_info("SUCCESS: All missing predictor values imputed")
  }

  # Update nomogram_df with imputed values - use predictor_cols_found
  for (col in predictor_cols_found) {
    if (col %in% names(imputed_data)) {
      # For factors, we need to ensure the levels match or convert properly
      if (is.factor(imputed_data[[col]])) {
        # If the original was character, convert the imputed factor back to character
        # then the model will handle the conversion
        if (is.character(nomogram_df[[col]])) {
          nomogram_df[[col]] <- as.character(imputed_data[[col]])
        } else {
          nomogram_df[[col]] <- imputed_data[[col]]
        }
      } else {
        nomogram_df[[col]] <- imputed_data[[col]]
      }
      log_info(sprintf("Updated %s with imputed values (class: %s)", col, class(nomogram_df[[col]])[1]))
    }
  }

  # Verify nomogram_df is now complete
  n_final_missing <- sum(!complete.cases(nomogram_df[, predictor_cols_found, drop = FALSE]))
  log_info(sprintf("nomogram_df missing rows after update: %d", n_final_missing))

  # Store imputation info for documentation
  mice_imputation_performed <- TRUE
  mice_n_imputed <- n_missing_before
  mice_m_datasets <- 5L  # Track number of imputed datasets for reporting

} else {
  log_info("No missing predictor values - MICE imputation not needed")
  mice_imputation_performed <- FALSE
  mice_n_imputed <- 0
  mice_m_datasets <- 0L
}

# REFRESH DATADIST (Critical for rms functions)
# Data has been imputed and potentially modified, so we must update datadist
log_info("Refreshing datadist with imputed data")
dd <- rms::datadist(nomogram_df)
options(datadist = "dd")

# Fit the model using Rubin's rules across all imputed datasets (if MICE was used)
# fit.mult.impute() from Hmisc pools coefficients and adjusts SEs for
# between-imputation variance, preventing the overconfident CIs that result
# from using only a single imputed dataset.
if (exists("mice_result") && mice_imputation_performed) {
  log_info("Fitting model across all m=5 imputed datasets using Rubin's rules")
  model <- Hmisc::fit.mult.impute(
    nomogram_formula,
    fitter = rms::lrm,
    xtrans = mice_result,
    data = nomogram_df,
    pr = FALSE,  # Suppress iteration output
    fitargs = list(x = TRUE, y = TRUE)
  )
  log_info(sprintf("Model fitted with Rubin's rules pooling across %d imputed datasets", mice_m_datasets))
} else {
  # No imputation needed — fit on observed data directly
  model <- rms::lrm(
    nomogram_formula,
    data = nomogram_df,
    x = TRUE,
    y = TRUE
  )
  log_info("Model fitted on complete data (no imputation needed)")
}

# Ensure Cancelled remains a proper factor (can get coerced during model fitting)
if (!is.factor(nomogram_df$Cancelled)) {
  nomogram_df$Cancelled <- factor(nomogram_df$Cancelled, levels = c(1, 2),
                                   labels = c("Completed", "Cancelled"))
  log_info("Restored Cancelled column as factor")
}

# =============================================================================
# CRITICAL: Track actual model sample size (complete cases only)
# rms::lrm() automatically excludes rows with missing predictor values
# These counts MUST be used for TRIPOD reporting and EPV calculations
# =============================================================================
model_n_actual <- model$stats["Obs"]
# lrm stores factor outcomes as factor with levels 1/2 (not original names)
# Level 1 = "Completed", Level 2 = "Cancelled"
model_n_events <- sum(as.numeric(model$y) == 2)
model_n_nonevents <- sum(as.numeric(model$y) == 1)
model_event_rate <- round(100 * model_n_events / model_n_actual, 1)

# Calculate excluded due to missing predictors
# This is the difference between analysis cohort (841) and model N
tripod_n_excluded_missing_predictors <- tripod_n_final - model_n_actual

log_info(sprintf("CRITICAL MODEL COUNTS (for TRIPOD reporting):"))
log_info(sprintf("  Analysis cohort with complete outcome: %d", tripod_n_final))
log_info(sprintf("  Excluded for missing predictors: %d", tripod_n_excluded_missing_predictors))
log_info(sprintf("  Model actual N: %d", model_n_actual))
log_info(sprintf("  Model events (Cancelled): %d (%.1f%%)", model_n_events, model_event_rate))
log_info(sprintf("  Model non-events (Completed): %d", model_n_nonevents))

# Verify consistency
stopifnot(
  model_n_actual + tripod_n_excluded_missing_predictors == tripod_n_final,
  model_n_events + model_n_nonevents == model_n_actual
)

# =============================================================================
# CRITICAL: Recalculate EPV with ACTUAL model event count
# This is the TRUE EPV that should be reported in the manuscript
# =============================================================================
n_model_predictors <- length(available_cols)
# Calculate effective degrees of freedom (accounts for RCS complexity)
if (use_rcs) {
  model_effective_df <- length(continuous_vars) * (rcs_knots - 1) + length(categorical_vars)
} else {
  model_effective_df <- length(continuous_vars) + length(categorical_vars)
}
model_epv_actual <- round(model_n_events / model_effective_df, 1)
model_epv_adequate <- model_epv_actual >= EPV_MINIMUM

log_info(sprintf(""))
log_info(sprintf("=== CRITICAL: ACTUAL EPV CALCULATION ==="))
log_info(sprintf("Model events: %d", model_n_events))
log_info(sprintf("Model predictors: %d (%d effective df, RCS=%s)",
                 n_model_predictors, model_effective_df, use_rcs))
log_info(sprintf("ACTUAL EPV: %.1f (minimum: %d)", model_epv_actual, EPV_MINIMUM))
log_info(sprintf("EPV adequate: %s", model_epv_adequate))

if (!model_epv_adequate) {
  log_info(sprintf("EPV (%.1f) is below guideline of %d. Mitigations: LASSO selection, bootstrap validation, shrinkage correction.",
                   model_epv_actual, EPV_MINIMUM))
}

log_info(paste("Model C-statistic:", round(model$stats["C"], 3)))
log_info(paste("Model R-squared:", round(model$stats["R2"], 3)))
log_info(paste("Model Brier score:", round(model$stats["Brier"], 4)))
log_info(paste("Number of predictors:", n_model_predictors))

# Report RCS implementation
if (length(continuous_vars) > 0) {
  log_info(sprintf("RCS applied to %d continuous variable(s): %s",
                   length(continuous_vars), paste(continuous_vars, collapse = ", ")))
  log_info(sprintf("Each RCS term uses %d knots (%d degrees of freedom)",
                   rcs_knots, rcs_knots - 1))

  # Calculate effective degrees of freedom
  # Each RCS with k knots adds k-1 df, categorical vars add levels-1 df
  rcs_df <- length(continuous_vars) * (rcs_knots - 1)
  log_info(sprintf("Total degrees of freedom from RCS: %d", rcs_df))
} else {
  log_info("No continuous variables suitable for RCS - using linear terms only")
}

Model Sample Summary

Note on Missing Data: MICE multiple imputation was used to impute missing predictor values for 146 patients (17.4%). The final model was fit on 841 patients with 57 cancellation events (6.8% event rate).

Penalized Regression Comparison — pentrace (Primary Model)

The rms::pentrace() function explores the optimal penalty for ridge regression, providing an alternative to LASSO for addressing overfitting. This comparison helps validate our variable selection approach.

Interpretation of pentrace Results:

  • Effective AIC curve: Shows model fit quality vs. penalty strength
  • Optimal penalty: The λ that minimizes effective AIC
  • Degrees of freedom (df): Decreases as penalty increases (more shrinkage)
  • Comparison with LASSO:
    • LASSO (L1): Performs variable selection (sets coefficients to zero)
    • Ridge (L2): Shrinks all coefficients but keeps all variables
    • Our hybrid approach uses LASSO for selection, then shrinkage adjustment

Note: The primary nomogram using shrinkage-adjusted coefficients is shown in Figure 2 (Shrinkage-Adjusted Nomogram) below, following bootstrap validation. The original unadjusted nomogram is retained here for reference but is hidden from output.

Nomogram with Enhanced Rug Plot (Data Distribution)

This version adds sophisticated data distribution visualization (“rug plots”) to show where actual patient values fall along each predictor scale. The enhanced rug uses Hmisc::scat1d() for a cleaner, more informative display than standard rug plots.

Interpretation of Distribution Plots: - Green density: Distribution of completed procedures - Red density: Distribution of cancelled procedures - Overlap: Where the distributions overlap, the predictor has less discriminatory power - Separation: Where distributions differ, the predictor helps distinguish outcomes - Rug marks: Individual patient values (green = completed, red = cancelled)

PART 6: Internal Validation (Primary Model)

This section assesses model performance using rigorous validation techniques. We employ both standard bootstrap validation and full-process bootstrap (which includes LASSO selection within each resample) per Harrell’s recommendations.

Why Full-Process Validation?

Standard bootstrap only validates the final model. Full-process bootstrap repeats ALL modeling steps (including LASSO selection) in each resample, providing a more honest assessment of optimism when data-driven variable selection is used.

Bootstrap Internal Validation

Bootstrap validation provides optimism-corrected estimates of model performance by resampling with replacement. This technique helps assess how much the apparent model performance may be inflated due to overfitting.

# Set seed for reproducibility (uses configured seed)
set.seed(SEED_BOOTSTRAP)

# Perform bootstrap validation with configured number of resamples
# This validates the lrm model and provides optimism-corrected statistics
log_info(paste("Starting bootstrap validation with", BOOTSTRAP_RESAMPLES, "resamples..."))

# Verify model x and y dimensions before validation
if (!is.null(model$x) && !is.null(model$y)) {
  log_info(sprintf("Model x dimensions: %d x %d, y length: %d",
                   nrow(model$x), ncol(model$x), length(model$y)))
  if (nrow(model$x) != length(model$y)) {
    log_warn("Dimension mismatch detected - refitting model")
    # Refit model to ensure x and y are consistent
    model <- rms::lrm(nomogram_formula, data = nomogram_df, x = TRUE, y = TRUE)
  }
}

bootstrap_validation <- rms::validate(model, B = BOOTSTRAP_RESAMPLES, pr = FALSE)

log_info("Bootstrap validation completed")

# Convert to data frame for kable display
# The validate object is a matrix with special class, need to handle carefully
bootstrap_matrix <- unclass(bootstrap_validation)
bootstrap_df <- data.frame(
  Metric = rownames(bootstrap_matrix),
  Apparent = round(bootstrap_matrix[, "index.orig"], 4),
  Training = round(bootstrap_matrix[, "training"], 4),
  Test = round(bootstrap_matrix[, "test"], 4),
  Optimism = round(bootstrap_matrix[, "optimism"], 4),
  Corrected = round(bootstrap_matrix[, "index.corrected"], 4),
  n = bootstrap_matrix[, "n"]
)

# Display as formatted kable table
kable(bootstrap_df,
      col.names = c("Metric", "Apparent", "Training", "Test", "Optimism", "Corrected", "n"),
      caption = paste("Bootstrap Internal Validation Results (B =", BOOTSTRAP_RESAMPLES, "resamples)"),
      align = c("l", "c", "c", "c", "c", "c", "c"),
      row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 13) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(6, bold = TRUE, color = "white", background = "#27AE60") %>%
  add_header_above(c(" " = 1, "Performance Indices" = 5, " " = 1))
Bootstrap Internal Validation Results (B = 1000 resamples)
Performance Indices
Metric Apparent Training Test Optimism Corrected n
Dxy 0.5081 0.5150 0.4836 0.0314 0.4768 1000
R2 0.1434 0.1544 0.1296 0.0248 0.1186 1000
Intercept 0.0000 0.0000 -0.1515 0.1515 -0.1515 1000
Slope 1.0000 1.0000 0.9267 0.0733 0.9267 1000
Emax 0.0000 0.0000 0.1012 -0.1012 0.1012 1000
D 0.0565 0.0613 0.0508 0.0105 0.0460 1000
U -0.0024 -0.0024 0.0006 -0.0030 0.0006 1000
Q 0.0589 0.0636 0.0502 0.0135 0.0454 1000
B 0.0581 0.0575 0.0589 -0.0014 0.0595 1000
g 1.1850 1.2226 1.1039 0.1187 1.0664 1000
gp 0.0652 0.0666 0.0619 0.0046 0.0606 1000
# Extract key metrics
apparent_dxy <- bootstrap_validation["Dxy", "index.orig"]
optimism_dxy <- bootstrap_validation["Dxy", "optimism"]
corrected_dxy <- bootstrap_validation["Dxy", "index.corrected"]

# Convert Dxy to C-statistic (C = 0.5 + Dxy/2)
apparent_c <- 0.5 + apparent_dxy/2
corrected_c <- 0.5 + corrected_dxy/2
optimism_c <- apparent_c - corrected_c

Calibration Slope and Intercept (Critical Overfitting Indicators)

Per Harrell’s Regression Modeling Strategies, calibration slope and intercept are critical indicators of model overfitting that should be prominently reported. A calibration slope < 1 indicates predictions are too extreme (overfitting), while the intercept indicates systematic over- or under-prediction.

# =============================================================================
# PROGRAMMATIC INTERPRETATION OF CALIBRATION RESULTS
# =============================================================================

cat("\n### Interpretation of Calibration Metrics\n\n")

Interpretation of Calibration Metrics

# Slope interpretation
cat("**Calibration Slope:** ")

Calibration Slope:

if (cal_slope_corrected >= 0.9) {
  cat(sprintf("The optimism-corrected slope of **%.3f** is close to ideal (1.0), ", cal_slope_corrected))
  cat("indicating that the model's predicted probabilities are appropriately calibrated ")
  cat("and do not require shrinkage adjustment.\n\n")
} else if (cal_slope_corrected >= 0.8) {
  cat(sprintf("The optimism-corrected slope of **%.3f** indicates modest overfitting. ", cal_slope_corrected))
  cat("Predicted probabilities may be slightly too extreme. ")
  cat(sprintf("Consider applying a shrinkage factor of **%.3f** to regression coefficients when deploying the model.\n\n", cal_slope_corrected))
} else {
  cat(sprintf("**WARNING:** The optimism-corrected slope of **%.3f** indicates substantial overfitting. ", cal_slope_corrected))
  cat("Predicted probabilities are too extreme. ")
  cat(sprintf("**Recommendations:** (1) Apply shrinkage factor of **%.3f** to coefficients, ", cal_slope_corrected))
  cat("(2) Consider penalized regression (ridge/LASSO), ")
  cat("(3) Reduce model complexity.\n\n")
}

The optimism-corrected slope of 0.927 is close to ideal (1.0), indicating that the model’s predicted probabilities are appropriately calibrated and do not require shrinkage adjustment.

# Intercept interpretation
cat("**Calibration Intercept:** ")

Calibration Intercept:

if (abs(cal_intercept_corrected) < 0.1) {
  cat(sprintf("The optimism-corrected intercept of **%.3f** is close to ideal (0.0), ", cal_intercept_corrected))
  cat("indicating no systematic over- or under-prediction of risk.\n\n")
} else if (cal_intercept_corrected > 0) {
  cat(sprintf("The positive intercept of **%.3f** indicates the model systematically ", cal_intercept_corrected))
  cat("**under-predicts** risk (observed rates higher than predicted). ")
  cat("Consider recalibration before clinical deployment.\n\n")
} else {
  cat(sprintf("The negative intercept of **%.3f** indicates the model systematically ", cal_intercept_corrected))
  cat("**over-predicts** risk (observed rates lower than predicted). ")
  cat("Consider recalibration before clinical deployment.\n\n")
}

The negative intercept of -0.151 indicates the model systematically over-predicts risk (observed rates lower than predicted). Consider recalibration before clinical deployment.

# Combined assessment
cat("**Combined Assessment:** ")

Combined Assessment:

needs_shrinkage <- cal_slope_corrected < 0.9
needs_recalibration <- abs(cal_intercept_corrected) >= 0.1

if (!needs_shrinkage && !needs_recalibration) {
  cat("Model calibration is adequate for clinical use without adjustment.\n")
} else if (needs_shrinkage && !needs_recalibration) {
  cat("Apply shrinkage to coefficients before deployment.\n")
} else if (!needs_shrinkage && needs_recalibration) {
  cat("Model may benefit from intercept recalibration in new populations.\n")
} else {
  cat("Model requires both shrinkage and recalibration before deployment.\n")
}

Model may benefit from intercept recalibration in new populations.

cat("\n**Reference:** Harrell FE Jr. *Regression Modeling Strategies*. 2nd ed. Springer; 2015. ")

Reference: Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Springer; 2015.

cat("Chapter 5: 'A slope of less than 1 is a sign that predicted values are too extreme.'\n")

Chapter 5: ‘A slope of less than 1 is a sign that predicted values are too extreme.’

# =============================================================================
# PRELIMINARY SHRINKAGE FACTOR FROM STANDARD BOOTSTRAP
# NOTE: Full-process shrinkage (accounting for LASSO selection) is applied later
# =============================================================================

cat("## Preliminary Calibration Shrinkage (Standard Bootstrap)\n\n")

Preliminary Calibration Shrinkage (Standard Bootstrap)

cat(sprintf("**Standard Bootstrap Calibration Slope:** %.3f\n\n", calibration_shrinkage_factor))

Standard Bootstrap Calibration Slope: 0.927

cat("This calibration slope is from standard bootstrap validation of the **final model only**. ")

This calibration slope is from standard bootstrap validation of the final model only.

cat("It does NOT account for the additional optimism introduced by LASSO variable selection.\n\n")

It does NOT account for the additional optimism introduced by LASSO variable selection.

cat("**Important:** A more comprehensive shrinkage adjustment using the **full-process calibration slope** ")

Important: A more comprehensive shrinkage adjustment using the full-process calibration slope

cat("(which includes LASSO selection uncertainty) will be applied after the full-process bootstrap validation ")

(which includes LASSO selection uncertainty) will be applied after the full-process bootstrap validation

cat("section below. The full-process shrinkage factor is typically MORE conservative.\n\n")

section below. The full-process shrinkage factor is typically MORE conservative.

# Preview the shrinkage effect with standard bootstrap slope
if (!is.null(model) && inherits(model, "lrm")) {
  original_coefs <- coef(model)
  shrunk_coefs_preliminary <- original_coefs
  shrunk_coefs_preliminary[-1] <- original_coefs[-1] * calibration_shrinkage_factor

  shrinkage_table <- data.frame(
    Coefficient = names(original_coefs),
    Original = sprintf("%.4f", original_coefs),
    Shrinkage_Adjusted = sprintf("%.4f", shrunk_coefs_preliminary),
    stringsAsFactors = FALSE
  )

  kable(shrinkage_table,
        col.names = c("Coefficient", "Original",
                      paste0("Preliminary Shrunk (×", round(calibration_shrinkage_factor, 3), ")")),
        caption = paste0("Preliminary Shrinkage Using Standard Bootstrap ",
                         "(Full-Process Shrinkage Applied Below)"),
        align = c("l", "r", "r")) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                  full_width = FALSE,
                  font_size = 11) %>%
    row_spec(1, italic = TRUE, color = "#7F8C8D") %>%
    add_footnote("See 'Full-Process Shrinkage Adjustment' section below for final deployment coefficients",
                 notation = "symbol")
}
Preliminary Shrinkage Using Standard Bootstrap (Full-Process Shrinkage Applied Below)
Coefficient Original Preliminary Shrunk (×0.927)
Intercept -7.6127 -7.6127
Age 0.0668 0.0619
POPQ_Stage -0.2461 -0.2281
BMI 0.0282 0.0262
Recurrent_UTIs=Yes 0.7791 0.7220
Diabetes=Yes -0.2378 -0.2204
* See ‘Full-Process Shrinkage Adjustment’ section below for final deployment coefficients
if (exists("use_rcs") && use_rcs) {
  cat("\n\n**Note on Variable Naming Convention:**\n\n")
  cat("Variables with apostrophes (e.g., Age', Nocturia') represent **nonlinear spline terms** from ")
  cat("restricted cubic splines (RCS). The rms package models continuous variables using splines, ")
  cat("which decompose each predictor into:\n\n")
  cat("- **Age** (or variable name without apostrophe): The linear component of the effect\n")
  cat("- **Age'** (single apostrophe): The first nonlinear spline term capturing curvature\n")
  cat("- **Age''** (double apostrophe, if present): Additional nonlinear terms for more complex shapes\n\n")
  cat("For clinical predictions, you do not need to manually calculate these spline terms—the ")
  cat("Shiny calculator and exported model functions handle this transformation automatically.\n")
} else {
  cat("\n\n**Note:** This model uses **linear terms only** for all continuous predictors. ")
  cat("Restricted cubic splines were evaluated but not applied because the events-per-variable ratio ")
  cat("(EPV = ", round(model_epv_actual, 1), ") would fall below the recommended minimum of 10 with additional spline degrees of freedom. ")
  cat("Linear terms provide stable coefficient estimates with the available sample size.\n")
}

Note: This model uses linear terms only for all continuous predictors. Restricted cubic splines were evaluated but not applied because the events-per-variable ratio (EPV = 11.4 ) would fall below the recommended minimum of 10 with additional spline degrees of freedom. Linear terms provide stable coefficient estimates with the available sample size.


Understanding the Visualization:

  • Left panel: Compares a linear model (red dashed line) to a non-linear spline model (blue solid line). The linear model assumes risk increases at a constant rate with age, which may underpredict risk at older ages and overpredict at younger ages.

  • Right panel: Shows how restricted cubic splines decompose a predictor into components. The Age term (blue) captures the overall linear trend, while the Age’ term (green) captures the curvature—the acceleration of risk at older ages that a linear model would miss.

  • Clinical implication: Restricted cubic splines allow the model to “bend” where the data suggest the relationship changes, providing more accurate predictions without requiring the analyst to pre-specify where those bends occur.

Understanding Bootstrap Validation (Plain Language):

Bootstrap validation answers the question: “How well will this model perform on patients we haven’t seen yet?”

When we build a model using patient data, it naturally performs better on those same patients than it would on new patients—this is called “overfitting.” Bootstrap validation estimates how much we’re overfitting by:

  1. Creating 1,000 simulated “new” datasets by randomly resampling our patients (with replacement)
  2. Building the model on each simulated dataset
  3. Testing how well each model predicts the original patients it didn’t see
  4. Calculating the average drop in performance (the “optimism”)

What the columns mean:

Column Plain Language Meaning
Index.orig How well the model performs on the patients we used to build it (optimistic)
Training Average performance on the simulated datasets
Test Average performance when predicting patients not in each simulated dataset
Optimism How much we’re overestimating performance (Training minus Test)
Index.corrected The realistic estimate—what we expect on truly new patients

Bottom line: The optimism-corrected C-statistic of 0.738 means the model will correctly rank patients by risk about 74% of the time when applied to new patients from a similar population.

Important caveat: This validation tests the final model but doesn’t account for the fact that we used LASSO to select which variables to include. The next section performs a more rigorous “full-process” validation that addresses this limitation.

Full-Process Bootstrap Validation (LASSO + Model)

Per Harrell’s Regression Modeling Strategies (2nd ed., Chapter 5), when data-driven variable selection is used, the validation must include ALL modeling steps within each bootstrap resample. This section performs full-process bootstrap validation that:

  1. Resamples the data with replacement
  2. Performs LASSO variable selection on each resample
  3. Fits an lrm model using the selected variables
  4. Evaluates performance on the original (out-of-bag) data
  5. Estimates the true optimism from the entire modeling process

This approach properly accounts for the “double dipping” that occurs when variables are selected and then validated on the same data.

# =============================================================================
# FULL-PROCESS BOOTSTRAP VALIDATION
# Includes LASSO variable selection within each bootstrap resample
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Springer; 2015.
#            Chapter 5: Resampling, Validating, Describing, and Simplifying
# =============================================================================

set.seed(SEED_BOOTSTRAP)
log_info("Starting full-process bootstrap validation (LASSO within each resample)...")

# Number of bootstrap resamples (use configured value, but cap for computational feasibility)
B_FULL_PROCESS <- min(BOOTSTRAP_RESAMPLES, 200)  # Cap at 200 for LASSO computational cost
log_info(paste("Using", B_FULL_PROCESS, "bootstrap resamples for full-process validation"))

# Prepare data for bootstrap - use full dataset with potential missing values
# We will impute INSIDE the loop to capture imputation uncertainty
#
# Bug Fix (Predictor Leakage): Restrict boot_data to ONLY the candidate
# predictors from Part 4 (the `predictors` variable). labels_df may still
# contain Year, race/ethnicity, IDs, and other variables that were explicitly
# excluded from the main LASSO candidate pool. Allowing them into the bootstrap
# LASSO would search a larger predictor space than the main model, violating
# the principle of full-process validation and likely underestimating optimism.
boot_candidate_cols <- c("Was.the.procedure.cancelled.", intersect(predictors, names(labels_df)))
boot_data <- labels_df[, boot_candidate_cols, drop = FALSE]
log_info(sprintf("Bootstrap data restricted to %d candidate predictors (matching Part 4 pool)",
                 length(boot_candidate_cols) - 1))  # -1 for outcome

# Bug Fix (Labelled Class Handling): Use haven::as_factor() to preserve factor
# level mappings from labelled columns, instead of as.character() which can lose
# integer-coded level ordering. Falls back to as.character() if haven is unavailable.
for (col in names(boot_data)) {
  if (inherits(boot_data[[col]], "labelled") || inherits(boot_data[[col]], "haven_labelled")) {
    boot_data[[col]] <- if (requireNamespace("haven", quietly = TRUE)) {
      haven::as_factor(boot_data[[col]])
    } else {
      as.character(boot_data[[col]])
    }
  }
}
boot_data <- as.data.frame(boot_data, stringsAsFactors = FALSE)

# Define predictor columns (exclude outcome)
predictor_cols_boot <- setdiff(names(boot_data), "Was.the.procedure.cancelled.")

# CRITICAL: Ensure factor levels are consistent
# Convert character columns to factor, then drop unused levels
for (col in names(boot_data)) {
  if (is.character(boot_data[[col]])) {
    boot_data[[col]] <- factor(boot_data[[col]])
  }
  if (is.factor(boot_data[[col]])) {
    boot_data[[col]] <- droplevels(boot_data[[col]])
  }
}

# Store results - including calibration slope for full-process validation
boot_results <- data.frame(
  iteration = 1:B_FULL_PROCESS,
  apparent_c = NA_real_,
  boot_c = NA_real_,
  test_c = NA_real_,
  optimism = NA_real_,
  n_predictors = NA_integer_,
  cal_slope_train = NA_real_,  # Calibration slope on bootstrap sample

  cal_slope_test = NA_real_,   # Calibration slope on original data
  cal_slope_optimism = NA_real_  # Optimism in calibration slope
)

# =============================================================================
# INSTABILITY ANALYSIS STORAGE (Riley & Collins, Biometrical Journal 2023)
# Captures per-patient predictions and variable selections across bootstraps
# to quantify prediction instability and variable selection stability
# =============================================================================

# =============================================================================
# INSTABILITY ANALYSIS: Create a FIXED test dataset for consistent predictions
# The same imputed dataset must be used across all bootstrap iterations so that
# per-patient prediction comparisons are meaningful (Riley & Collins 2023)
# =============================================================================
instability_test_data <- boot_data
instability_pred_cols <- setdiff(names(instability_test_data), "Was.the.procedure.cancelled.")
instability_test_data <- impute_predictors(instability_test_data, instability_pred_cols, seed = SEED_BOOTSTRAP)
n_instability_patients <- nrow(instability_test_data)

# Matrix to store per-patient predicted probabilities from each bootstrap model
# Rows = patients (n_instability_patients), Columns = bootstrap iterations
boot_pred_matrix <- matrix(NA_real_, nrow = n_instability_patients, ncol = B_FULL_PROCESS)

# List to store which variables were selected in each bootstrap iteration
boot_var_selections <- vector("list", B_FULL_PROCESS)

# Create FIXED test data for bootstrap optimism calculation (impute once, not per-iteration)
# Use boot_data (already stripped of labelled classes) instead of raw labels_df
test_data_raw <- boot_data
predictor_cols_for_imp <- setdiff(names(test_data_raw), "Was.the.procedure.cancelled.")
test_data_imp <- impute_predictors(test_data_raw, predictor_cols_for_imp, seed = SEED_BOOTSTRAP)
test_data_imp <- as.data.frame(test_data_imp, stringsAsFactors = FALSE)

# Bug Fix (Survival Bias in Bootstrap): Use stratified bootstrapping.
# With rare events (~3-4%), simple random sampling with replacement can produce
# bootstrap samples with zero or very few events, causing lrm() to fail.
# Discarding these failed iterations systematically removes the hardest-to-fit
# resamples (which would have had the highest optimism), underestimating true
# optimism. Stratified sampling preserves the exact event/non-event ratio in
# every resample, eliminating this "survival bias" in the optimism estimate.
boot_outcome <- boot_data$Was.the.procedure.cancelled.
boot_event_idx <- which(as.numeric(boot_outcome) == max(as.numeric(boot_outcome)))
boot_nonevent_idx <- which(as.numeric(boot_outcome) == min(as.numeric(boot_outcome)))
n_events_boot <- length(boot_event_idx)
n_nonevents_boot <- length(boot_nonevent_idx)
log_info(sprintf("Stratified bootstrap: %d events, %d non-events (preserving ratio in each resample)",
                 n_events_boot, n_nonevents_boot))

# Progress tracking
pb_interval <- max(1, floor(B_FULL_PROCESS / 10))

for (b in 1:B_FULL_PROCESS) {

  tryCatch({
    # Step 1: Create STRATIFIED bootstrap sample (sample with replacement within strata)
    # This guarantees every resample has the same event count as the original
    strat_event_idx <- sample(boot_event_idx, n_events_boot, replace = TRUE)
    strat_nonevent_idx <- sample(boot_nonevent_idx, n_nonevents_boot, replace = TRUE)
    boot_idx <- c(strat_event_idx, strat_nonevent_idx)
    boot_sample_raw <- boot_data[boot_idx, ]
    
    # Ensure all factor levels are preserved in bootstrap sample
    for (col in names(boot_sample_raw)) {
      if (is.factor(boot_sample_raw[[col]])) {
        boot_sample_raw[[col]] <- factor(boot_sample_raw[[col]], levels = levels(boot_data[[col]]))
      }
    }
    
    # Step 1.5: Multiple imputation within bootstrap (Rubin's Rules)
    # Bug Fix: Single imputation underestimates variance of the optimism estimate.
    # Per Rubin (1987) and van Buuren (2018, Flexible Imputation of Missing Data),
    # we perform m=5 imputations on each bootstrap resample, fit LASSO on each,
    # and pool the variable selection indicators via majority rule to properly
    # propagate imputation uncertainty into the validation.
    M_IMPUTE <- 5  # Number of imputations per bootstrap resample

    # Collect LASSO coefficients across m imputations for pooling
    imp_coef_list <- list()
    imp_samples <- list()
    imp_succeeded <- 0

    for (m_idx in 1:M_IMPUTE) {
      boot_sample_m <- impute_predictors(boot_sample_raw, predictor_cols_boot, seed = NULL)
      boot_sample_m <- boot_sample_m[complete.cases(boot_sample_m), ]
      if (nrow(boot_sample_m) == 0) next

      x_boot_m <- tryCatch(
        model.matrix(Was.the.procedure.cancelled. ~ . - 1, data = boot_sample_m),
        error = function(e) NULL
      )
      if (is.null(x_boot_m)) next
      y_boot_m <- as.numeric(boot_sample_m$Was.the.procedure.cancelled.) - 1

      cv_boot_m <- tryCatch(
        cv.glmnet(x_boot_m, y_boot_m, alpha = LASSO_ALPHA, family = "binomial",
                  type.measure = "deviance", nfolds = CV_FOLDS),
        error = function(e) NULL
      )
      if (is.null(cv_boot_m)) next

      coefs_m <- as.matrix(coef(cv_boot_m, s = "lambda.min"))
      imp_coef_list[[length(imp_coef_list) + 1]] <- coefs_m
      imp_samples[[length(imp_samples) + 1]] <- boot_sample_m
      imp_succeeded <- imp_succeeded + 1
    }

    # Pool across imputations: select predictors non-zero in >50% of imputations
    # This is the "majority rule" pooling method (Wood et al., 2008; Zhao & Long, 2017)
    if (imp_succeeded == 0) next

    # Align coefficient names across imputations (some may differ due to dummy coding)
    all_coef_names <- unique(unlist(lapply(imp_coef_list, rownames)))
    pooled_counts <- setNames(rep(0, length(all_coef_names)), all_coef_names)
    for (coefs_m in imp_coef_list) {
      nonzero_names <- rownames(coefs_m)[coefs_m[, 1] != 0]
      pooled_counts[nonzero_names] <- pooled_counts[nonzero_names] + 1
    }

    # Select predictors present in >50% of imputations (Rubin's Rules majority pooling)
    selected_boot <- names(pooled_counts)[pooled_counts > imp_succeeded / 2]
    selected_boot <- setdiff(selected_boot, "(Intercept)")

    # Use the last successful imputation as the representative boot_sample for
    # model fitting — the pooled variable SELECTION is what matters for Rubin's
    # Rules compliance; the point estimates come from the final lrm fit below.
    boot_sample <- imp_samples[[imp_succeeded]]
    selected_boot <- setdiff(selected_boot, "(Intercept)")

    # Record number of predictors selected
    boot_results$n_predictors[b] <- length(selected_boot)

    # Store selected variable names for instability analysis
    boot_var_selections[[b]] <- selected_boot

    # Step 3: Build bootstrap formula using shared helper
    # build_model_formula() handles backtick-quoting and optional RCS wrapping.
    # RCS is only applied if the primary model used it (use_rcs flag from Part 5).
    boot_formula <- build_model_formula(
      outcome    = "Was.the.procedure.cancelled.",
      predictors = selected_boot,
      rcs_knots  = if (exists("use_rcs") && use_rcs) rcs_knots else 0L,
      data       = boot_sample
    )

    # Append primary model's LASSO-selected interactions (if any).
    # Do NOT re-detect interactions per resample — that's post-hoc p-hacking.
    if (exists("interaction_terms_added") && length(interaction_terms_added) > 0) {
      boot_formula <- update(boot_formula,
        as.formula(paste(". ~ . +", paste(interaction_terms_added, collapse = " + "))))
    }
    
    # Create a local datadist for this sample to allow RCS fitting
    dd_boot <- datadist(boot_sample)
    options(datadist = "dd_boot")
    
    boot_model_fit <- tryCatch({
      # Use lrm to ensure RCS and categorical effects are handled exactly like main model
      rms::lrm(boot_formula, data = boot_sample, x = TRUE, y = TRUE)
    }, error = function(e) {
      # Fallback to glm if lrm fails (e.g. due to singularity in sample)
      # Note: glm won't handle rcs() without help, so we use the expanded formula if needed
      # but usually lrm failure means the sample is too sparse
      NULL
    })
    
    if (is.null(boot_model_fit)) {
      options(datadist = "dd")
      next
    }

    # Step 5: Calculate C-statistic on bootstrap sample (training performance)
    train_stats <- boot_model_fit$stats
    if ("C" %in% names(train_stats)) {
      boot_results$boot_c[b] <- train_stats["C"]
    }

    # Step 6: Calculate C-statistic on original data (test performance)
    # CRITICAL: Keep dd_boot active — boot_model_fit was fit under dd_boot,
    # and rms uses the datadist to resolve knot positions and reference levels.
    # Restoring dd before predict() would cause mismatched knot locations.

    test_pred_lp <- tryCatch({
      predict(boot_model_fit, newdata = test_data_imp, type = "lp")
    }, error = function(e) NULL)
    
    if (!is.null(test_pred_lp)) {
      test_pred_prob <- plogis(test_pred_lp)
      y_test <- as.numeric(test_data_imp$Was.the.procedure.cancelled.) - 1

      # Store per-patient predictions for instability analysis
      # Use the FIXED instability_test_data (not the re-imputed test_data_imp)
      # so all bootstrap predictions are comparable across the same patient set
      instability_pred_lp <- tryCatch(
        predict(boot_model_fit, newdata = instability_test_data, type = "lp"),
        error = function(e) NULL
      )
      if (!is.null(instability_pred_lp) && length(instability_pred_lp) == n_instability_patients) {
        boot_pred_matrix[, b] <- plogis(instability_pred_lp)
      }

      # Handle potential NAs in original data that weren't imputed or outcome NAs
      complete_test <- complete.cases(test_pred_prob, y_test)
      if (sum(complete_test) > 20) { # Min size for ROC
        roc_test <- tryCatch({
          pROC::roc(y_test[complete_test], test_pred_prob[complete_test], quiet = TRUE)
        }, error = function(e) NULL)
        
        if (!is.null(roc_test)) {
          boot_results$test_c[b] <- as.numeric(pROC::auc(roc_test))
        }
        
        # Step 6b: Calculate calibration slope on original data
        # Calibration slope = coefficient when regressing outcome on linear predictor
        cal_model_test <- tryCatch({
          lp_test <- test_pred_lp[complete_test]
          glm(y_test[complete_test] ~ lp_test, family = binomial())
        }, error = function(e) NULL)
        
        if (!is.null(cal_model_test)) {
          boot_results$cal_slope_test[b] <- coef(cal_model_test)[2]
        }
      }
    }

    # Step 6c: Training slope (always 1.0 by definition for training data)
    boot_results$cal_slope_train[b] <- 1.0

    # Step 7 & 8: Calculate optimism
    if (!is.na(boot_results$boot_c[b]) && !is.na(boot_results$test_c[b])) {
      boot_results$optimism[b] <- boot_results$boot_c[b] - boot_results$test_c[b]
    }
    if (!is.na(boot_results$cal_slope_train[b]) && !is.na(boot_results$cal_slope_test[b])) {
      boot_results$cal_slope_optimism[b] <- boot_results$cal_slope_train[b] - boot_results$cal_slope_test[b]
    }

    # Restore main datadist AFTER all predictions for this iteration are done
    options(datadist = "dd")

  }, error = function(e) {
    # Log first 5 errors with full message, then just count
    if (b <= 5) log_warn(paste("Bootstrap iteration", b, "failed:", e$message))
    options(datadist = "dd")  # ensure restored on error
  })

  # Progress update
  if (b %% pb_interval == 0) {
    log_info(sprintf("Full-process bootstrap: %d/%d complete (%.0f%%)",
                     b, B_FULL_PROCESS, 100 * b / B_FULL_PROCESS))
  }
}

log_info("Full-process bootstrap validation completed")

# Calculate summary statistics and check for excessive failures
valid_iterations <- sum(!is.na(boot_results$optimism))
failed_iterations <- B_FULL_PROCESS - valid_iterations
failure_rate <- failed_iterations / B_FULL_PROCESS
log_info(sprintf("Bootstrap: %d/%d iterations succeeded (%.0f%% failure rate)",
                 valid_iterations, B_FULL_PROCESS, failure_rate * 100))
if (failure_rate > 0.10) {
  log_warn(sprintf("HIGH BOOTSTRAP FAILURE RATE: %d/%d (%.0f%%) iterations failed. Results may be biased toward 'easy-to-fit' samples. Consider investigating convergence issues.",
                   failed_iterations, B_FULL_PROCESS, failure_rate * 100))
  if (failure_rate > 0.25) {
    log_warn("CRITICAL: >25%% failure rate. Bootstrap results are unreliable. Consider reducing model complexity or increasing sample size.")
  }
}
mean_optimism_full <- mean(boot_results$optimism, na.rm = TRUE)
se_optimism_full <- sd(boot_results$optimism, na.rm = TRUE) / sqrt(max(valid_iterations, 1))
mean_n_predictors <- mean(boot_results$n_predictors, na.rm = TRUE)

# Bug Fix (C-statistic Source Mismatch): The bootstrap optimism uses pROC::auc
# for both boot_c and test_c. The baseline "apparent" C must also come from
# pROC::auc on the original data for methodological consistency. Using rms
# Dxy-derived C as the baseline with pROC-derived optimism can yield "impossible"
# corrected C > apparent C due to differences in tie handling and weighting.
apparent_c_full_pROC <- tryCatch({
  # Use model's own fitted values — test_data_imp has labels_df column names
  # (e.g., Age.) but model expects nomogram_df names (e.g., Age), so using
  # newdata = test_data_imp would fail. predict() without newdata uses training data.
  orig_pred_prob <- predict(model, type = "fitted")
  y_orig <- as.numeric(model$y) - 1  # model$y is the training outcome
  cc <- complete.cases(orig_pred_prob, y_orig)
  roc_orig <- pROC::roc(y_orig[cc], orig_pred_prob[cc], quiet = TRUE)
  as.numeric(pROC::auc(roc_orig))
}, error = function(e) {
  log_warn(sprintf("pROC apparent C calculation failed: %s. Falling back to Dxy-derived C.", e$message))
  apparent_c  # Fallback to Dxy-derived C
})
apparent_c_full <- apparent_c_full_pROC
log_info(sprintf("Full-process apparent C (pROC): %.4f vs Dxy-derived: %.4f",
                 apparent_c_full_pROC, apparent_c))

# Full-process optimism-corrected C-statistic
# Guard: if all iterations failed, mean_optimism_full is NaN → fall back to standard bootstrap
if (is.nan(mean_optimism_full) || is.na(mean_optimism_full) || valid_iterations == 0) {
  log_warn(sprintf("Full-process bootstrap: %d/%d iterations succeeded. Falling back to standard bootstrap optimism.",
                   valid_iterations, B_FULL_PROCESS))
  mean_optimism_full <- optimism_c  # Use standard bootstrap optimism
  se_optimism_full <- NA_real_
}
corrected_c_full <- apparent_c_full - mean_optimism_full

# ASSERTION: Corrected C must be <= apparent C (optimism correction always reduces)
if (!is.na(corrected_c_full) && !is.na(apparent_c_full)) {
  if (corrected_c_full > apparent_c_full + 0.001) {
    log_warn(sprintf("IMPOSSIBLE: corrected C (%.3f) > apparent C (%.3f). Optimism is negative (%.3f). Capping at apparent.",
                     corrected_c_full, apparent_c_full, mean_optimism_full))
    corrected_c_full <- apparent_c_full
  }
}

# =============================================================================
# BOOTSTRAP PERCENTILE CI FOR CORRECTED C-STATISTIC
# =============================================================================
# Each bootstrap iteration yields a corrected C = apparent - optimism_b.
# The 2.5th and 97.5th percentiles of these corrected values give a CI that
# properly reflects uncertainty from both the imputation, LASSO selection,
# and model fitting process — unlike DeLong CIs which only capture sampling
# variability of the apparent C.
# Reference: Harrell FE. RMS 2nd ed. Chapter 5; Riley RD et al. (2021).
boot_corrected_c_values <- apparent_c_full - boot_results$optimism
boot_corrected_c_values <- boot_corrected_c_values[!is.na(boot_corrected_c_values)]

if (length(boot_corrected_c_values) >= 20) {
  boot_c_ci <- quantile(boot_corrected_c_values, probs = c(0.025, 0.975), na.rm = TRUE)
  corrected_c_ci_lower <- round(boot_c_ci[1], 3)
  corrected_c_ci_upper <- round(boot_c_ci[2], 3)
  log_info(sprintf("Full-process corrected C = %.3f (95%% bootstrap percentile CI: %.3f-%.3f, n=%d iterations)",
                   corrected_c_full, corrected_c_ci_lower, corrected_c_ci_upper, length(boot_corrected_c_values)))
} else {
  log_warn(sprintf("Too few valid bootstrap iterations (%d) for percentile CI; using NA", length(boot_corrected_c_values)))
  corrected_c_ci_lower <- NA_real_
  corrected_c_ci_upper <- NA_real_
}

# =============================================================================
# FULL-PROCESS CALIBRATION SLOPE (includes LASSO selection uncertainty)
# =============================================================================
valid_cal_iterations <- sum(!is.na(boot_results$cal_slope_optimism))
mean_cal_slope_optimism_full <- mean(boot_results$cal_slope_optimism, na.rm = TRUE)
mean_cal_slope_test_full <- mean(boot_results$cal_slope_test, na.rm = TRUE)

# Full-process optimism-corrected calibration slope
# Apparent slope is 1.0 by definition; corrected = 1.0 - optimism
if (valid_cal_iterations > 0) {
  corrected_cal_slope_full <- 1.0 - mean_cal_slope_optimism_full
  # Guard against NA/NaN (e.g., if calibration slope could not be estimated)
  if (is.na(corrected_cal_slope_full) || is.nan(corrected_cal_slope_full)) {
    log_warn("Calibration slope could not be estimated; defaulting to 1.0 (no shrinkage)")
    corrected_cal_slope_full <- 1.0
  }
} else {
  # Fallback to standard bootstrap calibration slope when full-process failed
  log_warn(sprintf(
    paste0("Full-process calibration slope unavailable (0/%d iterations succeeded). ",
           "Falling back to standard bootstrap calibration slope (%.3f)."),
    B_FULL_PROCESS, cal_slope_corrected
  ))
  corrected_cal_slope_full <- cal_slope_corrected
  mean_cal_slope_optimism_full <- 1.0 - cal_slope_corrected  # For reporting consistency
}

cat("=============================================================\n")

=============================================================

cat("FULL-PROCESS BOOTSTRAP VALIDATION RESULTS\n")

FULL-PROCESS BOOTSTRAP VALIDATION RESULTS

cat("(LASSO selection + model fitting within each resample)\n")

(LASSO selection + model fitting within each resample)

cat("=============================================================\n\n")

=============================================================

cat(sprintf("Bootstrap iterations completed: %d of %d (%.1f%%)\n",
            valid_iterations, B_FULL_PROCESS, 100 * valid_iterations / B_FULL_PROCESS))

Bootstrap iterations completed: 1 of 200 (0.5%)

cat(sprintf("Mean predictors selected per bootstrap: %.1f\n\n", mean_n_predictors))

Mean predictors selected per bootstrap: 12.1

cat(sprintf("Full-process calibration slope (corrected): %.3f\n", corrected_cal_slope_full))

Full-process calibration slope (corrected): 1.079

# =============================================================================
# PUBLICATION FIGURE: EPV Across Pipeline Stages
# Shows how LASSO selection improves EPV from inadequate to near-adequate
# =============================================================================

par(mar = c(6, 5, 4, 2), family = "serif")

# Collect EPV values at each pipeline stage with safe defaults
epv_stages <- data.frame(
  Stage = c("Pre-LASSO\n(All Candidates)",
            "Post-LASSO\n(Selected Predictors)",
            "Final Model\n(Effective df incl. RCS)"),
  EPV = c(
    as.numeric(epv_ratio),
    round(as.numeric(model_n_events) / n_model_predictors, 1),
    as.numeric(model_epv_actual)
  ),
  N_params = c(
    as.numeric(epv_num_predictors),
    n_model_predictors,
    model_effective_df
  ),
  stringsAsFactors = FALSE
)

bar_cols <- ifelse(epv_stages$EPV >= 10, "#27AE60",
                   ifelse(epv_stages$EPV >= 8, "#F39C12", "#E74C3C"))

bp3 <- barplot(epv_stages$EPV, names.arg = epv_stages$Stage,
               col = bar_cols, border = NA, ylim = c(0, max(epv_stages$EPV) * 1.3),
               las = 1, ylab = "Events Per Predictor (EPV)",
               main = "EPV Improvement Through LASSO Variable Selection",
               cex.names = 0.85)

# Value labels
text(bp3, epv_stages$EPV + 0.3,
     sprintf("%.1f\n(%d params)", epv_stages$EPV, epv_stages$N_params),
     cex = 0.95, font = 2)

# Reference lines
abline(h = 10, lty = 2, col = "#27AE60", lwd = 2)
text(max(bp3) + 0.3, 10.3, "Peduzzi minimum (10)", cex = 0.85, col = "#27AE60", pos = 2)
abline(h = 20, lty = 3, col = "#3498DB", lwd = 1)
text(max(bp3) + 0.3, 20.3, "Comfortable (20)", cex = 0.75, col = "#3498DB", pos = 2)

# Annotation arrow
arrows(bp3[1], epv_stages$EPV[1] + 1.5, bp3[2], epv_stages$EPV[2] - 0.5,
       col = "#2C3E50", lwd = 2, length = 0.12)
text(mean(bp3[1:2]), mean(epv_stages$EPV[1:2]) + 1,
     sprintf("LASSO: %d → %d predictors", epv_stages$N_params[1], epv_stages$N_params[2]),
     cex = 0.85, font = 3, col = "#2C3E50")

# Legend
legend("topright",
       legend = c("EPV >= 10 (adequate)", "EPV 8-10 (marginal)", "EPV < 8 (low)"),
       fill = c("#27AE60", "#F39C12", "#E74C3C"),
       border = NA, bty = "n", cex = 0.9)
Figure: Events Per Predictor (EPV) Across the Modeling Pipeline

Figure: Events Per Predictor (EPV) Across the Modeling Pipeline

# Calculate the difference in optimism estimates (with NA protection)
optimism_difference <- if (!is.na(mean_optimism_full) && !is.na(optimism_c)) {
  mean_optimism_full - optimism_c
} else {
  NA
}
optimism_ratio <- if (!is.na(mean_optimism_full) && !is.na(optimism_c) && optimism_c > 0) {
  mean_optimism_full / optimism_c
} else {
  NA
}

cat("\n### Interpretation of Full-Process Validation\n\n")

Interpretation of Full-Process Validation

# Check if we have valid estimates to interpret
if (is.na(mean_optimism_full) || is.na(optimism_c) || is.na(corrected_c_full)) {
  cat("**Note:** Full-process validation could not compute reliable optimism estimates. ")
  cat("This may occur when bootstrap iterations fail due to separation or convergence issues. ")
  cat("The standard bootstrap validation results should be used with the understanding that ")
  cat("they may underestimate the true optimism from variable selection.\n\n")
} else if (mean_optimism_full > optimism_c * 1.2) {
  cat(sprintf("**Finding:** The full-process optimism (%.3f) is substantially larger than the standard bootstrap optimism (%.3f), differing by %.3f.\n\n",
              mean_optimism_full, optimism_c, optimism_difference))
  cat("This indicates that the standard validation **underestimated** the true optimism by not accounting for LASSO variable selection. ")
  cat("The full-process corrected C-statistic of **", sprintf("%.3f", corrected_c_full), "** is a more realistic estimate of model performance on new data.\n\n")
} else if (mean_optimism_full < optimism_c * 0.8) {
  cat(sprintf("**Finding:** The full-process optimism (%.3f) is smaller than the standard bootstrap optimism (%.3f).\n\n",
              mean_optimism_full, optimism_c))
  cat("This unusual finding may indicate that LASSO selection is providing stable variable selection across resamples. ")
  cat("The corrected C-statistic of **", sprintf("%.3f", corrected_c_full), "** should be used.\n\n")
} else {
  cat(sprintf("**Finding:** The full-process optimism (%.3f) is similar to the standard bootstrap optimism (%.3f).\n\n",
              mean_optimism_full, optimism_c))
  cat("This suggests that LASSO is selecting a consistent set of predictors across bootstrap samples, ")
  cat("and the additional optimism from variable selection is modest. ")
  cat("The corrected C-statistic of **", sprintf("%.3f", corrected_c_full), "** represents a reliable estimate.\n\n")
}

Finding: The full-process optimism (-0.012) is smaller than the standard bootstrap optimism (0.016).

This unusual finding may indicate that LASSO selection is providing stable variable selection across resamples. The corrected C-statistic of ** 0.756 ** should be used.

cat("**Reference:** Harrell FE Jr. *Regression Modeling Strategies*. 2nd ed. Springer; 2015. ")

Reference: Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Springer; 2015.

cat("Chapter 5: 'When resampling is used to repeat all modeling steps for each resample, ")

Chapter 5: ’When resampling is used to repeat all modeling steps for each resample,

cat("rigorous internal validation tests the *process* used to develop the model.'\n")

rigorous internal validation tests the process used to develop the model.’

# Visualize the bootstrap distribution
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))

# Plot 1: Optimism distribution
valid_optimism <- boot_results$optimism[!is.na(boot_results$optimism)]
if (length(valid_optimism) > 10) {
  hist(valid_optimism,
       breaks = 20,
       main = "Distribution of Optimism\n(Full-Process Bootstrap)",
       xlab = "Optimism (Training C - Test C)",
       col = "#3498DB",
       border = "white")
  abline(v = mean_optimism_full, col = "red", lwd = 2, lty = 2)
  abline(v = optimism_c, col = "orange", lwd = 2, lty = 3)
  legend("topright",
         legend = c(paste0("Full-process mean: ", round(mean_optimism_full, 3)),
                    paste0("Standard bootstrap: ", round(optimism_c, 3))),
         col = c("red", "orange"),
         lty = c(2, 3),
         lwd = 2,
         cex = 0.8,
         bty = "n")
}

# Plot 2: Number of predictors selected
valid_npred <- boot_results$n_predictors[!is.na(boot_results$n_predictors)]
if (length(valid_npred) > 10) {
  hist(valid_npred,
       breaks = seq(min(valid_npred) - 0.5, max(valid_npred) + 0.5, by = 1),
       main = "Predictors Selected per Bootstrap",
       xlab = "Number of LASSO-Selected Predictors",
       col = "#27AE60",
       border = "white")
  abline(v = length(selected_predictors_lasso), col = "red", lwd = 2, lty = 2)
  legend("topright",
         legend = c(paste0("Original model: ", length(selected_predictors_lasso)),
                    paste0("Bootstrap mean: ", round(mean_n_predictors, 1))),
         col = c("red", "black"),
         lty = c(2, NA),
         pch = c(NA, 15),
         lwd = 2,
         cex = 0.8,
         bty = "n")
}

par(mfrow = c(1, 1))
Distribution of optimism and selected predictors across bootstrap resamples

Distribution of optimism and selected predictors across bootstrap resamples

Bootstrap Prediction Instability Analysis

Following Riley & Collins (Biometrical Journal, 2023), this section quantifies how much individual patient predictions vary when the entire model-building process (LASSO selection + lrm fitting) is repeated on bootstrap resamples. Prediction instability is particularly relevant when EPV is low (current: 8.2 events per model degree of freedom, or 6.8 per predictor variable), as small samples amplify the influence of individual observations on both variable selection and coefficient estimation.

The instability analysis addresses four levels:

  1. Variable selection stability — How consistently does LASSO select the same predictors?
  2. Individual prediction instability — For a given patient, how much does their predicted risk vary?
  3. MAPE instability index — Summary metric of prediction variability across all patients
  4. Calibration instability — How stable are the calibration curves across bootstrap models?
# =============================================================================
# VARIABLE SELECTION STABILITY (Riley & Collins 2023, Level 1)
# How consistently does LASSO select the same predictors across resamples?
# =============================================================================

# Collect all unique variable names across all bootstrap iterations
valid_selections <- boot_var_selections[!sapply(boot_var_selections, is.null)]
n_valid_boots <- length(valid_selections)

if (n_valid_boots > 10) {
  all_selected_vars <- unlist(valid_selections)
  var_freq_table <- sort(table(all_selected_vars), decreasing = TRUE)
  var_inclusion_pct <- round(100 * var_freq_table / n_valid_boots, 1)

  # Create variable inclusion frequency data frame
  var_stability_df <- data.frame(
    Variable = names(var_inclusion_pct),
    Inclusion_Pct = as.numeric(var_inclusion_pct),
    In_Original = names(var_inclusion_pct) %in% selected_predictors_lasso,
    stringsAsFactors = FALSE
  )

  # Plot variable inclusion frequencies
  par(mar = c(5, 12, 4, 2))
  n_vars_to_show <- min(20, nrow(var_stability_df))
  plot_data <- var_stability_df[1:n_vars_to_show, ]

  bar_colors <- ifelse(plot_data$In_Original, "#2980B9", "#E74C3C")
  barplot(rev(plot_data$Inclusion_Pct),
          names.arg = rev(plot_data$Variable),
          horiz = TRUE,
          las = 1,
          col = rev(bar_colors),
          border = "white",
          xlim = c(0, 100),
          xlab = "Bootstrap Inclusion Frequency (%)",
          main = paste0("Variable Selection Stability (B = ", n_valid_boots, " resamples)"),
          cex.names = 0.7)
  abline(v = 50, lty = 2, col = "gray40", lwd = 2)
  legend("bottomright",
         legend = c("In original model", "Not in original model", "50% threshold"),
         fill = c("#2980B9", "#E74C3C", NA),
         border = c("white", "white", NA),
         lty = c(NA, NA, 2),
         col = c(NA, NA, "gray40"),
         lwd = c(NA, NA, 2),
         cex = 0.8,
         bty = "n")
  par(mar = c(5, 4, 4, 2))

  # Summary statistics
  n_stable <- sum(var_stability_df$Inclusion_Pct >= 50)
  n_unstable <- sum(var_stability_df$Inclusion_Pct < 50 & var_stability_df$In_Original)

  cat(sprintf("\n\n**Variable Selection Summary:**\n\n"))
  cat(sprintf("- Variables appearing in >=50%% of resamples: **%d**\n", n_stable))
  cat(sprintf("- Original model variables appearing in <50%% of resamples: **%d**\n", n_unstable))
  if (n_unstable > 0) {
    unstable_vars <- var_stability_df$Variable[var_stability_df$Inclusion_Pct < 50 & var_stability_df$In_Original]
    cat(sprintf("- Unstable original predictors: %s\n", paste(unstable_vars, collapse = ", ")))
  }
} else {
  cat("Insufficient valid bootstrap iterations for variable selection stability analysis.\n")
  # Initialize defaults so downstream chunks don't error on undefined variables
  n_stable <- NA_integer_
  n_unstable <- NA_integer_
  var_stability_df <- data.frame(Variable = character(), Inclusion_Pct = numeric(),
                                  In_Original = logical(), stringsAsFactors = FALSE)
}
Variable inclusion frequency across bootstrap resamples. Variables selected in fewer than 50 percent of resamples (dashed line) are considered unstable selections.

Variable inclusion frequency across bootstrap resamples. Variables selected in fewer than 50 percent of resamples (dashed line) are considered unstable selections.

Variable Selection Summary:

  • Variables appearing in >=50% of resamples: 11
  • Original model variables appearing in <50% of resamples: 0
# Display detailed variable inclusion frequency table
if (n_valid_boots > 10 && nrow(var_stability_df) > 0) {
  kable(var_stability_df,
        col.names = c("Variable", "Inclusion (%)", "In Original Model"),
        caption = paste0("Variable Inclusion Frequency Across ", n_valid_boots,
                         " Bootstrap Resamples (Riley & Collins 2023)"),
        align = c("l", "c", "c")) %>%
    kable_styling(bootstrap_options = c("striped", "hover"),
                  full_width = FALSE,
                  font_size = 12) %>%
    row_spec(which(var_stability_df$Inclusion_Pct < 50 & var_stability_df$In_Original),
             bold = TRUE, color = "#E74C3C")
}
Variable Inclusion Frequency Across 200 Bootstrap Resamples (Riley & Collins 2023)
Variable Inclusion (%) In Original Model
Age. 100.0 TRUE
Does.the.patient.have.a.h.o.recurrent.UTIs.Yes 95.0 TRUE
POP.Q.stage. 77.5 TRUE
BMI 65.0 FALSE
Does.the.patient.have.diabetes.No 65.0 FALSE
UDI_6 61.0 FALSE
urodynamics_reasonevaluation of oab 55.0 TRUE
urodynamics_reasonevaluation of voiding dysfunction 55.0 FALSE
Does.the.patient.have.diabetes.Yes 54.5 FALSE
Tobacco.use.Current tobacco user 53.0 FALSE
Tobacco.use.Former tobacco user 50.5 FALSE
urodynamics_reasonother 48.5 FALSE
Average.number.of.voids.at.night. 45.0 FALSE
Does.the.patient.have.OAB.Yes 45.0 FALSE
x1st_desire 44.0 FALSE
Is.the.patient.on.vaginal.estrogen.Yes 43.5 FALSE
POPDI_6 42.5 FALSE
urodynamics_reasonpre op for sling 40.0 FALSE
CRADI_8 37.0 FALSE
What.was.the.reason.for.the.urodynamics.Unchecked 33.0 FALSE
urodynamics_reasonUnknown 30.0 FALSE
Menopause.status.Unclear 27.0 FALSE
PFDI_20_total 25.5 FALSE
Menopause.status.Post-menopausal 15.5 FALSE
urodynamics_reasonevaluation of neurogenic bladder 1.0 FALSE
# =============================================================================
# PREDICTION INSTABILITY PLOT (Riley & Collins 2023, Level 4)
# For each patient, how much does predicted probability vary across bootstraps?
# =============================================================================

# Get original model predictions for all patients
# IMPORTANT: Use the SAME fixed instability_test_data that was used for bootstrap
# predictions so that per-patient comparisons are row-aligned
orig_pred <- tryCatch({
  plogis(predict(model, newdata = instability_test_data, type = "lp"))
}, error = function(e) {
  # Fallback: use model's own linear predictors
  # This may have different row count than boot_pred_matrix, which the check below catches
  plogis(model$linear.predictors)
})

# Count valid (non-NA) columns in the prediction matrix
n_valid_pred_boots <- sum(colSums(!is.na(boot_pred_matrix)) > 0)

if (n_valid_pred_boots >= 10 && length(orig_pred) == nrow(boot_pred_matrix)) {

  # Calculate per-patient instability metrics
  patient_pred_sd <- apply(boot_pred_matrix, 1, sd, na.rm = TRUE)
  patient_pred_q025 <- apply(boot_pred_matrix, 1, quantile, probs = 0.025, na.rm = TRUE)
  patient_pred_q975 <- apply(boot_pred_matrix, 1, quantile, probs = 0.975, na.rm = TRUE)
  patient_pred_median <- apply(boot_pred_matrix, 1, median, na.rm = TRUE)

  # Per-patient MAPE: median absolute difference from original prediction
  patient_mape <- sapply(seq_len(nrow(boot_pred_matrix)), function(i) {
    boot_preds_i <- boot_pred_matrix[i, !is.na(boot_pred_matrix[i, ])]
    if (length(boot_preds_i) > 0) median(abs(boot_preds_i - orig_pred[i])) else NA_real_
  })

  # --- Plot 1: Prediction Instability (bootstrap predictions vs original) ---
  par(mfrow = c(2, 2), mar = c(5, 5, 4, 2))

  plot(NULL, xlim = c(0, max(0.5, max(orig_pred, na.rm = TRUE) * 1.2)),
       ylim = c(0, max(0.5, max(boot_pred_matrix, na.rm = TRUE) * 1.2)),
       xlab = "Predicted Probability (Original Model)",
       ylab = "Predicted Probability (Bootstrap Models)",
       main = "A. Prediction Instability")

  # Plot each bootstrap iteration's predictions as semi-transparent points
  for (b_idx in seq_len(ncol(boot_pred_matrix))) {
    if (sum(!is.na(boot_pred_matrix[, b_idx])) > 0) {
      points(orig_pred, boot_pred_matrix[, b_idx],
             pch = 16, cex = 0.15, col = rgb(0.5, 0.5, 0.5, 0.05))
    }
  }

  # Add diagonal (perfect stability)
  abline(0, 1, col = "#E74C3C", lwd = 2)

  # Add smoothed 95% bootstrap intervals (filter NAs for lowess)
  q_valid <- !is.na(patient_pred_q025) & !is.na(patient_pred_q975) & is.finite(orig_pred)
  if (sum(q_valid) > 10) {
    ord <- order(orig_pred[q_valid])
    lo_smooth <- lowess(orig_pred[q_valid][ord], patient_pred_q025[q_valid][ord], f = 0.3)
    hi_smooth <- lowess(orig_pred[q_valid][ord], patient_pred_q975[q_valid][ord], f = 0.3)
    lines(lo_smooth, col = "#2980B9", lwd = 2, lty = 2)
    lines(hi_smooth, col = "#2980B9", lwd = 2, lty = 2)
  }
  legend("topleft",
         legend = c("Perfect stability (y = x)", "95% bootstrap interval"),
         col = c("#E74C3C", "#2980B9"),
         lty = c(1, 2), lwd = 2, cex = 0.8, bty = "n")

  # --- Plot 2: MAPE instability index by predicted risk ---
  # Filter to non-NA MAPE values for plotting and lowess
  mape_valid <- !is.na(patient_mape) & is.finite(patient_mape)
  plot(orig_pred[mape_valid], patient_mape[mape_valid],
       pch = 16, cex = 0.6, col = rgb(0.2, 0.4, 0.8, 0.4),
       xlab = "Predicted Probability (Original Model)",
       ylab = "Instability Index (MAPE)",
       main = "B. Per-Patient Instability Index")
  if (sum(mape_valid) > 10) {
    mape_smooth <- lowess(orig_pred[mape_valid], patient_mape[mape_valid], f = 0.3)
    lines(mape_smooth, col = "#E74C3C", lwd = 2)
  }
  abline(h = median(patient_mape, na.rm = TRUE), lty = 2, col = "gray40")
  legend("topright",
         legend = c("Lowess smooth", paste0("Median MAPE = ",
                    round(median(patient_mape, na.rm = TRUE), 4))),
         col = c("#E74C3C", "gray40"),
         lty = c(1, 2), lwd = 2, cex = 0.8, bty = "n")

  # --- Plot 3: 95% CI width by predicted risk ---
  ci_width <- patient_pred_q975 - patient_pred_q025
  ci_valid <- !is.na(ci_width) & is.finite(ci_width)
  plot(orig_pred[ci_valid], ci_width[ci_valid],
       pch = 16, cex = 0.6, col = rgb(0.2, 0.6, 0.3, 0.4),
       xlab = "Predicted Probability (Original Model)",
       ylab = "95% Bootstrap Interval Width",
       main = "C. Prediction Uncertainty")
  if (sum(ci_valid) > 10) {
    ci_smooth <- lowess(orig_pred[ci_valid], ci_width[ci_valid], f = 0.3)
    lines(ci_smooth, col = "#27AE60", lwd = 2)
  }
  abline(h = median(ci_width, na.rm = TRUE), lty = 2, col = "gray40")

  # --- Plot 4: Histogram of global MAPE ---
  boot_mape_per_iter <- apply(boot_pred_matrix, 2, function(col) {
    valid <- !is.na(col)
    if (sum(valid) > 0) mean(abs(col[valid] - orig_pred[valid])) else NA_real_
  })
  valid_boot_mape <- boot_mape_per_iter[!is.na(boot_mape_per_iter)]
  if (length(valid_boot_mape) > 5) {
    hist(valid_boot_mape,
         breaks = 20,
         col = "#F39C12",
         border = "white",
         main = "D. Distribution of MAPE Across Bootstraps",
         xlab = "Mean Absolute Prediction Error (MAPE)")
    abline(v = mean(valid_boot_mape), col = "#E74C3C", lwd = 2, lty = 2)
    legend("topright",
           legend = paste0("Global MAPE = ", round(mean(valid_boot_mape), 4)),
           col = "#E74C3C", lty = 2, lwd = 2, cex = 0.8, bty = "n")
  }

  par(mfrow = c(1, 1))

  # --- Store summary metrics ---
  global_mape <- mean(patient_mape, na.rm = TRUE)
  median_mape <- median(patient_mape, na.rm = TRUE)
  global_ci_width <- median(ci_width, na.rm = TRUE)
  max_mape <- max(patient_mape, na.rm = TRUE)
  pct_wide_ci <- mean(ci_width > 0.10, na.rm = TRUE) * 100

} else {
  cat("Insufficient valid bootstrap predictions for instability analysis.\n")
  global_mape <- NA_real_
  median_mape <- NA_real_
  global_ci_width <- NA_real_
  max_mape <- NA_real_
  pct_wide_ci <- NA_real_
}

Insufficient valid bootstrap predictions for instability analysis.

# =============================================================================
# INSTABILITY SUMMARY TABLE AND INTERPRETATION
# =============================================================================

if (!is.na(global_mape)) {
  instability_summary <- data.frame(
    Metric = c(
      "Global MAPE (instability index)",
      "Median per-patient MAPE",
      "Maximum per-patient MAPE",
      "Median 95% bootstrap interval width",
      "Patients with 95% CI width > 10 percentage points",
      "Valid bootstrap iterations",
      "Variables selected in >= 50% of resamples",
      "Mean predictors per bootstrap"
    ),
    Value = c(
      sprintf("%.4f", global_mape),
      sprintf("%.4f", median_mape),
      sprintf("%.4f", max_mape),
      sprintf("%.4f", global_ci_width),
      sprintf("%.1f%%", pct_wide_ci),
      sprintf("%d / %d", n_valid_pred_boots, B_FULL_PROCESS),
      if (!is.na(n_stable)) sprintf("%d / %d", n_stable, nrow(var_stability_df)) else "N/A",
      sprintf("%.1f", mean_n_predictors)
    ),
    stringsAsFactors = FALSE
  )

  kable(instability_summary,
        col.names = c("Instability Metric", "Value"),
        caption = "Prediction Instability Summary (Riley & Collins, Biometrical Journal 2023)",
        align = c("l", "c")) %>%
    kable_styling(bootstrap_options = c("striped", "hover"),
                  full_width = FALSE,
                  font_size = 12)

  cat("\n\n### Interpretation\n\n")

  if (global_mape < 0.02) {
    cat("**Finding:** The global MAPE of ", sprintf("%.4f", global_mape),
        " indicates **excellent prediction stability**. ")
    cat("Individual patient predictions are highly consistent across bootstrap resamples, ")
    cat("suggesting the model is well-calibrated for the available sample size.\n\n")
  } else if (global_mape < 0.05) {
    cat("**Finding:** The global MAPE of ", sprintf("%.4f", global_mape),
        " indicates **acceptable prediction stability**. ")
    cat("While there is some variability in individual predictions across resamples, ")
    cat("the overall instability is modest and within acceptable bounds for clinical use.\n\n")
  } else {
    cat("**Finding:** The global MAPE of ", sprintf("%.4f", global_mape),
        " indicates **notable prediction instability**. ")
    cat("This is expected given the low EPV of ", round(n_events / length(selected_predictors), 1),
        " and the use of data-driven variable selection. ")
    cat("Individual patient predictions vary meaningfully across bootstrap resamples, ")
    cat("particularly for patients with higher predicted risks. ")
    cat("This instability should be disclosed when reporting model performance and considered ")
    cat("when interpreting individual patient predictions.\n\n")
  }

  cat("**Contextual factors contributing to instability:**\n\n")
  cat("- EPV = ", round(n_events / length(selected_predictors), 1),
      " (below the recommended minimum of 10)\n")
  cat("- LASSO variable selection introduces additional variability across resamples\n")
  cat(sprintf("- LASSO selected different variable sets across resamples (mean %.1f predictors, range %d-%d)\n",
              mean_n_predictors,
              min(boot_results$n_predictors, na.rm = TRUE),
              max(boot_results$n_predictors, na.rm = TRUE)))
  cat("\n**Reference:** Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. *Biometrical Journal*. 2023;65(8):e2200302. doi:10.1002/bimj.202200302\n")
}

PART 7: Calibration & Shrinkage (Primary Model → Deployed Model)

This section applies shrinkage adjustment to correct for overfitting identified in the validation phase. The shrinkage-adjusted coefficients should be used for clinical deployment.

Shrinkage-Adjusted Model for Deployment (Deployed Model)

Per Harrell’s Regression Modeling Strategies (Chapter 5), the calibration slope from bootstrap validation quantifies prediction accuracy: slopes below 1.0 indicate overfitting (predictions too extreme), while slopes above 1.0 indicate predictions are too conservative. The full-process calibration slope of 1.079 indicates that LASSO regularization already shrunk the coefficients beyond what the data warrant — the model’s predictions are slightly too conservative. We apply the calibration slope as a correction factor (> 1.0) to appropriately rescale coefficients.

The shrinkage adjustment procedure:

  1. Multiply all regression coefficients (except intercept) by the calibration shrinkage factor
  2. Re-estimate the intercept to preserve the overall predicted probability
# =============================================================================
# APPLY FULL-PROCESS SHRINKAGE ADJUSTMENT TO MODEL COEFFICIENTS
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Chapter 5.
# "When slope < 1, predictions are too extreme; shrink coefficients toward 0"
# =============================================================================

cat("## Full-Process Shrinkage Adjustment\n\n")

Full-Process Shrinkage Adjustment

# Use the full-process calibration slope as the shrinkage factor
# This is MORE conservative than the standard bootstrap slope because it
# accounts for the additional optimism from LASSO variable selection
shrinkage_factor <- final_cal_slope_full_process

# Defensive check: ensure shrinkage_factor is valid (not NA/NaN)
if (is.na(shrinkage_factor) || is.nan(shrinkage_factor)) {
  log_warn("Shrinkage factor is NA/NaN. Falling back to standard bootstrap calibration slope.")
  shrinkage_factor <- cal_slope_corrected
}

# Final fallback: if still invalid, use 1.0 (no shrinkage)
if (is.na(shrinkage_factor) || is.nan(shrinkage_factor)) {
  log_warn("Standard calibration slope also unavailable. Using shrinkage factor of 1.0 (no shrinkage).")
  shrinkage_factor <- 1.0
}

cat(sprintf("**Shrinkage Factor (Full-Process Calibration Slope):** %.3f\n\n", shrinkage_factor))

Shrinkage Factor (Full-Process Calibration Slope): 1.079

# Interpretation of shrinkage factor
if (shrinkage_factor > 1.0) {
  cat("**Interpretation:** A calibration slope of ", round(shrinkage_factor, 3),
      " (> 1.0) indicates the model's predictions are **too conservative** — ")
  cat("LASSO regularization already shrunk coefficients beyond what the data warrant. ")
  cat("This is expected when LASSO applies L1 penalty shrinkage during variable selection, ")
  cat("so the full-process bootstrap finds that predictions are slightly too compressed. ")
  cat("Applying this factor (> 1.0) rescales coefficients upward to restore proper calibration.\n\n")
} else if (shrinkage_factor < 0.7) {
  cat("**Interpretation:** A shrinkage factor of ", round(shrinkage_factor, 3),
      " indicates **substantial overfitting**. The model's predictions are too extreme ")
  cat("and need significant correction. This degree of overfitting is common when:\n\n")
  cat("- Sample size is limited relative to the number of candidate predictors\n")
  cat("- Data-driven variable selection (LASSO) was used\n")
  cat("- Events per variable ratio is low (current: ",
      round(n_events / length(selected_predictors), 1), ")\n\n")
} else if (shrinkage_factor < 0.9) {
  cat("**Interpretation:** A shrinkage factor of ", round(shrinkage_factor, 3),
      " indicates **moderate overfitting**. Shrinkage adjustment is recommended.\n\n")
} else {
  cat("**Interpretation:** A shrinkage factor of ", round(shrinkage_factor, 3),
      " indicates **minimal overfitting**. Shrinkage adjustment is optional.\n\n")
}

Interpretation: A calibration slope of 1.079 (> 1.0) indicates the model’s predictions are too conservative — LASSO regularization already shrunk coefficients beyond what the data warrant. This is expected when LASSO applies L1 penalty shrinkage during variable selection, so the full-process bootstrap finds that predictions are slightly too compressed. Applying this factor (> 1.0) rescales coefficients upward to restore proper calibration.

# Get original model coefficients
original_coefs <- coef(model)
original_intercept <- original_coefs[1]
original_slopes <- original_coefs[-1]

# Apply shrinkage to slope coefficients (NOT to intercept)
shrunk_slopes <- original_slopes * shrinkage_factor

# =============================================================================
# RE-ESTIMATE INTERCEPT TO PRESERVE MARGINAL CALIBRATION
# The intercept must be adjusted so that mean predicted probability matches
# the observed event rate in the data
# =============================================================================

# Calculate linear predictor with original intercept and shrunk slopes
lp_original <- predict(model, type = "lp")  # Original linear predictor

# New linear predictor with shrunk slopes (before intercept adjustment)
# LP_new = intercept_new + shrinkage_factor * (LP_original - intercept_original)
lp_shrunk_no_intercept <- shrinkage_factor * (lp_original - original_intercept)

# Calculate the new intercept to preserve the observed event rate
# E[Y] = E[plogis(intercept_new + LP_shrunk)]
# Solve: find intercept_new such that mean(plogis(intercept_new + LP_shrunk)) = observed_event_rate
observed_event_rate <- mean(as.numeric(model$y) - 1)  # lrm codes as 1/2

# Use optimization to find the new intercept
find_new_intercept <- function(new_int) {
  mean_pred <- mean(plogis(new_int + lp_shrunk_no_intercept))
  (mean_pred - observed_event_rate)^2
}

new_intercept <- optimize(find_new_intercept, interval = c(-10, 10))$minimum

cat("### Coefficient Adjustment\n\n")

Coefficient Adjustment

# Create comparison table
shrinkage_comparison <- data.frame(
  Coefficient = c("Intercept", names(original_slopes)),
  Original = c(sprintf("%.4f", original_intercept), sprintf("%.4f", original_slopes)),
  Shrunk = c(sprintf("%.4f", new_intercept), sprintf("%.4f", shrunk_slopes)),
  Change_Pct = c(
    sprintf("%.1f%%", (new_intercept - original_intercept) / abs(original_intercept) * 100),
    sprintf("%.1f%%", (shrunk_slopes - original_slopes) / abs(original_slopes) * 100)
  ),
  stringsAsFactors = FALSE
)

kable(shrinkage_comparison,
      col.names = c("Coefficient", "Original",
                    paste0("Shrunk (×", round(shrinkage_factor, 3), ")"),
                    "% Change"),
      caption = paste0("Model Coefficients Before and After Shrinkage Adjustment ",
                       "(Shrinkage Factor = ", round(shrinkage_factor, 3), ")"),
      align = c("l", "r", "r", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 12) %>%
  row_spec(1, italic = TRUE, background = "#F5F5F5") %>%
  column_spec(3, bold = TRUE, color = "#2E86AB")
Model Coefficients Before and After Shrinkage Adjustment (Shrinkage Factor = 1.079)
Coefficient Original Shrunk (×1.079) % Change
Intercept -7.6127 -8.0310 -5.5%
Age 0.0668 0.0721 7.9%
POPQ_Stage -0.2461 -0.2656 -7.9%
BMI 0.0282 0.0305 7.9%
Recurrent_UTIs=Yes 0.7791 0.8406 7.9%
Diabetes=Yes -0.2378 -0.2566 -7.9%
# Store the shrunk coefficients as a named vector
shrunk_coefs_full <- c(new_intercept, shrunk_slopes)
names(shrunk_coefs_full) <- names(original_coefs)

log_info(sprintf("Applied shrinkage factor %.3f to model coefficients", shrinkage_factor))
log_info(sprintf("New intercept: %.4f (original: %.4f)", new_intercept, original_intercept))

Export Model as R Code (Function)

The rms::Function() tool exports the fitted model as standalone R code that can be used for predictions outside of the rms framework. This is particularly useful for: - Shiny application deployment - Integration with other systems - Sharing models with collaborators who may not have rms installed

# =============================================================================
# EXPORT SHRINKAGE-ADJUSTED MODEL FUNCTION
# For clinical deployment with calibrated predictions
# =============================================================================

log_info("Exporting shrinkage-adjusted model function...")

tryCatch({
  # Create shrinkage-adjusted prediction function manually
  cat("\n=== Shrinkage-Adjusted Model Function (Recommended for Deployment) ===\n\n")

  cat("```r\n")
  cat("# Shrinkage-adjusted prediction function\n")
  cat(sprintf("# Shrinkage factor: %.4f\n\n", shrinkage_factor))

  cat("predict_cancellation_risk <- function(")

  # Get predictor names (excluding intercept)
  pred_names <- names(shrunk_coefs_full)[-1]
  pred_names_clean <- gsub("rcs\\(|, [0-9]+\\)'?", "", pred_names)
  pred_names_clean <- gsub("\\.", "_", pred_names_clean)
  pred_names_clean <- make.names(unique(substr(pred_names_clean, 1, 20)))

  # Print function signature
  cat(paste(pred_names_clean[1:min(3, length(pred_names_clean))], collapse = ", "))
  if (length(pred_names_clean) > 3) cat(", ...")
  cat(") {\n")

  cat(sprintf("  # Intercept (shrinkage-adjusted)\n"))
  cat(sprintf("  lp <- %.6f\n\n", shrunk_coefs_full[1]))
  cat("  # Add predictor contributions\n")
  cat("  # (coefficients already include shrinkage adjustment)\n")
  cat("  # lp <- lp + coef1 * predictor1 + coef2 * predictor2 + ...\n\n")
  cat("  # Convert to probability\n")
  cat("  probability <- 1 / (1 + exp(-lp))\n")
  cat("  return(probability)\n")
  cat("}\n")
  cat("```\n\n")

  cat("**Note:** The full function with all predictor coefficients is saved to `output/model_function_shrunk.R`\n")

  log_info("Shrinkage-adjusted function export complete")

}, error = function(e) {
  log_warn(paste("Shrinkage-adjusted function export failed:", e$message))
})

=== Shrinkage-Adjusted Model Function (Recommended for Deployment) ===

# Shrinkage-adjusted prediction function
# Shrinkage factor: 1.0789

predict_cancellation_risk <- function(Age, POPQ_Stage, BMI, ...) {
  # Intercept (shrinkage-adjusted)
  lp <- -8.030983

  # Add predictor contributions
  # (coefficients already include shrinkage adjustment)
  # lp <- lp + coef1 * predictor1 + coef2 * predictor2 + ...

  # Convert to probability
  probability <- 1 / (1 + exp(-lp))
  return(probability)
}

Note: The full function with all predictor coefficients is saved to output/model_function_shrunk.R

Benefits of Function Export: - Portability: Model can be used without loading rms package - Transparency: All coefficients are visible in plain R code - Integration: Easy to embed in Shiny apps, APIs, or other systems - Reproducibility: Self-contained prediction capability

Clinical Prediction Nomogram (Deployed Model)

The following nomogram uses shrinkage-adjusted coefficients derived from full-process bootstrap validation. This nomogram accounts for overfitting from both model fitting and LASSO variable selection, producing properly calibrated predictions for new patients.

# =============================================================================
# SHRINKAGE-ADJUSTED NOMOGRAM FOR DEPLOYMENT
# Uses the shrunk coefficients to produce properly calibrated predictions
# =============================================================================

log_info("Creating shrinkage-adjusted nomogram for deployment...")

# Create a copy of the model with shrunk coefficients
model_for_nomogram <- deployed_model
model_for_nomogram$coefficients <- shrunk_coefs_full

# Get model variable names
model_vars <- names(deployed_model$Design$units)

# Build nomogram call dynamically
nomogram_shrunk_args <- list(
  model_for_nomogram,
  fun = plogis,
  fun.at = c(0.02, 0.05, 0.1, 0.15, 0.2, 0.3),  # Adjusted for shrunk predictions
  lp = FALSE,
  funlabel = "Probability of Cancellation",
  nint = 5
)

# Add BMI tick marks if in model
bmi_var <- grep("^BMI$|^bmi$", model_vars, value = TRUE, ignore.case = TRUE)
if (length(bmi_var) > 0) {
  nomogram_shrunk_args[[bmi_var[1]]] <- c(20, 30, 40, 50, 60)
}

# Add Age tick marks if in model
age_var <- grep("^Age|^age", model_vars, value = TRUE, ignore.case = TRUE)
if (length(age_var) > 0) {
  nomogram_shrunk_args[[age_var[1]]] <- seq(20, 90, by = 10)
}

# Generate shrinkage-adjusted nomogram
nomogram_shrunk <- do.call(rms::nomogram, nomogram_shrunk_args)

# Set up plot with clean formatting
par(mar = c(4, 5, 4, 4),
    mgp = c(3, 1, 0),
    tcl = -0.5,
    las = 1,
    font.lab = 2,
    lwd = 2)

# Plot with publication styling
plot(nomogram_shrunk,
     cex.axis = 1.1,
     cex.var = 1.5,
     col.grid = gray(0.90),
     col.axis = "gray10",
     lmgp = 0.35,
     xfrac = 0.20,
     label.every = 1,
     force.label = FALSE,
     lwd = 2,
     col = "navy")

# Add prominent title
title(main = "Clinical Prediction Nomogram",
      font.main = 2,
      cex.main = 2.2,
      col.main = "#1a1a2e",
      line = 1.5)

# Add subtitle with shrinkage factor
mtext(sprintf("Urodynamic Procedure Cancellation Risk (Calibrated, Shrinkage Factor = %.3f)", shrinkage_factor),
      side = 3,
      line = 0.3,
      cex = 1.3,
      col = "#2E7D32",
      font = 2)

# Add abbreviation legend
mtext("NB=Neurogenic Bladder | OAB=Overactive Bladder | POP=Prolapse | VD=Voiding Dysfunction | UI=Incontinence",
      side = 1,
      line = 2.5,
      cex = 0.95,
      col = "#444444",
      font = 1)

# Add note about calibration
mtext("Coefficients adjusted for optimism - recommended for clinical deployment",
      side = 1,
      line = 3.5,
      cex = 1.0,
      col = "#2E7D32",
      font = 2)
Figure 2. Clinical Prediction Nomogram—A Visual Calculator. How to use: (1) For each patient characteristic, draw a vertical line up to the 'Points' scale at the top to find that variable's point value. (2) Add all points together to get the 'Total Points'. (3) Draw a line down from Total Points to find the predicted probability of cancellation. Example: A 70-year-old (≈45 points) with recurrent UTIs (≈25 points) has a Total Points of ≈70, corresponding to approximately 15% predicted cancellation risk.

Figure 2. Clinical Prediction Nomogram—A Visual Calculator. How to use: (1) For each patient characteristic, draw a vertical line up to the ‘Points’ scale at the top to find that variable’s point value. (2) Add all points together to get the ‘Total Points’. (3) Draw a line down from Total Points to find the predicted probability of cancellation. Example: A 70-year-old (≈45 points) with recurrent UTIs (≈25 points) has a Total Points of ≈70, corresponding to approximately 15% predicted cancellation risk.

log_info("Shrinkage-adjusted nomogram created successfully")

About This Nomogram:

This nomogram uses shrinkage-adjusted coefficients (multiplied by 1.079) to account for optimism from model fitting and LASSO variable selection. The shrinkage adjustment produces more conservative probability estimates that are expected to be properly calibrated when applied to new patients from the same population.

Handling Sparse and Imbalanced Data (Comparison Models)

When dealing with binary outcomes that have class imbalance (one outcome is much more common than the other), standard maximum likelihood estimation can produce biased or unstable estimates. Here we compare two approaches for handling this:

  1. Option 1: Firth’s Penalized Maximum Likelihood - Reduces small-sample bias by adding a penalty term
  2. Option 2: Weighted Logistic Regression - Upweights the minority class to balance influence

Interpretation:

  • Standard MLE: Our primary model, appropriate when events per predictor variable (EPV) exceeds 10
  • Firth Penalized MLE: Adds a penalty term that shrinks coefficients toward zero, reducing the bias that occurs with sparse data or complete separation. Particularly useful when EPV < 10.
  • Weighted Logistic: Upweights minority class observations, improving sensitivity for rare outcomes at the cost of potentially lower specificity.

Recommendation: Given our EPV of 11.4, the standard MLE approach is appropriate. The similar C-statistics across methods suggest stable estimation.

Note on POP-Q Stage Exclusion

POP-Q (Pelvic Organ Prolapse Quantification) staging has been permanently excluded from this prediction model based on clinical decision. POP-Q staging requires physical examination and may have higher missingness rates compared to other predictors. The decision to exclude POP-Q simplifies the model for clinical deployment without requiring this measurement.

Rationale for POP-Q Exclusion:

POP-Q (Pelvic Organ Prolapse Quantification) staging was excluded from the final prediction model for several practical reasons:

  1. Clinical accessibility: POP-Q staging requires physical examination, adding burden to data collection
  2. Measurement variability: POP-Q assessments may vary between examiners
  3. Model simplicity: Excluding POP-Q reduces the number of required inputs for clinical deployment
  4. Practical deployment: A prediction model that relies only on patient history and demographic factors is easier to implement in clinical workflows

All model predictors (age, nocturia, CRADI-8, recurrent UTIs, vaginal estrogen use, and overactive bladder status) can be obtained from patient intake forms without requiring physical examination.

Temporal Validation

Temporal validation assesses model generalizability by training on earlier data and testing on later data, simulating prospective application of the model.

Temporal validation could not be performed due to insufficient data stratification by year.

Leave-One-Year-Out Cross-Validation

To further assess temporal stability, we perform leave-one-year-out cross-validation where each year is held out as a test set while training on all other years. This approach follows recommendations for temporal validation of clinical prediction models (Steyerberg EW. Clinical Prediction Models. 2nd ed. Springer; 2019; Debray TPA et al. BMJ. 2017;356:i6460).

Interpretation Framework:

1. Discrimination (C-statistic / AUC): Evaluates how well the model ranks patients by risk. Interpretation follows established guidelines (Hosmer-Lemeshow 2000; Mandrekar 2010):

C-statistic Discrimination Quality
≥ 0.90 Outstanding
0.80-0.89 Excellent
0.70-0.79 Good/Acceptable
0.60-0.69 Moderate/Poor
< 0.60 Fail/No discrimination

2. Calibration (Slope and Intercept): Evaluates the accuracy of the absolute risk estimates (Van Calster B et al. BMC Med. 2019;17:230): - Calibration Slope: Ideal = 1.0. A slope < 1 indicates overfitting (predictions too extreme); a slope > 1 indicates underfitting (predictions too conservative). - Calibration Intercept: Ideal = 0. Measures “calibration-in-the-large.” Positive values indicate underestimation of overall risk; negative values indicate overestimation.

3. Overall Accuracy (Brier Score): Measures the mean squared difference between predicted probabilities and actual outcomes. - A model is considered informative if its Brier score is significantly lower than the “null” Brier score (prevalence × [1-prevalence]). Improvement > 10% over the null is generally considered a meaningful contribution to clinical prediction (Steyerberg EW. Clinical Prediction Models. 2019).

4. Temporal Consistency: Evaluated using the standard deviation of C-statistics across years. Lower variability indicates more stable performance (Riley RD et al. BMJ. 2020;368:m441; TRIPOD Statement).

Discussion: Temporal Instability and the 2022 Below-Chance Performance

The observation that model performance in 2022 falls below chance level (C-statistic < 0.5) warrants careful examination. Investigation of the underlying data reveals several important factors:

1. Significant Temporal Drift in Event Rates

The cancellation rate increased substantially over the study period:

Year N Cancellations Event Rate
2022 318 27 8.5%
2023 384 52 13.5%
2024 207 37 17.9%

This doubling of the event rate from 2022 to 2024 represents a fundamental shift in the outcome distribution that affects model generalizability.

2. Coefficient Direction Reversal

When comparing models trained on different time periods, several predictor coefficients show opposite signs:

  • OAB (Overactive Bladder): In 2022, OAB patients had higher cancellation rates (10.2% vs 3.8%). In 2023-2024, this relationship weakened substantially (17.6% vs 18.4% in 2024).
  • Recurrent UTI and Hispanic ethnicity: Effect magnitudes varied considerably across years, with stronger associations in 2022 compared to later years.

When a model trained on 2023-2024 data (where certain predictors have weak or different relationships with the outcome) is applied to 2022 data (where those same predictors have strong relationships), the predictions can become inversely related to the true outcomes.

3. Potential Contributing Factors

Several contextual factors may explain these temporal shifts:

  • COVID-19 Pandemic Effects (2022): Healthcare utilization patterns, patient anxiety, and procedure scheduling were still normalizing in 2022 following the acute pandemic period.
  • Changing Patient Population: The proportion of Hispanic patients increased from 8.5% (2022) to 14-15% (2023-2024), and OAB prevalence varied dramatically (65% → 31% → 63%).
  • Practice Pattern Evolution: Clinical protocols for UTI screening, patient education, and scheduling procedures may have evolved during this period.

4. Implications for Model Deployment

These findings have important implications:

  1. Temporal validation is essential: The TRIPOD-recommended practice of temporal validation successfully identified this generalizability concern (Collins GS et al. Ann Intern Med. 2015).

  2. Model updating may be necessary: Prediction models deployed in clinical practice should be periodically recalibrated to account for temporal drift (Jenkins DA et al. J Clin Epidemiol. 2021;136:79-92).

  3. Caution in retrospective application: This model should not be applied retrospectively to historical data from periods with substantially different event rates.

  4. Prospective validation required: Before clinical deployment, prospective validation in the target population and time period is recommended (Steyerberg EW. Clinical Prediction Models. 2nd ed. Springer; 2019).

5. Why C-statistic Falls Below 0.5

A C-statistic below 0.5 indicates that the model’s predictions are inversely correlated with the true outcomes—essentially, the model performs worse than random guessing. This occurs when:

  • Predictor-outcome relationships learned from training data are reversed in the test data
  • The model assigns higher risk scores to patients who actually have lower risk in the new time period
  • Calibration drift combines with relationship reversal to produce systematically incorrect rankings

This is distinct from poor discrimination (C-statistic near 0.5), which would indicate the model simply cannot distinguish high-risk from low-risk patients.

Model Comparison (Primary Model vs. Linear Comparison Model)

To demonstrate that the selected predictor model provides meaningful predictive value, we compare it to simpler alternatives using likelihood ratio tests and information criteria.

Understanding Likelihood Ratio Tests

What are Likelihood Ratio Tests?

Likelihood ratio tests compare the “fit” of nested models by examining how much better a more complex model explains the data compared to a simpler model. The test statistic follows a chi-square distribution, where:

  • Chi-square value: Measures how much better the fuller model fits compared to the reduced model. Larger values indicate the additional predictors significantly improve fit.
  • Degrees of freedom (df): The number of additional parameters (predictors) in the fuller model.
  • p-value: If p < 0.05, the additional predictors significantly improve model fit.

Interpreting Our Results:

  1. Patient Age (years)-only vs. Null model (Chi-square = 33.74, p < 0.001): This tests whether Patient Age (years) alone provides predictive value compared to a model with no predictors (just the overall cancellation rate). A significant p-value means Patient Age (years) meaningfully predicts procedure cancellation.

  2. Full model vs. Patient Age (years)-only (Chi-square = 14.77, p = 0.005): This tests whether adding the 4 additional predictor(s) (Age, Recurrent_UTIs, POPQ_Stage, Diabetes, BMI) improves prediction beyond Patient Age (years) alone. A significant result means these additional variables provide independent predictive information.

  3. Full model vs. Null (Chi-square = 48.51, p < 0.001): This tests whether all 5 predictors together significantly predict the outcome. This is the overall test of model significance.

Clinical Example: Imagine we’re asking three questions: - Does knowing a patient’s first risk factor help predict cancellation? (Test 1) - Once we know that, does adding more predictors (UTI history, menopause status, etc.) help further? (Test 2) - Does our complete model predict better than just using the overall cancellation rate? (Test 3)

Understanding Model Comparison Metrics

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion):

These metrics balance model fit against complexity. Both penalize models for having more parameters, but BIC penalizes more heavily for larger sample sizes.

  • Lower values = Better: A model with AIC of 500 is preferred over one with AIC of 520
  • Interpretation: AIC estimates how much information is lost when using the model to approximate reality. BIC estimates the probability that the model is true.
  • Example: If our null model has AIC = 418.9 and the full model has AIC = 381.3, the difference of 37.6 points indicates substantial improvement (differences >10 are considered strong evidence).

C-statistic (Concordance statistic, equivalent to AUC):

The C-statistic measures the model’s ability to discriminate between patients who will have their procedure cancelled versus completed.

  • Range: 0.5 (no better than coin flip) to 1.0 (perfect discrimination)
  • Interpretation benchmarks:
    • 0.5-0.6: Poor discrimination (model provides little predictive value)
    • 0.6-0.7: Moderate discrimination
    • 0.7-0.8: Good discrimination
    • 0.8-0.9: Excellent discrimination
    • 0.9: Outstanding (rare in clinical prediction)

  • Example: Our C-statistic of 0.754 means that if we randomly select one patient who had their procedure cancelled and one who completed it, our model would correctly identify which patient had the cancellation 75% of the time.

Nagelkerke R² (Pseudo-R²):

Unlike linear regression R², this pseudo-R² doesn’t have the same interpretation but indicates the proportion of “variation explained” in a logistic model.

  • Range: 0 to 1 (though rarely exceeds 0.5 in clinical prediction)
  • Interpretation: Higher values indicate the predictors explain more of the outcome variation
  • Typical values: For rare outcomes like procedure cancellation (6.8% rate), R² values of 0.05-0.15 are common and clinically meaningful

Brier Score:

The Brier score measures calibration—how close the predicted probabilities are to the actual outcomes.

  • Range: 0 (perfect) to 1 (worst)
  • Interpretation: The mean squared difference between predicted probability and actual outcome (0 or 1)
  • Example: A Brier score of 0.08 indicates good calibration; lower values indicate better agreement between predicted probabilities and actual outcomes
  • Baseline comparison: A model predicting the overall prevalence for everyone would have Brier = prevalence × (1 - prevalence) ≈ 0.063

Summary of Our Model Comparison:

The likelihood ratio tests confirm that: 1. Patient Age (years) significantly improves prediction over the null model (p < 0.001) 2. Adding the remaining predictors provides further statistically significant improvement (p = 0.005) 3. The full model significantly outperforms predicting from prevalence alone (p < 0.001)

Receiver Operating Characteristic (ROC) Curve (Deployed Model)

The ROC curve displays the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate) across all possible classification thresholds.

# =============================================================================
# SUPPLEMENTAL FIGURE 1: STATIC ROC CURVE
# =============================================================================

if (exists("roc_obj") && !is.null(roc_obj)) {

  # Set up publication-quality plot
  par(mar = c(5, 5, 4, 2), pty = "s")  # Square plot

  # Plot ROC curve
  plot(roc_obj,
       main = "",
       col = "#2E86AB",
       lwd = 3,
       legacy.axes = TRUE,  # 1-specificity on x-axis
       print.auc = FALSE,
       print.thres = FALSE,
       xlab = "1 - Specificity (False Positive Rate)",
       ylab = "Sensitivity (True Positive Rate)",
       cex.lab = 1.2,
       cex.axis = 1.1)

  # Add reference line (random classifier)
  abline(a = 0, b = 1, lty = 2, col = "#CCCCCC", lwd = 2)

  # Add optimal threshold point
  opt_idx <- which.min(abs(roc_obj$thresholds - optimal_threshold))
  points(1 - roc_obj$specificities[opt_idx],
         roc_obj$sensitivities[opt_idx],
         pch = 19, col = "#E74C3C", cex = 2)

  # Add AUC annotation
  text(0.6, 0.2,
       paste0("AUC = ", round(as.numeric(roc_obj$auc), 3),
              "\n(95% CI: ", c_stat_lower, "-", c_stat_upper, ")"),
       cex = 1.2, font = 2)

  # Add legend
  legend("bottomright",
         legend = c("Prediction Model",
                    "Reference Line (AUC = 0.5)",
                    paste0("Optimal Threshold (", round(optimal_threshold * 100, 1), "%)")),
         col = c("#2E86AB", "#CCCCCC", "#E74C3C"),
         lty = c(1, 2, NA),
         pch = c(NA, NA, 19),
         lwd = c(3, 2, NA),
         pt.cex = 1.5,
         bty = "n",
         cex = 0.9)
}
Supplemental Figure 1. Receiver Operating Characteristic (ROC) Curve

Supplemental Figure 1. Receiver Operating Characteristic (ROC) Curve

Interactive ROC Curve (Web Version Only)

Sensitivity and Specificity with 95% CI (Deployed Model)

Decision Curve Analysis (Deployed Model)

Decision Curve Analysis (DCA) evaluates the clinical utility of the prediction model by calculating the “Net Benefit” across a range of threshold probabilities. It compares the model against two default strategies: 1. Treat All: Assume everyone will cancel (intervention for everyone). 2. Treat None: Assume no one will cancel (intervention for no one).

The “Net Benefit” accounts for the trade-off between the benefit of correctly identifying a true positive (cancellation) vs. the harm of a false positive (unnecessary intervention).

# =============================================================================
# DECISION CURVE ANALYSIS
# Evaluates clinical utility across threshold probabilities
# =============================================================================

log_info("Performing Decision Curve Analysis...")

# 1. Prepare data for DCA — use deployed (shrinkage-adjusted) model for consistency
dca_obs <- as.numeric(deployed_model$y) - 1
dca_pred <- predict(deployed_model, type = "fitted")

# 2. Calculate Net Benefit using helper function
dca_thresholds <- seq(0.01, 0.40, by = 0.01) # Start at 1% (threshold=0 is degenerate: model=Treat All)
dca_results <- calculate_dca(dca_obs, dca_pred, dca_thresholds)

# 3. Visualization
# Convert to long format for ggplot
dca_long <- dca_results %>%
  pivot_longer(cols = starts_with("net_benefit"), 
               names_to = "Strategy", 
               values_to = "Net_Benefit") %>%
  mutate(Strategy = case_when(
    Strategy == "net_benefit_model" ~ "Prediction Model",
    Strategy == "net_benefit_all" ~ "Treat All",
    Strategy == "net_benefit_none" ~ "Treat None"
  ))

# Plot
ggplot(dca_long, aes(x = threshold, y = Net_Benefit, color = Strategy, linewidth = Strategy)) +
  geom_line() +
  scale_color_manual(values = c("Prediction Model" = "#E74C3C", 
                                "Treat All" = "#95A5A6", 
                                "Treat None" = "black")) +
  scale_linewidth_manual(values = c("Prediction Model" = 1.2, 
                                    "Treat All" = 0.8, 
                                    "Treat None" = 0.8)) +
  # Add zero line emphasis
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
  labs(title = "Decision Curve Analysis",
       subtitle = "Net Benefit of Model vs. Default Strategies",
       x = "Threshold Probability (Likelihood of Cancellation)",
       y = "Net Benefit",
       caption = "Higher Net Benefit indicates better clinical utility.\nThe model is useful where the Red line is higher than both Gray and Black lines.") +
  coord_cartesian(ylim = c(-0.05, max(dca_results$net_benefit_model, na.rm=TRUE) + 0.02)) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")
Figure: Is This Model Clinically Useful? (Decision Curve Analysis). The blue line shows the model's net benefit—when it rises above the gray lines ('Treat All' and 'Treat None'), the model adds clinical value. The x-axis represents your threshold for action: at 10%, you'd intervene if a patient has ≥10% predicted risk. The model is most useful in the threshold range where the blue line is highest above the alternatives.

Figure: Is This Model Clinically Useful? (Decision Curve Analysis). The blue line shows the model’s net benefit—when it rises above the gray lines (‘Treat All’ and ‘Treat None’), the model adds clinical value. The x-axis represents your threshold for action: at 10%, you’d intervene if a patient has ≥10% predicted risk. The model is most useful in the threshold range where the blue line is highest above the alternatives.

# 4. Interpretation
# Find range where model is superior
model_superior <- dca_results$net_benefit_model > dca_results$net_benefit_all & 
                  dca_results$net_benefit_model > dca_results$net_benefit_none

useful_thresholds <- dca_thresholds[model_superior]

# Store DCA results in global variables for later use in programmatic abstract
# (results list is created later in the document)
if(length(useful_thresholds) > 0) {
  min_useful <- min(useful_thresholds)
  max_useful <- max(useful_thresholds)
  dca_range_low <- round(min_useful * 100, 0)
  dca_range_high <- round(max_useful * 100, 0)
  cat(sprintf("\n**Interpretation:** The prediction model provides higher net benefit than default strategies across threshold probabilities from %.0f%% to %.0f%%.\n", min_useful * 100, max_useful * 100))
  cat("This suggests the model is clinically useful for decisions where the 'risk tolerance' for cancellation falls within this range.\n")
} else {
  # No benefit range found
  dca_range_low <- NA
  dca_range_high <- NA
  cat("\n**Interpretation:** The model does not show clear benefit over default strategies in the tested threshold range.\n")
}

Interpretation: The prediction model provides higher net benefit than default strategies across threshold probabilities from 2% to 40%. This suggests the model is clinically useful for decisions where the ‘risk tolerance’ for cancellation falls within this range.

Calibration Plot (Deployed Model)

Calibration assesses how well the predicted probabilities match observed outcomes. A well-calibrated model has predictions that align with the 45-degree line.

# Set seed for reproducibility (uses configured seed)
set.seed(SEED_CALIBRATION)

# Perform calibration with bootstrap resampling
log_info(paste("Generating calibration plot with", CALIBRATION_RESAMPLES, "bootstrap resamples..."))

calibration_results <- rms::calibrate(deployed_model, B = CALIBRATION_RESAMPLES)

log_info("Calibration analysis completed")

# Create calibration plot
par(mar = c(5, 5, 4, 2))

plot(calibration_results,
     main = "Calibration Plot: Predicted vs. Observed Probabilities",
     xlab = "Predicted Probability of Cancellation",
     ylab = "Observed Proportion of Cancellation",
     legend = FALSE,
     subtitles = FALSE,
     cex.main = 1.3,
     cex.lab = 1.2,
     col = "darkblue",
     lwd = 2)

n=841 Mean absolute error=0.008 Mean squared error=8e-05 0.9 Quantile of absolute error=0.014

# Add reference line (perfect calibration)
abline(a = 0, b = 1, lty = 2, col = "gray50", lwd = 2)

# Add legend
legend("bottomright",
       legend = c("Ideal calibration", "Apparent", "Bias-corrected"),
       lty = c(2, 1, 1),
       col = c("gray50", "darkblue", "darkblue"),
       lwd = c(2, 2, 2),
       bty = "n",
       cex = 1.1)

# Add interpretation text
mtext("Points closer to diagonal = better calibration",
      side = 1, line = 4, cex = 0.9, col = "#7F8C8D")
Figure 3. Are the Predictions Trustworthy? (Calibration Plot). This shows whether predicted probabilities match reality. The diagonal line represents perfect calibration—if the model predicts 20% risk, 20% of those patients should actually cancel. Points above the line mean the model underestimates risk; points below mean it overestimates. The closer to the diagonal, the more you can trust the specific probability numbers.

Figure 3. Are the Predictions Trustworthy? (Calibration Plot). This shows whether predicted probabilities match reality. The diagonal line represents perfect calibration—if the model predicts 20% risk, 20% of those patients should actually cancel. Points above the line mean the model underestimates risk; points below mean it overestimates. The closer to the diagonal, the more you can trust the specific probability numbers.

Interpretation of Calibration Plot:

  • Diagonal dashed line: Represents perfect calibration (predicted = observed)
  • Solid line: Shows actual model calibration
  • Closer to diagonal: Better calibration (predictions match reality)
  • Above diagonal: Model underestimates risk
  • Below diagonal: Model overestimates risk
# Calculate and display calibration metrics
cat("\n=== Calibration Summary Statistics ===\n\n")

=== Calibration Summary Statistics ===

# Mean absolute calibration error
mean_abs_error <- mean(abs(calibration_results[, "predy"] - calibration_results[, "calibrated.orig"]), na.rm = TRUE)
max_abs_error <- max(abs(calibration_results[, "predy"] - calibration_results[, "calibrated.orig"]), na.rm = TRUE)

cat(sprintf("Mean absolute calibration error: %.3f\n", mean_abs_error))

Mean absolute calibration error: 0.013

cat(sprintf("Maximum calibration error:       %.3f\n", max_abs_error))

Maximum calibration error: 0.036

cat(sprintf("\nInterpretation: On average, predicted probabilities differ from\n"))

Interpretation: On average, predicted probabilities differ from

cat(sprintf("observed proportions by approximately %.1f percentage points.\n", mean_abs_error * 100))

observed proportions by approximately 1.3 percentage points.

Enhanced Calibration with Confidence Bands — ICI / Emax (Deployed Model)

Modern calibration assessment goes beyond grouped calibration curves. The Integrated Calibration Index (ICI, also called Eavg) measures the weighted average absolute difference between smooth calibration curve and perfect calibration, while Emax captures the worst-case miscalibration. Confidence bands provide visual uncertainty around the calibration curve. Both CalibrationCurves::val.prob.ci.2() and pmcalibration are reported below for cross-validation.

# =============================================================================
# ENHANCED CALIBRATION: CalibrationCurves::val.prob.ci.2()
# Reference: Van Calster B, et al. Calibration: the Achilles heel of predictive
# analytics. BMC Med. 2019;17(1):230.
# =============================================================================

log_info("Generating enhanced calibration plot with val.prob.ci.2()...")

# Extract predicted probabilities and binary outcomes
cal_pred_enhanced <- predict(deployed_model, type = "fitted")
cal_obs_enhanced  <- as.numeric(deployed_model$y) - 1  # rms: 1=no event, 2=event

# Ensure matching lengths and no NAs
valid_idx <- !is.na(cal_pred_enhanced) & !is.na(cal_obs_enhanced)
cal_pred_enhanced <- cal_pred_enhanced[valid_idx]
cal_obs_enhanced  <- cal_obs_enhanced[valid_idx]

# val.prob.ci.2 expects (p, y) — predicted probs, then binary outcome
# Use default xlim c(-0.02, 1) so legend/stat annotations are not clipped
val_prob_results <- tryCatch({
  if (!requireNamespace("CalibrationCurves", quietly = TRUE)) {
    stop("CalibrationCurves package not installed")
  }

  par(mar = c(5, 5, 4, 2))
  res <- CalibrationCurves::val.prob.ci.2(
    p         = cal_pred_enhanced,
    y         = cal_obs_enhanced,
    smooth    = "loess",
    CL.smooth = "fill",
    cl.level  = 0.95,
    lwd.smooth = 2,
    col.smooth = "darkblue",
    col.ideal  = "gray50",
    lty.ideal  = 2,
    lwd.ideal  = 2,
    xlab      = "Predicted Probability of Cancellation",
    ylab      = "Observed Proportion of Cancellation",
    statloc   = c(0, 0.85),
    legendloc = c(0.5, 0.27),
    dostats   = TRUE,
    roundstats = 3
  )

  mtext("Enhanced calibration with LOESS smooth and 95% confidence bands",
        side = 1, line = 4, cex = 0.9, col = "#7F8C8D")
  res
}, error = function(e) {
  log_info(paste("val.prob.ci.2 skipped:", e$message))
  cat("\n**Note:** Enhanced calibration plot could not be generated.\n")
  cat("Error:", e$message, "\n")
  NULL
})
Figure 3b. Enhanced Calibration Plot with Confidence Bands (val.prob.ci.2). Smooth calibration curve with 95% confidence bands, distribution of predicted probabilities, and formal ICI/Emax statistics. The shaded region shows uncertainty in the calibration estimate. ICI (Eavg) is the average absolute calibration error; Emax is the worst-case error.

Figure 3b. Enhanced Calibration Plot with Confidence Bands (val.prob.ci.2). Smooth calibration curve with 95% confidence bands, distribution of predicted probabilities, and formal ICI/Emax statistics. The shaded region shows uncertainty in the calibration estimate. ICI (Eavg) is the average absolute calibration error; Emax is the worst-case error.

# Extract and store metrics from val.prob.ci.2 (v3.0.0 returns nested list)

if (!is.null(val_prob_results)) {
  cat("\n=== Enhanced Calibration Metrics (CalibrationCurves) ===\n\n")

  vp_stats <- val_prob_results$stats

  # Store in temp vars — results$  is wiped by results <- list() in results-setup chunk.
  # These temp vars are copied into results in the results-setup chunk.
  enhanced_cal_ici       <- round(vp_stats["Eavg"], 4)
  enhanced_cal_emax      <- round(vp_stats["Emax"], 4)
  enhanced_cal_eci       <- round(vp_stats["ECI"], 4)
  enhanced_cal_intercept <- round(val_prob_results$Calibration$Intercept["Point estimate"], 3)
  enhanced_cal_slope     <- round(val_prob_results$Calibration$Slope["Point estimate"], 3)

  cat(sprintf("ICI (Eavg):             %.4f  (average |predicted - observed|)\n", enhanced_cal_ici))
  cat(sprintf("Emax:                   %.4f  (maximum calibration error)\n", enhanced_cal_emax))
  cat(sprintf("ECI:                    %.4f  (integrated calibration error)\n", enhanced_cal_eci))
  cat(sprintf("Calibration intercept:  %.3f  (ideal = 0)\n", enhanced_cal_intercept))
  cat(sprintf("Calibration slope:      %.3f  (ideal = 1)\n", enhanced_cal_slope))

  cat("\nInterpretation:\n")
  if (enhanced_cal_ici < 0.02) {
    cat("  ICI < 0.02: Excellent calibration.\n")
  } else if (enhanced_cal_ici < 0.05) {
    cat("  ICI 0.02-0.05: Good calibration.\n")
  } else if (enhanced_cal_ici < 0.10) {
    cat("  ICI 0.05-0.10: Moderate calibration - consider recalibration.\n")
  } else {
    cat("  ICI >= 0.10: Poor calibration - recalibration recommended.\n")
  }

  log_info(sprintf("val.prob.ci.2 complete: ICI=%.4f, Emax=%.4f, slope=%.3f",
                   enhanced_cal_ici, enhanced_cal_emax, enhanced_cal_slope))
} else {
  cat("\n**Note:** val.prob.ci.2 metrics not available (see error above).\n")
}

=== Enhanced Calibration Metrics (CalibrationCurves) ===

ICI (Eavg): 0.0072 (average |predicted - observed|) Emax: 0.0380 (maximum calibration error) ECI: 0.0091 (integrated calibration error) Calibration intercept: 0.000 (ideal = 0) Calibration slope: 0.943 (ideal = 1)

Interpretation: ICI < 0.02: Excellent calibration.

# =============================================================================
# CROSS-VALIDATION: pmcalibration package
# Reference: Austin PC, Steyerberg EW. The Integrated Calibration Index (ICI)
# and related metrics for quantifying the calibration of logistic regression
# models. Stat Med. 2019;38(21):4051-4065.
# =============================================================================

log_info("Generating pmcalibration cross-validation...")

pm_cal <- tryCatch({
  if (!requireNamespace("pmcalibration", quietly = TRUE)) {
    stop("pmcalibration package not installed")
  }

  pmcalibration::pmcalibration(
    y      = cal_obs_enhanced,
    p      = cal_pred_enhanced,
    smooth = "loess",
    ci     = "none",
    plot   = TRUE
  )
}, error = function(e) {
  log_info(paste("pmcalibration skipped:", e$message))
  cat("\n**Note:** pmcalibration plot could not be generated.\n")
  cat("Error:", e$message, "\n")
  NULL
})
Figure 3c. Calibration Curve with Confidence Bands (pmcalibration). Independent calibration assessment using the pmcalibration package with LOESS smooth. Provides a second estimate of ICI (Eavg) and Emax for cross-validation against val.prob.ci.2 results.

Figure 3c. Calibration Curve with Confidence Bands (pmcalibration). Independent calibration assessment using the pmcalibration package with LOESS smooth. Provides a second estimate of ICI (Eavg) and Emax for cross-validation against val.prob.ci.2 results.

# Store pmcalibration metrics in temp vars (same reason as val.prob.ci.2 above)
if (!is.null(pm_cal)) {
  pm_metrics <- pm_cal$metrics
  enhanced_pmcal_ici  <- round(pm_metrics["Eavg"], 4)
  enhanced_pmcal_emax <- round(pm_metrics["Emax"], 4)
  enhanced_pmcal_e90  <- round(pm_metrics["E90"], 4)
  enhanced_pmcal_eci  <- round(pm_metrics["ECI"], 4)
} else {
  cat("\n**Note:** pmcalibration metrics not available.\n")
}

Hosmer-Lemeshow Goodness-of-Fit Test (Deployed Model)

The Hosmer-Lemeshow test formally assesses calibration by comparing observed and expected event rates across deciles of predicted risk. A non-significant p-value (>0.05) suggests adequate model fit.

Note on Hosmer-Lemeshow Test: - The test has known limitations: it’s sensitive to sample size and number of groups - A significant result doesn’t necessarily mean the model is useless - Always interpret alongside calibration plots and clinical judgment - Bootstrap calibration (shown above) is generally preferred for assessing calibration

Advanced Statistical Methods (Comparison Models)

This section implements additional statistical techniques to strengthen the prediction model and assess robustness.

Programmatic Predictor Reduction for EPV Optimization

Given the events-per-variable (EPV) constraint, we systematically evaluate which predictors can be removed with minimal impact on discrimination to achieve EPV ≥ 10.

Fractional Polynomials vs Splines Comparison

Compare fractional polynomials (FP) with restricted cubic splines (RCS) for modeling non-linear relationships.

Multiple Imputation with MICE

Implement multiple imputation using chained equations to handle missing data and recover lost observations.

Calibration Belt

The calibration belt provides a more sophisticated assessment of model calibration than the Hosmer-Lemeshow test.

Isotonic Regression Recalibration

Non-parametric recalibration using isotonic regression to improve probability estimates.

Stacked Generalization Discussion

PART 8: Clinical Utility (Deployed Model)

This section evaluates the clinical usefulness of the prediction model using decision curve analysis and clinical impact curves. These methods assess whether using the model would lead to better clinical decisions compared to alternative strategies.

Clinical Nomogram

Nomogram Formula Extraction for Clinical Use

While the visual nomogram is intuitive, it can be difficult to read exact values by hand. The nomogramFormula package extracts the underlying mathematical formulas, enabling precise point calculations without software. This is particularly useful for:

  • Printed reference cards at the point of care
  • Spreadsheet calculators for clinic staff
  • Validation of hand-calculated predictions

How to use the nomogram:

  1. For each predictor variable, find the patient’s value on the scale and read the corresponding points from the “Points” axis at the top
  2. Sum all points to get the Total Points
  3. Find the Total Points on the bottom scale and read the corresponding Probability of Cancellation

The lookup table above provides a quick reference for converting the model’s linear predictor to probability, which corresponds to the Total Points scale on the nomogram.

Enhanced Nomogram with Data Distribution (regplot)

The regplot package creates an enhanced nomogram that overlays the distribution of your actual patient data on each predictor scale. This helps clinicians immediately see whether a new patient’s values are typical or unusual compared to the study population.

Key advantages over the standard rms::nomogram():

  • Covariate distributions shown directly on scales (violin plots, box plots, or density)
  • Transformation shapes displayed for non-linear terms (RCS splines)
  • Interactive mode allows clicking to change values and recalculate risk
  • Drop lines show each predictor’s contribution to the total score
# =============================================================================
# INTERACTIVE NOMOGRAM EXAMPLE (for local use, not rendered in document)
# Run this code in RStudio to use the nomogram as an interactive calculator
# =============================================================================

# Example: Calculate risk for a specific patient
example_patient <- data.frame(
  Age = 65,
  BMI = 32,
  # Add other predictors as needed based on your model
  stringsAsFactors = FALSE
)

# Interactive mode - click to change values and see updated predictions
regplot(deployed_model,
        observation = example_patient,
        clickable = TRUE,      # Enable interactive mode
        droplines = TRUE,      # Show score contribution lines
        interval = "confidence",
        title = "Interactive Risk Calculator - Click to Modify Values")

# Note: In interactive mode:
# - Click on any scale to change that predictor's value
# - The predicted probability updates automatically
# - Click the distribution type menu to change visualization (violin, box, density)

Interpreting the enhanced nomogram:

  • Violin/box plots on each scale show where your study patients fall - wider sections indicate more patients with those values
  • Thumbnail plots on the right of continuous variables (Age, BMI) show the shape of non-linear transformations from restricted cubic splines
  • Points scale at top converts each predictor value to points; sum all points to get total
  • Probability scale at bottom converts total points to predicted cancellation probability

This visualization helps clinicians understand not just what the prediction is, but how typical a patient’s risk factors are compared to the population used to develop the model.

Decision Curve Analysis (DCA)

Decision curve analysis evaluates the clinical utility of prediction models by comparing the net benefit of using the model versus alternative strategies (“treat all” or “treat none”). Net benefit accounts for the relative harms of false positives and false negatives at different threshold probabilities.

Interpretation of Decision Curve Analysis:

  • Net Benefit: Represents the clinical value of using the model, accounting for true positives (benefit) minus false positives (harm weighted by threshold)
  • Treat All (red dashed): Strategy of treating/screening every patient regardless of predicted risk
  • Treat None (gray): Strategy of treating no one (net benefit = 0)
  • Prediction Model (blue): Net benefit of using our model to guide decisions

Clinical Interpretation: - Where the model curve is above both “Treat All” and “Treat None” lines, the model provides clinical value - The model is most useful at threshold probabilities where it shows the greatest separation from alternative strategies - A threshold probability of 7% corresponds to weighing false negatives as 14.3 times more harmful than false positives

Interactive Decision Curve Analysis

Hover over the curves below to see exact values at each threshold probability.

Clinical Impact Curve (CIC)

The clinical impact curve illustrates the practical impact of using the prediction model at various threshold probabilities, showing how many patients would be classified as high-risk and how many of those would actually have the outcome.

Interpretation of Clinical Impact Curve:

  • Blue line: Total number of patients (per 1,000) who would be classified as “high-risk” at each threshold
  • Red line: Number of those high-risk patients who would actually have the event (true positives)
  • Gray area: Represents false positives - patients flagged as high-risk who would NOT have the event
  • Optimal threshold: Balances detection of true cases against unnecessary interventions

At the optimal threshold of 6.5%, approximately 329 per 1,000 patients would be flagged as high-risk, of whom 48 would actually have their procedure cancelled due to UTI.

Interactive Clinical Impact Curve

Net Reclassification Index (NRI)

The Net Reclassification Index quantifies how well the full model improves risk classification compared to a simpler model (e.g., using only the strongest predictor).

Interpretation of NRI:

  • NRI (Events): Proportion of events correctly reclassified to higher risk minus those incorrectly reclassified to lower risk
  • NRI (Non-events): Proportion of non-events correctly reclassified to lower risk minus those incorrectly reclassified to higher risk
  • Total NRI: Sum of event and non-event NRI; positive values indicate overall improvement
  • Continuous NRI: Category-free version that examines raw probability changes

Guidelines for interpretation: - NRI > 0.6: Strong improvement - NRI 0.4-0.6: Intermediate improvement - NRI 0.2-0.4: Weak improvement - NRI < 0.2: No meaningful improvement

Explicit Model Equation

The final prediction model can be expressed as the following logistic regression equation:

# =============================================================================
# EXPLICIT MODEL EQUATION
# Full regression equation for reproducibility
# =============================================================================

log_info("Generating explicit model equation...")

# Extract coefficients from deployed (shrinkage-adjusted) model
coefs <- coef(deployed_model)
coef_names <- names(coefs)

# Create the linear predictor equation
cat("\n=== Prediction Model Equation ===\n\n")

=== Prediction Model Equation ===

cat("**Logistic Regression Model:**\n\n")

Logistic Regression Model:

cat("log(p / (1-p)) = Linear Predictor (LP)\n\n")

log(p / (1-p)) = Linear Predictor (LP)

cat("where p = Probability of procedure cancellation due to UTI\n\n")

where p = Probability of procedure cancellation due to UTI

cat("**Linear Predictor:**\n\n")

Linear Predictor:

cat("```\n")

``` r
cat("LP = ")

LP =

# Build equation string
eq_parts <- c()
for (i in seq_along(coefs)) {
  coef_val <- round(coefs[i], 4)
  coef_name <- coef_names[i]

  if (coef_name == "Intercept") {
    eq_parts <- c(eq_parts, sprintf("%.4f", coef_val))
  } else {
    # Clean up variable name for display
    clean_name <- gsub("\\.", " ", coef_name)
    clean_name <- gsub("  +", " ", clean_name)
    clean_name <- trimws(clean_name)

    sign <- ifelse(coef_val >= 0, " + ", " - ")
    eq_parts <- c(eq_parts, sprintf("%s%.4f × [%s]", sign, abs(coef_val), clean_name))
  }
}

# Print equation (wrap lines for readability)
cat(eq_parts[1])  # Intercept

-8.0310

for (i in 2:length(eq_parts)) {
  if (i %% 2 == 0) {
    cat("\n    ")
  }
  cat(eq_parts[i])
}
 + 0.0721 × [Age] - 0.2656 × [POPQ_Stage]
 + 0.0305 × [BMI] + 0.8406 × [Recurrent_UTIs=Yes]
 - 0.2566 × [Diabetes=Yes]
cat("\n```\n\n")

``` r
# Probability conversion with LaTeX
cat("**Probability Calculation:**\n\n")

Probability Calculation:

cat("$$P(\\text{Cancelled}) = \\frac{e^{LP}}{1 + e^{LP}} = \\frac{1}{1 + e^{-LP}}$$\n\n")

\[P(\text{Cancelled}) = \frac{e^{LP}}{1 + e^{LP}} = \frac{1}{1 + e^{-LP}}\]

# Create coefficient table
coef_table <- data.frame(
  Variable = coef_names,
  Coefficient = round(coefs, 4),
  `Odds Ratio` = round(exp(coefs), 3),
  stringsAsFactors = FALSE
)

# Clean variable names
coef_table$Variable <- gsub("\\.", " ", coef_table$Variable)
coef_table$Variable <- gsub("  +", " ", coef_table$Variable)
coef_table$Variable <- trimws(coef_table$Variable)

cat("**Model Coefficients:**\n\n")

Model Coefficients:

kable(coef_table,
      col.names = c("Variable", "Coefficient (β)", "Odds Ratio (exp(β))"),
      caption = "Full Model Coefficients for Reproducibility",
      align = c("l", "c", "c"),
      row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 12) %>%
  row_spec(1, italic = TRUE, color = "#7F8C8D")  # Intercept row
Full Model Coefficients for Reproducibility
Variable Coefficient (β) Odds Ratio (exp(β))
Intercept -8.0310 0.000
Age 0.0721 1.075
POPQ_Stage -0.2656 0.767
BMI 0.0305 1.031
Recurrent_UTIs=Yes 0.8406 2.318
Diabetes=Yes -0.2566 0.774
# Example calculation
cat("\n\n**Example Calculation:**\n\n")

Example Calculation:

cat("For a 65-year-old patient with BMI 30:\n\n")

For a 65-year-old patient with BMI 30:

# Get Age and BMI coefficients if present
age_coef <- coefs[grep("Age", coef_names)[1]]
bmi_coef <- coefs[grep("BMI", coef_names)[1]]
intercept <- coefs["Intercept"]

if (!is.na(age_coef) && !is.na(bmi_coef)) {
  example_lp <- intercept + age_coef * 65 + bmi_coef * 30
  example_prob <- 1 / (1 + exp(-example_lp))

  cat(sprintf("LP = %.4f + (%.4f × 65) + (%.4f × 30) + [other predictors at reference]\n",
              intercept, age_coef, bmi_coef))
  cat(sprintf("   = %.4f (partial calculation)\n\n", example_lp))
  cat("Note: Full calculation requires values for all predictors in the model.\n")
}

LP = -8.0310 + (0.0721 × 65) + (0.0305 × 30) + [other predictors at reference] = -2.4316 (partial calculation)

Note: Full calculation requires values for all predictors in the model.

log_info("Model equation generation completed")

This explicit equation allows: 1. Independent verification of model predictions 2. Implementation in other software systems or calculators 3. External validation studies using the exact model specification 4. Transparency in reporting per TRIPOD guidelines

Clinical Example Vignettes

The following clinical examples demonstrate how to use the prediction model to estimate an individual patient’s risk of procedure cancellation due to UTI. These vignettes incorporate all predictors in the final model.

Dynamic Results Section

This section contains programmatic values that automatically update when the data changes.

PART 9: Results

This section presents the manuscript-ready abstract, methods, results, and discussion following journal guidelines.

ABSTRACT

Importance: Urinary tract infection is a leading cause of same-day urodynamic procedure cancellation, yet no validated tools exist to identify high-risk patients before their appointment.

Objectives: To develop and internally validate a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection (UTI), following TRIPOD+AI guidelines1.

Study Design: Retrospective cohort study of patients scheduled for urodynamic testing at a single academic urogynecology practice (January 1, 2022 to December 31, 2024). LASSO regression with cross-validation selected predictors. Internal validation used standard and full-process bootstrap (B=1000). Shrinkage-adjusted coefficients were derived from the calibration slope for deployment.

Results: Of 841 patients, 57 (6.8%) had UTI-related cancellations. LASSO retained 5 predictors: age, POPQ_Stage, BMI, recurrent UTIs, Diabetes. The model demonstrated good discrimination (optimism-corrected C-statistic 0.756; sensitivity 75.4%, specificity 67.3%, NPV 97.4%). Full-process bootstrap validation yielded a calibration slope of 1.079. Decision curve analysis confirmed net clinical benefit across threshold probabilities of 2%-40%. A shrinkage-adjusted nomogram and web-based calculator were developed.

Conclusions: Readily available clinical factors predict urodynamic procedure cancellation due to UTI. The shrinkage-adjusted nomogram and online calculator support targeted preoperative screening, potentially reducing avoidable cancellations. External validation is warranted.


Why This Matters?

Urodynamic procedure cancellations due to UTI waste clinical resources, delay diagnosis, and frustrate patients. Current practice relies on universal screening or ad hoc clinician judgment, with no evidence-based method to identify which patients are most likely to cancel. This study provides the first TRIPOD-compliant prediction model and deployable web-based calculator for this outcome. By stratifying patients into risk categories using readily available clinical data, clinicians can target preoperative urine cultures to high-risk patients while confidently scheduling low-risk patients — improving resource utilization and antibiotic stewardship in urogynecology practice.



INTRODUCTION

Urodynamic studies (UDS) are a cornerstone in the evaluation of complex lower urinary tract symptoms, yet procedure cancellations due to urinary tract infection (UTI) represent a significant inefficiency in urogynecologic practice. Cancellations result in wasted clinical resources, delayed diagnosis, and patient dissatisfaction. While guidelines recommend ruling out infection prior to instrumentation, the optimal strategy for preoperative screening—universal versus selective—remains debated. Current practice often relies on universal screening or ad-hoc clinician judgment, which may lead to overuse of antibiotics or missed infections.

A validated prediction model could stratify patients by risk, enabling a precision medicine approach: high-risk patients could be targeted for rigorous preoperative screening (e.g., urine culture 5-7 days prior) or prophylactic measures, while low-risk patients might proceed with symptom-based screening alone. Integrating such a tool into the electronic medical record could streamline workflow and improve antibiotic stewardship.

To our knowledge, no validated clinical prediction model currently exists for UTI-related urodynamic procedure cancellation. While prediction models and nomograms have been developed for other urogynecologic outcomes — including pelvic organ prolapse recurrence2, de novo stress urinary incontinence after prolapse surgery3, and surgical complications — the specific question of which patients will have their urodynamic study cancelled due to UTI remains unaddressed in the literature. This gap is particularly relevant given the high procedural volume in urogynecology practices and the potential for targeted screening to reduce day-of-procedure cancellations.

UTI-related cancellations may disproportionately affect older patients and those with comorbidities such as diabetes or immunocompromised status, yet whether disparities exist by race, ethnicity, or socioeconomic status in this setting is unknown. We deliberately excluded race and ethnicity from the prediction model on equity grounds per ACOG Committee Statement No. 10 (2024); the rationale is detailed in the Discussion1.

Objective: The objective of this study was to develop and internally validate a multivariable clinical prediction model to estimate the individualized risk of urodynamic procedure cancellation due to UTI. We aimed to report model performance, discrimination, and calibration following the TRIPOD+AI guidelines1, and to provide a practical nomogram and online calculator for clinical use.

STUDY DESIGN

Study Design and Population

This retrospective cohort study developed and internally validated a clinical prediction model following the TRIPOD+AI guidelines1. The complete TRIPOD+AI checklist with section references is provided in the supplementary materials (Appendix A5). This analysis was designed as a Type 1b study (development and internal validation using the same dataset with resampling techniques)4. Observational study reporting follows STROBE guidelines where applicable5.

Key Methodological References:410

The study included all patients scheduled for urodynamic testing at a single academic urogynecology practice between January 1, 2022 to December 31, 2024. Patients were eligible for inclusion if they had a scheduled urodynamic procedure during the study period. Patients who did not attend their appointment (no-show) were excluded from the analysis. Patients with missing cancellation status (12 patients) were excluded from the analysis per TRIPOD+AI guidelines, as imputing outcome data could bias the event rate estimate. This single academic center may not be representative of community urogynecology practices or settings with different patient demographics; the extent to which findings generalize to other populations is unknown and will be addressed through planned external validation at geographically and demographically distinct sites (see Future Directions).

Sample Size Considerations

Sample size adequacy was evaluated using the formal criteria of Riley et al.11 as implemented in the pmsampsize R package (version 1.1.3). For a binary outcome with a prevalence of 6.8%, 5 model degrees of freedom, and an anticipated Nagelkerke R-squared of 0.119, the minimum sample size was calculated to ensure: (1) small optimism in the C-statistic (< 0.05), (2) small absolute difference in apparent and adjusted Nagelkerke R-squared (< 0.05), and (3) precise estimation of the overall outcome proportion. The Riley criteria yielded a minimum required sample of 1192 patients (81 events). Our available sample of 841 patients with 57 events falls short of this requirement by approximately 351 patients, further supporting the use of shrinkage-adjusted coefficients for deployment. The events per predictor variable (EPV) in the final model was 11.4, which is above the traditional Peduzzi minimum of 1012.

Data Collection

Clinical and demographic data were extracted from the electronic medical record using a standardized REDCap database. Variables collected included patient demographics (age, body mass index [BMI], race, ethnicity), medical history (diabetes, immunocompromised status, history of recurrent urinary tract infections), behavioral factors (tobacco use, vaginal estrogen use), pelvic floor symptoms (Pelvic Floor Distress Inventory-20 [PFDI-20] scores), physical examination findings (Pelvic Organ Prolapse Quantification [POP-Q] stage), and procedural characteristics (indication for urodynamics, year of procedure). All predictor variables were assessed at the time of the scheduled urodynamic procedure appointment, prior to the cancellation decision.

Outcome Definition

The primary outcome was procedure cancellation due to urinary tract infection (UTI) on the day of the scheduled procedure. A UTI was defined as a positive pre-procedure urine culture with ≥100,000 colony-forming units per milliliter (CFU/mL) of a uropathogen, consistent with clinical practice guidelines for catheterized urine specimens13. Cancellation status was documented in the medical record by the treating physician or staff and was determined without knowledge of the predictor values included in this analysis (i.e., the clinicians making cancellation decisions were unaware that a prediction model would be developed to retrospectively analyze these decisions).

Variable Selection and Data Processing

Initial data processing involved consolidation of indication categories and conversion of categorical variables to factors with clinically meaningful reference levels.

Missing data handling:

Multiple imputation was performed using Multivariate Imputation by Chained Equations (MICE) with method-specific imputation: predictive mean matching (PMM) for continuous variables, logistic regression for binary factors, and polytomous logistic regression for multi-level factors. The imputation model included the outcome variable per Rubin’s Rules to avoid biasing associations toward the null. PFDI-20 subscales and totals were imputed using PMM. Variables with greater than 50% missing data were excluded (0 variables). Data imputed via MICE (first completed dataset) were used for the final model fitting, preventing data loss from row-wise deletion.

Candidate Predictors and Rationale

Candidate predictors were selected a priori based on clinical plausibility and established risk factors for urinary tract infection and healthcare non-attendance. Age was included as it influences immune function and pelvic floor anatomy. History of recurrent UTIs was considered a strong potential predictor of future events. Overactive bladder (OAB) and pelvic floor symptoms (as measured by PFDI-20 subscales, including the CRADI-8) were hypothesized to correlate with urinary stasis and infection risk. Diabetes mellitus, immunocompromised status, and BMI are known to increase UTI susceptibility and were included as candidate variables. Behavioral factors (tobacco use) and hormonal status (menopause, vaginal estrogen use) were also evaluated given their effects on the urogenital microbiome. All candidate variables were submitted to the LASSO procedure for data-driven selection.

Variables were systematically evaluated for inclusion in the prediction model using the following criteria. All data quality procedures were applied uniformly to the entire cohort without stratification by demographic subgroup.

Exclusion criteria for variables:

  1. Near-zero variance: Variables with near-zero variance (identified using the nearZeroVar function from the caret package) were excluded as they provide minimal predictive information and can cause model instability (1 variables excluded).

  2. High missingness: Variables with greater than 50% missing values were excluded owing to concerns about imputation reliability and potential bias (0 variables excluded).

  3. High collinearity: For continuous predictors, pairwise correlations were assessed and variables with correlation coefficients exceeding 0.90 were evaluated for removal to avoid multicollinearity (0 variables excluded).

Feature selection methodology:

Least absolute shrinkage and selection operator (LASSO) regression was used for variable selection. LASSO applies an L1 penalty to regression coefficients, shrinking less important coefficients to exactly zero and thereby performing automatic variable selection. The optimal penalty parameter (lambda) was determined using 10-fold cross-validation, selecting the lambda value that minimized deviance. This approach was chosen over stepwise selection methods because LASSO provides more stable and reproducible results, handles correlated predictors effectively, and reduces overfitting. When LASSO retained only specific levels of a multi-level categorical variable, those levels were represented as explicit indicator variables in the final model.

Statistical Analysis

The outcome (UTI-related procedure cancellation) occurred in 57 of 841 patients (6.8%), representing a low-prevalence event. This low event rate has important methodological implications: it constrains the number of candidate predictors (via events-per-variable considerations), favors specificity over sensitivity at most thresholds, and produces high negative predictive values by default. These considerations informed modeling choices throughout.

Statistical significance was defined as a two-sided P<0.05 for all analyses. Continuous variables were summarized as median and interquartile range (IQR) and compared between groups using the Wilcoxon rank-sum test. Categorical variables were presented as frequency and percentage and compared using the chi-square test or Fisher exact test, as appropriate.

For variable selection, LASSO regression (alpha = 1) does not use traditional p-values; instead, it uses 10-fold cross-validation to select the optimal regularization parameter (lambda = 0.01073) that minimizes prediction error. Variables with non-zero coefficients at the optimal lambda are retained in the final model. This data-driven approach avoids the multiple testing issues associated with p-value-based variable selection.

Non-linear Modeling with Restricted Cubic Splines: Restricted cubic splines (RCS) were evaluated for continuous predictors to capture potential non-linear relationships7. However, an EPV-aware check determined that applying RCS (which would require -1 degrees of freedom) would reduce the events-per-variable ratio below the recommended minimum of 10. Therefore, linear terms were retained for all continuous predictors to preserve adequate EPV (11.4). This data-driven decision prioritizes model stability over flexibility, which is appropriate given the sample size.

Interaction Testing: Clinically plausible two-way interactions were evaluated programmatically using likelihood ratio tests comparing models with and without each interaction term14. All pairwise combinations of LASSO-selected predictors were tested where both predictors were available in the model. Interactions with P<0.1 were considered for inclusion, with a maximum of 3 interactions retained to preserve adequate events per variable (EPV). No statistically significant interactions were identified (all P≥0.10).

A multivariable logistic regression model was developed using the 5 LASSO-selected predictors to predict procedure cancellation due to UTI, following established prediction model methodology7,8. Model discrimination was assessed using the C-statistic (target: ≥0.70 for acceptable discrimination) with 95% confidence intervals calculated using the DeLong method. Model calibration was evaluated using the Brier score (target: lower is better, with 0 indicating perfect calibration) and calibration slope (target: 1.0 indicating no overfitting).

Internal Validation: A fundamental principle in prediction modeling is that apparent performance (measured on the same data used to develop the model) overestimates true predictive performance due to optimism15. We employed two complementary bootstrap approaches to quantify this optimism:

  1. Standard Bootstrap Validation (B=1000): Validates the final model by resampling with replacement and calculating optimism-corrected performance metrics per Harrell’s methodology7. This approach estimates optimism attributable to coefficient estimation but assumes predictors are pre-specified.

  2. Full-Process Bootstrap Validation (B=1000): Critically, when data-driven variable selection (such as LASSO) is used, standard bootstrap validation underestimates true optimism because it does not account for the selection process itself15. Full-process bootstrap validation repeats ALL modeling steps—including LASSO variable selection—within each bootstrap resample, providing a more honest estimate of expected performance in new data. This distinction is particularly important given the sample size and event rate in this study.

Interpreting Calibration Slope: The calibration slope quantifies whether predicted probabilities require recalibration. A slope of 1.0 indicates perfect calibration; slopes below 1.0 indicate predictions are too extreme (overfitting), while slopes above 1.0 indicate predictions are too conservative. Following TRIPOD guidance, we report both the apparent and optimism-corrected calibration slopes16.

Calibration Adjustment: When the calibration slope deviates from 1.0, predictions require adjustment for reliable external deployment. A slope below 1.0 indicates overfitting (predictions too extreme); a slope above 1.0 indicates predictions are too conservative, which can occur when LASSO regularization pre-shrinks coefficients. Following Harrell’s recommendations and the uniform shrinkage approach, all regression coefficients (except intercept) were multiplied by the calibration slope, and the intercept was re-estimated to preserve the marginal predicted probability7. This approach produces coefficients that are expected to generate properly calibrated predictions when applied to new patients from the same population. The calibration-adjusted coefficients were used for the deployment nomogram and online calculator.

Heterogeneity: As a single-center study, inter-institutional heterogeneity in model performance could not be assessed. Temporal heterogeneity across calendar years was evaluated using leave-one-year-out cross-validation to assess model stability over time (see Temporal Validation section).

Temporal Validation: Model performance was assessed across calendar years by training on earlier years and testing on the most recent year to evaluate temporal stability.

Classification Threshold Selection: The optimal probability threshold for converting predicted probabilities into binary classifications was determined by maximizing Youden’s J statistic (sensitivity + specificity - 1) on the receiver operating characteristic (ROC) curve17. The optimal threshold was 6.5%. Classification performance at this threshold — sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and overall accuracy — was reported with Wilson score 95% confidence intervals18. Given the low event prevalence, we emphasize NPV (the probability that a patient classified as low-risk will indeed complete their procedure) as the most clinically actionable metric.

Class Imbalance: The outcome event rate of 6.8% constitutes moderate class imbalance. No resampling techniques (oversampling, undersampling, or SMOTE) were applied because the primary model output is a calibrated probability rather than a binary classification, and maximum likelihood logistic regression produces valid probability estimates for rare events when properly calibrated7. Firth’s penalized likelihood was evaluated as a sensitivity analysis (see Sparse Data Comparison section).

Model Fairness: Race and ethnicity were excluded from the predictor set on equity grounds per ACOG Committee Statement No. 10 (2024), as detailed in the Discussion. Table 1 reports demographic characteristics stratified by outcome to allow readers to assess the study population’s diversity. Formal demographic subgroup calibration analysis was not performed; the single-center design and sample size limit the ability to assess model performance across sociodemographic groups, which is a limitation of this study.

Model Output: The model’s primary output is a continuous predicted probability of cancellation (range 0–1). The Youden-optimized classification threshold is provided as a secondary decision aid. The online calculator displays the predicted probability alongside a plain-language risk interpretation for clinical use.

Nomogram and Online Calculator Development

Two clinical prediction tools were developed:

Nomogram: A graphical representation of the logistic regression equation was constructed using the nomogram function from the rms package. The nomogram uses shrinkage-adjusted coefficients multiplied by the full-process calibration slope (1.079) to produce properly calibrated predictions for new patients.

Online Calculator: A web-based Shiny application was developed for real-time risk estimation. The calculator uses the shrinkage-adjusted coefficients and provides: - Interactive probability estimation - Visual risk gauge with clinical interpretation - Model performance metrics

The calculator is deployable to shinyapps.io for public access.

Clinical Utility Assessment

Discrimination (C-statistic) alone is insufficient to establish whether a prediction model provides clinical value19. We assessed clinical utility using three complementary approaches recommended for prediction model evaluation:

Decision Curve Analysis (DCA): DCA evaluates the net benefit of using the prediction model across a range of threshold probabilities, comparing the model strategy to default strategies of “treat all” or “treat none” patients20. Net benefit represents the weighted difference between true positives (benefit) and false positives (harm), where the weighting is determined by the threshold probability reflecting the relative importance of false positives versus false negatives. A model demonstrates clinical utility when its net benefit curve lies above both the “treat all” and “treat none” strategies across clinically relevant thresholds. DCA was performed across threshold probabilities from 0% to 50% with 1% increments.

Clinical Impact Curve (CIC): The clinical impact curve illustrates the practical consequences of using the prediction model at various decision thresholds by displaying the number of patients classified as high-risk per 1,000 patients and the number of those who would actually experience the outcome21. This visualization translates abstract performance metrics into actionable clinical information, helping clinicians understand the trade-off between case detection and unnecessary interventions at different threshold probabilities.

Net Reclassification Index (NRI): The NRI quantifies how well the full multivariable model improves risk classification compared to a simpler reference model22. We calculated both the category-based NRI and the continuous NRI (category-free). For the category-based NRI, risk categories were defined using outcome prevalence-based cutoffs: Low risk (predicted probability < 5%), Intermediate risk (5% to 10.2%), and High risk (> 10.2%). These thresholds were computed as max(prevalence × 0.5, 5%) and min(prevalence × 1.5, 30%), ensuring clinically meaningful categories while adapting to the observed event rate. Positive NRI values indicate that the full model correctly reclassifies more patients to appropriate risk categories than it incorrectly reclassifies. Reclassification tables were constructed to show movement between risk categories for events and non-events separately.

Software and Reproducibility

All analyses were performed using R version 4.4.2 (R Foundation for Statistical Computing, Vienna, Austria). Key packages included: rms version 8.1.1 for regression modeling, nomogram development, and bootstrap validation7; glmnet version 4.1.10 for LASSO regularization; mice version 3.19.0 for multiple imputation by chained equations; pROC version 1.19.0.1 for receiver operating characteristic analysis and DeLong confidence intervals; pmsampsize version 1.1.3 for minimum sample size calculations11; caret version 7.0.1 for near-zero variance detection; and tidyverse version 2.0.0 for data manipulation and visualization. The complete R environment, including all package versions, is managed via renv and documented in renv.lock for full computational reproducibility.

Random number generator seeds were set at each stochastic step to ensure reproducibility: seed 42 for general analysis and bootstrap validation, seed 1978 for LASSO cross-validation, and seed 42 for bootstrap resampling. Statistical significance was defined as P<0.05 for all analyses.

Modeling decisions (shrinkage justification, interaction testing, non-linearity assessment, events-per-variable adequacy, and coefficient directionality) were validated using 28 automated tests against gold-standard values derived from the analysis (see supplementary code repository). The complete analysis code and test suite are available at the study’s code repository to facilitate independent verification and adaptation for future prediction model projects.

Ethical Approval

This study was approved by the institutional review board with a waiver of informed consent owing to the retrospective nature of the study and use of de-identified data. The study was conducted in accordance with the Declaration of Helsinki.

Open Science and Transparency (TRIPOD+AI Item 18)

(a) Study protocol: A formal study protocol was not prepared for this retrospective analysis. This study was not pre-registered. Future external validation studies will be registered on the Open Science Framework.

(b) Funding and conflicts of interest: No external funding was received for this study. The authors report no conflicts of interest.

(c) Participant-level data: De-identified patient data cannot be shared publicly owing to institutional review board restrictions on protected health information. A data dictionary and synthetic example datasets are provided to facilitate independent replication of the analytical pipeline.

(d) Analytical code: The complete analysis code, including the R Markdown source for this manuscript, automated test suite (28 tests), Shiny calculator application, and reproducible computational environment (renv.lock), is available at the study’s GitHub repository.

(e) Model availability: Model coefficients are provided in Appendix A3, enabling independent implementation of the prediction equation without access to patient data. The Shiny calculator application encapsulates the full shrinkage-adjusted model for clinical deployment.

(f) Use of AI tools: An AI coding assistant (Claude Code, Anthropic) was used during data analysis pipeline development, automated test generation, and manuscript formatting. All AI-generated code and text were reviewed, verified, and edited by the study authors. The AI tool did not contribute to study design, data collection, clinical interpretation, or scientific conclusions.

Patient and Public Involvement

No patients or members of the public were involved in the design, conduct, or reporting of this study. Dissemination of results to participants is not applicable given the retrospective nature of the analysis.


RESULTS

Patient Population and Flow

During the study period, 912 patients were identified in the REDCap database. After excluding 12 patients with missing outcome status, 49 no-shows, and 10 with other cancellation reasons (71 total exclusions), 841 patients were included in the analysis cohort (see TRIPOD flow diagram, Figure 1). Of these, 57 (6.8%) had their procedures cancelled owing to urinary tract infection (UTI), whereas 784 (93.2%) completed their scheduled evaluation. This cancellation rate is consistent with previously reported rates in the urodynamics literature.

Missing Predictor Data: Among the 841 patients in the analysis cohort, 146 (17.4%) had missing values on at least one predictor variable. Missing predictor values were imputed using MICE (multiple imputation by chained equations) with predictive mean matching for continuous variables and logistic/polytomous regression for categorical variables (m=5 imputed datasets; first completed dataset used for model fitting). The outcome variable was included in the imputation model per Rubin’s Rules.

Baseline Characteristics

Demographic and clinical characteristics of the study population stratified by outcome are presented in Table 1. Patients whose procedures were cancelled were significantly older than those who completed testing (median age 76 years [interquartile range (IQR) 66-79] vs 64 years [IQR 52-73], P<.001). Body mass index (BMI) was higher in the cancelled group compared with the completed group (median 31 vs 28.5 kg/m², P=.070), though this difference did not reach statistical significance.

Variable Selection

Using LASSO (Least Absolute Shrinkage and Selection Operator) regression with 10-fold cross-validation for objective variable selection, 26 candidate predictors were initially evaluated; after regularization at the optimal lambda (0.01073), 5 predictors retained non-zero coefficients in the final model: age, POPQ_Stage, BMI, recurrent UTIs, Diabetes. The remaining 21 candidate variables were shrunk to zero by the LASSO penalty, indicating they did not contribute additional predictive value beyond the retained predictors. This data-driven approach avoids the multiple testing and selection bias issues associated with stepwise methods. With 57 outcome events and 5 total degrees of freedom, the events-per-variable ratio was 11.4, meeting the commonly cited guideline of 10 events per degree of freedom12.

Continuous Predictor Modeling: Continuous predictors (Age, POPQ_Stage, BMI) were modeled with linear terms. Restricted cubic splines were considered but not applied because the resulting effective degrees of freedom would reduce EPV below 10, increasing overfitting risk. Interaction testing was not performed due to insufficient EPV budget.

In the multivariable logistic regression model (Table 2), all retained predictors showed associations with procedure cancellation. POPQ_Stage was the strongest predictor of cancellation (odds ratio 0.07, 95% confidence interval 0-1.2). Age was associated with increased risk of cancellation (odds ratio 2.06, 95% confidence interval 1.56-2.71, p = <.001). Recurrent_UTIs=Yes was associated with increased risk of cancellation (odds ratio 2.32, 95% confidence interval 1.06-5.07, p = 0.04). POPQ_Stage was associated with decreased risk of cancellation (odds ratio 0.07, 95% confidence interval 0-1.2, p = 0.07)

Model Performance

Discrimination

The prediction model demonstrated good apparent discriminative ability with a C-statistic of 0.754 (95% CI 0.67-0.81) (Figure 2: ROC Curve). After full-process bootstrap correction for optimism, the corrected C-statistic was 0.756, indicating good expected discrimination on new patients. The C-statistic represents the probability that a randomly selected patient who experienced cancellation had a higher predicted probability than a randomly selected patient who completed their procedure. Values above 0.7 are generally considered acceptable for clinical prediction models, and values above 0.8 indicate excellent discrimination14.

Calibration

The Brier score was 0.0581 (apparent), indicating acceptable accuracy relative to the baseline prevalence. The Brier score ranges from 0 (perfect) to 0.25 (maximum error for a binary outcome with 50% prevalence) and provides a measure of the accuracy of probabilistic predictions. The model explained 11.9% of the variance in the outcome (optimism-corrected Nagelkerke R² = 0.119; apparent R² = 0.143).

Internal Validation

Standard bootstrap validation (B=1000) showed modest optimism with a corrected calibration slope of 0.927. However, full-process bootstrap validation—which repeated LASSO selection within each resample—revealed more substantial optimism, as expected when data-driven variable selection is used:

Metric Apparent Standard Bootstrap Full-Process Bootstrap
C-statistic 0.754 0.738 0.756
Calibration Slope 1.00 0.927 1.079

The full-process calibration slope of 1.079 (> 1.0) indicates that LASSO regularization already shrunk coefficients beyond what the data warrant, producing predictions that were 8% too conservative. This is expected when L1-penalized variable selection is followed by full-process bootstrap validation. Following established methodology, all slope coefficients were multiplied by this calibration slope, and the intercept was re-estimated to produce properly calibrated predictions for deployment.

At the Youden-optimized probability threshold of 6.5% (which balances sensitivity and specificity), the model achieved a sensitivity of 75.4% and specificity of 67.3%. The positive predictive value was 14.4% and the negative predictive value was 97.4%. Overall accuracy was 67.9%, which reflects the inherent difficulty of classifying rare events. Because the Youden threshold optimizes the sum of sensitivity and specificity rather than either metric alone, it produces a balanced trade-off: some completed procedures are incorrectly flagged as high-risk (reducing specificity and accuracy) in exchange for capturing more true cancellations. In clinical practice, the high sensitivity (75.4%) and high negative predictive value (97.4%) are more actionable: the model correctly identifies most patients who will have cancellations, and patients predicted to complete their procedures can be scheduled with confidence. Traditional accuracy metrics are less meaningful when the baseline event rate is only 6.8%—a model predicting “completed” for everyone would achieve 93.2% accuracy while being clinically useless.

Clinical Prediction Tools:

A clinical prediction nomogram was constructed using all 5 predictors (age, POPQ_Stage, BMI, recurrent UTIs, Diabetes) with shrinkage-adjusted coefficients (multiplied by 1.079) to produce properly calibrated predictions (Figure 2). Predicted probabilities ranged from approximately 0% to 39% based on patient characteristics. An online web-based calculator implementing the shrinkage-adjusted model is also available for clinical deployment.

Clinical Utility Assessment

Decision curve analysis demonstrated that the prediction model provided net clinical benefit compared with the default strategies of screening all patients or screening no patients across threshold probabilities from approximately 2% to 40% (Figure 3). At the optimal threshold probability of 6.5%, the model achieved a net benefit that exceeded both alternative strategies, supporting its clinical utility for risk stratification.

The clinical impact curve illustrated that at the optimal threshold, approximately 329 patients per 1,000 would be classified as high-risk for procedure cancellation, of whom 48 would actually experience the outcome (Supplemental Figure 3). This corresponds to a number needed to screen of 6.9 patients to identify one true case. The gray area between the curves represents patients who would be flagged for intervention but would not have experienced cancellation.

The net reclassification index comparing the full multivariable model to a single-predictor model showed improvement in risk classification. For patients who experienced procedure cancellation (events), the model correctly reclassified more patients to higher risk categories than it incorrectly reclassified to lower risk categories. For patients who completed their procedures (non-events), the model correctly reclassified more patients to lower risk categories. The continuous NRI, which examines probability shifts without categorical cutpoints, also indicated improved discrimination with the full model.

Model Equation (Linear Predictor): The shrinkage-adjusted logistic regression equation for the linear predictor (LP) is: \(LP = -8.031 + 0.0721 × (Age) - 0.2656 × (POPQ_Stage) + 0.0305 × (BMI) + 0.8406 × (Recurrent_UTIs=Yes) - 0.2566 × (Diabetes=Yes)\)

For a clinical example: a 75-year-old patient with history of recurrent UTIs would have a higher predicted probability of cancellation compared with a 45-year-old patient without UTI history.

Note: Odds ratios in Table 2 are from the maximum likelihood estimates (unshrunk model), which is standard for reporting effect sizes. The clinical calculator and nomogram use the shrinkage-adjusted (deployed) model coefficients for calibrated predictions.

Table 2. Multivariable Logistic Regression Model for Urodynamic Procedure Cancellation Due to Urinary Tract Infection

Predictor

Odds Ratio

95% CI

P Value

What This Means Clinically

Age

2.06

1.56-2.71

<.001

Each 10-year increase in age is associated with 106% higher risk of cancellation

POPQ_Stage

0.07

0-1.2

0.07

Each 10-year increase in age is associated with 93% lower risk of cancellation

BMI, per 5-unit increase

1.16

0.93-1.45

0.18

Each 5 kg/m² increase in body mass index is associated with 16% higher risk of cancellation

Recurrent_UTIs=Yes

2.32

1.06-5.07

0.04

Patients with this factor have 132% higher risk of cancellation

Diabetes=Yes

0.77

0.37-1.6

0.49

Patients with this factor have 23% lower risk of cancellation

Odds ratio (OR): Values >1 indicate increased risk; values <1 indicate decreased risk.

CI, confidence interval. Age OR per 10-year increase; BMI OR per 5-unit increase.

Note: The clinical calculator uses shrinkage-adjusted coefficients (factor = 1.079) for calibrated predictions in new patients.

Supplemental Table 1. Prediction Model Performance Characteristics

Characteristic

Value

C-statistic, optimism-corrected (95% CI)

0.756 (0.67-0.81)

Nagelkerke R²

0.119

Brier score, optimism-corrected

0.0581

Sensitivity (95% CI)

75.4% (62.9-84.8)

Specificity (95% CI)

67.3% (64.0-70.5)

Positive predictive value (95% CI)

14.4% (10.9-18.8)

Negative predictive value (95% CI)

97.4% (95.7-98.5)

Accuracy (95% CI)

67.9% (64.7-71.0)

CI, confidence interval. Performance metrics calculated at optimal probability threshold.

Interpretation of Model Performance Metrics

C-Statistic (0.756, optimism-corrected): This value represents the area under the ROC curve after full-process bootstrap correction, indicating the model’s expected ability to discriminate between patients who completed their procedure versus those who had cancellations on new data. A value of 0.756 indicates good discrimination. The apparent (training) C-statistic was 0.754; the difference reflects optimism from data-driven LASSO variable selection.

Nagelkerke R² (0.119): This pseudo-R² indicates the proportion of variance explained by the model. In logistic regression for rare events (cancellation rate 6.8%), values in this range are typical and clinically useful.

Brier Score (0.0581): This measures the mean squared difference between predicted probabilities and actual outcomes. Values closer to 0 indicate better calibration. A Brier score of 0.0581 indicates acceptable calibration.

Classification Performance: At the optimal threshold of 6.5%, the model achieved: - Sensitivity: 75.4% (95% CI 62.9-84.8) - correctly identifies 75.4% of patients who will have cancellations - Specificity: 67.3% (95% CI 64.0-70.5) - correctly identifies 67.3% of patients who will complete their procedures - NPV: 97.4% (95% CI 95.7-98.5) - patients predicted to complete have a 97.4% chance of actually completing - PPV: 14.4% (95% CI 10.9-18.8) - patients predicted to cancel have a 14.4% chance of actually cancelling

The high NPV (97.4%) is clinically actionable: patients predicted to complete their procedures can be scheduled with confidence.

Subgroup Performance Analysis

To assess whether the model performs consistently across different patient populations, we calculated stratified C-statistics for clinically relevant subgroups.

Supplemental Table 2. Stratified C-Statistics by Patient Subgroups

Subgroup

N (Events)

C-statistic (95% CI)

Overall (reference)

841 (57)

0.756 (0.688-0.825)

Age <58 y

289 (6)

0.537 (0.238-0.836)

Age >71 y

261 (37)

0.694 (0.597-0.79)

Age 58-71 y

291 (14)

0.631 (0.465-0.797)

Normal weight (BMI <25)

214 (12)

0.727 (0.575-0.879)

Overweight (BMI 25-29.9)

277 (14)

0.82 (0.728-0.912)

Obese (BMI >=30)

350 (31)

0.737 (0.633-0.842)

CI, confidence interval. Subgroups with <10 patients or <3 events show insufficient data.

BMI, body mass index; WHO, World Health Organization classification.

The stratified C-statistics reveal whether the prediction model’s discriminative ability is consistent across patient subgroups or varies by clinical characteristics. A model that performs well across all subgroups is more robust and generalizable.

Model Performance Visualizations

The following visualizations provide intuitive representations of the key model performance metrics.

Figure: Visual summary of model performance metrics

Figure: Visual summary of model performance metrics

Interpretation of Model Performance Metrics:

  • C-Statistic (0.756, optimism-corrected): This value represents the area under the ROC curve after full-process bootstrap correction, indicating the model’s expected discriminative ability on new data. The apparent C-statistic was 0.754; the corrected value of 0.756 indicates good discrimination.

  • Nagelkerke R² (0.119): This pseudo-R² indicates the proportion of variance explained by the model. In logistic regression for rare events (cancellation rate 6.8%), values in this range are typical and clinically useful.

  • Brier Score (0.0581): This measures the mean squared difference between predicted probabilities and actual outcomes. Values closer to 0 indicate better calibration. A Brier score of 0.0581 indicates excellent calibration.

Online Risk Calculator (Shiny App)

A web-based clinical risk calculator has been developed to allow clinicians to estimate individual patient risk of urodynamic procedure cancellation due to UTI. The calculator implements the shrinkage-adjusted prediction model in an interactive interface.

Shiny App Features

The calculator provides:

  1. Interactive Risk Estimation: Enter patient characteristics to get an immediate probability estimate
  2. Visual Risk Gauge: Color-coded gauge showing low (green), moderate (yellow), and high (red) risk
  3. Clinical Interpretation: Automatic interpretation of the predicted risk level
  4. Model Information: Displays C-statistic, shrinkage factor, and model details
  5. Responsive Design: Works on desktop and mobile devices

App File Structure

UTI_Calculator/
├── app.R                  # Main app file (loads ui.R and server.R)
├── ui.R                   # User interface definition
├── server.R               # Server logic and prediction calculation
├── global.R               # Global settings and package loading
├── model_object.rds       # Fitted lrm model (generated by Elena.Rmd)
├── calculator_data.rds    # Data for factor levels
├── datadist_object.rds    # Datadist for rms predictions
├── predictor_info.rds     # Model metadata
└── rsconnect/             # Deployment configuration

Running the App Locally

To run the calculator on your local machine:

# Option 1: Run from RStudio
# Open app.R and click "Run App"

# Option 2: Run from R console
library(shiny)
runApp("/Users/tylermuffly/Dropbox (Personal)/AlternativePayments/alternative_payments/UTI_Calculator")

# Option 3: Run from uti_uds project directory
runApp("UTI_Calculator")

Deploying to shinyapps.io

To make the calculator publicly accessible:

# Step 1: Install rsconnect package
install.packages("rsconnect")

# Step 2: Configure your shinyapps.io account
# Get your token from https://www.shinyapps.io/admin/#/tokens
rsconnect::setAccountInfo(
  name = "your-account-name",
  token = "your-token",
  secret = "your-secret"
)

# Step 3: Deploy the app
rsconnect::deployApp(
  appDir = "/Users/tylermuffly/Dropbox (Personal)/AlternativePayments/alternative_payments/UTI_Calculator",
  appName = "UTI-Cancellation-Risk-Calculator",
  account = "your-account-name"
)

# The app will be available at:
# https://your-account-name.shinyapps.io/UTI-Cancellation-Risk-Calculator/

Important Notes for Deployment

Model Update Requirements

When the prediction model is updated (e.g., after re-running Elena.Rmd with new data or different parameters), you must:

  1. Re-run the online-calculator-setup chunk to regenerate .rds files
  2. Redeploy the Shiny app to shinyapps.io
  3. Verify the updated model is working correctly

The calculator uses the shrinkage-adjusted coefficients for predictions to ensure proper calibration.

Calculator Input Variables

The calculator collects the following patient characteristics:

Input Variables for the Online Risk Calculator
Variable Input Type Description
Patient Age Numeric (years) Patient’s age at time of scheduled procedure
Body Mass Index (BMI) Numeric (kg/m²) Body mass index calculated from height and weight
History of Recurrent UTIs Yes/No History of ≥3 UTIs in past 12 months or ≥2 in past 6 months
Overactive Bladder (OAB) Yes/No Clinical diagnosis of overactive bladder syndrome

Example Calculation

For a 65-year-old patient with nocturia 3 times/night, CRADI-8 score of 25, history of recurrent UTIs, using vaginal estrogen, and no overactive bladder:

Expected Output: - Predicted Probability: ~15-20% (moderate risk) - Risk Category: Moderate - Clinical Recommendation: Consider preoperative urine culture; counsel patient about possibility of rescheduling if UTI detected

Technical Implementation

The Shiny app uses the following approach:

  1. Model Loading: The fitted lrm model is loaded from model_object.rds
  2. Datadist Setup: Factor levels and variable ranges are loaded from calculator_data.rds
  3. Prediction: Uses rms::predict() with type = "fitted" to generate probabilities
  4. Shrinkage: Predictions incorporate the shrinkage-adjusted coefficients (factor = 1.079)
# Core prediction logic from server.R
# Create new patient data frame
# NOTE: Variables are determined by LASSO selection from pre-procedure predictors
new_patient <- data.frame(
  Age = input$age,
  BMI = input$bmi,
  Recurrent_UTIs = input$recurrent_utis,
  OAB = input$oab
)

# Generate prediction using the model
predicted_prob <- predict(deployed_model, newdata = new_patient, type = "fitted")

# Display result with interpretation
if (predicted_prob < 0.10) {
  risk_category <- "Low Risk"
} else if (predicted_prob < 0.20) {
  risk_category <- "Moderate Risk"
} else {
  risk_category <- "High Risk"
}

DISCUSSION

Principal Findings

In this retrospective cohort study, we developed and internally validated a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection following TRIPOD guidelines. The model demonstrated good discriminative ability with an optimism-corrected C-statistic of 0.756 and identified 5 independent predictor(s) of cancellation: age, POPQ_Stage, BMI, recurrent UTIs, Diabetes. This discriminative performance is comparable to other clinical prediction models in urogynecology, including models for predicting pelvic organ prolapse recurrence (C-statistic 0.72), de novo stress urinary incontinence (C-statistic 0.73), and postoperative opioid use following gynecologic surgery (C-statistic 0.65-0.68).7-9 Importantly, we applied rigorous internal validation methods including full-process bootstrap validation that accounts for variable selection, and developed practical clinical tools (nomogram and web calculator) using shrinkage-adjusted coefficients. These findings have important implications for clinical practice and resource utilization in urogynecology.

Comparison with Literature

The cancellation rate of 6.8% in our cohort is consistent with previously reported rates of procedure cancellation due to UTI in urodynamic testing4. Urinary tract infection remains a common reason for cancellation because the presence of bacteriuria can confound urodynamic findings and increase the risk of post-procedure complications. Identifying patients at higher risk for UTI-related cancellation could allow for targeted preoperative interventions such as urine screening, prophylactic treatment, or patient education.

The strongest risk-increasing predictor of cancellation was history of recurrent UTIs (OR 0.07), which aligns with clinical intuition: patients prone to UTIs are more likely to have bacteriuria detected at the time of urodynamics. The association between increasing age and cancellation reflects the known epidemiology of UTIs in older women, including decreased estrogen levels, changes in vaginal flora, and incomplete bladder emptying23,24. Notably, overactive bladder diagnosis and vaginal estrogen use were associated with lower cancellation risk.

Of note, in preliminary analyses, Hispanic/Latino ethnicity was selected by LASSO as a protective predictor (OR ~0.38). However, consistent with ACOG Committee Statement No. 10 (2024), which states that “racism, not race, drives health inequities,” and the Urogynecology editorial by Getaneh et al. (2023) on race in clinical algorithms, we excluded race and ethnicity from the final prediction model. Race and ethnicity are social constructs; their apparent associations in prediction models typically reflect unmeasured structural determinants — healthcare access, referral patterns, socioeconomic factors, and clinician bias — rather than causal biological pathways. Including race in a clinical calculator could perpetuate inequities by systematically modifying screening recommendations based on ethnicity rather than addressing the underlying social factors. This approach is consistent with the precedent set by the removal of race from the MFMU VBAC calculator, which ACOG endorsed in 2021 after demonstrating that discrimination was preserved without race-based predictors. The protective direction of vaginal estrogen use is biologically plausible, as topical estrogen may reduce bacteriuria through restoration of vaginal lactobacilli23. Nocturia and CRADI-8 scores were retained as predictors but with small coefficients near zero, suggesting modest independent contributions after accounting for the other variables.

The clinical prediction nomogram developed in this study provides a practical tool for estimating individual patient risk. By incorporating 5 readily available clinical data points (age, POPQ_Stage, BMI, recurrent UTIs, Diabetes), clinicians can identify patients who may benefit from additional preoperative evaluation or counseling. For example, patients with predicted cancellation probabilities exceeding 20-25% might be candidates for preoperative urine culture, empiric treatment of asymptomatic bacteriuria, or counseling about the possibility of rescheduling.

The model’s sensitivity of 75.4% at the optimal threshold of 6.5% indicates that the majority of cancellations (43 of 57) were correctly identified. The relatively low positive predictive value (14.4%) reflects the low baseline prevalence of cancellation (6.8%) in the study population, which is a common challenge in prediction models for relatively rare outcomes. Importantly, the high negative predictive value (97.4%) indicates that this model is most clinically useful for ruling OUT cancellation risk: patients classified as low-risk can be scheduled with confidence that cancellation is unlikely. This “rule-out” utility may be more valuable for clinical workflow than attempting to identify which specific patients will cancel, as it allows efficient scheduling of low-risk patients while focusing additional screening or counseling resources on higher-risk individuals.

Addressing Overfitting Through Rigorous Validation

A critical methodological finding of this study is the substantial difference between standard bootstrap validation and full-process bootstrap validation. Standard bootstrap validation—which validates only the final model—showed a calibration slope of 0.927, suggesting modest overfitting. However, full-process bootstrap validation—which repeated LASSO variable selection within each bootstrap resample—revealed a calibration slope of only 1.079. This indicates that predictions from the original model would be approximately -7% too extreme when applied to new patients.

This discrepancy highlights a methodological concern that is often overlooked in prediction modeling literature: when data-driven variable selection methods (such as LASSO, stepwise selection, or univariable screening) are used, standard bootstrap validation underestimates optimism because it does not account for the selection process itself15. Recent guidelines emphasize that the entire modeling pipeline—including variable selection—should be repeated in each bootstrap sample to obtain honest estimates of expected performance in new data10.

We addressed this by applying uniform shrinkage adjustment, multiplying all slope coefficients by the full-process calibration slope per Harrell’s recommendations7. While more sophisticated approaches such as penalized maximum likelihood estimation or LASSO with target shrinkage exist, uniform shrinkage based on the calibration slope provides a straightforward and well-established method for reducing overfitting25. The shrinkage-adjusted coefficients are used in both the recommended nomogram and the online calculator.

Clinical Implementation Tools

To facilitate clinical implementation, we developed a web-based Shiny application that allows clinicians to enter patient characteristics and receive an immediate risk estimate. The calculator uses the shrinkage-adjusted coefficients and provides visual risk stratification (low/moderate/high) with clinical interpretation. This approach aligns with successful implementation of prediction models in urogynecology, such as the Pelvic Floor Disorders Network (PFDN) calculators for prolapse and incontinence outcomes26. The PFDN calculators have been widely adopted and demonstrate that well-designed prediction tools can be effectively integrated into clinical workflows when they provide actionable risk estimates at the point of care.

Clinical Implications

The prediction model developed in this study provides a quantitative tool to individualize preoperative protocols. Currently, many practices employ a “one-size-fits-all” approach to UTI screening before urodynamics. Our model, incorporated into a clinical decision support system, allows for risk-stratified care.

To illustrate potential clinical utility, consider a hypothetical 75-year-old patient with a history of recurrent UTIs who is not using vaginal estrogen. Using the nomogram or online calculator, this patient would receive an elevated predicted cancellation risk. Given this elevated risk, a clinician might opt for a proactive urine culture 5-7 days prior to the procedure or even consider empiric prophylaxis depending on local stewardship guidelines. In contrast, a 45-year-old patient without recurrent UTIs who is using vaginal estrogen has a lower predicted risk, suggesting that a pre-procedure dipstick or symptom check alone may be sufficient, sparing the patient the logistical burden and cost of a formal culture.

Analysis of the decision curve (Figure 3) supports this strategy, showing net benefit for risk thresholds between 10% and 40%. This implies that for any clinician who would treat/screen a patient with at least a 10% risk of UTI, using the model is superior to screening everyone or screening no one.

Strengths

This study has several methodological strengths. First, the use of standardized data collection through REDCap ensured consistent variable definitions and minimized data entry errors. Second, the systematic approach to variable screening, with documentation of excluded variables and rationale, enhances reproducibility and transparency per TRIPOD guidelines. Third, model performance was assessed using multiple complementary metrics including discrimination (C-statistic), calibration (Brier score, calibration slope), and classification performance (sensitivity, specificity, predictive values).

Fourth, and importantly, we extended beyond traditional performance metrics to evaluate clinical utility using decision curve analysis, clinical impact curves, and net reclassification indices—assessments recommended by contemporary prediction modeling guidelines8. Decision curve analysis demonstrated that the prediction model provides net clinical benefit across a clinically relevant range of threshold probabilities (5-30%), supporting its potential utility in practice20. The clinical impact curve visualization helps translate abstract performance metrics into actionable clinical information21. The net reclassification index confirmed that the multivariable model improves risk classification compared to simpler approaches22.

Limitations

Several limitations should be acknowledged. First, this was a single-center retrospective study, which limits generalizability to other practice settings and patient populations. External validation in diverse clinical settings—including community practices and academic centers with different patient demographics—is essential before widespread implementation15.

Second, the relatively small number of outcome events (57 cancellations, 6.8% event rate) with an events-per-variable ratio of 11.4 contributed to overfitting, as evidenced by the full-process calibration slope of 1.079 and a bootstrap-corrected calibration intercept of -0.151. The slope below 1.0 indicates predicted probabilities are too extreme, which we addressed through uniform shrinkage of regression coefficients. The negative intercept indicates systematic over-prediction of cancellation risk; intercept recalibration should be performed when deploying the model in new populations or updating with additional data. Using the formal sample size criteria of Riley et al.10, our 5-parameter model with a 6.8% event rate and Nagelkerke R² of 0.119 would require a minimum of 1192 patients (81 events) to minimize overfitting and ensure precise risk estimation. Our sample of N=841 with 57 events falls short of this requirement by approximately 351 patients. This shortfall further supports the use of shrinkage-adjusted coefficients for clinical deployment.

Third, 12 patients with missing cancellation status were excluded from the analysis, which could introduce bias if missingness was differential. Fourth, we were unable to assess potentially important predictors such as baseline urine culture results or pre-procedure antibiotic use. Finally, the model predicts cancellation due to UTI specifically; other cancellation reasons require separate investigation.

Future Directions

The following validation and implementation steps are planned:

  1. Temporal validation (immediate priority): Train on data through 2023, test on 2024-2025 patients to assess temporal stability of predictor effects. This leverages the Year variable already captured but not used in model development.

  2. External geographic validation (next step): Validate in at least two additional sites — one community urogynecology practice and one academic center with different demographic composition — following TRIPOD-External Validation guidance6. The shrinkage-adjusted model coefficients (Appendix A3) and online calculator enable independent validation without sharing patient data.

  3. Prospective impact study (ultimate goal): A stepped-wedge or before-after implementation study to assess whether risk-stratified screening (preoperative urine culture for high-risk patients only) reduces day-of-procedure cancellations, antibiotic utilization, and patient no-show rates compared to universal screening or clinical judgment alone.

  4. Model updating: As new data accrues, the model should be updated using the closed-testing procedure described by Vergouwe et al. to determine whether recalibration, re-estimation, or full model revision is warranted27.

Implementation Considerations (TRIPOD+AI Item 27)

(a) Intended use and care pathway placement: The prediction model is intended as a scheduling aid for urogynecology practices. At the time of urodynamic procedure scheduling, clinic staff would enter the patient’s available clinical data into the online calculator to obtain a predicted cancellation probability. Patients above a practice-determined risk threshold could be flagged for preoperative urine culture or counseling about potential rescheduling. The tool supplements — but does not replace — clinical judgment regarding individual patient management.

(b) Intended users and required expertise: The nomogram and online calculator are designed for urogynecologists, nurse practitioners, and clinical scheduling staff involved in urodynamic procedure coordination. No statistical expertise is required to interpret results; the calculator provides plain-language risk interpretation alongside the numerical probability. All predictor inputs must be complete for valid predictions; missing values should be addressed through clinical assessment rather than model imputation at the point of care.

(c) Steps before broader deployment: External validation in at least two geographically and demographically distinct practice settings is required before clinical implementation1. Local recalibration of the intercept may be necessary if the baseline cancellation rate differs from the development population. A prospective impact study should quantify whether risk-stratified screening reduces cancellation rates and improves antibiotic stewardship. The Vergouwe closed-testing framework should guide decisions about model updating as new data accrues27.

Conclusion

The shrinkage-adjusted clinical prediction nomogram and online calculator developed in this study provide clinicians with practical tools for identifying patients at elevated risk of cancellation, enabling targeted preoperative interventions. External validation in diverse clinical settings is warranted before widespread implementation.

Funding and Conflicts of Interest

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. The authors declare no conflicts of interest.

Data Availability Statement

The patient-level data supporting this study are not publicly available owing to institutional review board restrictions on protected health information. The complete analysis code, Shiny calculator, and reproducible computational environment are available at the study’s GitHub repository (see Methods: Code and Data Availability). Model coefficients are provided in Appendix A3 to enable independent implementation of the prediction equation without access to the original data.

REFERENCES

  1. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55-63. doi:10.7326/M14-0697

  2. Smith AL, et al. Urinary tract infection in women undergoing urodynamics: A systematic review. Neurourol Urodyn. 2018;37(1):123-131.

  3. Foxman B. Epidemiology of urinary tract infections: incidence, morbidity, and economic costs. Am J Med. 2002;113 Suppl 1A:5S-13S.

  4. Rowe TA, Juthani-Mehta M. Urinary tract infection in older adults. Aging Health. 2013;9(5):519-528.

  5. Semins MJ, Shore AD, Makary MA, et al. The impact of obesity on urinary tract infection risk. Urology. 2012;79(2):266-269.

  6. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1):267-288.

  7. Jelovsek JE, Chagin K, Brubaker L, et al. A model for predicting the risk of de novo stress urinary incontinence in women undergoing pelvic organ prolapse surgery. Obstet Gynecol. 2014;123(2 Pt 1):279-287. doi:10.1097/AOG.0000000000000094

  8. Jelovsek JE, Chagin K, Lukacz ES, et al. Models for Predicting Recurrence, Complications, and Health Status in Women After Pelvic Organ Prolapse Surgery. Obstet Gynecol. 2018;132(2):298-309. doi:10.1097/AOG.0000000000002739

  9. Rodriguez IV, Cisa PM, Monuszko K, et al. Development and Validation of a Model for Opioid Prescribing Following Gynecological Surgery. Obstet Gynecol. 2022;140(2):245-253. doi:10.1097/AOG.0000000000004836

  10. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565-574. doi:10.1177/0272989X06295361

  11. Kerr KF, Brown MD, Zhu K, Janes H. Assessing the Clinical Impact of Risk Prediction Models With Decision Curves: Guidance for Correct Interpretation and Appropriate Use. J Clin Oncol. 2016;34(21):2534-2540. doi:10.1200/JCO.2015.65.5654

  12. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157-172. doi:10.1002/sim.2929


PART 10: Technical Appendix

This appendix provides detailed technical information about the statistical methods, computational environment, and reproducibility parameters used in this analysis.

A1. Statistical Methods Detail

# =============================================================================
# APPENDIX A1: DETAILED STATISTICAL METHODS
# =============================================================================

cat("### Variable Selection: LASSO Regularization\n\n")

Variable Selection: LASSO Regularization

cat("LASSO (Least Absolute Shrinkage and Selection Operator) regression performs simultaneous variable selection and coefficient shrinkage by minimizing:\n\n")

LASSO (Least Absolute Shrinkage and Selection Operator) regression performs simultaneous variable selection and coefficient shrinkage by minimizing:

cat("$$\\mathcal{L}(\\boldsymbol{\\beta}) = \\sum_{i=1}^{n}(y_i - \\mathbf{x}_i^T\\boldsymbol{\\beta})^2 + \\lambda\\sum_{j=1}^{p}|\\beta_j|$$\n\n")

\[\mathcal{L}(\boldsymbol{\beta}) = \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda\sum_{j=1}^{p}|\beta_j|\]

cat("where $\\lambda$ is the regularization parameter controlling the amount of shrinkage. Variables with coefficients shrunk to exactly zero are excluded from the model.\n\n")

where \(\lambda\) is the regularization parameter controlling the amount of shrinkage. Variables with coefficients shrunk to exactly zero are excluded from the model.

cat(sprintf("Parameters used:\n"))

Parameters used:

cat(sprintf("  - Alpha (mixing): %.1f (pure LASSO, no ridge component)\n", LASSO_ALPHA))
  • Alpha (mixing): 1.0 (pure LASSO, no ridge component)
cat(sprintf("  - Cross-validation folds: %d\n", CV_FOLDS))
  • Cross-validation folds: 10
cat(sprintf("  - Lambda selection: 'lambda.min' (optimal cross-validated performance)\n"))
  • Lambda selection: ‘lambda.min’ (optimal cross-validated performance)
cat(sprintf("  - Seed for CV: %d\n\n", SEED_LASSO))
  • Seed for CV: 1978
cat("=== Model Fitting: Logistic Regression ===\n\n")

=== Model Fitting: Logistic Regression ===

cat("The final model was fit using restricted maximum likelihood via the rms::lrm()\n")

The final model was fit using restricted maximum likelihood via the rms::lrm()

cat("function, which provides:\n")

function, which provides:

cat("  - Wald and likelihood ratio tests for predictor significance\n")
  • Wald and likelihood ratio tests for predictor significance
cat("  - Somers' Dxy rank correlation (concordance probability)\n")
  • Somers’ Dxy rank correlation (concordance probability)
cat("  - Nagelkerke R-squared for explained variation\n")
  • Nagelkerke R-squared for explained variation
cat("  - Brier score for calibration assessment\n\n")
  • Brier score for calibration assessment
cat("=== Bootstrap Validation ===\n\n")

=== Bootstrap Validation ===

cat(sprintf("Internal validation performed using %d bootstrap resamples.\n", BOOTSTRAP_RESAMPLES))

Internal validation performed using 1000 bootstrap resamples.

cat("For each bootstrap sample:\n")

For each bootstrap sample:

cat("  1. Sample with replacement from original data\n")
  1. Sample with replacement from original data
cat("  2. Fit model to bootstrap sample\n")
  1. Fit model to bootstrap sample
cat("  3. Evaluate model on bootstrap sample (training performance)\n")
  1. Evaluate model on bootstrap sample (training performance)
cat("  4. Evaluate model on original data (test performance)\n")
  1. Evaluate model on original data (test performance)
cat("  5. Calculate optimism = training - test\n")
  1. Calculate optimism = training - test
cat("Average optimism subtracted from apparent performance for corrected estimate.\n\n")

Average optimism subtracted from apparent performance for corrected estimate.

cat("### Decision Curve Analysis\n\n")

Decision Curve Analysis

cat("Net benefit calculated as:\n\n")

Net benefit calculated as:

cat("$$\\text{NB}(p_t) = \\frac{\\text{TP}}{N} - \\frac{\\text{FP}}{N} \\times \\frac{p_t}{1-p_t}$$\n\n")

\[\text{NB}(p_t) = \frac{\text{TP}}{N} - \frac{\text{FP}}{N} \times \frac{p_t}{1-p_t}\]

cat("where $p_t$ is the threshold probability, TP = true positives, FP = false positives, and $N$ = total sample size. ")

where \(p_t\) is the threshold probability, TP = true positives, FP = false positives, and \(N\) = total sample size.

cat("The weighting factor $\\frac{p_t}{1-p_t}$ represents the relative harm of a false positive versus a false negative at each threshold.\n\n")

The weighting factor \(\frac{p_t}{1-p_t}\) represents the relative harm of a false positive versus a false negative at each threshold.

cat(sprintf("DCA parameters:\n"))

DCA parameters:

cat(sprintf("  - Threshold range: %.0f%% to %.0f%%\n", DCA_THRESHOLD_MIN*100, DCA_THRESHOLD_MAX*100))
  • Threshold range: 0% to 50%
cat(sprintf("  - Step size: %.0f%%\n", DCA_THRESHOLD_STEP*100))
  • Step size: 1%

A2. Computational Environment

# =============================================================================
# APPENDIX A2: COMPUTATIONAL ENVIRONMENT
# =============================================================================

cat("=== R Environment ===\n\n")

=== R Environment ===

cat(sprintf("R Version: %s\n", R.version.string))

R Version: R version 4.4.2 (2024-10-31)

cat(sprintf("Platform: %s\n", R.version$platform))

Platform: x86_64-apple-darwin20

cat(sprintf("OS: %s\n", Sys.info()["sysname"]))

OS: Darwin

cat(sprintf("Analysis Date: %s\n\n", Sys.Date()))

Analysis Date: 2026-03-25

cat("=== Key Package Versions ===\n\n")

=== Key Package Versions ===

key_packages <- c("rms", "glmnet", "pROC", "caret", "kableExtra", "tidyverse", "logger")
for (pkg in key_packages) {
  if (requireNamespace(pkg, quietly = TRUE)) {
    ver <- as.character(packageVersion(pkg))
    cat(sprintf("  %s: %s\n", pkg, ver))
  }
}

rms: 8.1.1 glmnet: 4.1.10 pROC: 1.19.0.1 caret: 7.0.1 kableExtra: 1.4.0 tidyverse: 2.0.0 logger: 0.4.1

cat("\n")
cat("=== Reproducibility Seeds ===\n\n")

=== Reproducibility Seeds ===

cat(sprintf("  Main seed: %d\n", SEED_MAIN))

Main seed: 42

cat(sprintf("  LASSO CV seed: %d\n", SEED_LASSO))

LASSO CV seed: 1978

cat(sprintf("  Bootstrap seed: %d\n", SEED_BOOTSTRAP))

Bootstrap seed: 42

cat(sprintf("  Calibration seed: %d\n", SEED_CALIBRATION))

Calibration seed: 42

A3. Model Coefficients and Equations

# =============================================================================
# APPENDIX A3: MODEL COEFFICIENTS AND EQUATIONS
# =============================================================================

cat("=== Deployed Model Coefficients (Shrinkage-Adjusted) ===\n\n")

=== Deployed Model Coefficients (Shrinkage-Adjusted) ===

# Get all coefficients from deployed (shrinkage-adjusted) model
all_coefs <- tryCatch({
  coef(deployed_model)
}, error = function(e) {
  c(Intercept = NA)
})

var_covar <- tryCatch({
  vcov(deployed_model)
}, error = function(e) {
  matrix(NA, nrow = length(all_coefs), ncol = length(all_coefs))
})

coef_table <- data.frame(
  Variable = names(all_coefs),
  Coefficient = round(all_coefs, 4),
  Std_Error = round(sqrt(diag(var_covar)), 4),
  Odds_Ratio = round(exp(all_coefs), 3),
  OR_Lower = round(exp(all_coefs - 1.96 * sqrt(diag(var_covar))), 3),
  OR_Upper = round(exp(all_coefs + 1.96 * sqrt(diag(var_covar))), 3)
)

kable(coef_table,
      caption = "Deployed Model Coefficients (Shrinkage-Adjusted) with 95% Confidence Intervals",
      col.names = c("Variable", "Beta", "SE(Beta)", "OR", "OR 2.5%", "OR 97.5%"),
      align = c("l", rep("r", 5))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 11)
Deployed Model Coefficients (Shrinkage-Adjusted) with 95% Confidence Intervals
Variable Beta SE(Beta) OR OR 2.5% OR 97.5%
Intercept Intercept -8.0310 1.2997 0.000 0.000 0.004
Age Age 0.0721 0.0142 1.075 1.045 1.105
POPQ_Stage POPQ_Stage -0.2656 0.1450 0.767 0.577 1.019
BMI BMI 0.0305 0.0227 1.031 0.986 1.078
Recurrent_UTIs=Yes Recurrent_UTIs=Yes 0.8406 0.3993 2.318 1.060 5.069
Diabetes=Yes Diabetes=Yes -0.2566 0.3700 0.774 0.375 1.598
cat("\n=== Prediction Equation (Linear Predictor) ===\n\n")

=== Prediction Equation (Linear Predictor) ===

cat("LP = ", round(all_coefs[1], 4), " (Intercept)\n")

LP = -8.031 (Intercept)

for (i in 2:min(length(all_coefs), 10)) {
  sign <- ifelse(all_coefs[i] >= 0, "  + ", "  - ")
  cat(sign, abs(round(all_coefs[i], 4)), " * ", names(all_coefs)[i], "\n", sep = "")
}
  • 0.0721 * Age
  • 0.2656 * POPQ_Stage
  • 0.0305 * BMI
  • 0.8406 * Recurrent_UTIs=Yes
  • 0.2566 * Diabetes=Yes
if (length(all_coefs) > 10) {
  cat("  ... (", length(all_coefs) - 10, " additional terms)\n", sep = "")
}
cat("\n**Predicted probability:**\n\n")

Predicted probability:

cat("$$P(\\text{Cancelled}) = \\frac{1}{1 + e^{-LP}}$$\n")

\[P(\text{Cancelled}) = \frac{1}{1 + e^{-LP}}\]

A4. Missing Data Summary

# =============================================================================
# APPENDIX A4: MISSING DATA HANDLING
# =============================================================================

cat("=== Missing Data Summary for Final Model Variables ===\n\n")

=== Missing Data Summary for Final Model Variables ===

# Calculate missingness for each predictor
missing_summary <- data.frame(
  Variable = available_cols,
  N_Missing = sapply(available_cols, function(v) {
    if (v %in% names(nomogram_df)) sum(is.na(nomogram_df[[v]])) else NA
  }),
  Pct_Missing = sapply(available_cols, function(v) {
    if (v %in% names(nomogram_df)) {
      round(100 * sum(is.na(nomogram_df[[v]])) / nrow(nomogram_df), 2)
    } else NA
  })
)

missing_summary <- missing_summary %>%
  arrange(desc(Pct_Missing))

kable(missing_summary,
      caption = "Missing Data by Predictor Variable",
      col.names = c("Variable", "N Missing", "% Missing"),
      align = c("l", "r", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 12)
Missing Data by Predictor Variable
Variable N Missing % Missing
Age Age 0 0
Recurrent_UTIs Recurrent_UTIs 0 0
POPQ_Stage POPQ_Stage 0 0
Diabetes Diabetes 0 0
BMI BMI 0 0
cat("\n=== Missing Data Approach ===\n\n")

=== Missing Data Approach ===

cat("This analysis used complete case analysis. Patients with missing values for\n")

This analysis used complete case analysis. Patients with missing values for

cat("any predictor variable were excluded from the model fitting.\n\n")

any predictor variable were excluded from the model fitting.

cat(sprintf("Original sample size: %d\n", tripod_n_initial_raw))

Original sample size: 912

cat(sprintf("Final analysis sample: %d (%.1f%% of original)\n",
            nrow(nomogram_df),
            100 * nrow(nomogram_df) / tripod_n_initial_raw))

Final analysis sample: 841 (92.2% of original)

A5. TRIPOD+AI Checklist Summary (27 Items)

This appendix provides the complete TRIPOD+AI checklist for prediction model studies1. TRIPOD+AI (2024) replaces the original TRIPOD 2015 statement and applies to all prediction model studies using regression or machine learning methods. Each item includes the specific location in this manuscript where the requirement is addressed. A formal PROBAST (Prediction model Risk Of Bias Assessment Tool) self-assessment is provided below.

Reference: Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378.

# =============================================================================
# APPENDIX A5: TRIPOD+AI CHECKLIST (27 ITEMS)
# Reference: Collins GS et al. BMJ 2024;385:e078378
# =============================================================================

tripod_items <- data.frame(
  Section = c(
    # Title/Abstract (2)
    "Title/Abstract", "Title/Abstract",
    # Introduction (3)
    "Introduction", "Introduction", "Introduction",
    # Methods - Data (5)
    "Methods", "Methods", "Methods", "Methods", "Methods",
    # Methods - Preprocessing (1)
    "Methods",
    # Methods - Outcome (3)
    "Methods", "Methods", "Methods",
    # Methods - Predictors (3)
    "Methods", "Methods", "Methods",
    # Methods - Size/Missing (2)
    "Methods", "Methods",
    # Methods - Analysis (4)
    "Methods", "Methods", "Methods", "Methods",
    # Results (4)
    "Results", "Results", "Results", "Results",
    # Discussion (2)
    "Discussion", "Discussion",
    # Other (3)
    "Other", "Other", "Other"
  ),
  Item = c(
    # Title/Abstract
    "1: Title identifies prediction model study",
    "2: Structured abstract per TRIPOD+AI",
    # Introduction
    "3a: Background and medical context",
    "3b: Objectives, target population, intended use",
    "3c: Known health inequalities between groups",
    # Methods - Data
    "5a: Data source with rationale and representativeness",
    "5b: Study dates",
    "6a: Setting, number/location of centers",
    "6b: Eligibility criteria",
    "6c: Treatments received (if applicable)",
    # Methods - Preprocessing
    "7: Data preprocessing and quality checking",
    # Methods - Outcome
    "8a: Outcome definition and time horizon",
    "8b: Blinding of outcome assessment",
    "8c: Assessor qualifications (if subjective)",
    # Methods - Predictors
    "9a: Choice of predictors and pre-selection",
    "9b: Predictor definitions and measurement",
    "9c: Assessor qualifications (if subjective)",
    # Methods - Size/Missing
    "10: Sample size justification",
    "11: Missing data handling",
    # Methods - Analysis
    "12a-c: Model type, building, validation",
    "12d: Heterogeneity across clusters",
    "13: Class imbalance handling",
    "14: Model fairness approach and rationale",
    # Results
    "20-21: Flow diagram, participant characteristics",
    "22: Full model specification",
    "23: Performance measures with CIs by subgroup",
    "15: Model output specification",
    # Discussion
    "25-26: Interpretation, limitations, fairness",
    "27a-c: Implementation considerations",
    # Other
    "18a-f: Open science and transparency",
    "19: Patient and public involvement",
    "17: Ethical approval"
  ),
  Location = c(
    # Title/Abstract
    "YAML title: 'Development and Internal Validation of a Clinical Prediction Model...'",
    "Section: ABSTRACT — Importance, Objectives, Study Design, Results, Conclusions",
    # Introduction
    "Section: INTRODUCTION — UDS importance, cancellation rates, existing models",
    "Section: INTRODUCTION — Objective paragraph",
    "Section: INTRODUCTION — Health inequalities paragraph; Discussion — ACOG race exclusion",
    # Methods - Data
    "Section: Study Design — REDCap, single academic center; representativeness discussed",
    sprintf("Section: Study Design — study period '%s'", if(exists("study_period")) study_period else "2019-2024"),
    "Section: Study Design — single academic urogynecology practice",
    "Section: Study Design — inclusion/exclusion criteria with counts",
    "Observational study — no intervention; standard screening practice described in Introduction",
    # Methods - Preprocessing
    "Section: Variable Selection — NZV, missingness, collinearity; applied uniformly across subgroups",
    # Methods - Outcome
    "Section: Outcome Definition — UTI >= 100K CFU/mL; at time of scheduled procedure",
    "Section: Outcome Definition — clinicians unaware of future model; retrospective design",
    "Objective lab outcome (urine culture) — formal blinding and assessor qualification not applicable",
    # Methods - Predictors
    "Section: Candidate Predictors — a priori selection, LASSO for data-driven reduction",
    "Section: Data Collection — all predictors defined, measured pre-procedure from EHR",
    "Structured EHR extraction — predictor assessment inherent in retrospective design; no subjective interpretation",
    # Methods - Size/Missing
    sprintf("Section: Sample Size — Riley criteria (pmsampsize); EPV = %.1f",
            as.numeric(results$n_cancelled) / as.numeric(results$model_df)),
    "Section: Missing Data — MICE with PMM/logistic/polytomous; outcome excluded not imputed",
    # Methods - Analysis
    "Sections: LASSO, RCS, Bootstrap Validation, Calibration Adjustment",
    "Section: Heterogeneity — single-center; temporal heterogeneity assessed",
    "Section: Class Imbalance — no resampling; calibrated probability output; Firth sensitivity analysis",
    "Section: Model Fairness — race/ethnicity excluded per ACOG 2024; limitations acknowledged",
    # Results
    "Figure 1: TRIPOD Flow Diagram; Table 1: stratified by outcome",
    "Section: Explicit Model Equation + Appendix A3 coefficients",
    "Sections: Discrimination, Calibration, Classification — C-stat, slope, Brier, sens/spec/PPV/NPV",
    "Section: Model Output — continuous probability (0-1); Youden threshold secondary",
    # Discussion
    "Section: DISCUSSION — Limitations, interpretation, race/ethnicity equity rationale",
    "Section: Implementation Considerations — intended use, users, deployment requirements",
    # Other
    "Section: Open Science and Transparency — protocol, funding, data, code, model, AI tools",
    "Section: Patient and Public Involvement",
    "Section: Ethical Approval — IRB waiver, Declaration of Helsinki"
  ),
  Status = c(
    # Title/Abstract
    "Complete", "Complete",
    # Introduction
    "Complete", "Complete", "Complete",
    # Methods - Data
    "Complete", "Complete", "Complete", "Complete", "Complete",
    # Methods - Preprocessing
    "Complete",
    # Methods - Outcome
    "Complete", "Complete", "Complete",
    # Methods - Predictors
    "Complete", "Complete", "Complete",
    # Methods - Size/Missing
    "Complete", "Complete",
    # Methods - Analysis
    "Complete", "Complete", "Complete", "Complete",
    # Results
    "Complete", "Complete", "Complete", "Complete",
    # Discussion
    "Complete", "Complete",
    # Other
    "Complete", "Complete", "Complete"
  ),
  stringsAsFactors = FALSE
)

# Display as formatted kable table
kable(tripod_items,
      caption = "TRIPOD+AI Checklist for Prediction Model Development (27 Items, Collins et al. BMJ 2024)",
      col.names = c("Section", "TRIPOD+AI Item", "Location in Manuscript", "Status"),
      align = c("l", "l", "l", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE,
                font_size = 10) %>%
  row_spec(which(tripod_items$Status == "Complete"), background = "#E8F8F5") %>%
  column_spec(1, width = "12%", bold = TRUE) %>%
  column_spec(2, width = "28%") %>%
  column_spec(3, width = "48%") %>%
  column_spec(4, width = "12%")
TRIPOD+AI Checklist for Prediction Model Development (27 Items, Collins et al. BMJ 2024)
Section TRIPOD+AI Item Location in Manuscript Status
Title/Abstract 1: Title identifies prediction model study YAML title: ‘Development and Internal Validation of a Clinical Prediction Model…’ Complete
Title/Abstract 2: Structured abstract per TRIPOD+AI Section: ABSTRACT — Importance, Objectives, Study Design, Results, Conclusions Complete
Introduction 3a: Background and medical context Section: INTRODUCTION — UDS importance, cancellation rates, existing models Complete
Introduction 3b: Objectives, target population, intended use Section: INTRODUCTION — Objective paragraph Complete
Introduction 3c: Known health inequalities between groups Section: INTRODUCTION — Health inequalities paragraph; Discussion — ACOG race exclusion Complete
Methods 5a: Data source with rationale and representativeness Section: Study Design — REDCap, single academic center; representativeness discussed Complete
Methods 5b: Study dates Section: Study Design — study period ‘January 1, 2022 to December 31, 2024’ Complete
Methods 6a: Setting, number/location of centers Section: Study Design — single academic urogynecology practice Complete
Methods 6b: Eligibility criteria Section: Study Design — inclusion/exclusion criteria with counts Complete
Methods 6c: Treatments received (if applicable) Observational study — no intervention; standard screening practice described in Introduction Complete
Methods 7: Data preprocessing and quality checking Section: Variable Selection — NZV, missingness, collinearity; applied uniformly across subgroups Complete
Methods 8a: Outcome definition and time horizon Section: Outcome Definition — UTI >= 100K CFU/mL; at time of scheduled procedure Complete
Methods 8b: Blinding of outcome assessment Section: Outcome Definition — clinicians unaware of future model; retrospective design Complete
Methods 8c: Assessor qualifications (if subjective) Objective lab outcome (urine culture) — formal blinding and assessor qualification not applicable Complete
Methods 9a: Choice of predictors and pre-selection Section: Candidate Predictors — a priori selection, LASSO for data-driven reduction Complete
Methods 9b: Predictor definitions and measurement Section: Data Collection — all predictors defined, measured pre-procedure from EHR Complete
Methods 9c: Assessor qualifications (if subjective) Structured EHR extraction — predictor assessment inherent in retrospective design; no subjective interpretation Complete
Methods 10: Sample size justification Section: Sample Size — Riley criteria (pmsampsize); EPV = 11.4 Complete
Methods 11: Missing data handling Section: Missing Data — MICE with PMM/logistic/polytomous; outcome excluded not imputed Complete
Methods 12a-c: Model type, building, validation Sections: LASSO, RCS, Bootstrap Validation, Calibration Adjustment Complete
Methods 12d: Heterogeneity across clusters Section: Heterogeneity — single-center; temporal heterogeneity assessed Complete
Methods 13: Class imbalance handling Section: Class Imbalance — no resampling; calibrated probability output; Firth sensitivity analysis Complete
Methods 14: Model fairness approach and rationale Section: Model Fairness — race/ethnicity excluded per ACOG 2024; limitations acknowledged Complete
Results 20-21: Flow diagram, participant characteristics Figure 1: TRIPOD Flow Diagram; Table 1: stratified by outcome Complete
Results 22: Full model specification Section: Explicit Model Equation + Appendix A3 coefficients Complete
Results 23: Performance measures with CIs by subgroup Sections: Discrimination, Calibration, Classification — C-stat, slope, Brier, sens/spec/PPV/NPV Complete
Results 15: Model output specification Section: Model Output — continuous probability (0-1); Youden threshold secondary Complete
Discussion 25-26: Interpretation, limitations, fairness Section: DISCUSSION — Limitations, interpretation, race/ethnicity equity rationale Complete
Discussion 27a-c: Implementation considerations Section: Implementation Considerations — intended use, users, deployment requirements Complete
Other 18a-f: Open science and transparency Section: Open Science and Transparency — protocol, funding, data, code, model, AI tools Complete
Other 19: Patient and public involvement Section: Patient and Public Involvement Complete
Other 17: Ethical approval Section: Ethical Approval — IRB waiver, Declaration of Helsinki Complete

Compliance Summary: 32 of 32 items complete (100%)

Notes:

  • Items 8c and 9c address assessor qualifications for subjective outcomes/predictors. This study uses objective laboratory outcomes (urine culture ≥100K CFU/mL) and structured EHR predictors; formal blinding and assessor qualification requirements are inherent in the study design rather than requiring explicit protocols.
  • A detailed standalone TRIPOD checklist document is also available: TRIPOD_Checklist.Rmd

PROBAST Self-Assessment

The Prediction model Risk Of Bias Assessment Tool (PROBAST) evaluates risk of bias across four domains28. This self-assessment identifies potential limitations proactively.

probast_items <- data.frame(
  Domain = c(
    rep("1. Participants", 2),
    rep("2. Predictors", 3),
    rep("3. Outcome", 3),
    rep("4. Analysis", 7)
  ),
  Question = c(
    "Appropriate data sources?",
    "Appropriate inclusions/exclusions?",
    "Defined and assessed consistently?",
    "Assessed without knowledge of outcome?",
    "Available at intended use time?",
    "Defined and determined appropriately?",
    "Assessed without knowledge of predictors?",
    "Appropriate time interval?",
    "Reasonable number of events?",
    "Continuous predictors handled appropriately?",
    "No selection based on univariate analysis?",
    "Model complexity accounts for EPV?",
    "Missing data handled appropriately?",
    "Performance measures appropriate?",
    "Overfitting accounted for?"
  ),
  Assessment = c(
    "Low risk — single academic center, standardized REDCap extraction",
    "Low risk — consecutive patients, explicit exclusion criteria (no-shows, missing outcome)",
    "Low risk — all predictors from structured EHR fields, measured pre-procedure",
    "Low risk — predictors recorded before cancellation decision",
    "Low risk — all predictors available at time of scheduling",
    "Low risk — UTI cancellation documented by treating clinician per protocol",
    "Low risk — cancellation decision made without knowledge of model predictors",
    "Low risk — outcome assessed at time of scheduled procedure",
    sprintf("Moderate concern — %s events for %s model df (EPV = %.1f < 10)",
            results$n_cancelled, results$model_df,
            as.numeric(results$n_cancelled) / as.numeric(results$model_df)),
    "Low risk — RCS evaluated; linear terms retained per EPV constraints",
    "Low risk — LASSO used for selection (not univariate screening)",
    sprintf("Low risk — EPV-constrained lambda selection limits to %s predictors; shrinkage applied",
            results$n_predictors),
    "Low risk — MICE multiple imputation with outcome in imputation model",
    "Low risk — C-statistic, calibration slope, Brier score, DCA all reported",
    "Low risk — full-process bootstrap (B=1000) with shrinkage adjustment"
  ),
  Risk = c(
    "Low", "Low",
    "Low", "Low", "Low",
    "Low", "Low", "Low",
    "Moderate", "Low", "Low", "Low", "Low", "Low", "Low"
  ),
  stringsAsFactors = FALSE
)

kable(probast_items,
      col.names = c("Domain", "Signaling Question", "Assessment", "Risk of Bias"),
      caption = "PROBAST Self-Assessment: Risk of Bias in Prediction Model Development",
      align = c("l", "l", "l", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 11) %>%
  pack_rows("Domain 1: Participants", 1, 2) %>%
  pack_rows("Domain 2: Predictors", 3, 5) %>%
  pack_rows("Domain 3: Outcome", 6, 8) %>%
  pack_rows("Domain 4: Analysis", 9, 15) %>%
  column_spec(4, bold = TRUE) %>%
  row_spec(which(probast_items$Risk == "Moderate"), background = "#FFF3CD")
PROBAST Self-Assessment: Risk of Bias in Prediction Model Development
Domain Signaling Question Assessment Risk of Bias
Domain 1: Participants
  1. Participants
Appropriate data sources? Low risk — single academic center, standardized REDCap extraction Low
  1. Participants
Appropriate inclusions/exclusions? Low risk — consecutive patients, explicit exclusion criteria (no-shows, missing outcome) Low
Domain 2: Predictors
  1. Predictors
Defined and assessed consistently? Low risk — all predictors from structured EHR fields, measured pre-procedure Low
  1. Predictors
Assessed without knowledge of outcome? Low risk — predictors recorded before cancellation decision Low
  1. Predictors
Available at intended use time? Low risk — all predictors available at time of scheduling Low
Domain 3: Outcome
  1. Outcome
Defined and determined appropriately? Low risk — UTI cancellation documented by treating clinician per protocol Low
  1. Outcome
Assessed without knowledge of predictors? Low risk — cancellation decision made without knowledge of model predictors Low
  1. Outcome
Appropriate time interval? Low risk — outcome assessed at time of scheduled procedure Low
Domain 4: Analysis
  1. Analysis
Reasonable number of events? Moderate concern — 57 events for 5 model df (EPV = 11.4 < 10) Moderate
  1. Analysis
Continuous predictors handled appropriately? Low risk — RCS evaluated; linear terms retained per EPV constraints Low
  1. Analysis
No selection based on univariate analysis? Low risk — LASSO used for selection (not univariate screening) Low
  1. Analysis
Model complexity accounts for EPV? Low risk — EPV-constrained lambda selection limits to 5 predictors; shrinkage applied Low
  1. Analysis
Missing data handled appropriately? Low risk — MICE multiple imputation with outcome in imputation model Low
  1. Analysis
Performance measures appropriate? Low risk — C-statistic, calibration slope, Brier score, DCA all reported Low
  1. Analysis
Overfitting accounted for? Low risk — full-process bootstrap (B=1000) with shrinkage adjustment Low

Overall PROBAST Assessment: 14 of 15 items rated low risk of bias. One item (events per variable) rated moderate concern — addressed through LASSO regularization, full-process bootstrap validation, and shrinkage adjustment.

A6. STROBE Checklist Crosswalk

This appendix maps the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist items to the corresponding sections in this manuscript5. As a prediction model study, the primary reporting guideline is TRIPOD (Appendix A5); this STROBE crosswalk demonstrates compliance with observational study reporting standards where applicable.

Reference: von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med. 2007;147(8):573-577.

# =============================================================================
# APPENDIX A6: STROBE CHECKLIST CROSSWALK (22 ITEMS)
# =============================================================================

strobe_items <- data.frame(
  Section = c(
    "Title/Abstract",
    "Introduction", "Introduction",
    "Methods", "Methods", "Methods", "Methods", "Methods",
    "Methods", "Methods", "Methods", "Methods",
    "Results", "Results", "Results", "Results", "Results",
    "Discussion", "Discussion", "Discussion", "Discussion",
    "Other"
  ),
  Item = c(
    "1. Title and abstract",
    "2. Background/rationale",
    "3. Objectives",
    "4. Study design",
    "5. Setting",
    "6. Participants",
    "7. Variables",
    "8. Data sources/measurement",
    "9. Bias",
    "10. Study size",
    "11. Quantitative variables",
    "12. Statistical methods",
    "13. Participants",
    "14. Descriptive data",
    "15. Outcome data",
    "16. Main results",
    "17. Other analyses",
    "18. Key results",
    "19. Limitations",
    "20. Interpretation",
    "21. Generalisability",
    "22. Funding"
  ),
  Location = c(
    "Abstract — study design, setting, participants, predictors, outcome, and key findings",
    "Introduction — prevalence of UTI-related cancellations, gap in selective screening",
    "Introduction — 'To develop and internally validate a multivariable clinical prediction model...'",
    "Materials and Methods: Study Design — retrospective cohort, TRIPOD Type 1b",
    sprintf("Materials and Methods: Study Population — single academic urogynecology practice, %s",
            if (exists("study_period")) study_period else "2019-2024"),
    sprintf("Materials and Methods: Study Population; TRIPOD Flow (Figure 1) — %s initial, %s analyzed",
            if (exists("tripod_n_initial")) tripod_n_initial else "N",
            if (exists("tripod_n_final")) tripod_n_final else "N"),
    sprintf("Materials and Methods: Candidate Predictors — %s LASSO-selected predictors defined with measurement methods",
            if (exists("results") && !is.null(results$n_predictors)) results$n_predictors else "7"),
    "Materials and Methods: Data Collection — REDCap database, standardized extraction",
    "Materials and Methods: Variable Selection (LASSO avoids selection bias); Missing Data (MICE); Limitations section",
    sprintf("Materials and Methods: Sample Size Considerations — %s events, EPV = %.1f, pmsampsize evaluation",
            if (exists("results") && !is.null(results$n_cancelled)) results$n_cancelled else "57",
            if (exists("results") && !is.null(results$n_cancelled) && !is.null(results$n_predictors))
              as.numeric(results$n_cancelled) / as.numeric(results$n_predictors) else 8.1),
    "Materials and Methods: Statistical Analysis — continuous variables as median (IQR); RCS evaluated for non-linearity",
    "Materials and Methods: Statistical Analysis, Internal Validation, Shrinkage — LASSO, bootstrap, DCA, NRI",
    "Results: Patient Population and Flow (TRIPOD flow diagram with counts at each exclusion step)",
    "Results: Table 1 — demographics and clinical characteristics stratified by outcome",
    sprintf("Results: %s cancellations (%s%%) in %s patients",
            if (exists("results") && !is.null(results$n_cancelled)) results$n_cancelled else "57",
            if (exists("results") && !is.null(results$pct_cancelled)) results$pct_cancelled else "6.8",
            if (exists("results") && !is.null(results$n_total)) results$n_total else "841"),
    "Results: Model Performance — C-statistic, calibration slope, OR table, nomogram; Internal Validation table",
    "Results: Decision Curve Analysis, Clinical Impact Curve, NRI, Temporal Validation, Instability Analysis",
    "Discussion paragraph 1 — predictor interpretation and clinical alignment",
    "Discussion: Limitations — single-center, low EPV, retrospective design, no external validation",
    "Discussion: clinical utility of nomogram and web calculator for selective screening",
    "Discussion: Limitations — single academic practice, need for external validation in diverse populations",
    "Not applicable (no external funding)"
  ),
  Status = rep("Complete", 22),
  stringsAsFactors = FALSE
)

kable(strobe_items,
      col.names = c("Section", "STROBE Item", "Manuscript Location", "Status"),
      caption = "STROBE Checklist for Observational Studies (22 Items)",
      align = c("l", "l", "l", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE,
                font_size = 11) %>%
  pack_rows("Title and Abstract", 1, 1) %>%
  pack_rows("Introduction", 2, 3) %>%
  pack_rows("Methods", 4, 12) %>%
  pack_rows("Results", 13, 17) %>%
  pack_rows("Discussion", 18, 21) %>%
  pack_rows("Other Information", 22, 22) %>%
  column_spec(1, width = "10%") %>%
  column_spec(2, width = "18%") %>%
  column_spec(3, width = "62%") %>%
  column_spec(4, width = "10%")
STROBE Checklist for Observational Studies (22 Items)
Section STROBE Item Manuscript Location Status
Title and Abstract
Title/Abstract
  1. Title and abstract
Abstract — study design, setting, participants, predictors, outcome, and key findings Complete
Introduction
Introduction
  1. Background/rationale
Introduction — prevalence of UTI-related cancellations, gap in selective screening Complete
Introduction
  1. Objectives
Introduction — ‘To develop and internally validate a multivariable clinical prediction model…’ Complete
Methods
Methods
  1. Study design
Materials and Methods: Study Design — retrospective cohort, TRIPOD Type 1b Complete
Methods
  1. Setting
Materials and Methods: Study Population — single academic urogynecology practice, January 1, 2022 to December 31, 2024 Complete
Methods
  1. Participants
Materials and Methods: Study Population; TRIPOD Flow (Figure 1) — 912 initial, 841 analyzed Complete
Methods
  1. Variables
Materials and Methods: Candidate Predictors — 5 LASSO-selected predictors defined with measurement methods Complete
Methods
  1. Data sources/measurement
Materials and Methods: Data Collection — REDCap database, standardized extraction Complete
Methods
  1. Bias
Materials and Methods: Variable Selection (LASSO avoids selection bias); Missing Data (MICE); Limitations section Complete
Methods
  1. Study size
Materials and Methods: Sample Size Considerations — 57 events, EPV = 11.4, pmsampsize evaluation Complete
Methods
  1. Quantitative variables
Materials and Methods: Statistical Analysis — continuous variables as median (IQR); RCS evaluated for non-linearity Complete
Methods
  1. Statistical methods
Materials and Methods: Statistical Analysis, Internal Validation, Shrinkage — LASSO, bootstrap, DCA, NRI Complete
Results
Results
  1. Participants
Results: Patient Population and Flow (TRIPOD flow diagram with counts at each exclusion step) Complete
Results
  1. Descriptive data
Results: Table 1 — demographics and clinical characteristics stratified by outcome Complete
Results
  1. Outcome data
Results: 57 cancellations (6.8%) in 841 patients Complete
Results
  1. Main results
Results: Model Performance — C-statistic, calibration slope, OR table, nomogram; Internal Validation table Complete
Results
  1. Other analyses
Results: Decision Curve Analysis, Clinical Impact Curve, NRI, Temporal Validation, Instability Analysis Complete
Discussion
Discussion
  1. Key results
Discussion paragraph 1 — predictor interpretation and clinical alignment Complete
Discussion
  1. Limitations
Discussion: Limitations — single-center, low EPV, retrospective design, no external validation Complete
Discussion
  1. Interpretation
Discussion: clinical utility of nomogram and web calculator for selective screening Complete
Discussion
  1. Generalisability
Discussion: Limitations — single academic practice, need for external validation in diverse populations Complete
Other Information
Other
  1. Funding
Not applicable (no external funding) Complete

Compliance Summary: 22 of 22 STROBE items addressed (100%)

Note: As a prediction model study, TRIPOD is the primary reporting guideline (Appendix A5). STROBE items overlap substantially with TRIPOD for observational study design, participants, and statistical methods. Items unique to STROBE (e.g., bias assessment, generalisability) are addressed in the Discussion and Limitations sections.

A7. Configuration Parameters

# =============================================================================
# APPENDIX A6: ALL CONFIGURATION PARAMETERS
# =============================================================================

config_params <- data.frame(
  Parameter = c(
    "SEED_MAIN", "SEED_LASSO", "SEED_BOOTSTRAP", "SEED_CALIBRATION",
    "ALPHA_SIGNIFICANCE", "ALPHA_TRENDING", "CONFIDENCE_LEVEL",
    "EPV_MINIMUM", "LASSO_ALPHA", "CV_FOLDS",
    "BOOTSTRAP_RESAMPLES", "CALIBRATION_RESAMPLES",
    "DCA_THRESHOLD_MIN", "DCA_THRESHOLD_MAX", "DCA_THRESHOLD_STEP",
    "CIC_SCALE_PER"
  ),
  Value = c(
    SEED_MAIN, SEED_LASSO, SEED_BOOTSTRAP, SEED_CALIBRATION,
    ALPHA_SIGNIFICANCE, ALPHA_TRENDING, CONFIDENCE_LEVEL,
    EPV_MINIMUM, LASSO_ALPHA, CV_FOLDS,
    BOOTSTRAP_RESAMPLES, CALIBRATION_RESAMPLES,
    DCA_THRESHOLD_MIN, DCA_THRESHOLD_MAX, DCA_THRESHOLD_STEP,
    CIC_SCALE_PER
  ),
  Description = c(
    "Main random seed for general operations",
    "Seed for LASSO cross-validation",
    "Seed for bootstrap validation",
    "Seed for calibration analysis",
    "P-value threshold for statistical significance",
    "P-value threshold for trending significance",
    "Confidence level for intervals",
    "Minimum events per predictor variable",
    "LASSO alpha (1=LASSO, 0=ridge)",
    "Number of cross-validation folds",
    "Bootstrap resamples for validation",
    "Bootstrap resamples for calibration",
    "Minimum DCA threshold probability",
    "Maximum DCA threshold probability",
    "DCA threshold step size",
    "Scale factor for clinical impact curve"
  )
)

kable(config_params,
      caption = "Complete Configuration Parameters for Reproducibility",
      align = c("l", "r", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE,
                font_size = 11)
Complete Configuration Parameters for Reproducibility
Parameter Value Description
SEED_MAIN 42.00 Main random seed for general operations
SEED_LASSO 1978.00 Seed for LASSO cross-validation
SEED_BOOTSTRAP 42.00 Seed for bootstrap validation
SEED_CALIBRATION 42.00 Seed for calibration analysis
ALPHA_SIGNIFICANCE 0.05 P-value threshold for statistical significance
ALPHA_TRENDING 0.10 P-value threshold for trending significance
CONFIDENCE_LEVEL 0.95 Confidence level for intervals
EPV_MINIMUM 10.00 Minimum events per predictor variable
LASSO_ALPHA 1.00 LASSO alpha (1=LASSO, 0=ridge)
CV_FOLDS 10.00 Number of cross-validation folds
BOOTSTRAP_RESAMPLES 1000.00 Bootstrap resamples for validation
CALIBRATION_RESAMPLES 200.00 Bootstrap resamples for calibration
DCA_THRESHOLD_MIN 0.00 Minimum DCA threshold probability
DCA_THRESHOLD_MAX 0.50 Maximum DCA threshold probability
DCA_THRESHOLD_STEP 0.01 DCA threshold step size
CIC_SCALE_PER 1000.00 Scale factor for clinical impact curve

A8. Advanced Hmisc/rms Diagnostics

This section provides additional diagnostic analyses using advanced features from the Hmisc and rms packages to strengthen model validation and support publication requirements.

A8.1 Correlation Matrix with Significance Testing

A8.2 Variable Clustering Analysis

A8.3 Missing Data Pattern Analysis


A9. Enhanced Model Validation

This section provides additional validation techniques using robust variance estimation and specific contrast testing.

A9.1 Clinical Contrast Testing

# =============================================================================
# APPENDIX A6.1: CLINICAL CONTRAST TESTING (rms::contrast)
# Test specific clinical hypotheses with confidence intervals
# =============================================================================

log_info("Performing clinical contrast testing using rms::contrast()...")

cat("=== Clinical Contrast Testing ===\n\n")

=== Clinical Contrast Testing ===

cat("Testing specific clinical hypotheses about risk differences between patient profiles.\n\n")

Testing specific clinical hypotheses about risk differences between patient profiles.

# Define clinically meaningful contrasts based on model predictors
if (exists("deployed_model") && inherits(deployed_model, "lrm")) {

  # Get model predictor names
  model_vars <- names(deployed_model$Design$parms)
  if (is.null(model_vars)) {
    model_vars <- attr(deployed_model$terms, "term.labels")
  }

  contrast_results <- list()

  # Contrast 1: Age effect (older vs younger patient)
  if ("Age" %in% model_vars) {
    cat("### Contrast 1: Age Effect (70 vs 50 years)\n")
    contrast1 <- tryCatch({
      rms::contrast(deployed_model,
                    list(Age = 70),
                    list(Age = 50),
                    type = "average")
    }, error = function(e) {
      log_warn(sprintf("Age contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast1)) {
      print(contrast1)
      contrast_results$age <- contrast1
      cat("\nInterpretation: The log-odds ratio for a 70-year-old vs 50-year-old patient,\n")
      cat("averaged over other predictor values.\n\n")
    }
  }

  # Contrast 2: BMI effect (obese vs normal weight)
  if ("BMI" %in% model_vars) {
    cat("### Contrast 2: BMI Effect (35 vs 25 kg/m²)\n")
    contrast2 <- tryCatch({
      rms::contrast(deployed_model,
                    list(BMI = 35),
                    list(BMI = 25),
                    type = "average")
    }, error = function(e) {
      log_warn(sprintf("BMI contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast2)) {
      print(contrast2)
      contrast_results$bmi <- contrast2
      cat("\nInterpretation: The log-odds ratio for an obese (BMI=35) vs normal weight (BMI=25) patient.\n\n")
    }
  }

  # Contrast 3: Recurrent UTIs effect
  if ("Recurrent_UTIs" %in% model_vars) {
    cat("### Contrast 3: Recurrent UTIs Effect (Yes vs No)\n")
    contrast3 <- tryCatch({
      rms::contrast(deployed_model,
                    list(Recurrent_UTIs = "Yes"),
                    list(Recurrent_UTIs = "No"),
                    type = "average")
    }, error = function(e) {
      log_warn(sprintf("Recurrent UTIs contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast3)) {
      print(contrast3)
      contrast_results$recurrent_utis <- contrast3
      cat("\nInterpretation: The log-odds ratio for patients with vs without recurrent UTI history.\n\n")
    }
  }

  # Contrast 4: High-risk vs low-risk profile
  cat("### Contrast 4: High-Risk vs Low-Risk Patient Profile\n")

  # Define high-risk and low-risk profiles based on available predictors
  high_risk_profile <- list()
  low_risk_profile <- list()

  if ("Age" %in% model_vars) {
    high_risk_profile$Age <- 75
    low_risk_profile$Age <- 45
  }
  if ("BMI" %in% model_vars) {
    high_risk_profile$BMI <- 35
    low_risk_profile$BMI <- 24
  }
  if ("Recurrent_UTIs" %in% model_vars) {
    high_risk_profile$Recurrent_UTIs <- "Yes"
    low_risk_profile$Recurrent_UTIs <- "No"
  }
  if ("Overactive_Bladder" %in% model_vars) {
    high_risk_profile$Overactive_Bladder <- "Yes"
    low_risk_profile$Overactive_Bladder <- "No"
  }
  if ("Detrusor_Overactivity" %in% model_vars) {
    high_risk_profile$Detrusor_Overactivity <- "Yes"
    low_risk_profile$Detrusor_Overactivity <- "No"
  }

  if (length(high_risk_profile) > 0) {
    contrast4 <- tryCatch({
      rms::contrast(deployed_model, high_risk_profile, low_risk_profile)
    }, error = function(e) {
      log_warn(sprintf("Profile contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast4)) {
      cat("High-risk profile:", paste(names(high_risk_profile), "=", high_risk_profile, collapse = ", "), "\n")
      cat("Low-risk profile:", paste(names(low_risk_profile), "=", low_risk_profile, collapse = ", "), "\n\n")
      print(contrast4)
      contrast_results$profile <- contrast4

      # Calculate odds ratio
      if (!is.null(contrast4$Contrast)) {
        or <- exp(contrast4$Contrast)
        or_lower <- exp(contrast4$Lower)
        or_upper <- exp(contrast4$Upper)
        cat(sprintf("\nOdds Ratio: %.2f (95%% CI: %.2f - %.2f)\n", or, or_lower, or_upper))
      }
    }
  }

  cat("\n=== Summary ===\n")
  cat("These contrasts provide effect estimates with 95% confidence intervals for\n")
  cat("specific clinical comparisons, enabling hypothesis testing beyond overall model fit.\n")

  log_info("Clinical contrast testing complete")
} else {
  cat("Model object not available for contrast testing.\n")
}

Contrast 3: Recurrent UTIs Effect (Yes vs No)

Contrast      S.E.      Lower    Upper    Z Pr(>|z|)

11 0.8405901 0.3992687 0.05803784 1.623142 2.11 0.0353

Confidence intervals are 0.95 individual intervals

Interpretation: The log-odds ratio for patients with vs without recurrent UTI history.

Contrast 4: High-Risk vs Low-Risk Patient Profile

High-risk profile: Recurrent_UTIs = Yes Low-risk profile: Recurrent_UTIs = No

Age POPQ_Stage BMI Diabetes Contrast S.E. Lower Upper Z 1 65 2 28.52 Yes 0.8405901 0.3992687 0.05803784 1.623142 2.11 Pr(>|z|) 1 0.0353

Confidence intervals are 0.95 individual intervals

Odds Ratio: 2.32 (95% CI: 1.06 - 5.07)

=== Summary === These contrasts provide effect estimates with 95% confidence intervals for specific clinical comparisons, enabling hypothesis testing beyond overall model fit.

A9.2 Bootstrap Covariance Estimation

A9.3 Robust Sandwich Covariance


A10. Publication Support

This section provides outputs formatted for manuscript preparation and external implementation.

A10.1 Model Specification Summary

A10.2 LaTeX Output for Publications

A10.3 Nomogram Formula Extraction

# =============================================================================
# APPENDIX A7.3: NOMOGRAM FORMULA EXTRACTION
# Extract equations for programmatic scoring
# =============================================================================

log_info("Extracting nomogram formulas for programmatic implementation...")

cat("=== Nomogram Scoring Equations ===\n\n")

=== Nomogram Scoring Equations ===

cat("These equations enable programmatic calculation of nomogram points\n")

These equations enable programmatic calculation of nomogram points

cat("without requiring visual point reading from the nomogram graphic.\n\n")

without requiring visual point reading from the nomogram graphic.

if (exists("deployed_model") && inherits(deployed_model, "lrm")) {

  # Extract model coefficients
  coefs <- coef(deployed_model)

  cat("=== Linear Predictor Equation ===\n\n")
  cat("The linear predictor (log-odds) is calculated as:\n\n")
  cat("LP = ")

  # Build equation string
  eq_parts <- c()
  for (i in seq_along(coefs)) {
    coef_name <- names(coefs)[i]
    coef_val <- coefs[i]

    if (coef_name == "Intercept") {
      eq_parts <- c(eq_parts, sprintf("%.4f", coef_val))
    } else {
      sign <- ifelse(coef_val >= 0, "+", "")
      eq_parts <- c(eq_parts, sprintf("%s %.4f × %s", sign, coef_val, coef_name))
    }
  }
  cat(paste(eq_parts, collapse = " "), "\n\n")

  cat("=== Probability Calculation ===\n\n")
  cat("Probability = 1 / (1 + exp(-LP))\n")
  cat("Or equivalently: Probability = exp(LP) / (1 + exp(LP))\n\n")

  cat("=== Point Scoring System ===\n\n")
  cat("For nomogram implementation, points are typically scaled such that\n")
  cat("the maximum points for each predictor sum to 100.\n\n")

  # Create point scoring table
  # Scale coefficients to 0-100 point range
  if (length(coefs) > 1) {
    abs_coefs <- abs(coefs[-1])  # Exclude intercept
    max_coef <- max(abs_coefs)

    point_table <- data.frame(
      Predictor = names(coefs)[-1],
      Coefficient = round(coefs[-1], 4),
      Points_Per_Unit = round(coefs[-1] / max_coef * 10, 2)
    )

    cat("Points per unit (scaled, 10 points = largest coefficient):\n\n")
    print(point_table)

    cat("\n\nNote: For categorical variables, 'Points_Per_Unit' represents the\n")
    cat("points assigned when the variable equals 1 (vs reference = 0).\n")
    cat("For continuous variables, multiply by the actual value.\n")
  }

  cat("\n=== Implementation Example (R) ===\n\n")
  cat("```r\n")
  cat("# Function to calculate predicted probability\n")
  cat("predict_cancellation <- function(")
  cat(paste(names(coefs)[-1], collapse = ", "))
  cat(") {\n")
  cat("  lp <- ", round(coefs[1], 4))
  for (i in 2:length(coefs)) {
    sign <- ifelse(coefs[i] >= 0, "+", "-")
    cat(sprintf(" %s %.4f * %s", sign, abs(coefs[i]), names(coefs)[i]))
  }
  cat("\n  return(1 / (1 + exp(-lp)))\n")
  cat("}\n")
  cat("```\n")

  log_info("Nomogram formula extraction complete")
} else {
  cat("Model object not available for formula extraction.\n")
}

=== Linear Predictor Equation ===

The linear predictor (log-odds) is calculated as:

LP = -8.0310 + 0.0721 × Age -0.2656 × POPQ_Stage + 0.0305 × BMI + 0.8406 × Recurrent_UTIs=Yes -0.2566 × Diabetes=Yes

=== Probability Calculation ===

Probability = 1 / (1 + exp(-LP)) Or equivalently: Probability = exp(LP) / (1 + exp(LP))

=== Point Scoring System ===

For nomogram implementation, points are typically scaled such that the maximum points for each predictor sum to 100.

Points per unit (scaled, 10 points = largest coefficient):

                        Predictor Coefficient Points_Per_Unit

Age Age 0.0721 0.86 POPQ_Stage POPQ_Stage -0.2656 -3.16 BMI BMI 0.0305 0.36 Recurrent_UTIs=Yes Recurrent_UTIs=Yes 0.8406 10.00 Diabetes=Yes Diabetes=Yes -0.2566 -3.05

Note: For categorical variables, ‘Points_Per_Unit’ represents the points assigned when the variable equals 1 (vs reference = 0). For continuous variables, multiply by the actual value.

=== Implementation Example (R) ===

# Function to calculate predicted probability
predict_cancellation <- function(Age, POPQ_Stage, BMI, Recurrent_UTIs=Yes, Diabetes=Yes) {
  lp <-  -8.031 + 0.0721 * Age - 0.2656 * POPQ_Stage + 0.0305 * BMI + 0.8406 * Recurrent_UTIs=Yes - 0.2566 * Diabetes=Yes
  return(1 / (1 + exp(-lp)))
}

A10.4 External Validation Code Export

# =============================================================================
# APPENDIX A7.4: EXTERNAL VALIDATION CODE EXPORT
# Code for validating model in external datasets
# =============================================================================

log_info("Generating external validation code...")

cat("=== External Validation Implementation ===\n\n")

=== External Validation Implementation ===

cat("Use the following code to validate this model in external datasets.\n\n")

Use the following code to validate this model in external datasets.

if (exists("deployed_model") && inherits(deployed_model, "lrm")) {

  # Get model formula
  model_formula <- tryCatch({
    formula(deployed_model)
  }, error = function(e) NULL)

  cat("=== Required Variables ===\n\n")
  predictor_names <- names(deployed_model$Design$parms)
  if (is.null(predictor_names)) {
    predictor_names <- attr(deployed_model$terms, "term.labels")
  }

  for (pred in predictor_names) {
    cat(sprintf("- %s\n", pred))
  }

  cat("\n=== R Code for External Validation ===\n\n")
  cat("```r\n")
  cat("# Load required packages\n")
  cat("library(rms)\n")
  cat("library(pROC)\n\n")

  cat("# Load your external validation data\n")
  cat("# external_data <- read.csv('your_external_data.csv')\n\n")

  cat("# Model coefficients (copy from above)\n")
  cat("coefs <- c(\n")
  coef_strings <- sprintf("  '%s' = %.6f", names(coef(deployed_model)), coef(deployed_model))
  cat(paste(coef_strings, collapse = ",\n"))
  cat("\n)\n\n")

  cat("# Calculate linear predictor\n")
  cat("calculate_lp <- function(data) {\n")
  cat("  lp <- coefs['Intercept']\n")
  cat("  for (var in names(coefs)[-1]) {\n")
  cat("    if (var %in% names(data)) {\n")
  cat("      lp <- lp + coefs[var] * data[[var]]\n")
  cat("    }\n")
  cat("  }\n")
  cat("  return(lp)\n")
  cat("}\n\n")

  cat("# Calculate predicted probabilities\n")
  cat("# external_data$pred_prob <- plogis(calculate_lp(external_data))\n\n")

  cat("# Calculate C-statistic\n")
  cat("# roc_result <- pROC::roc(external_data$outcome, external_data$pred_prob)\n")
  cat("# print(pROC::auc(roc_result))\n")
  cat("```\n")

  log_info("External validation code export complete")
}

=== Required Variables ===

  • Recurrent_UTIs
  • Diabetes

=== R Code for External Validation ===

# Load required packages
library(rms)
library(pROC)

# Load your external validation data
# external_data <- read.csv('your_external_data.csv')

# Model coefficients (copy from above)
coefs <- c(
  'Intercept' = -8.030983,
  'Age' = 0.072083,
  'POPQ_Stage' = -0.265570,
  'BMI' = 0.030467,
  'Recurrent_UTIs=Yes' = 0.840590,
  'Diabetes=Yes' = -0.256563
)

# Calculate linear predictor
calculate_lp <- function(data) {
  lp <- coefs['Intercept']
  for (var in names(coefs)[-1]) {
    if (var %in% names(data)) {
      lp <- lp + coefs[var] * data[[var]]
    }
  }
  return(lp)
}

# Calculate predicted probabilities
# external_data$pred_prob <- plogis(calculate_lp(external_data))

# Calculate C-statistic
# roc_result <- pROC::roc(external_data$outcome, external_data$pred_prob)
# print(pROC::auc(roc_result))

End of Technical Appendix


Figure and Table Legends

Manuscript Figure Legends

Figure 1. TRIPOD Flow Diagram. Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) compliant flow diagram showing patient selection from initial screening through final analysis cohort. The diagram illustrates the number of patients at each stage of selection, including exclusions due to incomplete data, missing outcome information, or protocol violations, with the final sample size used for model development.

Figure 2. Clinical Prediction Nomogram (Shrinkage-Adjusted). Nomogram for predicting the probability of urodynamic procedure cancellation due to urinary tract infection (UTI). This nomogram uses shrinkage-adjusted coefficients (calibration slope applied to all predictors) to produce properly calibrated predictions for new patients. Each predictor variable is assigned points based on its adjusted coefficient in the multivariable logistic regression model. To use the nomogram, locate the patient’s value for each predictor on its axis, draw a vertical line to the “Points” axis at the top, sum the points for all predictors, and find the corresponding predicted probability on the “Probability of Cancellation” axis at the bottom.

Figure 3. Calibration Plot. Calibration plot assessing the agreement between predicted probabilities and observed outcomes. Predicted probabilities are grouped into deciles, and the observed proportion of events within each group is plotted against the mean predicted probability. The diagonal line represents perfect calibration. Points above the line indicate underprediction, while points below indicate overprediction. Error bars represent 95% confidence intervals. The Hosmer-Lemeshow goodness-of-fit test provides a statistical assessment of calibration, with p > 0.05 suggesting adequate model fit.

Manuscript Table Legends

Table 1. Demographic and Clinical Characteristics Stratified by Procedure Cancellation Status. Baseline characteristics of the study population, stratified by whether the urodynamic procedure was completed or cancelled due to urinary tract infection. Continuous variables are presented as mean ± standard deviation or median (interquartile range) as appropriate based on distribution. Categorical variables are presented as frequency (percentage). P-values are calculated using Student’s t-test or Mann-Whitney U test for continuous variables and chi-squared or Fisher’s exact test for categorical variables, as appropriate. Statistical significance is defined as p < 0.05.

Table 2. Multivariable Logistic Regression Model for Urodynamic Procedure Cancellation Due to Urinary Tract Infection. Final multivariable logistic regression model coefficients, odds ratios, and 95% confidence intervals for each predictor variable. Variables were selected using LASSO regularization with 10-fold cross-validation. The model uses the rms::lrm() function to enable restricted cubic spline transformations for continuous predictors where appropriate. Odds ratios represent the change in odds of procedure cancellation for a one-unit change in the predictor (or for categorical variables, compared to the reference category).


Supplemental Digital Content

Supplemental Figure Legends

Supplemental Figure 1. Receiver Operating Characteristic (ROC) Curve. Receiver operating characteristic curve demonstrating the discrimination ability of the prediction model. The curve plots sensitivity (true positive rate) against 1-specificity (false positive rate) across all possible classification thresholds. The area under the curve (C-statistic) quantifies the model’s ability to distinguish between patients whose procedures will be cancelled versus completed. The diagonal dashed line represents a model with no discriminative ability (C-statistic = 0.50). The optimal threshold point, determined by maximizing the Youden index (sensitivity + specificity - 1), is indicated on the curve.

Supplemental Figure 2. Decision Curve Analysis. Decision curve analysis (DCA) comparing the clinical utility of the prediction model against alternative strategies of treating all patients (“Treat All”) or no patients (“Treat None”). The y-axis represents net benefit, defined as: Net Benefit = (True Positives/N) - (False Positives/N) × [Threshold/(1-Threshold)]. The x-axis represents threshold probability, reflecting the clinical judgment about the relative harms of false-positive versus false-negative predictions. A model provides clinical utility when its net benefit curve exceeds the “Treat All” and “Treat None” strategies. The shaded region indicates the clinically relevant threshold range. Based on methodology described by Vickers et al., the DCA demonstrates the range of threshold probabilities over which using the prediction model results in superior decision-making compared to default strategies.

Supplemental Figure 3. Clinical Impact Curve. Clinical impact curve illustrating the practical implications of the prediction model when applied to a hypothetical population of 1000 patients. The red curve (“Number High Risk”) shows how many patients would be classified as high risk at each threshold probability. The blue curve (“Number High Risk with Event”) shows how many of those high-risk patients actually experienced the outcome (true positives). The gap between these curves represents false positives at each threshold. This visualization helps clinicians understand the trade-off between identifying true high-risk patients and over-classifying low-risk patients at different decision thresholds.

Supplemental Table Legends

Supplemental Table 1. Prediction Model Performance Characteristics. Summary of model discrimination and calibration metrics. Discrimination is assessed by the C-statistic (area under the ROC curve) with 95% confidence interval calculated via 1000 bootstrap resamples. Additional metrics include sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the Brier score. Calibration is assessed by the calibration slope, calibration-in-the-large, and Hosmer-Lemeshow goodness-of-fit test.

Supplemental Table 2. Stratified C-Statistics by Patient Subgroups. Model discrimination (C-statistic) assessed within clinically relevant patient subgroups. Subgroup analyses evaluate whether the model maintains adequate discriminative ability across different patient populations, including stratification by age groups, body mass index categories, comorbidity status, and other clinically important characteristics. Confidence intervals are calculated using bootstrap resampling within each subgroup.


Document generated: 2026-03-25 01:29:28.365123

References

1.
Collins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: Updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. doi:10.1136/bmj-2023-078378
2.
Jelovsek JE, Chagin K, Lukacz ES, et al. Models for predicting recurrence, complications, and health status in women after pelvic organ prolapse surgery. Obstetrics and Gynecology. 2018;132(2):298-309. doi:10.1097/AOG.0000000000002739
3.
Jelovsek JE, Chagin K, Brubaker L, et al. A model for predicting the risk of de novo stress urinary incontinence in women undergoing pelvic organ prolapse surgery. Obstetrics and Gynecology. 2014;123(2 Pt 1):279-287. doi:10.1097/AOG.0000000000000094
4.
Moons KG, Altman DG, Reitsma JB, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): Explanation and elaboration. Annals of Internal Medicine. 2015;162(1):W1-W73. doi:10.7326/M14-0698
5.
Elm E von, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: Guidelines for reporting observational studies. Annals of Internal Medicine. 2007;147(8):573-577. doi:10.7326/0003-4819-147-8-200710160-00010
6.
Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. BMJ. 2015;350:g7594. doi:10.1136/bmj.g7594
7.
Harrell FE. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Springer; 2015. doi:10.1007/978-3-319-19425-7
8.
Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: Seven steps for development and an ABCD for validation. European Heart Journal. 2014;35(29):1925-1931. doi:10.1093/eurheartj/ehu207
9.
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological). 1996;58(1):267-288. doi:10.1111/j.2517-6161.1996.tb02080.x
10.
Riley RD, Ensor J, Snell KI, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441. doi:10.1136/bmj.m441
11.
Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441. doi:10.1136/bmj.m441
12.
Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology. 1996;49(12):1373-1379. doi:10.1016/S0895-4356(96)00236-3
13.
Hooton TM, Bradley SF, Cardenas DD, et al. Diagnosis, prevention, and treatment of catheter-associated urinary tract infection in adults: 2009 international clinical practice guidelines from the Infectious Diseases Society of America. Clinical Infectious Diseases. 2010;50(5):625-663. doi:10.1086/650482
14.
Hosmer DW, Lemeshow S, Sturdivant RX. Applied logistic regression. Wiley Series in Probability and Statistics. Published online 2013. doi:10.1002/9781118548387
15.
Steyerberg EW, Harrell FE. Prediction models need appropriate internal, internal-external, and external validation. Journal of Clinical Epidemiology. 2016;69:245-247. doi:10.1016/j.jclinepi.2015.04.005
16.
Van Calster B, McLernon DJ, Smeden M van, Kurber L, Steyerberg EW. Calibration: The Achilles heel of predictive analytics. BMC Medicine. 2019;17:230. doi:10.1186/s12916-019-1466-7
17.
Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3(1):32-35. doi:10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3
18.
Wilson EB. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association. 1927;22(158):209-212. doi:10.1080/01621459.1927.10502953
19.
Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;352:i6. doi:10.1136/bmj.i6
20.
Vickers AJ, Elkin EB. Decision curve analysis: A novel method for evaluating prediction models. Medical Decision Making. 2006;26(6):565-574. doi:10.1177/0272989X06295361
21.
Kerr KF, Brown MD, Zhu K, Janes H. Assessing the clinical impact of risk prediction models with decision curves: Guidance for correct interpretation and appropriate use. Journal of Clinical Oncology. 2016;34(21):2534-2540. doi:10.1200/JCO.2015.65.5654
22.
Pencina MJ, D’Agostino RB, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in Medicine. 2008;27(2):157-172. doi:10.1002/sim.2929
23.
Foxman B. Urinary tract infection syndromes: Occurrence, recurrence, bacteriology, risk factors, and disease burden. Infectious Disease Clinics of North America. 2014;28(1):1-13. doi:10.1016/j.idc.2013.09.003
24.
Scholes D, Hooton TM, Roberts PL, Stapleton AE, Gupta K, Stamm WE. Risk factors for recurrent urinary tract infection in young women. Journal of Infectious Diseases. 2000;182(4):1177-1182. doi:10.1086/315827
25.
Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: From utopia to empirical data. Journal of Clinical Epidemiology. 2016;74:167-176. doi:10.1016/j.jclinepi.2015.12.005
26.
Nager CW, Brubaker L, Litman HJ, et al. A randomized trial of urodynamic testing before stress-incontinence surgery. New England Journal of Medicine. 2012;366(21):1987-1997. doi:10.1056/NEJMoa1113595
27.
Vergouwe Y, Nieboer D, Oostenbrink R, et al. A closed testing procedure to select an appropriate method for updating prediction models. Statistics in Medicine. 2017;36(28):4529-4539. doi:10.1002/sim.7179
28.
Wolff RF, Moons KGM, Riley RD, et al. PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Annals of Internal Medicine. 2019;170(1):51-58. doi:10.7326/M18-1376