Document Overview

This analysis develops and validates a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection (UTI). The document follows a structured workflow aligned with the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines.

Analysis Roadmap

Part Section Description
1 Setup & Configuration Package loading, seeds, parameters
2 Data Preparation Loading, cleaning, TRIPOD flow
3 Exploratory Analysis Descriptive statistics, Table 1
4 Feature Selection Least Absolute Shrinkage and Selection Operator (LASSO) variable selection
5 Model Development Logistic regression with Restricted Cubic Splines (RCS)
6 Internal Validation Bootstrap, full-process validation
7 Calibration & Shrinkage Optimism correction, coefficient adjustment
8 Clinical Utility Nomogram, decision curve analysis
9 Results Tables, figures, interpretation
10 Technical Appendix Methods detail, reproducibility

Key Findings Summary

Quick Reference (values populated after analysis runs):

  • Sample Size: 841 patients from a single academic urogynecology practice
  • Event Rate: Approximately 7 out of 100 procedures (~6.8%) were cancelled due to urinary tract infection
  • Model Accuracy: The model correctly identifies which patients are more likely to cancel about 72% of the time (C-statistic = 0.72, adjusted for overfitting)
  • What the model needs: Patient age, body mass index, history of recurrent urinary tract infections, and overactive bladder diagnosis
  • Overfitting correction: Predictions have been adjusted (shrinkage factor 0.603) to ensure they remain accurate when applied to new patients

Clinical Context

Understanding these patterns could help clinics implement targeted interventions such as additional reminders for high-risk patients, better education about the importance of urodynamic testing, or addressing specific concerns that commonly lead to cancellations, ultimately improving resource utilization and patient care in urogynecology practices.

1 PART 1: Setup & Configuration

This section loads all required packages, sets random seeds for reproducibility, and defines key parameters used throughout the analysis.

2 PART 2: Data Preparation

This section handles data loading, cleaning, and transformation. All patient counts are tracked for the TRIPOD flow diagram to ensure transparent reporting of cohort selection.

2.1 Data Loading

3 Summary of Data Preparation Code

The data prep pipeline performs these key steps (all tracked for TRIPOD reporting):

  1. Load Raw Data from REDCap Export
    • Reads the labeled CSV file directly: 230119Urodynamics_DATA_LABELS_2025-09-15_1126 w labels.csv
    • Loads PFDI-20 totals separately from pfdi_20_totals.csv
    • Initial record count captured for TRIPOD: 912
  2. Exclude Records with Missing Outcome
    • Records with NA in “Was the procedure cancelled?” are excluded
    • Excluded: 12 records
  3. Exclude Non-UTI Cancellations
    • Patient no-shows excluded: 49 records
    • Other cancellation reasons excluded: 10 records
    • These are not relevant to the UTI prediction model
  4. Convert Categorical Variables to Factors
    • All categorical predictors converted to factors with appropriate reference levels
    • Race: “Other” as reference
    • Tobacco: “Non-tobacco user” as reference
    • Menopause: “Pre-menopausal” as reference
  5. Remove Low-Information Columns
    • Zero-variance columns removed via caret::nearZeroVar
    • Columns with >50% missing values removed
  6. Handle Missing Values
    • Missing values in demographic/clinical variables are retained (complete case analysis)
    • NO cross-patient imputation (downfill would fabricate data)
    • Median imputation for PFDI-20 subscale scores only
      • LIMITATION: Single imputation understates uncertainty and assumes MCAR
      • If PFDI selected by LASSO, consider sensitivity analysis with multiple imputation
  7. Process Date Variable
    • Extract Year from procedure date
    • Fix Year 2034 → 2024 data entry error
  8. Join PFDI-20 Totals
    • POPDI-6, CRADI-8, UDI-6, and PFDI-20 total scores joined by Record ID
  9. Feature Selection (Later in Pipeline)
    • LASSO (Least Absolute Shrinkage and Selection Operator) regularization automatically selects the most predictive variables by shrinking less important coefficients to exactly zero, avoiding overfitting and the multiple testing problems of stepwise selection.
    • Complete case analysis performed before LASSO (drop_na())

Final output: Clean dataset with 29 variables and 841 observations for predicting urodynamics procedure cancellation due to UTI.

4 Data Cleaning

Data Preview (First 20 of 841 Observations)

Age:

BMI

Race:

Is the patient hispanic, latino or of Spanish origin?

Does the patient have diabetes?

Does the patient have a h/o recurrent UTIs?

Immunocompromised:

Tobacco use:

Menopause status:

Is the patient on vaginal estrogen?

Does the patient have OAB?

Average number of voids at night:

POP-Q stage:

Was the procedure cancelled?

What was the reason for the urodynamics?

x1st_desire

Year

POPDI_6

CRADI_8

UDI_6

PFDI_20_total

urodynamics_reason

49

34.18

Other

Yes

No

No

No

Non-tobacco user

Pre-menopausal

No

No

1.0

Yes

Checked

201

2022

4.166667

6.250000

50.000000

100

Unknown

78

31.28

White

No

No

No

No

Former tobacco user

Post-menopausal

No

5.0

Yes

Checked

241

2022

79.166667

53.125000

91.666667

224

Unknown

63

27.17

White

No

Yes

No

No

Former tobacco user

Post-menopausal

No

Yes

1.0

Yes

Checked

111

2022

16.666667

50.000000

50.000000

117

Unknown

62

18.96

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

3.0

0

Yes

Unchecked

151

2022

0.000000

7.142857

20.833333

28

pre op for sling

70

37.12

White

No

No

Yes

Yes

Current tobacco user

Post-menopausal

No

Yes

1.5

0

No

Unchecked

116

2022

8.333333

0.000000

91.666667

100

evaluation of oab

59

29.87

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

4.0

3

No

Checked

141

2022

60.000000

56.250000

81.250000

198

pre op for sling

53

30.27

Other

No

Yes

No

Non-tobacco user

Post-menopausal

No

No

4.5

1

No

Unchecked

135

2022

79.166667

78.125000

100.000000

257

pre op for sling

37

25.79

Other

No

No

No

No

Non-tobacco user

Pre-menopausal

No

No

1.0

1

No

Unchecked

172

2022

0.000000

6.250000

8.333333

15

pre op for sling

86

34.01

White

No

No

No

No

Non-tobacco user

Post-menopausal

Yes

No

3.5

3

No

Checked

230

2022

29.166667

18.750000

50.000000

100

pre op for sling

67

25.86

White

No

No

No

No

Former tobacco user

Post-menopausal

No

Yes

2.0

0

No

Unchecked

138

2022

0.000000

50.000000

62.500000

112

evaluation of oab

61

30.90

Other

Yes

Yes

No

No

Non-tobacco user

Post-menopausal

No

Yes

4.0

1

No

Unchecked

115

2022

29.166667

18.750000

50.000000

100

pre op for sling

68

23.00

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

No

Unchecked

114

2022

29.166667

18.750000

50.000000

100

pre op for sling

67

20.25

White

No

No

No

No

Non-tobacco user

Post-menopausal

Yes

No

1.0

3

No

Checked

151

2022

29.166667

6.250000

50.000000

85

Unknown

74

27.34

Other

No

No

No

Yes

Non-tobacco user

Post-menopausal

No

No

1.0

4

No

Checked

61

2022

33.333333

0.000000

37.500000

71

Unknown

51

36.07

Black or African American

No

Yes

No

No

Non-tobacco user

Post-menopausal

No

Yes

5.5

0

No

Unchecked

12

2022

37.500000

6.250000

41.666667

85

evaluation of oab

93

23.46

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

0.0

Yes

Unchecked

77

2022

29.166667

18.750000

50.000000

100

other

69

28.34

White

No

No

No

No

Former tobacco user

Post-menopausal

No

Yes

2

No

Checked

73

2022

29.166667

18.750000

50.000000

100

evaluation of oab

47

35.73

White

No

No

No

No

Non-tobacco user

Pre-menopausal

No

Yes

0.0

0

No

Unchecked

63

2022

16.666667

18.750000

58.333333

94

pre op for sling

43

23.00

White

No

No

No

No

Non-tobacco user

Pre-menopausal

No

Yes

1.0

0

No

Unchecked

100

2022

0.000000

0.000000

62.500000

62

evaluation of mixed incontinence

75

24.10

Other

No

No

No

Non-tobacco user

Post-menopausal

Yes

No

1.5

4

No

Checked

39

2022

37.500000

6.250000

20.000000

64

Unknown

5 Are there Enough Events Per Predictor?

# Display EPV check results (function already printed to console in previous chunk)
message(paste("EPV Analysis Complete: Ratio =", epv_ratio,
              ifelse(epv_adequate, "(Adequate)", "(Below recommended)")))
EPV Analysis Complete: Ratio = 2.7 (Below recommended)

5.0.1 Interpretation of Events Per Predictor Analysis

The Events Per Predictor Variable (EPV) rule is a guideline in logistic regression that recommends having at least 10 outcome events for each predictor variable in the model to ensure stable coefficient estimates and reliable inference.

In our dataset:

  • Total observations: 841 patients scheduled for urodynamic testing
  • Events (procedure cancellations): 57 patients had procedures cancelled due to urinary tract infection (UTI)
  • Non-events (completed procedures): 784 patients completed their scheduled procedures
  • Number of candidate predictors: 21 variables
  • Minimum events required: 210 (based on 10 × 21 predictors)
  • Current EPV ratio: 2.7 events per predictor variable

Assessment: With only 57 events and 21 candidate predictors, the EPV ratio (2.7) is below the recommended minimum of 10. This limitation will be addressed through LASSO regularization, which automatically selects a parsimonious subset of predictors, thereby improving the effective EPV ratio in the final model.

The LASSO variable selection process (described in the Feature Selection section below) will identify the most predictive variables, reducing the number of predictors in the final model and substantially improving the EPV ratio. The final model metrics are reported in the Results section.

5.1 TRIPOD Flow Diagram

The following flow diagram follows the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) guidelines for reporting patient flow in prediction model studies.

# Create TRIPOD-compliant flow diagram using ggplot2
library(ggplot2)

# Define box positions and sizes
box_width <- 3.5
box_height <- 0.85  # Slightly taller for better text fit

# Define a premium color palette
colors <- list(
  source = "#D6EAF8",      # Light Blue
  initial = "#EBDEF0",     # Light Purple
  excluded = "#F9E79F",    # Soft Yellow/Orange (Warning/Exclusion)
  imputed = "#D5F5E3",     # Soft Green (Imputation/Preservation)
  final = "#ABEBC6",       # Green (Success)
  outcome_yes = "#F5B7B1", # Light Red (Event)
  outcome_no = "#AED6F1"   # Light Blue (No Event)
)

# Create data for boxes using PROGRAMMATIC values from tripod_labels
boxes <- data.frame(
  id = c("source", "initial", "excl_missing", "excl2", "excl3", "after_excl", "imputed_pred", "final", "outcome_yes", "outcome_no"),
  x = c(5, 5, 9, 9, 9, 5, 9, 5, 3, 7.2), # Adjusted final classification x-positions
  y = c(10, 8.5, 8.5, 7, 5.5, 6.5, 6.5, 5, 3, 3),
  label = c(
    tripod_labels$source,
    tripod_labels$initial,
    tripod_labels$excl_missing,     # Missing outcomes
    tripod_labels$excl_noshow,
    tripod_labels$excl_other,
    tripod_labels$after_exclusions,
    tripod_labels$imputed,          # Imputation
    tripod_labels$final,
    tripod_labels$cancelled,        # Outcome Yes
    tripod_labels$completed         # Outcome No
  ),
  fill = c(colors$source, colors$initial, colors$excluded, colors$excluded, colors$excluded,
           colors$initial, colors$imputed, colors$final, colors$outcome_yes, colors$outcome_no)
)

# Define connections (arrows) programmatically
# Types: "main" (vertical flow), "exclusion" (horizontal out), "imputation" (horizontal side)
connections <- data.frame(
  x = c(5, 5, 5, 5, 5, 5, 5, 6.76, 6.76, 6.76, 6.76),
  y = c(9.57, 8.07, 6.07, 4.57, 4.57, 8.5, 7, 8.5, 7, 5.5, 6.5),
  xend = c(5, 5, 5, 3, 7.2, 5, 5, 7.24, 7.24, 7.24, 7.24),
  yend = c(8.93, 6.93, 5.43, 3.43, 3.43, 8.1, 6.1, 8.5, 7, 5.5, 6.5),
  type = c("main", "main", "main", "branch", "branch", "segment", "segment", "exclusion", "exclusion", "exclusion", "imputation")
)
# Note: segments (main vertical parts) and exclusions are fine-tuned based on box geometry

# Refined connection segments for precision
segments <- data.frame(
  x = c(5, 5, 5, 5, 5, 6.76, 6.76, 6.76, 6.76),
  y = c(9.57, 8.07, 6.07, 4.57, 4.57, 8.5, 7, 5.5, 6.5),
  xend = c(5, 5, 5, 3, 7.2, 7.24, 7.24, 7.24, 7.24),
  yend = c(8.93, 6.93, 5.43, 3.43, 3.43, 8.5, 7, 5.5, 6.5),
  linetype = c("solid", "solid", "solid", "solid", "solid", "dashed", "dashed", "dashed", "dotted"),
  size = c(1, 1, 1, 1, 1, 0.8, 0.8, 0.8, 0.8)
)

# Create the plot
tripod_plot <- ggplot() +
  # Draw boxes
  geom_rect(data = boxes,
            aes(xmin = x - box_width/2, xmax = x + box_width/2,
                ymin = y - box_height/2, ymax = y + box_height/2,
                fill = fill),
            color = "#2C3E50", linewidth = 0.8) +
  # Add labels
  geom_text(data = boxes,
            aes(x = x, y = y, label = label),
            size = 3.2, fontface = "bold", lineheight = 1.0, color = "#2C3E50") +
  # Draw arrows
  geom_segment(data = segments,
               aes(x = x, y = y, xend = xend, yend = yend, linetype = linetype, linewidth = size),
               arrow = arrow(length = unit(0.25, "cm"), type = "closed"),
               color = "#2C3E50") +
  # Use identity for fill/linetype/size
  scale_fill_identity() +
  scale_linetype_identity() +
  scale_linewidth_identity() +
  # Theme and labels
  labs(title = "TRIPOD Flow Diagram",
       subtitle = "Patient Selection and Model Development Cohort") +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5, color = "#1A5276"),
    plot.subtitle = element_text(size = 13, hjust = 0.5, color = "#5D6D7E", margin = margin(b = 20)),
    plot.background = element_rect(fill = "white", color = NA),
    plot.margin = margin(30, 30, 30, 30)
  ) +
  coord_cartesian(xlim = c(0.5, 12), ylim = c(2, 10.5))

print(tripod_plot)
Figure 1. TRIPOD Flow Diagram

Figure 1. TRIPOD Flow Diagram

Flow Diagram Summary:

  • Data Source: REDCap urodynamics database (study ID: 230119)
  • Initial records: 912 patients in database
  • Exclusions:
    • Missing outcome status: 12 patients (excluded, not imputed)
    • No-show (did not attend): 49 patients
    • Cancellation for other reasons (not UTI): 10 patients
    • Total excluded: 71 patients
  • Final analysis cohort: 841 patients
    • Completed procedures: 784 (93.2%)
    • Cancelled due to UTI: 57 (6.8%)
  • Missing predictor handling: Multivariate Imputation by Chained Equations (MICE) using predictive mean matching (PMM) for all predictors; ensures maximum preservation of cohort size (N=841) as recommended by TRIPOD.

5.1.1 Detailed Cohort Attrition Dashboard

Detailed tracking of patient inclusion and exclusion.
Cohort Selection Step N Affected / Remaining % of Initial Cumulative Retention
  1. Initial Database Import
912 100.0 912
  1. Exclusion: Missing Outcome Status
12 1.3 900
  1. Exclusion: Patient No-Show
49 5.4 851
  1. Exclusion: Other Cancellation Reasons
10 1.1 841
  1. Final Analysis Cohort
841 92.2 841

6 Exploratory Analysis

This section provides descriptive statistics and visualizations to understand the study population. Table 1 summarizes patient demographics stratified by procedure cancellation status.

6.1 Comprehensive Data Description

The Hmisc::describe() function provides a comprehensive summary of the dataset including distributions, missing values, unique values, and quantiles for each variable.

Interpretation of Hmisc::describe() Output:

  • n: Number of non-missing observations
  • missing: Count and percentage of missing values
  • distinct: Number of unique values
  • Info: Information content (1 = maximum, lower = more ties)
  • Mean/Gmd: Mean and Gini mean difference (robust dispersion measure)
  • Quantiles: Distribution percentiles (.05, .10, .25, .50, .75, .90, .95)
  • lowest/highest: Extreme values to check for outliers/errors
# =============================================================================
# VISUAL SUMMARY OF KEY PREDICTORS
# Uses Hmisc plotting capabilities for comprehensive visualization
# =============================================================================

# Create visual summary if continuous predictors exist (using labels_df which is available)
continuous_cols <- names(labels_df)[sapply(labels_df, is.numeric)]
# Remove record_id and related identifiers from plots
continuous_cols <- setdiff(continuous_cols, c("record_id", "Record.ID", "Record_ID", "recordid"))
if (length(continuous_cols) > 0) {

  log_info(sprintf("Creating visual summaries for %d continuous variables", length(continuous_cols)))

  # Plot histogram/density for each continuous predictor
  par(mfrow = c(2, min(3, ceiling(length(continuous_cols)/2))), mar = c(4, 4, 3, 1))

  for (var in head(continuous_cols, 6)) {  # Limit to 6 for readability
    if (var %in% names(labels_df)) {
      var_data <- labels_df[[var]]
      if (!all(is.na(var_data))) {
        hist(var_data,
             main = var,
             xlab = var,
             col = adjustcolor("#3498db", alpha.f = 0.6),
             border = "white",
             breaks = 20)

        # Add rug for data density
        if (length(var_data[!is.na(var_data)]) > 0) {
          Hmisc::scat1d(var_data[!is.na(var_data)],
                        col = adjustcolor("#2c3e50", alpha.f = 0.5))
        }
      }
    }
  }

  par(mfrow = c(1, 1))  # Reset
}
Distributions of key continuous and categorical predictors in the study cohort.

Distributions of key continuous and categorical predictors in the study cohort.

6.2 Correlation Matrix of Continuous Predictors

Correlation matrix of all continuous predictors. Color intensity and ellipse shape indicate correlation strength.

Correlation matrix of all continuous predictors. Color intensity and ellipse shape indicate correlation strength.

Correlation matrix includes 8 continuous variables.

6.3 Table 1: Patient Characteristics

Looking at the demographics table comparing patients whose procedures were cancelled versus completed, there are several notable findings:

6.3.1 Statistically Significant Differences

  1. Age (p = < 0.001):
    • Patients with cancelled procedures had higher age (median 76 vs 64 years)
    • IQR for cancelled: 66-79 vs completed: 52-73
    • This suggests older age is associated with procedure cancellation
  2. History of recurrent UTIs (p = 0.002):
    • Higher proportion in the cancelled group (21.1% vs 8%)
    • This indicates recurrent UTIs may increase risk of procedure cancellation
  3. POP-Q stage (p = 0.019):
    • Different distribution of prolapse stages between groups
    • Stage 0 (no prolapse): 58.5% (cancelled) vs 34.7% (completed)
    • Stage 2 (moderate): 9.8% (cancelled) vs 29.7% (completed)
    • Stage 3-4 (advanced): 22% (cancelled) vs 24.1% (completed)
    • This suggests pelvic organ prolapse stage distribution differs between groups
  4. Menopause status (p = 0.015):
    • Higher proportion in the cancelled group (91.2% vs 74%)
    • This indicates post-menopausal status is associated with procedure cancellation

6.3.2 Variables Not Reaching Statistical Significance (P = .05 to .10)

  1. BMI (p = 0.090):
    • Patients with cancelled procedures had higher bmi (median 31 vs 28.5 kg/m²)
    • IQR for cancelled: 25.6-35.2 vs completed: 24.8-33.5
    • This suggests higher body mass index may be associated with procedure cancellation
  2. Diabetes (p = 0.053):
    • Higher proportion in the cancelled group (24.6% vs 14%)
    • This indicates diabetes may be a risk factor for procedure cancellation

Why are variables with P = .05 to .10 retained in the analysis?

These “trending” variables are kept for several important reasons:

  1. LASSO handles variable selection: Unlike traditional stepwise regression that uses p-values for variable selection, LASSO uses cross-validation to select variables based on their contribution to predictive accuracy. A variable that does not reach statistical significance in univariate analysis may still improve model prediction when combined with other variables.

  2. P-values are not selection criteria for LASSO: The p-values in Table 1 describe univariate associations (each variable tested alone). LASSO considers multivariate relationships and can identify variables that are predictive in combination, even if individually they show weaker associations.

  3. Clinical plausibility: Body mass index and recurrent urinary tract infections have biological rationale for affecting urinary tract infection risk and procedure cancellation. Excluding clinically relevant variables prematurely could result in an underspecified model.

  4. Small sample size considerations: With only 57 events (cancellations), statistical power is limited. P-values near but above 0.05 may suggest meaningful effects that did not reach significance due to sample size constraints rather than true absence of association.

  5. Let the data decide: By including these variables as candidates, we allow the LASSO algorithm to make an objective, data-driven decision about their inclusion based on cross-validated prediction error—not arbitrary p-value cutoffs.

6.3.3 No Significant Differences

The following factors showed no significant association with procedure cancellation (P ≥ 0.10): - Hispanic/Latino - Race - Immunocompromised status - Tobacco use - Vaginal estrogen use - OAB status - Nighttime voiding frequency - Year - PFDI-20 total - POPDI-6 - CRADI-8

6.3.4 Clinical Interpretation

The data suggests that Age, Recurrent UTIs, POP-Q stage, Menopause status are statistically significant predictors of procedure cancellation (P < 0.05). Additionally, BMI, Diabetes, UDI-6 showed differences between groups but did not reach the prespecified significance threshold (0.05 ≤ P < 0.10); however, these variables may warrant clinical consideration as potential risk factors.

The sample size for cancellations is relatively small (N=57 vs N=784), which may limit statistical power to detect some associations. The significant age difference suggests pre-procedure assessment might need to be more comprehensive for elderly patients to reduce cancellation rates.

Table 1. Demographic and Clinical Characteristics Stratified by Procedure Cancellation Status

Demographics Stratified by Procedure Cancellation Status
Completed (N=784) Cancelled (N=57) Total (N=841) p value
Age (years) < 0.01 (1)
- Median 64.0 76.0 65.0
- Q1, Q3 52.0, 73.0 66.0, 79.0 53.0, 73.0
Body Mass Index (kg/m²) 0.09 (1)
- Median 28.5 31.0 28.6
- Q1, Q3 24.8, 33.5 25.6, 35.2 24.9, 33.6
- N Missing 11 0 11
Race: 0.66 (2)
- American Indian or Alaska Native 1 (0.1%) 0 (0.0%) 1 (0.1%)
- Asian 12 (1.5%) 0 (0.0%) 12 (1.4%)
- Black or African American 36 (4.6%) 1 (1.8%) 37 (4.4%)
- White 625 (80.3%) 46 (80.7%) 671 (80.4%)
- Other 99 (12.7%) 9 (15.8%) 108 (12.9%)
- Native Hawaiian or Other Pacific Islander 5 (0.6%) 1 (1.8%) 6 (0.7%)
- N Missing 6 0 6
Is the patient hispanic, latino or of Spanish origin? 0.32 (2)
- No 686 (88.6%) 48 (84.2%) 734 (88.3%)
- Yes 88 (11.4%) 9 (15.8%) 97 (11.7%)
- N Missing 10 0 10
Does the patient have diabetes? 0.03 (2)
- Yes 110 (14.2%) 14 (24.6%) 124 (14.9%)
- No 666 (85.8%) 43 (75.4%) 709 (85.1%)
- N Missing 8 0 8
History of Recurrent Urinary Tract Infections < 0.01 (2)
- No 711 (91.9%) 45 (78.9%) 756 (91.0%)
- Yes 63 (8.1%) 12 (21.1%) 75 (9.0%)
- N Missing 10 0 10
Immunocompromised: 0.70 (2)
- Yes 35 (4.6%) 2 (3.5%) 37 (4.5%)
- No 725 (95.4%) 55 (96.5%) 780 (95.5%)
- N Missing 24 0 24
Tobacco use: 0.17 (2)
- Non-tobacco user 494 (63.8%) 33 (57.9%) 527 (63.4%)
- Current tobacco user 45 (5.8%) 1 (1.8%) 46 (5.5%)
- Former tobacco user 235 (30.4%) 23 (40.4%) 258 (31.0%)
- N Missing 10 0 10
Menopause status: 0.02 (2)
- Pre-menopausal 167 (21.3%) 4 (7.0%) 171 (20.4%)
- Post-menopausal 580 (74.1%) 52 (91.2%) 632 (75.2%)
- Unclear 36 (4.6%) 1 (1.8%) 37 (4.4%)
- N Missing 1 0 1
Is the patient on vaginal estrogen? 0.23 (2)
- Yes 117 (15.1%) 12 (21.1%) 129 (15.5%)
- No 660 (84.9%) 45 (78.9%) 705 (84.5%)
- N Missing 7 0 7
Does the patient have Overactive Bladder? 0.08 (2)
- Yes 389 (50.3%) 35 (62.5%) 424 (51.1%)
- No 385 (49.7%) 21 (37.5%) 406 (48.9%)
- N Missing 10 1 11
Average number of voids at night: 0.33 (1)
- Median 2.0 2.0 2.0
- Q1, Q3 1.0, 3.0 1.0, 3.0 1.0, 3.0
- N Missing 68 5 73
Pelvic Organ Prolapse Quantification Stage 0.02 (1)
- Median 2.0 0.0 2.0
- Q1, Q3 0.0, 2.0 0.0, 2.0 0.0, 2.0
- N Missing 103 16 119
Pelvic Floor Distress Inventory-20 Total Score 0.39 (1)
- Median 100.0 100.0 100.0
- Q1, Q3 95.8, 100.2 100.0, 109.0 98.0, 101.0
Pelvic Organ Prolapse Distress Inventory-6 0.61 (1)
- Median 29.2 29.2 29.2
- Q1, Q3 25.0, 29.2 29.2, 33.3 25.0, 29.2
Colorectal-Anal Distress Inventory-8 0.74 (1)
- Median 18.8 18.8 18.8
- Q1, Q3 18.8, 21.9 18.8, 25.0 18.8, 21.9
Urinary Distress Inventory-6 0.17 (1)
- Median 50.0 50.0 50.0
- Q1, Q3 45.8, 50.0 50.0, 54.2 50.0, 50.0
Year 0.95 (2)
- 2022 288 (36.8%) 22 (38.6%) 310 (36.9%)
- 2023 332 (42.4%) 23 (40.4%) 355 (42.3%)
- 2024 163 (20.8%) 12 (21.1%) 175 (20.8%)
- N Missing 1 0 1
What was the reason for the urodynamics? 0.38 (2)
- Checked 279 (35.6%) 17 (29.8%) 296 (35.2%)
- Unchecked 505 (64.4%) 40 (70.2%) 545 (64.8%)
  1. Linear Model ANOVA
  2. Pearson’s Chi-squared test

7 PART 4: Feature Selection

This section uses LASSO (Least Absolute Shrinkage and Selection Operator) for objective, data-driven variable selection. Unlike traditional p-value based selection, LASSO uses regularization to automatically identify the most predictive variables while guarding against overfitting.

Why LASSO instead of p-value selection?

LASSO may select variables that do not have p < 0.10 in univariate analysis, and may exclude variables that do. This occurs because:

  1. Multicollinearity handling: LASSO selects one variable from correlated groups (e.g., Menopause Status was excluded because Age is preferred for clinical interpretability; they correlate at r=0.77)
  2. Predictive vs. associative: A variable can be statistically significant but add little predictive information once other variables are in the model
  3. Regularization: LASSO penalizes model complexity, favoring parsimonious models that generalize better

The variables shown in the “LASSO Model Selected Features” table below are the final predictors used in the nomogram, regardless of their univariate p-values.

7.0.1 Variable Screening

Before LASSO feature selection, variables were screened for data quality issues. Variables were excluded if they had: (1) near-zero variance, (2) >50% missing data, or (3) correlation >0.90 with another predictor.

Variables screened: 22 | Excluded (IDs/Year): 1 | Retained for LASSO: 19

Variable

Exclusion Reason

Immunocompromised.

Near-zero variance

8 Feature Selection with LASSO

LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization method that automatically selects the most important predictors by shrinking less important coefficients to exactly zero. We used 10-fold cross-validation to identify the optimal regularization strength (λ) that minimizes prediction error while producing a parsimonious model.

# Create a data frame with the cross-validation results
cv_results <- data.frame(
  lambda = cv_lasso$lambda,
  mean_cross_validated_error = cv_lasso$cvm,           # Mean cross-validated error
  standard_error_of_cross_validation = cv_lasso$cvsd,         # Standard error of CV error
  upper_bound_for_cross_validation_error = cv_lasso$cvup,         # Upper bound for CV error
  lower_bound_for_cross_validation_error = cv_lasso$cvlo          # Lower bound for CV error
)

# Get the optimal lambda value first
optimal_lambda <- cv_lasso$lambda.min

# Find the row index of the optimal lambda
optimal_row <- which.min(abs(cv_results$lambda - optimal_lambda))

# Display actual cross-validation results from our data
# NOTE: head(20) is for DISPLAY ONLY - showing all lambda values would be unwieldy.
# The full CV results are used for model selection.
cv_results_display <- cv_results %>%
  head(20) %>%
  mutate(
    lambda = round(lambda, 6),
    mean_cross_validated_error = round(mean_cross_validated_error, 4),
    standard_error_of_cross_validation = round(standard_error_of_cross_validation, 4),
    upper_bound_for_cross_validation_error = round(upper_bound_for_cross_validation_error, 4),
    lower_bound_for_cross_validation_error = round(lower_bound_for_cross_validation_error, 4)
  ) %>%
  rename(
    `λ (Lambda)` = lambda,
    `CV Error` = mean_cross_validated_error,
    `Std Error` = standard_error_of_cross_validation,
    `Upper Bound` = upper_bound_for_cross_validation_error,
    `Lower Bound` = lower_bound_for_cross_validation_error
  )

# Create flextable with highlighted optimal row (if within display range)
ft <- cv_results_display %>%
  flextable() %>%
  set_caption(paste0("LASSO Cross-Validation Results from Our Urodynamics Data (Optimal λ = ",
                     round(optimal_lambda, 6), ")")) %>%
  theme_vanilla() %>%
  autofit() %>%
  bold(part = "header") %>%
  fontsize(size = 9, part = "all")

# Only highlight if optimal row is within the displayed first 20 rows
if (optimal_row <= 20) {
  ft <- ft %>%
    bg(i = optimal_row, bg = "#90EE90") %>%  # Highlight optimal lambda row in green
    bold(i = optimal_row)
}

ft <- ft %>%
  add_footer_lines(paste0("Optimal λ (lambda.min = ", round(optimal_lambda, 6),
                          ") minimizes cross-validated error.",
                          if (optimal_row > 20) " (Optimal row not shown in first 20 rows)" else " (Highlighted in green)")) %>%
  fontsize(size = 8, part = "footer") %>%
  italic(part = "footer")

ft
LASSO Cross-Validation Results from Our Urodynamics Data (Optimal λ = 0.008511)

λ (Lambda)

CV Error

Std Error

Upper Bound

Lower Bound

0.028524

0.3065

0.0350

0.3415

0.2715

0.025990

0.3067

0.0351

0.3418

0.2716

0.023681

0.3046

0.0355

0.3401

0.2690

0.021577

0.3020

0.0360

0.3380

0.2660

0.019661

0.2989

0.0359

0.3348

0.2631

0.017914

0.2958

0.0353

0.3311

0.2605

0.016323

0.2930

0.0349

0.3279

0.2581

0.014872

0.2903

0.0346

0.3248

0.2557

0.013551

0.2873

0.0342

0.3214

0.2531

0.012347

0.2848

0.0339

0.3187

0.2508

0.011250

0.2826

0.0337

0.3163

0.2488

0.010251

0.2812

0.0339

0.3151

0.2474

0.009340

0.2806

0.0342

0.3148

0.2463

0.008511

0.2803

0.0347

0.3150

0.2457

0.007755

0.2805

0.0351

0.3155

0.2454

0.007066

0.2807

0.0354

0.3162

0.2453

0.006438

0.2810

0.0357

0.3167

0.2453

0.005866

0.2811

0.0361

0.3172

0.2451

0.005345

0.2813

0.0364

0.3176

0.2449

0.004870

0.2815

0.0367

0.3183

0.2448

Optimal λ (lambda.min = 0.008511) minimizes cross-validated error. (Highlighted in green)

8.0.1 Cross-Validation Results

Cross-validation results from our urodynamics data showing optimal lambda selection

Cross-validation results from our urodynamics data showing optimal lambda selection

Interpretation of Our Cross-Validation Results:

  • The blue line shows how prediction error changes as we increase regularization (λ)
  • The shaded blue region represents uncertainty (±1 standard error)
  • The red dashed line marks our selected λ (0.008511) - the value that minimizes error
  • The green dashed line marks λ.1se (0.028524) - a more conservative choice
kable(selected_features_df,
      caption = "LASSO Model Selected Features and Coefficients",
      col.names = c("Feature", "Coefficient"),  # Match the actual column names
      align = c("l", "r"),                      # Left-align text, right-align numbers
      booktabs = TRUE,
      linesep = "",
      format = "html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = TRUE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, color = "black", background = "#f2f2f2") %>%
  column_spec(1, width = "400px") %>%  # Wide column for Feature names
  column_spec(2, width = "150px")  # Set width for the coefficient column
LASSO Model Selected Features and Coefficients
Feature Coefficient
Age. 0.0366504
Is.the.patient.hispanic..latino.or.of.Spanish.origin.Yes 0.1230534
Does.the.patient.have.a.h.o.recurrent.UTIs.Yes 1.0956229
Tobacco.use.Current tobacco user -0.0263446
Is.the.patient.on.vaginal.estrogen.Yes 0.1759149
Does.the.patient.have.OAB.Yes 0.6041643
Average.number.of.voids.at.night. 0.0573524
CRADI_8 -0.0174316
urodynamics_reasonevaluation of voiding dysfunction 0.2750816

9 LASSO Regularization Path

# Create the regularization path plot with clean feature labels
ggplot(lasso_filtered, aes(x = log10(Lambda), y = Coefficient, color = Feature_Clean)) +
  geom_line(linewidth = 1.2) +
  geom_vline(xintercept = log10(optimal_lambda), linetype = "dashed", color = "red", linewidth = 1.2) +
  labs(
    title = "LASSO Regularization Path",
    subtitle = "Red line indicates optimal lambda from cross-validation",
    x = expression(log[10](lambda)),
    y = "Coefficient Values",
    color = "Predictor"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "right",
    legend.text = element_text(size = 11),
    legend.title = element_text(face = "bold", size = 12),
    axis.title = element_text(face = "bold"),
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray40")
  ) +
  scale_color_viridis_d(option = "turbo")
Figure: LASSO Regularization Path showing how coefficient values change with lambda

Figure: LASSO Regularization Path showing how coefficient values change with lambda

9.1 Interactive LASSO Regularization Path Visualization

# Check if required objects exist
if (!exists("lasso_filtered") || !exists("cv_lasso") || !exists("optimal_lambda")) {
  cat("Required LASSO objects not available - skipping interactive plot\n")
} else if (!requireNamespace("plotly", quietly = TRUE)) {
  cat("plotly package not available - showing static plot\n")
  print(ggplot(lasso_filtered, aes(x = log_lambda, y = Coefficient, color = Feature_Clean)) +
    geom_line(linewidth = 0.8) +
    geom_vline(xintercept = log10(optimal_lambda), linetype = "dashed", color = "red") +
    labs(title = "LASSO Regularization Path", x = "log10(lambda)", y = "Coefficient",
         color = "Predictor") +
    theme_minimal() +
    scale_color_viridis_d(option = "turbo"))
} else {
  # Create static ggplot as base with clean feature names
  p_base <- ggplot(lasso_filtered, aes(x = log_lambda, y = Coefficient, color = Feature_Clean)) +
    geom_line(linewidth = 0.8) +
    geom_vline(xintercept = log10(optimal_lambda), linetype = "dashed",
               color = "red", linewidth = 1) +
    labs(
      title = "Interactive LASSO Regularization Path",
      subtitle = paste0("Hover to identify features | Optimal lambda = ", round(cv_lasso$lambda.min, 4)),
      x = "log10(lambda)",
      y = "Coefficient Value",
      color = "Predictor"
    ) +
    theme_minimal(base_size = 12) +
    theme(
      legend.position = "right",
      legend.text = element_text(size = 10),
      axis.title = element_text(face = "bold"),
      plot.title = element_text(face = "bold", size = 14)
    ) +
    scale_color_viridis_d(option = "turbo")

  # Try to convert to plotly, fall back to static if it fails
  result <- tryCatch({
    p_interactive <- plotly::ggplotly(p_base, tooltip = c("color", "x", "y"))
    p_interactive
  }, error = function(e) {
    message("Plotly conversion failed: ", e$message)
    message("Displaying static ggplot instead")
    p_base
  })

  print(result)
}

This interactive visualization allows you to hover over any line to see the feature name and coefficient value, click legend entries to show/hide specific features, and zoom/pan to explore specific regions.


Figure Interpretation: LASSO Regularization Path

This graph shows the LASSO regularization path for a logistic regression model, illustrating how coefficient values change as the regularization strength (lambda) varies.

Key Observations:

  1. Regularization Effect: As lambda increases (moving left to right on the x-axis), more coefficients are pushed toward zero. This demonstrates LASSO’s feature selection capability.

  2. Optimal Lambda: The vertical red line indicates the optimal lambda value (λ = 0.0085) determined through cross-validation. This represents the best balance between model complexity and predictive performance.

  3. Selected Predictors at Optimal Lambda: LASSO selected 9 feature(s) with non-zero coefficients:

LASSO-Selected Predictors at Optimal Lambda
Predictor Coefficient Effect Direction
Age 0.0367 ↑ Increases risk
Is the patient hispanic latino or of Spanish origin Yes 0.1231 ↑ Increases risk
Does the patient have a h o recurrent UTIs Yes 1.0956 ↑ Increases risk
Tobacco use Current tobacco user -0.0263 ↓ Decreases risk
Is the patient on vaginal estrogen Yes 0.1759 ↑ Increases risk
Does the patient have OAB Yes 0.6042 ↑ Increases risk
Nocturia 0.0574 ↑ Increases risk
CRADI-8 Score -0.0174 ↓ Decreases risk
urodynamics reasonevaluation of voiding dysfunction 0.2751 ↑ Increases risk
  1. Variables Eliminated: All other features have coefficients shrunk to exactly zero at the optimal lambda, indicating they contribute little predictive value after accounting for the selected variable(s).

  2. Direction of Effects:

    • Features with positive coefficients (above zero) increase the probability of procedure cancellation
    • Features with negative coefficients (below zero) decrease the probability of procedure cancellation

10 Cross-Validation Results

11 PART 5: Model Development

This section fits the logistic regression model using the LASSO-selected predictors. Continuous predictors are modeled flexibly using restricted cubic splines (RCS) per Harrell’s recommendations.

11.1 Restricted Cubic Splines

Restricted cubic splines allow continuous predictors to have non-linear relationships with the outcome without assuming a specific functional form. The number of knots (typically 3-5) controls the flexibility of the fit.

11.2 RMS Setup and Spline Specification

12 Model with Restricted Cubic Splines

Following Harrell’s Regression Modeling Strategies (Chapter 2), continuous predictors are modeled using restricted cubic splines (RCS) to capture potential nonlinear relationships. The number of knots (5) was selected based on the effective sample size (510).

# =============================================================================
# DISPLAY MODEL SUMMARY IN CLEAN FORMAT
# =============================================================================

cat("## Final Model Summary\n\n")

12.1 Final Model Summary

# Extract and display model statistics in a clean table
model_stats <- model$stats

# Calculate events correctly - model$y is a factor with levels "Completed", "Cancelled"
# Convert to numeric: Completed=1 -> 0, Cancelled=2 -> 1
model_y_numeric <- as.numeric(model$y) - 1  # 0 = Completed, 1 = Cancelled
n_events_model <- sum(model_y_numeric == 1, na.rm = TRUE)
event_rate_model <- 100 * mean(model_y_numeric == 1, na.rm = TRUE)

model_fit <- data.frame(
  Statistic = c(
    "Observations",
    "Complete Cases",
    "Events (Cancelled)",
    "Model Likelihood Ratio Chi-Square",
    "Degrees of Freedom",
    "P-value",
    "Nagelkerke R² (Explained Variation)",
    "C-statistic (Area Under ROC Curve)",
    "Somers' Dxy (Rank Correlation)",
    "Brier Score (Mean Squared Error)"
  ),
  Value = c(
    sprintf("%d", model$stats["Obs"]),
    sprintf("%d", sum(!is.na(model$y))),
    sprintf("%d (%.1f%%)", n_events_model, event_rate_model),
    sprintf("%.2f", model$stats["Model L.R."]),
    sprintf("%d", model$stats["d.f."]),
    sprintf("%.4f", model$stats["Pr(> chi2)"]),
    sprintf("%.3f", model$stats["R2"]),
    sprintf("%.3f", model$stats["C"]),
    sprintf("%.3f", model$stats["Dxy"]),
    sprintf("%.3f", model$stats["Brier"])
  ),
  stringsAsFactors = FALSE
)

kable(model_fit,
      col.names = c("Model Statistic", "Value"),
      caption = "Logistic Regression Model Performance Metrics",
      align = c("l", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 13) %>%
  pack_rows("Sample", 1, 3) %>%
  pack_rows("Overall Model Fit", 4, 6) %>%
  pack_rows("Discrimination", 7, 10)
Logistic Regression Model Performance Metrics
Model Statistic Value
Sample
Observations 738
Complete Cases 738
Events (Cancelled) 51 (6.9%)
Overall Model Fit
Model Likelihood Ratio Chi-Square 47.04
Degrees of Freedom 14
P-value NA
Discrimination
Nagelkerke R² (Explained Variation) 0.156
C-statistic (Area Under ROC Curve) 0.770
Somers’ Dxy (Rank Correlation) 0.539
Brier Score (Mean Squared Error) 0.059
cat("\n\n**Note:** The C-statistic is equivalent to the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) for binary outcomes. ")

Note: The C-statistic is equivalent to the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) for binary outcomes.

cat("A value of 0.5 indicates random prediction, while 1.0 indicates perfect discrimination.\n")

A value of 0.5 indicates random prediction, while 1.0 indicates perfect discrimination.

extract_discrimination_metrics(model_output = model, verbose = TRUE)

13 Model Performance Interpretation

The table above shows key metrics evaluating how well the model predicts procedure cancellation.

Key Metrics Explained in Plain Language:

  • C-statistic (0.77): How well does the model rank patients? If we randomly select one patient who cancelled and one who completed their procedure, the model will correctly identify which one was more likely to cancel 77% of the time. Values above 70% indicate the model is clinically useful; above 80% is excellent.

  • Brier Score (0.059): How close are the predicted probabilities to what actually happened? This measures the average squared difference between predictions and outcomes. A score of 0 means perfect predictions; lower is better. For context, always predicting the average cancellation rate would give a Brier score of about 0.063.

  • R² (0.156): How much of the variation in outcomes does the model explain? A value of 16% is typical for medical prediction models, especially for relatively rare events like procedure cancellation (~6.8% rate). This does NOT mean the model is wrong—it means individual patient behavior has inherent unpredictability.

Why we use a special threshold: Because cancellations are uncommon (57 out of 841 procedures, or 6.8%), we use an optimized probability threshold (7.0%) instead of the typical 50% cutoff. This helps balance catching true cancellations against raising too many false alarms.

13.1 Harrell Regression Modeling Strategies (rms) Visualizations

The following visualizations follow Frank Harrell’s Regression Modeling Strategies approach for presenting logistic regression model results. These provide complementary views of predictor importance and effects.

13.1.1 Analysis of Variance (ANOVA) Chi-Square Plot (Variable Importance)

This plot shows the chi-square contribution of each predictor to the model, providing an objective measure of variable importance. Larger chi-square values indicate stronger predictors.

# =============================================================================
# ANOVA VISUALIZATION - Following Harrell's RMS approach
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Chapter 10.
# "plot(anova(f))" is the standard way to visualize predictor importance
# =============================================================================

log_info("Generating ANOVA chi-square plot for variable importance...")

# Compute ANOVA for the model
model_anova <- anova(model)

# Clean up variable names for display
# Create a mapping to convert underscores and clean up any remaining technical names
anova_label_map <- c(
  # Clean names from renaming (underscores to spaces)
  "Recurrent_UTIs" = "Recurrent UTIs",
  "Overactive_Bladder" = "Overactive Bladder",
  "Detrusor_Overactivity" = "Detrusor Overactivity",
  "Tobacco_Use" = "Tobacco Use",
  "Stress_Incontinence" = "Stress Incontinence",
  "Pelvic_Prolapse" = "Pelvic Prolapse",
  "Voiding_Dysfunction" = "Voiding Dysfunction",
  "Prior_Pelvic_Surgery" = "Prior Pelvic Surgery",
  "POPQ_Stage" = "POP-Q Stage",
  # Keep simple names as-is
  "Age" = "Age",
  "BMI" = "BMI",
  "Nocturia" = "Nocturia",
  "Hispanic" = "Hispanic/Latino",
  "Diabetes" = "Diabetes",
  "Neurological" = "Neurological",
  "TOTAL" = "TOTAL"
)

# Rename rows in the ANOVA object for cleaner display
if (!is.null(rownames(model_anova))) {
  old_names <- rownames(model_anova)
  new_names <- sapply(old_names, function(x) {
    if (x %in% names(anova_label_map)) {
      return(anova_label_map[x])
    } else {
      # Default: clean up dots and underscores
      cleaned <- gsub("\\.", " ", x)
      cleaned <- gsub("_", " ", cleaned)
      cleaned <- trimws(cleaned)
      return(cleaned)
    }
  })
  rownames(model_anova) <- new_names
}

# Create the chi-square plot with cleaner labels
# (ANOVA table output suppressed - values shown in plot)
plot(model_anova,
     what = "chisqminusdf",
     main = "Predictor Importance: Chi-Square Minus d.f.",
     cex.main = 1.2,
     margin = c("chisq", "P"))
Figure: ANOVA Chi-Square Plot showing relative importance of each predictor

Figure: ANOVA Chi-Square Plot showing relative importance of each predictor

log_info("ANOVA plot generated successfully")

Interpretation: The plot ranks predictors by their chi-square statistic minus degrees of freedom. This “partial chi-square” approach accounts for the complexity of each term, providing a fair comparison between simple (1 d.f.) and complex (multi-d.f.) predictors.

13.1.2 Summary Odds Ratio Plot (Effect Sizes)

This forest plot shows the odds ratios with 95% confidence intervals for each predictor. For continuous variables, the odds ratio represents the effect of moving from the 25th to 75th percentile (interquartile range).

Figure: Which Factors Matter Most? (Odds Ratio Forest Plot). Each horizontal bar shows how much a predictor increases or decreases cancellation risk. Bars to the RIGHT of the vertical line (odds ratio = 1) indicate higher risk; bars to the LEFT indicate lower risk. Longer bars mean larger effects. The horizontal lines through each bar show the uncertainty—if they cross the vertical line, the effect might not be statistically significant.

Figure: Which Factors Matter Most? (Odds Ratio Forest Plot). Each horizontal bar shows how much a predictor increases or decreases cancellation risk. Bars to the RIGHT of the vertical line (odds ratio = 1) indicate higher risk; bars to the LEFT indicate lower risk. Longer bars mean larger effects. The horizontal lines through each bar show the uncertainty—if they cross the vertical line, the effect might not be statistically significant.

Interpretation: - Odds ratios > 1: Predictor is associated with increased cancellation risk - Odds ratios < 1: Predictor is associated with decreased cancellation risk - CI crossing 1: Effect not statistically significant at α = 0.05 - For continuous predictors (Age, body mass index): odds ratio represents effect of interquartile range increase

13.1.3 Partial Effects Plot (Predictor-Outcome Relationships)

This visualization shows the predicted log-odds of cancellation across the range of each predictor, holding other predictors at their reference values. It reveals both the shape of relationships (linear vs. nonlinear via restricted cubic splines) and effect magnitudes.

# =============================================================================
# PARTIAL EFFECTS PLOT - Following Harrell's RMS approach
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Chapter 10.
# "ggplot(Predict(f))" shows partial effects with data distribution
# =============================================================================

log_info("Generating partial effects plot...")

# Ensure rms methods are available (fixes HTML rendering namespace issue)
requireNamespace("rms", quietly = TRUE)

# Generate predictions across predictor ranges using explicit namespace
# This fixes "no applicable method for 'predict'" error during HTML render
model_predictions <- tryCatch({
  rms::Predict(model)
}, error = function(e) {
  log_warn(sprintf("Predict failed: %s. Trying alternative approach.", e$message))
  # Fallback: try with explicit datadist setting
  options(datadist = dd)
  rms::Predict(model)
})

# Create ggplot visualization with data distribution
partial_effects_plot <- ggplot(model_predictions,
                                sepdiscrete = "vertical",
                                vnames = "names",
                                rdata = selected_labels_df,
                                histSpike.opts = list(frac = function(f) 0.1 * f / max(f)))

print(partial_effects_plot)
Figure: Partial Effects Plot showing predictor-outcome relationships

Figure: Partial Effects Plot showing predictor-outcome relationships

log_info("Partial effects plot generated successfully")

Interpretation: - Solid line: Predicted effect (log-odds scale) - Shaded region: 95% confidence interval - Rug/histogram: Distribution of predictor values in the data - Horizontal reference: Effect = 0 (no change from reference) - Nonlinear curves: Reflect restricted cubic spline (RCS) modeling

# =============================================================================
# INDIVIDUAL PREDICTOR PLOTS ON PROBABILITY SCALE
# Shows effects transformed to probability for clinical interpretation
# =============================================================================

log_info("Generating individual predictor probability plots...")

# Get continuous predictors from the model
continuous_preds <- names(model$Design$parms)[sapply(model$Design$parms, function(x) !is.null(attr(x, "knots")) || is.numeric(x))]

# Plot each continuous predictor on probability scale if available
if (length(continuous_preds) > 0) {
  # Create prediction for Age if in model
  if ("Age" %in% names(model$Design$parms)) {
    age_pred <- Predict(model, Age, fun = plogis)
    p_age <- ggplot(age_pred) +
      labs(title = "Effect of Age on Cancellation Probability",
           y = "Predicted Probability") +
      theme_minimal() +
      theme(plot.title = element_text(size = 12, face = "bold"))
    print(p_age)
  }

  # Create prediction for BMI if in model
  if ("BMI" %in% names(model$Design$parms)) {
    bmi_pred <- Predict(model, BMI, fun = plogis)
    p_bmi <- ggplot(bmi_pred) +
      labs(title = "Effect of BMI on Cancellation Probability",
           y = "Predicted Probability") +
      theme_minimal() +
      theme(plot.title = element_text(size = 12, face = "bold"))
    print(p_bmi)
  }
}

log_info("Individual predictor plots generated successfully")

Clinical Interpretation of Partial Effects: These plots allow clinicians to understand: 1. Direction of effect: Whether increasing a predictor increases or decreases risk 2. Magnitude of effect: How much the probability changes across the predictor range 3. Shape of relationship: Linear trends vs. thresholds or U-shaped patterns 4. Uncertainty: Wider confidence bands indicate less precise estimates

13.2 Logistic Regression Diagnostics

Since we are using logistic regression with a binary outcome, traditional OLS diagnostics (normality of residuals, homoscedasticity) do not apply. Instead, we perform diagnostics specific to logistic regression including: linearity of the logit, multicollinearity, goodness-of-fit, and influential observation analysis.

13.2.1 Multicollinearity Assessment: Variance Inflation Factors (VIF)

Multicollinearity inflates standard errors and makes coefficient estimates unstable. We assess this using Variance Inflation Factors (VIF) and Generalized VIF (GVIF) for categorical predictors. VIF > 5 indicates moderate collinearity; VIF > 10 indicates severe problems.

# =============================================================================
# VARIANCE INFLATION FACTOR (VIF) ANALYSIS
# Assesses multicollinearity among predictors in logistic regression
# =============================================================================

# Install car package if needed (for vif function)
if (!requireNamespace("car", quietly = TRUE)) {
  log_warn("car package not available for VIF analysis")
} else {
  library(car)

  # For rms::lrm models, we need to extract the design matrix
  # and fit a standard glm for VIF calculation
  tryCatch({
    # Get the model formula and data
    model_terms <- attr(model$terms, "term.labels")

    # Create a simplified formula without RCS for VIF (RCS creates multiple terms)
    # Extract base variable names
    base_vars <- unique(gsub("rcs\\(([^,]+),.*\\)", "\\1", model_terms))
    base_vars <- gsub("'$", "", base_vars)  # Remove prime notation

    # Build formula for GLM
    outcome_var <- "Was.the.procedure.cancelled."
    if (!(outcome_var %in% names(selected_labels_df))) {
      outcome_var <- grep("cancelled", names(selected_labels_df), value = TRUE, ignore.case = TRUE)[1]
    }
    
    vif_formula <- as.formula(paste(outcome_var, "~", paste(base_vars, collapse = " + ")))

    # Fit GLM for VIF calculation
    vif_glm <- glm(vif_formula, data = selected_labels_df, family = binomial)

    # Calculate VIF
    vif_results <- car::vif(vif_glm)

    # Handle both simple VIF and GVIF (for factors)
    if (is.matrix(vif_results)) {
      # GVIF output for categorical variables
      vif_df <- data.frame(
        Variable = rownames(vif_results),
        GVIF = round(vif_results[, "GVIF"], 3),
        Df = vif_results[, "Df"],
        `GVIF^(1/(2*Df))` = round(vif_results[, "GVIF^(1/(2*Df))"], 3),
        Assessment = ifelse(vif_results[, "GVIF^(1/(2*Df))"] > 2.24, "Moderate Collinearity",
                           ifelse(vif_results[, "GVIF^(1/(2*Df))"] > 3.16, "High Collinearity", "Acceptable")),
        check.names = FALSE
      )
    } else {
      # Simple VIF output
      vif_df <- data.frame(
        Variable = names(vif_results),
        VIF = round(vif_results, 3),
        Assessment = ifelse(vif_results > 10, "Severe Collinearity",
                           ifelse(vif_results > 5, "Moderate Collinearity", "Acceptable"))
      )
    }

    # Display table
    knitr::kable(vif_df,
                 caption = "Table: Variance Inflation Factors (VIF) for Model Predictors",
                 align = c("l", rep("c", ncol(vif_df) - 1)))

    # Summary message
    max_vif <- max(vif_results, na.rm = TRUE)
    if (max_vif > 10) {
      log_warn(sprintf("High multicollinearity detected (max VIF = %.2f). Consider removing correlated predictors.", max_vif))
    } else if (max_vif > 5) {
      log_info(sprintf("Moderate multicollinearity present (max VIF = %.2f). Interpret coefficients with caution.", max_vif))
    } else {
      log_info(sprintf("No problematic multicollinearity (max VIF = %.2f). All VIF values acceptable.", max_vif))
    }

  }, error = function(e) {
    log_warn(sprintf("VIF calculation failed: %s", e$message))
    cat("VIF analysis could not be completed. See correlation matrix in Appendix for collinearity assessment.\n")
  })
}

Interpretation: VIF values below 5 indicate acceptable multicollinearity. For categorical variables with multiple levels, we report GVIF^(1/(2*Df)) which should be interpreted similarly (values < 2.24 correspond to VIF < 5).

13.2.2 Linearity of the Logit Assessment

Logistic regression assumes continuous predictors have a linear relationship with the log-odds of the outcome. We test this assumption using: 1. Component-plus-residual (partial residual) plots: Visual inspection of linearity 2. Comparison of linear vs. restricted cubic spline (RCS) models: Formal nonlinearity test

# =============================================================================
# LINEARITY OF THE LOGIT ASSESSMENT
# Tests whether continuous predictors have linear relationship with log-odds
# =============================================================================

# Identify continuous predictors in the model
continuous_vars <- c("Age", "BMI")
continuous_vars <- continuous_vars[continuous_vars %in% names(selected_labels_df)]

if (length(continuous_vars) > 0) {

  # Create panel of diagnostic plots
  par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

  for (var in continuous_vars) {
    # Get predicted log-odds
    lp <- predict(model, type = "lp")

    # Calculate partial residuals (component + residual)
    # For logistic: residual = observed - predicted probability
    fitted_probs <- plogis(lp)
    outcome_numeric <- as.numeric(selected_labels_df$Cancelled == "Cancelled")
    pearson_resid <- (outcome_numeric - fitted_probs) / sqrt(fitted_probs * (1 - fitted_probs))

    # Get coefficient for this variable (approximate for RCS)
    var_coef <- coef(model)[grep(paste0("^", var), names(coef(model)))]
    if (length(var_coef) > 0) {
      component <- selected_labels_df[[var]] * var_coef[1]
    } else {
      component <- 0
    }
    partial_resid <- component + pearson_resid

    # Get the predictor values
    x_vals <- selected_labels_df[[var]]

    # Scatter plot with loess smoother
    plot(x_vals, partial_resid,
         xlab = var, ylab = "Partial Residual (Component + Residual)",
         main = paste("Linearity Check:", var),
         pch = 16, col = alpha("steelblue", 0.5), cex = 0.8)

    # Add loess smoother using temporary data frame for consistent naming
    temp_df <- data.frame(x = x_vals, y = partial_resid)
    loess_fit <- loess(y ~ x, data = temp_df, span = 0.75)
    x_seq <- seq(min(x_vals, na.rm = TRUE), max(x_vals, na.rm = TRUE), length.out = 100)
    loess_pred <- predict(loess_fit, newdata = data.frame(x = x_seq))
    lines(x_seq, loess_pred, col = "red", lwd = 2)

    # Add reference line
    abline(lm(partial_resid ~ x_vals), col = "blue", lwd = 1, lty = 2)
    legend("topright", c("Loess (actual)", "Linear"),
           col = c("red", "blue"), lwd = c(2, 1), lty = c(1, 2), cex = 0.8)
  }

  # Formal test: Compare linear vs RCS model using ANOVA
  cat("\n### Formal Nonlinearity Tests (ANOVA)\n\n")
  cat("Testing whether restricted cubic splines significantly improve over linear terms:\n\n")

  # Get ANOVA from existing model to test nonlinearity
  model_anova <- anova(model)

  # Convert to data frame for easier manipulation
  anova_df <- as.data.frame(model_anova)
  anova_df$Variable <- rownames(anova_df)

  # Extract nonlinearity tests (rows containing "Nonlinear")
  nonlin_idx <- grep("Nonlinear", rownames(anova_df), value = FALSE)

  if (length(nonlin_idx) > 0) {
    nonlin_tests <- anova_df[nonlin_idx, , drop = FALSE]
    print(nonlin_tests)

    cat("\n**Interpretation**: \n")
    cat("- p < 0.05 indicates significant nonlinearity (RCS justified)\n")
    cat("- p > 0.05 suggests linear term may be sufficient\n\n")

    # Check if any nonlinear terms are significant
    # P column name may vary - try common variations
    p_col <- grep("^P|p.*value", names(nonlin_tests), value = TRUE, ignore.case = TRUE)
    if (length(p_col) > 0) {
      sig_nonlin <- any(nonlin_tests[[p_col[1]]] < 0.05, na.rm = TRUE)
      if (sig_nonlin) {
        log_info("Significant nonlinearity detected - RCS specification is appropriate")
      } else {
        log_info("No significant nonlinearity - linear terms may be sufficient, but RCS adds flexibility")
      }
    }
  } else {
    cat("No explicit nonlinearity tests available from model ANOVA.\n")
    cat("Visual inspection of partial residual plots above should guide interpretation.\n")
  }

  par(mfrow = c(1, 1))
} else {
  cat("No continuous predictors in model to test for linearity of logit.\n")
}

No continuous predictors in model to test for linearity of logit.

Interpretation: The red loess curve should closely follow the blue linear reference line if the linearity assumption holds. Substantial deviation indicates the relationship is nonlinear and justifies using restricted cubic splines (RCS).

13.2.3 Influential Observation Diagnostics

We identify observations with disproportionate influence on model coefficients using Cook’s Distance, DFBETAS (Difference in Betas, measuring how much each coefficient changes when an observation is removed), and residual analysis.

# =============================================================================
# INFLUENTIAL OBSERVATION DIAGNOSTICS
# Identifies data points with outsized impact on coefficient estimates
# =============================================================================

# Extract diagnostics from the model
n_obs <- model$stats["Obs"]

# Calculate predicted probabilities and residuals
lp <- predict(model, type = "lp")
fitted_probs <- plogis(lp)
outcome_numeric <- as.numeric(selected_labels_df$Cancelled == "Cancelled")

# Deviance residuals
deviance_resid <- sign(outcome_numeric - fitted_probs) *
  sqrt(-2 * (outcome_numeric * log(fitted_probs + 1e-10) +
             (1 - outcome_numeric) * log(1 - fitted_probs + 1e-10)))

# Pearson residuals
pearson_resid <- (outcome_numeric - fitted_probs) / sqrt(fitted_probs * (1 - fitted_probs))

# Standardized Pearson residuals
std_pearson_resid <- pearson_resid  # Approximate (true standardization requires hat matrix)

# Create diagnostic plots
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))

# Plot 1: Deviance Residuals vs Fitted Values
plot(fitted_probs, deviance_resid,
     xlab = "Fitted Probability", ylab = "Deviance Residual",
     main = "Deviance Residuals vs Fitted",
     pch = 16, col = ifelse(abs(deviance_resid) > 2, "red", alpha("steelblue", 0.5)),
     cex = ifelse(abs(deviance_resid) > 2, 1.2, 0.8))
abline(h = c(-2, 0, 2), lty = c(2, 1, 2), col = c("orange", "black", "orange"))
legend("topright", c("|Residual| > 2"), col = "red", pch = 16, cex = 0.8)

# Plot 2: Pearson Residuals vs Fitted Values
plot(fitted_probs, pearson_resid,
     xlab = "Fitted Probability", ylab = "Pearson Residual",
     main = "Pearson Residuals vs Fitted",
     pch = 16, col = ifelse(abs(pearson_resid) > 2.5, "red", alpha("steelblue", 0.5)),
     cex = ifelse(abs(pearson_resid) > 2.5, 1.2, 0.8))
abline(h = c(-2.5, 0, 2.5), lty = c(2, 1, 2), col = c("orange", "black", "orange"))
legend("topright", c("|Residual| > 2.5"), col = "red", pch = 16, cex = 0.8)

# Plot 3: Index plot of Deviance Residuals
plot(1:length(deviance_resid), deviance_resid,
     xlab = "Observation Index", ylab = "Deviance Residual",
     main = "Deviance Residuals by Observation",
     pch = 16, col = ifelse(abs(deviance_resid) > 2, "red", alpha("steelblue", 0.5)),
     cex = 0.8)
abline(h = c(-2, 0, 2), lty = c(2, 1, 2), col = c("orange", "black", "orange"))

# Identify potentially influential points
influential_idx <- which(abs(deviance_resid) > 2)
if (length(influential_idx) > 0 && length(influential_idx) <= 10) {
  text(influential_idx, deviance_resid[influential_idx],
       labels = influential_idx, pos = 3, cex = 0.7, col = "red")
}

# Plot 4: Histogram of Deviance Residuals
hist(deviance_resid, breaks = 30, col = "steelblue", border = "white",
     main = "Distribution of Deviance Residuals",
     xlab = "Deviance Residual", freq = FALSE)
curve(dnorm(x, mean = mean(deviance_resid), sd = sd(deviance_resid)),
      add = TRUE, col = "red", lwd = 2)
legend("topright", "Normal curve", col = "red", lwd = 2, cex = 0.8)
Figure: Influential Observation Diagnostics for Logistic Regression

Figure: Influential Observation Diagnostics for Logistic Regression

par(mfrow = c(1, 1))

# Summary statistics
cat("\n### Residual Summary Statistics\n\n")

13.2.4 Residual Summary Statistics

resid_summary <- data.frame(
  Statistic = c("Mean", "SD", "Min", "Max", "N > 2", "N > 3", "% > 2"),
  `Deviance Residuals` = c(
    round(mean(deviance_resid, na.rm = TRUE), 4),
    round(sd(deviance_resid, na.rm = TRUE), 4),
    round(min(deviance_resid, na.rm = TRUE), 4),
    round(max(deviance_resid, na.rm = TRUE), 4),
    sum(abs(deviance_resid) > 2, na.rm = TRUE),
    sum(abs(deviance_resid) > 3, na.rm = TRUE),
    round(100 * mean(abs(deviance_resid) > 2, na.rm = TRUE), 1)
  ),
  `Pearson Residuals` = c(
    round(mean(pearson_resid, na.rm = TRUE), 4),
    round(sd(pearson_resid, na.rm = TRUE), 4),
    round(min(pearson_resid, na.rm = TRUE), 4),
    round(max(pearson_resid, na.rm = TRUE), 4),
    sum(abs(pearson_resid) > 2.5, na.rm = TRUE),
    sum(abs(pearson_resid) > 3.5, na.rm = TRUE),
    round(100 * mean(abs(pearson_resid) > 2.5, na.rm = TRUE), 1)
  ),
  check.names = FALSE
)

knitr::kable(resid_summary,
             caption = "Table: Summary of Residual Diagnostics",
             align = c("l", "c", "c"))
Table: Summary of Residual Diagnostics
Statistic Deviance Residuals Pearson Residuals
Mean -0.1627 -0.0034
SD 0.6426 0.9750
Min -1.5406 -1.5088
Max 2.9255 8.4369
N > 2 30.0000 30.0000
N > 3 0.0000 16.0000
% > 2 4.1000 4.1000
# Interpretation
n_large_deviance <- sum(abs(deviance_resid) > 2, na.rm = TRUE)
pct_large <- round(100 * n_large_deviance / length(deviance_resid), 1)

cat("\n**Interpretation**:\n")

Interpretation:

cat(sprintf("- %d observations (%.1f%%) have deviance residuals > 2 in absolute value\n",
            n_large_deviance, pct_large))
  • 30 observations (3.6%) have deviance residuals > 2 in absolute value
cat("- For a well-fitting model, we expect ~5% of residuals to exceed ±2\n")
  • For a well-fitting model, we expect ~5% of residuals to exceed ±2
if (pct_large > 10) {
  cat("- **Warning**: Higher than expected proportion of large residuals may indicate model misspecification\n")
  log_warn("Higher than expected proportion of large residuals")
} else {
  cat("- Residual distribution is within acceptable range\n")
  log_info("Residual distribution acceptable")
}
  • Residual distribution is within acceptable range

13.2.5 Cook’s Distance Analysis

Cook’s Distance measures the influence of each observation on the overall model fit. Observations with Cook’s D > 4/n are potentially influential.

# =============================================================================
# COOK'S DISTANCE ANALYSIS
# Identifies observations with outsized influence on coefficient estimates
# =============================================================================

tryCatch({
  # Fit GLM for Cook's distance (rms::lrm doesn't provide this directly)
  # Use simplified formula for diagnostic purposes
  base_vars <- names(model$Design$parms)
  base_vars <- unique(gsub("'$", "", base_vars))  # Remove RCS prime notation

  if (length(base_vars) > 0) {
    # Build formula
    diag_formula <- as.formula(paste("Cancelled ~", paste(base_vars, collapse = " + ")))

    # Fit GLM
    diag_glm <- glm(diag_formula, data = selected_labels_df, family = binomial)

    # Calculate Cook's distance
    cooks_d <- cooks.distance(diag_glm)

    # Threshold: commonly 4/n or 1
    threshold_4n <- 4 / length(cooks_d)
    threshold_1 <- 1

    # Create plot
    par(mfrow = c(1, 2))

    # Plot 1: Cook's distance by index
    plot(cooks_d, type = "h",
         main = "Cook's Distance by Observation",
         xlab = "Observation Index", ylab = "Cook's Distance",
         col = ifelse(cooks_d > threshold_4n, "red", "steelblue"))
    abline(h = threshold_4n, col = "orange", lty = 2, lwd = 2)
    abline(h = threshold_1, col = "red", lty = 2, lwd = 2)
    legend("topright", c(paste0("4/n = ", round(threshold_4n, 4)), "D = 1"),
           col = c("orange", "red"), lty = 2, lwd = 2, cex = 0.8)

    # Identify highly influential points
    high_influence <- which(cooks_d > threshold_4n)
    if (length(high_influence) > 0 && length(high_influence) <= 15) {
      text(high_influence, cooks_d[high_influence],
           labels = high_influence, pos = 3, cex = 0.6, col = "red")
    }

    # Plot 2: Histogram of Cook's D
    hist(cooks_d, breaks = 50, col = "steelblue", border = "white",
         main = "Distribution of Cook's Distance",
         xlab = "Cook's Distance")
    abline(v = threshold_4n, col = "orange", lty = 2, lwd = 2)

    par(mfrow = c(1, 1))

    # Summary
    n_influential <- sum(cooks_d > threshold_4n, na.rm = TRUE)
    n_very_influential <- sum(cooks_d > threshold_1, na.rm = TRUE)

    cat("\n### Cook's Distance Summary\n\n")
    cat(sprintf("- Threshold (4/n): %.4f\n", threshold_4n))
    cat(sprintf("- Observations with D > 4/n: %d (%.1f%%)\n",
                n_influential, 100 * n_influential / length(cooks_d)))
    cat(sprintf("- Observations with D > 1: %d\n", n_very_influential))
    cat(sprintf("- Maximum Cook's D: %.4f (observation %d)\n",
                max(cooks_d), which.max(cooks_d)))

    if (n_very_influential > 0) {
      cat("\n**Warning**: Observations with Cook's D > 1 may have substantial influence on results.\n")
      log_warn(sprintf("%d observations with Cook's D > 1 detected", n_very_influential))
    } else if (n_influential > nrow(selected_labels_df) * 0.05) {
      cat("\n**Note**: More than 5% of observations exceed 4/n threshold. Review data quality.\n")
      log_info("Elevated number of influential observations")
    } else {
      cat("\n**Result**: No observations with extreme influence on model coefficients.\n")
      log_info("No highly influential observations detected")
    }
  }
}, error = function(e) {
  log_warn(sprintf("Cook's distance calculation failed: %s", e$message))
  cat("Cook's distance could not be calculated for this model specification.\n")
})
Figure: Cook's Distance for Influential Observation Detection

Figure: Cook’s Distance for Influential Observation Detection

13.2.6 Cook’s Distance Summary

  • Threshold (4/n): 0.0054
  • Observations with D > 4/n: 52 (7.0%)
  • Observations with D > 1: 0
  • Maximum Cook’s D: 0.1195 (observation 726)

Note: More than 5% of observations exceed 4/n threshold. Review data quality.

13.2.7 Comprehensive Confusion Matrix with All Metrics

# =============================================================================
# COMPREHENSIVE CONFUSION MATRIX WITH ALL CLASSIFICATION METRICS
# Complete assessment of classification performance
# =============================================================================

# Get predictions at optimal threshold
# IMPORTANT: Use model$y for outcomes to ensure length matches predictions
# The model may have dropped rows with missing predictors, so selected_labels_df
# could have more rows than the model used
lp <- predict(model, type = "lp")
fitted_probs <- plogis(lp)
n_predictions <- length(fitted_probs)

# Use model's stored outcomes to ensure length consistency
# lrm stores factor outcomes as factor with levels 1/2 (not original names)
# Level 1 = first factor level = "Completed", Level 2 = "Cancelled"
if (!is.null(model$y)) {
  outcome_actual <- factor(ifelse(as.numeric(model$y) == 2, "Cancelled", "Completed"),
                           levels = c("Completed", "Cancelled"))
  # If model$y was used, predictions must match its length
  if (length(outcome_actual) != length(fitted_probs)) {
    # If missing predictors caused predict() to return more values than model$y,
    # align them using non-NA linear predictors
    fitted_probs <- fitted_probs[!is.na(lp)]
  }
} else {
  # Fallback: model$y not available
  model_vars <- all.vars(model$sformula)[-1]
  complete_rows <- complete.cases(selected_labels_df[, model_vars, drop = FALSE])
  outcome_actual <- selected_labels_df$Cancelled[complete_rows]
  fitted_probs <- fitted_probs[complete_rows]
}

# Final safety check for alignment and NAs
valid_idx <- !is.na(outcome_actual) & !is.na(fitted_probs)
outcome_actual <- outcome_actual[valid_idx]
fitted_probs <- fitted_probs[valid_idx]
n_final <- length(outcome_actual)

# Determine optimal threshold using Youden's J
roc_obj <- pROC::roc(outcome_actual, fitted_probs, levels = c("Completed", "Cancelled"))
optimal_coords <- pROC::coords(roc_obj, "best", ret = c("threshold", "sensitivity", "specificity"))
optimal_threshold <- optimal_coords$threshold

# Create predictions at optimal threshold
predicted_class <- factor(ifelse(fitted_probs >= optimal_threshold, "Cancelled", "Completed"),
                          levels = c("Completed", "Cancelled"))

# Confusion matrix
cm <- table(Predicted = predicted_class, Actual = outcome_actual)

# Extract values
TP <- cm["Cancelled", "Cancelled"]
TN <- cm["Completed", "Completed"]
FP <- cm["Cancelled", "Completed"]
FN <- cm["Completed", "Cancelled"]

# Calculate all metrics
total <- TP + TN + FP + FN
assert_positive_count(total, "confusion matrix total", "performance metrics")
assert_invariant(total == length(outcome_actual),
                 "Confusion matrix total must equal outcome length",
                 context = "performance metrics",
                 data = list(cm_total = total, outcome_length = length(outcome_actual)))

sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)
ppv <- TP / (TP + FP)
npv <- TN / (TN + FN)
accuracy <- (TP + TN) / total
balanced_accuracy <- (sensitivity + specificity) / 2
f1_score <- 2 * (ppv * sensitivity) / (ppv + sensitivity)
prevalence <- (TP + FN) / total
detection_rate <- TP / total
detection_prevalence <- (TP + FP) / total
mcc <- (TP * TN - FP * FN) / sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
youden_j <- sensitivity + specificity - 1
diagnostic_odds_ratio <- (TP * TN) / (FP * FN + 1e-10)  # Add small value to avoid division by zero

# Validate metrics are in expected ranges
assert_invariant(sensitivity >= 0 && sensitivity <= 1,
                 "Sensitivity must be in [0,1]",
                 data = list(sensitivity = sensitivity, TP = TP, FN = FN))
assert_invariant(specificity >= 0 && specificity <= 1,
                 "Specificity must be in [0,1]",
                 data = list(specificity = specificity, TN = TN, FP = FP))

# Wilson score confidence intervals
wilson_ci_func <- function(x, n, alpha = 0.05) {
  z <- qnorm(1 - alpha/2)
  p_hat <- x / n
  denom <- 1 + z^2 / n
  center <- (p_hat + z^2 / (2 * n)) / denom
  margin <- z * sqrt((p_hat * (1 - p_hat) + z^2 / (4 * n)) / n) / denom
  c(lower = max(0, center - margin), upper = min(1, center + margin))
}

sens_ci <- wilson_ci_func(TP, TP + FN)
spec_ci <- wilson_ci_func(TN, TN + FP)
ppv_ci <- wilson_ci_func(TP, TP + FP)
npv_ci <- wilson_ci_func(TN, TN + FN)

# Create visualization
par(mfrow = c(1, 2))

# Plot 1: Confusion Matrix Heatmap
cm_mat <- matrix(c(TN, FP, FN, TP), nrow = 2, byrow = TRUE,
                 dimnames = list(Predicted = c("Completed", "Cancelled"),
                                Actual = c("Completed", "Cancelled")))

# Colors: correct predictions in blue, incorrect in red
colors <- matrix(c("lightgreen", "lightcoral", "lightcoral", "lightgreen"),
                 nrow = 2, byrow = TRUE)

plot(c(0, 3), c(0, 3), type = "n", xlab = "", ylab = "", axes = FALSE,
     main = "Confusion Matrix")

# Draw cells
rect(0.5, 1.5, 1.5, 2.5, col = colors[1, 1], border = "black")  # TN
rect(1.5, 1.5, 2.5, 2.5, col = colors[1, 2], border = "black")  # FP
rect(0.5, 0.5, 1.5, 1.5, col = colors[2, 1], border = "black")  # FN
rect(1.5, 0.5, 2.5, 2.5, col = colors[2, 2], border = "black")  # TP

# Add values
text(1, 2, sprintf("TN\n%d", TN), cex = 1.2, font = 2)
text(2, 2, sprintf("FP\n%d", FP), cex = 1.2, font = 2)
text(1, 1, sprintf("FN\n%d", FN), cex = 1.2, font = 2)
text(2, 1, sprintf("TP\n%d", TP), cex = 1.2, font = 2)

# Add labels
text(1, 2.7, "Completed", cex = 1)
text(2, 2.7, "Cancelled", cex = 1)
text(0.3, 2, "Completed", cex = 1, srt = 90)
text(0.3, 1, "Cancelled", cex = 1, srt = 90)
text(1.5, 2.9, "Actual", cex = 1.1, font = 2)
text(0.1, 1.5, "Predicted", cex = 1.1, font = 2, srt = 90)

# Plot 2: Metrics Bar Chart
metrics_df <- data.frame(
  Metric = c("Sensitivity", "Specificity", "PPV", "NPV", "Accuracy", "Balanced Acc", "F1 Score"),
  Value = c(sensitivity, specificity, ppv, npv, accuracy, balanced_accuracy, f1_score)
)

barplot(metrics_df$Value, names.arg = metrics_df$Metric,
        col = "steelblue", ylim = c(0, 1),
        main = "Classification Performance Metrics",
        ylab = "Value", las = 2, cex.names = 0.8)
abline(h = c(0.5, 0.7, 0.9), lty = 2, col = "gray")
Figure: Comprehensive Confusion Matrix Visualization

Figure: Comprehensive Confusion Matrix Visualization

par(mfrow = c(1, 1))

# Comprehensive metrics table
metrics_table <- data.frame(
  Metric = c("True Positives (TP)", "True Negatives (TN)",
             "False Positives (FP)", "False Negatives (FN)",
             "---",
             "Sensitivity (Recall, TPR)", "Specificity (TNR)",
             "Positive Predictive Value (Precision)", "Negative Predictive Value",
             "---",
             "Accuracy", "Balanced Accuracy", "F1 Score (Harmonic Mean of Precision & Recall)",
             "Matthews Correlation Coefficient", "Youden's J Statistic",
             "Diagnostic Odds Ratio",
             "---",
             "Prevalence", "Detection Rate", "Detection Prevalence",
             "---",
             "Optimal Threshold", "AUC-ROC"),
  Value = c(as.character(TP), as.character(TN),
            as.character(FP), as.character(FN),
            "",
            sprintf("%.1f%% [%.1f%%, %.1f%%]", sensitivity*100, sens_ci[1]*100, sens_ci[2]*100),
            sprintf("%.1f%% [%.1f%%, %.1f%%]", specificity*100, spec_ci[1]*100, spec_ci[2]*100),
            sprintf("%.1f%% [%.1f%%, %.1f%%]", ppv*100, ppv_ci[1]*100, ppv_ci[2]*100),
            sprintf("%.1f%% [%.1f%%, %.1f%%]", npv*100, npv_ci[1]*100, npv_ci[2]*100),
            "",
            sprintf("%.1f%%", accuracy*100),
            sprintf("%.1f%%", balanced_accuracy*100),
            sprintf("%.3f", f1_score),
            sprintf("%.3f", mcc),
            sprintf("%.3f", youden_j),
            sprintf("%.1f", diagnostic_odds_ratio),
            "",
            sprintf("%.1f%%", prevalence*100),
            sprintf("%.1f%%", detection_rate*100),
            sprintf("%.1f%%", detection_prevalence*100),
            "",
            sprintf("%.4f", optimal_threshold),
            sprintf("%.3f", as.numeric(roc_obj$auc))),
  Interpretation = c("Correctly identified cancellations", "Correctly identified completions",
                     "False alarms (predicted cancel, actually complete)", "Missed cancellations",
                     "",
                     "Proportion of actual cancellations correctly identified",
                     "Proportion of actual completions correctly identified",
                     "Of predicted cancellations, proportion actually cancelled",
                     "Of predicted completions, proportion actually completed",
                     "",
                     "Overall proportion correct", "(Sensitivity + Specificity) / 2",
                     "Harmonic mean of precision and recall", "Correlation coefficient [-1, 1]",
                     "Sensitivity + Specificity - 1", "TP*TN / (FP*FN)",
                     "",
                     "Proportion of events in sample", "TP / Total",
                     "(TP + FP) / Total",
                     "",
                     "Probability cutoff for classification", "Area Under ROC Curve")
)

knitr::kable(metrics_table,
             caption = "Table: Comprehensive Classification Performance Metrics",
             align = c("l", "c", "l"))
Table: Comprehensive Classification Performance Metrics
Metric Value Interpretation
True Positives (TP) 38 Correctly identified cancellations
True Negatives (TN) 465 Correctly identified completions
False Positives (FP) 222 False alarms (predicted cancel, actually complete)
False Negatives (FN) 13 Missed cancellations
Sensitivity (Recall, TPR) 74.5% [61.1%, 84.5%] Proportion of actual cancellations correctly identified
Specificity (TNR) 67.7% [64.1%, 71.1%] Proportion of actual completions correctly identified
Positive Predictive Value (Precision) 14.6% [10.8%, 19.4%] Of predicted cancellations, proportion actually cancelled
Negative Predictive Value 97.3% [95.4%, 98.4%] Of predicted completions, proportion actually completed
Accuracy 68.2% Overall proportion correct
Balanced Accuracy 71.1% (Sensitivity + Specificity) / 2
F1 Score (Harmonic Mean of Precision & Recall) 0.244 Harmonic mean of precision and recall
Matthews Correlation Coefficient NA Correlation coefficient [-1, 1]
Youden’s J Statistic 0.422 Sensitivity + Specificity - 1
Diagnostic Odds Ratio 6.1 TPTN / (FPFN)
Prevalence 6.9% Proportion of events in sample
Detection Rate 5.1% TP / Total
Detection Prevalence 35.2% (TP + FP) / Total
Optimal Threshold 0.0699 Probability cutoff for classification
AUC-ROC 0.770 Area Under ROC Curve
cat("\n**Note**: Confidence intervals calculated using Wilson score method.\n")

Note: Confidence intervals calculated using Wilson score method.

cat(sprintf("At optimal threshold (%.4f), the model prioritizes %s.\n",
            optimal_threshold,
            ifelse(sensitivity > specificity, "sensitivity (detecting cancellations)",
                   "specificity (avoiding false alarms)")))

At optimal threshold (0.0699), the model prioritizes sensitivity (detecting cancellations).

13.2.8 McFadden’s Pseudo-R² and Model Fit Statistics

# =============================================================================
# COMPREHENSIVE PSEUDO-R² AND MODEL FIT STATISTICS
# Multiple measures of explained variation for logistic regression
# =============================================================================

# Extract from model
model_stats <- model$stats
n <- model_stats["Obs"]
null_deviance <- -2 * model$deviance[1] / 2  # Approximate from model output

# McFadden's R²
mcfadden_r2 <- 1 - (model$deviance[2] / model$deviance[1])
if (is.na(mcfadden_r2)) mcfadden_r2 <- model_stats["R2"]

# Nagelkerke R² (from model)
nagelkerke_r2 <- model_stats["R2"]

# Brier Score
brier <- model_stats["Brier"]

# AIC and BIC
aic <- AIC(model)
bic <- BIC(model)

# Log-likelihood
loglik <- -model$deviance[2] / 2

# Likelihood Ratio Chi-square
lr_chisq <- model$stats["Model L.R."]
lr_df <- length(coef(model)) - 1
lr_pval <- 1 - pchisq(lr_chisq, lr_df)

# C-statistic
c_stat <- model_stats["C"]
dxy <- model_stats["Dxy"]

fit_table <- data.frame(
  Measure = c("Pseudo-R² Measures", "---",
              "McFadden's R²", "Nagelkerke R²", "Brier Score",
              "", "Information Criteria", "---",
              "AIC", "BIC", "-2 Log-Likelihood",
              "", "Likelihood Ratio Test", "---",
              "Chi-square", "Degrees of Freedom", "p-value",
              "", "Discrimination", "---",
              "C-statistic (AUC)", "Dxy (Somers' D)"),
  Value = c("", "",
            sprintf("%.4f", mcfadden_r2), sprintf("%.4f", nagelkerke_r2), sprintf("%.4f", brier),
            "", "", "",
            sprintf("%.2f", aic), sprintf("%.2f", bic), sprintf("%.2f", -2 * loglik),
            "", "", "",
            sprintf("%.2f", lr_chisq), sprintf("%d", as.integer(lr_df)), sprintf("%.2e", lr_pval),
            "", "", "",
            sprintf("%.4f", c_stat), sprintf("%.4f", dxy)),
  Interpretation = c("", "",
                     "0.2-0.4 = excellent fit", "Proportion variance explained", "Lower = better (range 0-1)",
                     "", "", "",
                     "Lower = better (relative)", "Lower = better (penalizes complexity)", "Model deviance",
                     "", "", "",
                     "Test statistic", "Number of predictors", "Model vs. null significance",
                     "", "", "",
                     ">0.7 acceptable, >0.8 good", "2*(C - 0.5), range [-1, 1]")
)

knitr::kable(fit_table,
             caption = "Table: Comprehensive Model Fit Statistics",
             align = c("l", "c", "l"))
Table: Comprehensive Model Fit Statistics
Measure Value Interpretation
Pseudo-R² Measures
McFadden’s R² 0.1268 0.2-0.4 = excellent fit
Nagelkerke R² 0.1563 Proportion variance explained
Brier Score 0.0595 Lower = better (range 0-1)
Information Criteria
AIC 353.90 Lower = better (relative)
BIC 422.96 Lower = better (penalizes complexity)
-2 Log-Likelihood 323.90 Model deviance
Likelihood Ratio Test
Chi-square 47.04 Test statistic
Degrees of Freedom 14 Number of predictors
p-value 1.89e-05 Model vs. null significance
Discrimination
C-statistic (AUC) 0.7697 >0.7 acceptable, >0.8 good
Dxy (Somers’ D) 0.5395 2*(C - 0.5), range [-1, 1]
# Interpretation
cat("\n**Interpretation**:\n")

Interpretation:

cat(sprintf("- McFadden's R² = %.3f: %s\n", mcfadden_r2,
            ifelse(mcfadden_r2 >= 0.4, "Excellent fit",
                   ifelse(mcfadden_r2 >= 0.2, "Good fit", "Modest fit (typical for rare outcomes)"))))
  • McFadden’s R² = 0.127: Modest fit (typical for rare outcomes)
cat(sprintf("- Likelihood Ratio Test: χ²(%.0f) = %.2f, p %s\n", lr_df, lr_chisq,
            ifelse(lr_pval < 0.001, "< 0.001", sprintf("= %.4f", lr_pval))))
  • Likelihood Ratio Test: χ²(14) = 47.04, p < 0.001
cat(sprintf("- C-statistic = %.3f: %s discrimination\n", c_stat,
            ifelse(c_stat >= 0.8, "Good", ifelse(c_stat >= 0.7, "Acceptable", "Poor"))))
  • C-statistic = 0.770: Acceptable discrimination

13.2.9 Brier Score Plot

This calibration plot visualizes how well the predicted probabilities from your logistic regression model align with the actual observed outcomes. The x-axis (“Predicted Probability”) represents the probabilities that the model estimated, while the y-axis (“Observed Probability of Cancellation”) indicates how frequently cancellations occurred within each predicted probability bin. Each point (blue circle) corresponds to a group of observations with similar predicted probabilities, and the size of each circle reflects the number of observations in that group, as detailed in the legend. Vertical lines around each point represent the 95% confidence intervals, indicating the uncertainty of each observed proportion.

A dashed grey diagonal line shows perfect calibration (where predictions exactly match observations). Your model shows good overall calibration, as evidenced by points closely following this ideal diagonal line, particularly for predicted probabilities below approximately 0.3. However, at higher probabilities (around 0.3 to 0.4), there is more deviation from perfect calibration, suggesting predictions in this range might be less reliable due to fewer observations or greater variability. The Brier score (0.0601) supports this interpretation, indicating good predictive accuracy overall, with only minor discrepancies observed at higher predicted probabilities.

# Create the calibration plot using the function
calibration_plot <- create_calibration_plot(model)

# Display the plot
print(calibration_plot)

The Brier Score and Area Under the Receiver Operating Characteristic Curve (AUC) are both valuable metrics used to assess model performance, though they measure different aspects of predictive accuracy. The Brier Score evaluates a model’s overall accuracy by simultaneously assessing calibration—how closely the predicted probabilities match observed outcomes—and its discriminatory power, or the ability to distinguish between different outcome classes. It ranges from 0 to 1, where lower values indicate better performance. A score of 0 represents perfect prediction accuracy, whereas higher scores indicate poorer performance, with scores closer to 0.25 typically suggesting no better performance than random chance in a balanced binary classification. Conversely, the AUC specifically quantifies a model’s ability to discriminate between two outcome classes. It ranges from 0.5 to 1.0, with 0.5 indicating random guessing (no discrimination ability) and 1.0 indicating perfect discrimination. For instance, an AUC of 0.85 demonstrates strong ability to correctly distinguish between classes based on predicted probabilities, but it does not guarantee that these predicted probabilities accurately reflect the actual observed frequencies. Thus, while the Brier Score combines both calibration (the accuracy of predicted probabilities) and discrimination (the ability to distinguish between outcomes), the AUC exclusively measures discrimination.

13.2.10 AUC Plot

# =============================================================================
# ROC CURVE USING MODEL'S OWN DATA (ensures consistency with C-statistic)
# =============================================================================

# Use the model's stored linear predictors and outcomes for perfect alignment
# model$linear.predictors has the SAME length as model$y (complete cases only)
# Convert linear predictors to probabilities using plogis (inverse logit)
roc_predictions <- plogis(model$linear.predictors)
roc_outcomes <- as.numeric(model$y) - 1  # Convert factor to 0/1

# Verify alignment
log_info(sprintf("ROC calculation: %d predictions, %d outcomes",
                 length(roc_predictions), length(roc_outcomes)))
log_info(sprintf("Events: %d (%.1f%%)", sum(roc_outcomes == 1),
                 100 * mean(roc_outcomes == 1)))

# Calculate ROC
roc_obj_main <- pROC::roc(roc_outcomes, roc_predictions, quiet = TRUE)
ci_auc_main <- pROC::ci.auc(roc_obj_main, method = "delong")

# Create publication-quality ROC plot
par(mar = c(5, 5, 4, 2), pty = "s")
plot(roc_obj_main,
     main = "ROC Curve",
     col = "#2E86AB",
     lwd = 3,
     legacy.axes = TRUE,
     print.auc = FALSE,
     xlab = "1 - Specificity (False Positive Rate)",
     ylab = "Sensitivity (True Positive Rate)",
     cex.lab = 1.2,
     cex.axis = 1.1)

abline(a = 0, b = 1, lty = 2, col = "#CCCCCC", lwd = 2)

# Add AUC annotation
text(0.6, 0.2,
     paste0("AUC = ", round(as.numeric(roc_obj_main$auc), 3),
            "\n(95% CI: ", round(ci_auc_main[1], 3), "-", round(ci_auc_main[3], 3), ")"),
     cex = 1.3, font = 2)

legend("bottomright",
       legend = c("Prediction Model", "Reference Line (AUC = 0.5)"),
       col = c("#2E86AB", "#CCCCCC"),
       lty = c(1, 2),
       lwd = c(3, 2),
       bty = "n",
       cex = 1.0)
Figure: Model Accuracy (ROC Curve). This curve shows the trade-off between correctly identifying cancellations (sensitivity, y-axis) and avoiding false alarms (1-specificity, x-axis). The blue shaded area represents model accuracy—larger areas indicate better performance. A perfect model would reach the upper-left corner. The diagonal dashed line represents random guessing (50% accuracy).

Figure: Model Accuracy (ROC Curve). This curve shows the trade-off between correctly identifying cancellations (sensitivity, y-axis) and avoiding false alarms (1-specificity, x-axis). The blue shaded area represents model accuracy—larger areas indicate better performance. A perfect model would reach the upper-left corner. The diagonal dashed line represents random guessing (50% accuracy).

# Log for verification
log_info(sprintf("ROC AUC = %.3f (should match C-statistic = %.3f)",
                 as.numeric(roc_obj_main$auc), model$stats["C"]))

The values in your ROC curve represent important metrics related to the performance of your predictive model:

Interpreting ROC Results (Note: Values calculated below in the optimal threshold section)

The ROC curve analysis helps evaluate model performance at different classification thresholds. Key concepts:

Sensitivity (True Positive Rate) - The proportion of actual cancellations correctly identified by the model - Higher sensitivity means fewer missed cancellations

Specificity (True Negative Rate) - The proportion of completed procedures correctly identified by the model - Higher specificity means fewer false alarms

What this means for your model: - The AUC of 0.77 indicates a moderately good model (values range from 0.5 for random guessing to 1.0 for perfect prediction) - At the optimal threshold (calculated below), the model balances sensitivity and specificity - The specific sensitivity and specificity values at the optimal threshold are reported in the subsequent sections

Depending on your clinical goals, you might want to adjust this threshold: - If missing a cancellation is very costly, use a lower threshold to increase sensitivity - If falsely predicting cancellations is problematic, use a higher threshold to increase specificity

13.2.11 Confusion Matrix

# Histogram of predicted probabilities
# Note: Uses predictions computed in calculate-optimal-threshold chunk above
hist(predictions,
     breaks = 30,
     main = "Distribution of Predicted Cancellation Probabilities",
     xlab = "Predicted Probability of Cancellation",
     ylab = "Number of Patients",
     col = "steelblue",
     border = "white")
abline(v = optimal_threshold, col = "red", lwd = 2, lty = 2)
legend("topright", legend = c(paste0("Optimal Threshold (", optimal_threshold_display, ")")), col = "red", lty = 2, lwd = 2)
Distribution of predicted probabilities

Distribution of predicted probabilities

13.2.12 Interpreting the Histogram of Predictions

The histogram above shows the distribution of predicted cancellation probabilities across the 738 patients with complete data used in model fitting. Key observations:

  1. Right-skewed distribution: Most patients have low predicted probabilities (concentrated below 0.21, the 95th percentile), which reflects the low baseline cancellation rate (6.9%) in these patients.

  2. Peak near the threshold: The distribution peaks near the optimal threshold (0.126), indicating most patients are predicted to have relatively low risk of cancellation.

  3. Long right tail: A smaller number of patients extend to higher probabilities (up to 0.69), representing high-risk individuals (typically older patients with history of recurrent UTIs).

  4. Red dashed line: The optimal classification threshold (0.126) is shown. Patients with predicted probabilities above this line are classified as “likely to cancel.”

Why is the threshold so low? The threshold of 0.126 (12.6%) may seem low, but it reflects the base rate of cancellation in the data. Because cancellations are rare, the optimal threshold that balances sensitivity and specificity is near the population prevalence. Using the standard 0.5 threshold would result in classifying virtually no patients as “at risk.”

Optimal Threshold Interpretation (Programmatic): The optimal classification threshold of 0.1259322 (12.6%) was determined by maximizing Youden’s J statistic (sensitivity + specificity - 1), which identifies the probability cutoff that best balances the ability to detect true cancellations (sensitivity) while minimizing false alarms (1 - specificity). This threshold is close to the base cancellation rate of 6.9%, which is expected when the outcome is rare—the optimal threshold gravitates toward the prevalence to avoid systematic over- or under-prediction.

# 5. Print the confusion matrix as a formatted table
conf_df <- as.data.frame.matrix(conf_matrix)
conf_df <- cbind(Actual = rownames(conf_df), conf_df)
rownames(conf_df) <- NULL

# Display using flextable for better formatting
conf_ft <- flextable(conf_df) %>%
  set_caption("Confusion Matrix: Actual vs. Predicted Classifications") %>%
  set_header_labels(Actual = "Actual Outcome") %>%
  add_header_row(values = c("", "Predicted"), colwidths = c(1, 2)) %>%
  theme_vanilla() %>%
  autofit() %>%
  align(align = "center", part = "all") %>%
  bold(part = "header") %>%
  bg(i = 1, j = 2, bg = "#d4edda") %>%  # True negatives - green

  bg(i = 2, j = 3, bg = "#d4edda") %>%  # True positives - green
  bg(i = 1, j = 3, bg = "#f8d7da") %>%  # False positives - red
  bg(i = 2, j = 2, bg = "#f8d7da")      # False negatives - red

conf_ft
Confusion Matrix: Actual vs. Predicted Classifications

Predicted

Actual Outcome

Cancelled

Completed

Cancelled

4

41

Completed

109

492

# 6. Calculate and display metrics
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity <- conf_matrix["Cancelled", "Cancelled"] / sum(conf_matrix["Cancelled", ])
specificity <- conf_matrix["Completed", "Completed"] / sum(conf_matrix["Completed", ])

# Create metrics table
metrics_df <- data.frame(
  Metric = c("Accuracy", "Sensitivity (Recall)", "Specificity"),
  Value = c(paste0(round(accuracy * 100, 1), "%"),
            paste0(round(sensitivity * 100, 1), "%"),
            paste0(round(specificity * 100, 1), "%")),
  Interpretation = c(
    "Overall correct classification rate",
    "Proportion of actual cancellations correctly identified",
    "Proportion of completed procedures correctly identified"
  )
)

flextable(metrics_df) %>%
  set_caption(paste0("Classification Performance Metrics at Optimal Threshold (", optimal_threshold_display, ")")) %>%
  theme_vanilla() %>%
  autofit() %>%
  bold(part = "header")
Classification Performance Metrics at Optimal Threshold (0.126)

Metric

Value

Interpretation

Accuracy

76.8%

Overall correct classification rate

Sensitivity (Recall)

8.9%

Proportion of actual cancellations correctly identified

Specificity

81.9%

Proportion of completed procedures correctly identified

13.2.13 Interpreting the Confusion Matrix

The confusion matrix shows how the model’s predictions compare to actual outcomes. At the optimal threshold of 0.126:

Predicted: Completed Predicted: Cancelled
Actually Completed True Negatives (TN) False Positives (FP)
Actually Cancelled False Negatives (FN) True Positives (TP)

Reading the results:

  • True Positives (green): Patients correctly predicted to have cancellations
  • True Negatives (green): Patients correctly predicted to complete their procedures
  • False Positives (red): Patients predicted to cancel but who actually completed (false alarms)
  • False Negatives (red): Patients predicted to complete but who actually cancelled (missed detections)

Trade-off at the chosen threshold:

The optimal threshold (0.126) balances sensitivity and specificity. However, clinical considerations may suggest adjusting this threshold:

  • If missing a cancellation is costly (wasted OR time, patient inconvenience): Use a lower threshold to increase sensitivity, accepting more false positives
  • If false alarms are problematic (unnecessary interventions, patient anxiety): Use a higher threshold to increase specificity, accepting more missed cancellations

Your classification model predicting whether a procedure would be cancelled achieved good overall accuracy. Given that cancellations represent a minority class (as indicated by your data, with 57 cancellations out of 841 observations), accuracy alone might not fully capture your model’s performance.

The sensitivity (recall) describes the model’s ability to correctly detect patients whose procedures actually were cancelled. A higher sensitivity means fewer missed cancellations. Your calculated sensitivity reflects how well the model identified those who truly experienced cancellations. Conversely, specificity represents the proportion of correctly identified non-cancelled procedures. A high specificity means your model avoids incorrectly labeling procedures as cancelled when they actually occurred as planned.

However, sensitivity and specificity alone don’t provide a full picture—particularly if cancellations are uncommon (as your data suggest, with cancellations being a minority class, around 6.8%). In these scenarios, positive predictive value (PPV) and negative predictive value (NPV) become essential.

  • A high PPV indicates that when the model predicts cancellation, it is likely correct. In practice, this would help prioritize interventions or preoperative counseling for patients most likely to cancel.
  • A high NPV would indicate reliability in identifying patients whose procedures likely will not be cancelled, allowing more confident scheduling without unnecessary resource allocation.

Given your results:

  • Accuracy: Your model correctly classified a good proportion of cases overall, though accuracy alone can be misleading if the dataset is imbalanced.
  • Sensitivity and specificity indicate reasonable discriminative ability (C-index = 0.77), suggesting your model effectively differentiates between cancellations and non-cancellations.
  • A Brier score of 0.059 indicates good overall calibration, suggesting your model’s predicted probabilities closely align with actual outcomes.

To fully interpret your results in clinical practice, calculate the PPV and NPV. High values in these metrics would further support your model’s clinical usefulness, making it easier for clinicians to act confidently based on predicted risk levels.

Overall, your model demonstrates good discrimination and calibration, making it useful for anticipating procedural cancellations and enabling targeted management or counseling, potentially reducing cancellations or enhancing operational efficiency.

14 Nomogram

14.1 Technical Details: Setting Variable Limits for Nomogram Generation

The code below configures the data distribution settings required by the rms package for nomogram generation. The datadist object defines the range of values for Age and BMI that will be displayed on the nomogram axes.

# Load required packages
library(rms)
library(logger)

# Set up logging
log_info("Starting procedure cancellation nomogram creation")

# Ensure dataset meets requirements
if (!exists("selected_labels_df")) {
    stop("Error: selected_labels_df not found")
}

# ============================================================================
# CLEAN VARIABLE NAMES AND FACTOR LEVELS FOR NOMOGRAM DISPLAY
# This ensures the nomogram has readable labels instead of ugly R variable names
# ============================================================================

log_info("Cleaning variable names and factor levels for nomogram display")

# Create a copy for nomogram display
nomogram_df <- selected_labels_df

# CRITICAL FIX: Convert empty strings to NA across all character columns
# Empty strings masquerade as valid values but should be treated as missing
# This must happen BEFORE any imputation or modeling
log_info("Converting empty strings to NA across all character columns...")
for (col in names(nomogram_df)) {
  if (is.character(nomogram_df[[col]])) {
    n_empty <- sum(nomogram_df[[col]] == "", na.rm = TRUE)
    if (n_empty > 0) {
      log_info(sprintf("  %s: converting %d empty strings to NA", col, n_empty))
      nomogram_df[[col]][nomogram_df[[col]] == ""] <- NA
    }
  }
}

# COMPREHENSIVE VARIABLE NAME MAPPING (old name -> new clean name)
# Using underscores instead of spaces to avoid formula issues
var_name_map <- c(
  "Age." = "Age",
  "BMI" = "BMI",
  "Is.the.patient.hispanic..latino.or.of.Spanish.origin." = "Hispanic",
  "Is.the.patient.Hispanic.or.Latino." = "Hispanic",
  "Race." = "Race",
  "Does.the.patient.have.a.h.o.recurrent.UTIs." = "Recurrent_UTIs",
  "Does.the.patient.have.recurrent.UTIs." = "Recurrent_UTIs",
  # POP-Q stage removed from model per user request
  "What.was.the.reason.for.the.urodynamics." = "UDS_Indication",
  "What.is.the.reason.for.the.urodynamics." = "UDS_Indication",
  "Does.the.patient.have.diabetes." = "Diabetes",
  "Does.the.patient.have.OAB." = "OAB",
  "Is.the.patient.on.vaginal.estrogen." = "Vaginal_Estrogen",
  "Immunocompromised." = "Immunocompromised",
  "Tobacco.use." = "Tobacco_Use",
  "Menopause.status." = "Menopause",
  "Average.number.of.voids.at.night." = "Nocturia",
  "Year" = "Year",
  "PFDI.20.Score" = "PFDI_20",
  "UDI_6" = "UDI_6",
  "POPDI_6" = "POPDI_6",
  "CRADI_8" = "CRADI_8",
  "Did.the.patient.have.detrusor.overactivity." = "Detrusor_Overactivity",
  "Did.the.patient.have.DO." = "Detrusor_Overactivity_DO"
)

# Rename columns that exist in the data
for (old_name in names(var_name_map)) {
  if (old_name %in% names(nomogram_df)) {
    new_name <- var_name_map[old_name]
    names(nomogram_df)[names(nomogram_df) == old_name] <- new_name
    log_info(paste("Renamed:", old_name, "->", new_name))
  }
}

# FORCE SAFE NAMES (Replace spaces/special chars with underscores, keep case)
# This prevents issues with rms::datadist and formula construction while preserving
# capitalization expected by downstream chunks (e.g. Factor Cleaning)
log_info("Sanitizing all column names (removing spaces/dots)...")
safe_names <- names(nomogram_df)
safe_names <- gsub("\\s+", "_", safe_names)       # Spaces to underscores
safe_names <- gsub("[^a-zA-Z0-9_]", "_", safe_names) # Special chars to underscores
safe_names <- gsub("_+", "_", safe_names)         # Deduplicate underscores
safe_names <- gsub("_$", "", safe_names)          # Remove trailing underscores
names(nomogram_df) <- safe_names

# Define exclude columns (outcomes) - matching potential safe names
exclude_cols <- c("Cancelled", "Was_the_procedure_cancelled", "Was_the_procedure_cancelled_")

# Remove constant columns (zero variance) as they cause issues for datadist/lrm
log_info("Removing constant columns...")
constant_cols <- sapply(nomogram_df, function(x) length(unique(x[!is.na(x)])) <= 1)
if (any(constant_cols)) {
  removed_const <- names(nomogram_df)[constant_cols]
  log_info(paste("Removed constant columns:", paste(removed_const, collapse = ", ")))
  nomogram_df <- nomogram_df[, !constant_cols]
}

# Update available_cols from the sanitized dataframe
available_cols <- setdiff(names(nomogram_df), exclude_cols)

log_info(paste("Sanitized predictors:", paste(available_cols, collapse = ", ")))

# CLEAN FACTOR LEVELS - Replace abbreviations with readable labels
# Hispanic/Latino
if ("Hispanic" %in% names(nomogram_df)) {
  nomogram_df$Hispanic <- factor(nomogram_df$Hispanic)
  levels(nomogram_df$Hispanic) <- gsub("Yes|yes|Y|1|TRUE", "Yes", levels(nomogram_df$Hispanic))
  levels(nomogram_df$Hispanic) <- gsub("No|no|N|0|FALSE", "No", levels(nomogram_df$Hispanic))
}

# Recurrent UTIs
if ("Recurrent_UTIs" %in% names(nomogram_df)) {
  nomogram_df$Recurrent_UTIs <- factor(nomogram_df$Recurrent_UTIs)
  old_levels <- levels(nomogram_df$Recurrent_UTIs)
  new_levels <- ifelse(grepl("Yes|yes|Y|1|TRUE", old_levels), "Yes",
                       ifelse(grepl("No|no|N|0|FALSE", old_levels), "No", old_levels))
  levels(nomogram_df$Recurrent_UTIs) <- new_levels
}

# Diabetes
if ("Diabetes" %in% names(nomogram_df)) {
  nomogram_df$Diabetes <- factor(nomogram_df$Diabetes)
  old_levels <- levels(nomogram_df$Diabetes)
  new_levels <- ifelse(grepl("Yes|yes|Y|1|TRUE", old_levels), "Yes",
                       ifelse(grepl("No|no|N|0|FALSE", old_levels), "No", old_levels))
  levels(nomogram_df$Diabetes) <- new_levels
}

# OAB
if ("OAB" %in% names(nomogram_df)) {
  nomogram_df$OAB <- factor(nomogram_df$OAB)
  old_levels <- levels(nomogram_df$OAB)
  new_levels <- ifelse(grepl("Yes|yes|Y|1|TRUE", old_levels), "Yes",
                       ifelse(grepl("No|no|N|0|FALSE", old_levels), "No", old_levels))
  levels(nomogram_df$OAB) <- new_levels
}

# Vaginal Estrogen
if ("Vaginal_Estrogen" %in% names(nomogram_df)) {
  nomogram_df$Vaginal_Estrogen <- factor(nomogram_df$Vaginal_Estrogen)
  old_levels <- levels(nomogram_df$Vaginal_Estrogen)
  new_levels <- ifelse(grepl("Yes|yes|Y|1|TRUE", old_levels), "Yes",
                       ifelse(grepl("No|no|N|0|FALSE", old_levels), "No", old_levels))
  levels(nomogram_df$Vaginal_Estrogen) <- new_levels
}

# Immunocompromised
if ("Immunocompromised" %in% names(nomogram_df)) {
  nomogram_df$Immunocompromised <- factor(nomogram_df$Immunocompromised)
  old_levels <- levels(nomogram_df$Immunocompromised)
  new_levels <- ifelse(grepl("Yes|yes|Y|1|TRUE", old_levels), "Yes",
                       ifelse(grepl("No|no|N|0|FALSE", old_levels), "No", old_levels))
  levels(nomogram_df$Immunocompromised) <- new_levels
}

# Tobacco Use
if ("Tobacco_Use" %in% names(nomogram_df)) {
  nomogram_df$Tobacco_Use <- factor(nomogram_df$Tobacco_Use)
  old_levels <- levels(nomogram_df$Tobacco_Use)
  new_levels <- ifelse(grepl("Yes|yes|Y|1|TRUE|Current|Former", old_levels), "Yes",
                       ifelse(grepl("No|no|N|0|FALSE|Never", old_levels), "No", old_levels))
  levels(nomogram_df$Tobacco_Use) <- new_levels
}

# Menopause Status
if ("Menopause" %in% names(nomogram_df)) {
  nomogram_df$Menopause <- factor(nomogram_df$Menopause)
  old_levels <- levels(nomogram_df$Menopause)
  new_levels <- gsub("Post-menopausal|Postmenopausal|post", "Post", old_levels)
  new_levels <- gsub("Pre-menopausal|Premenopausal|pre", "Pre", new_levels)
  new_levels <- gsub("Peri-menopausal|Perimenopausal|peri", "Peri", new_levels)
  levels(nomogram_df$Menopause) <- new_levels
}

# POP-Q Stage removed from model per user request - keeping code commented for reference
# if ("POP_Q_Stage" %in% names(nomogram_df)) {
#   nomogram_df$POP_Q_Stage <- factor(nomogram_df$POP_Q_Stage)
#   old_levels <- levels(nomogram_df$POP_Q_Stage)
#   # Map common POP-Q values
#   new_levels <- gsub("^0$|Stage 0|stage 0", "0", old_levels)
#   new_levels <- gsub("^1$|^I$|Stage 1|stage 1|Stage I", "I", new_levels)
#   new_levels <- gsub("^2$|^II$|Stage 2|stage 2|Stage II", "II", new_levels)
#   new_levels <- gsub("^3$|^III$|Stage 3|stage 3|Stage III", "III", new_levels)
#   new_levels <- gsub("^4$|^IV$|Stage 4|stage 4|Stage IV", "IV", new_levels)
#   levels(nomogram_df$POP_Q_Stage) <- new_levels
# }

# UDS Indication - make VERY SHORT labels to prevent overlap
if ("UDS_Indication" %in% names(nomogram_df)) {
  nomogram_df$UDS_Indication <- factor(nomogram_df$UDS_Indication)
  old_levels <- levels(nomogram_df$UDS_Indication)
  log_info(paste("Original UDS levels:", paste(old_levels, collapse = " | ")))

  # Use very short abbreviations - comprehensive pattern matching
  new_levels <- old_levels
  new_levels <- gsub(".*[Ss]ling.*", "SL", new_levels)
  new_levels <- gsub(".*[Pp]re-?op.*|.*[Pp]reop.*", "Pre", new_levels)
  new_levels <- gsub(".*[Ii]ncontinence.*|.*SUI.*|.*[Uu]rinary.*[Ii]ncont.*", "UI", new_levels)
  new_levels <- gsub(".*POP.*|.*[Pp]rolapse.*", "POP", new_levels)
  new_levels <- gsub(".*[Vv]oiding.*[Dd]ysf.*|.*[Vv]oiding.*|.*VD.*", "VD", new_levels)
  new_levels <- gsub(".*[Rr]ecurrent.*UTI.*|.*rUTI.*", "rUTI", new_levels)
  new_levels <- gsub(".*[Nn]eurogenic.*[Bb]ladder.*|.*[Nn]eurogenic.*|.*NB.*", "NB", new_levels)
  new_levels <- gsub(".*OAB.*|.*[Oo]veractive.*[Bb]ladder.*", "OAB", new_levels)
  new_levels <- gsub(".*[Oo]ther.*", "Oth", new_levels)
  # Clean up any remaining long labels
  new_levels <- gsub("evaluation of ", "", new_levels)
  new_levels <- gsub("Evaluation of ", "", new_levels)
  new_levels <- gsub(" ", "", new_levels)  # Remove any spaces

  levels(nomogram_df$UDS_Indication) <- new_levels
  log_info(paste("Cleaned UDS levels:", paste(levels(nomogram_df$UDS_Indication), collapse = ", ")))
}

# Race - clean up race categories
if ("Race" %in% names(nomogram_df)) {
  nomogram_df$Race <- factor(nomogram_df$Race)
  old_levels <- levels(nomogram_df$Race)
  new_levels <- gsub(".*[Ww]hite.*|.*[Cc]aucasian.*", "White", old_levels)
  new_levels <- gsub(".*[Bb]lack.*|.*African.*", "Black", new_levels)
  new_levels <- gsub(".*[Hh]ispanic.*|.*[Ll]atino.*", "Hispanic", new_levels)
  new_levels <- gsub(".*[Aa]sian.*", "Asian", new_levels)
  new_levels <- gsub(".*[Oo]ther.*|.*[Mm]ulti.*|.*[Mm]ixed.*", "Other", new_levels)
  levels(nomogram_df$Race) <- new_levels
}

# Also rename the outcome variable
if ("Was.the.procedure.cancelled." %in% names(nomogram_df)) {
  names(nomogram_df)[names(nomogram_df) == "Was.the.procedure.cancelled."] <- "Cancelled"
}

# SET RMS LABELS for nice display on nomogram axes
# These will appear as the variable names on the left side of the nomogram
label_display_map <- c(
  "Age" = "Patient Age (years)",
  "BMI" = "Body Mass Index",
  "Hispanic" = "Hispanic/Latino",
  "Race" = "Race",
  "Recurrent_UTIs" = "Recurrent UTIs",
  # "POP_Q_Stage" = "POP-Q Stage",  # Removed from model per user request
  "UDS_Indication" = "Indication for UDS",
  "Diabetes" = "Diabetes",
  "OAB" = "Overactive Bladder",
  "Vaginal_Estrogen" = "Vaginal Estrogen",
  "Immunocompromised" = "Immunocompromised",
  "Tobacco_Use" = "Tobacco Use",
  "Menopause" = "Menopause Status",
  "Nocturia" = "Nocturia (per night)",
  "Year" = "Year",
  "PFDI_20" = "PFDI-20 Score",
  "UDI_6" = "UDI-6 Score",
  "POPDI_6" = "POPDI-6 Score",
  "CRADI_8" = "CRADI-8 Score",
  "Detrusor_Overactivity" = "Detrusor Overactivity"
)

for (col in names(nomogram_df)) {
  if (col %in% names(label_display_map)) {
    Hmisc::label(nomogram_df[[col]]) <- label_display_map[[col]]
    log_info(paste("Set label for", col, "->", label_display_map[[col]]))
  }
}

log_info("Variable and factor level cleaning complete")

# ============================================================================
# DYNAMIC MODEL SETUP FOR NOMOGRAM
# Use the LASSO-selected predictors from nomogram_df (cleaned version)
# This ensures the nomogram is based on LASSO feature selection results
# ============================================================================

log_info("Setting up nomogram with LASSO-selected predictors")

# Get predictors from nomogram_df (which contains LASSO-selected variables with clean names)
exclude_cols <- c("Cancelled", "Was.the.procedure.cancelled.")
available_cols <- setdiff(names(nomogram_df), exclude_cols)

if (length(available_cols) == 0) {
  stop("Error: No predictor columns found in nomogram_df. Check LASSO feature selection.")
}

log_info(paste("Initial predictors from LASSO:", paste(available_cols, collapse = ", ")))

# =============================================================================


log_info(paste("Final predictors for model:", paste(available_cols, collapse = ", ")))

# Verify the outcome column exists
if (!"Cancelled" %in% names(nomogram_df)) {
  stop("Error: Outcome column 'Cancelled' not found in nomogram_df")
}

# Log ranges for numeric predictors
for (col in available_cols) {
  if (is.numeric(nomogram_df[[col]])) {
    log_info(paste("Current", col, "range:",
                   min(nomogram_df[[col]], na.rm=TRUE),
                   "to", max(nomogram_df[[col]], na.rm=TRUE)))
  } else {
    log_info(paste(col, "is categorical with levels:",
                   paste(levels(factor(nomogram_df[[col]])), collapse = ", ")))
  }
}

# ✅ Step 1: Explicitly Recreate and Apply `datadist`
log_info("Creating and applying datadist")
dd <- rms::datadist(nomogram_df)

# Set reasonable limits for numeric predictors
if ("Age" %in% available_cols && is.numeric(nomogram_df$Age)) {
  dd$limits["Low", "Age"] <- 20
  dd$limits["High", "Age"] <- 90
  log_info(paste("Age limits set to:", dd$limits["Low", "Age"], "to", dd$limits["High", "Age"]))
}

if ("BMI" %in% available_cols && is.numeric(nomogram_df$BMI)) {
  dd$limits["Low", "BMI"] <- 20
  dd$limits["High", "BMI"] <- 60
  dd$limits["Adjust to", "BMI"] <- 30  # Reference value
  log_info(paste("BMI limits set to:", dd$limits["Low", "BMI"], "to", dd$limits["High", "BMI"]))
}

options(datadist = dd)  # Ensure it is applied globally
log_info("datadist created and applied globally")

# ✅ Step 2: Variable names are already clean - just store for later use
log_info("Variable names already cleaned - storing for display")
predictor_names_display <- available_cols

# =============================================================================
# ✅ Step 2.5: AUTOMATIC RESTRICTED CUBIC SPLINES FOR CONTINUOUS VARIABLES
# =============================================================================
# RCS allows modeling of non-linear relationships for continuous predictors
# This can improve model fit (R²) while maintaining interpretability
# Reference: Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Springer; 2015.
# =============================================================================

log_info("Detecting continuous variables for restricted cubic splines (RCS)")

# Configuration: Number of knots for RCS
# 3 knots = 2 degrees of freedom (minimum for non-linearity)
# 4 knots = 3 degrees of freedom (moderate flexibility)
# 5 knots = 4 degrees of freedom (more flexibility, higher risk of overfitting)
# Given EPV constraints, we use 3 knots for parsimony
rcs_knots <- 3

# Minimum unique values required for RCS (need at least knots + 2)
min_unique_for_rcs <- rcs_knots + 2

# Identify continuous vs categorical variables
continuous_vars <- c()
categorical_vars <- c()

for (col in available_cols) {
  col_data <- nomogram_df[[col]]

  if (is.numeric(col_data)) {
    n_unique <- length(unique(na.omit(col_data)))

    # Only apply RCS if sufficient unique values
    if (n_unique >= min_unique_for_rcs) {
      continuous_vars <- c(continuous_vars, col)
      log_info(sprintf("  %s: continuous (n_unique=%d) -> will use rcs(%s, %d)",
                       col, n_unique, col, rcs_knots))
    } else {
      # Treat as linear if too few unique values
      categorical_vars <- c(categorical_vars, col)
      log_info(sprintf("  %s: continuous but low cardinality (n_unique=%d) -> linear term",
                       col, n_unique))
    }
  } else {
    categorical_vars <- c(categorical_vars, col)
    n_levels <- length(unique(na.omit(col_data)))
    log_info(sprintf("  %s: categorical (n_levels=%d) -> factor term", col, n_levels))
  }
}

# Build formula with RCS for continuous variables
rcs_terms <- sapply(continuous_vars, function(v) {
  v_safe <- if (grepl("[^a-zA-Z0-9_.]", v)) paste0("`", v, "`") else v
  sprintf("rcs(%s, %d)", v_safe, rcs_knots)
})
linear_terms <- categorical_vars

# Combine all terms
all_terms <- c(rcs_terms, linear_terms)

log_info(sprintf("RCS Summary: %d continuous vars with rcs(), %d categorical/linear vars",
                 length(continuous_vars), length(categorical_vars)))

# Store for later reference
rcs_variables <- continuous_vars
linear_variables <- categorical_vars

# =============================================================================
# ✅ Step 2.6: PROGRAMMATIC INTERACTION DETECTION
# =============================================================================
# Test clinically plausible interactions and add significant ones to model
# Uses likelihood ratio test to assess interaction significance
# Reference: Hosmer DW, Lemeshow S. Applied Logistic Regression. 3rd ed. Wiley; 2013.
# =============================================================================

log_info("Testing clinically plausible interactions programmatically")

# Use helper function to detect interactions (standardized across validation steps)
significant_interactions <- detect_interactions(nomogram_df, "Cancelled", available_cols)

if (length(significant_interactions) > 0) {
  for (int_term in significant_interactions) {
    log_info(paste("Adding significant interaction:", int_term))
  }
} else {
  log_info("No significant interactions found (p < 0.10)")
}

# Store for documentation
interaction_terms_added <- significant_interactions

# Store the number of potential interactions tested for reporting
# This matches the pairs defined in detect_interactions(): age × diabetes, age × UTIs,
# BMI × diabetes, BMI × UTIs, age × OAB, diabetes × UTIs
available_interactions <- c(
  "Age × Diabetes", "Age × Recurrent UTIs", "BMI × Diabetes",
  "BMI × Recurrent UTIs", "Age × OAB", "Diabetes × Recurrent UTIs"
)

# ✅ Step 3: Refit Model with dynamic predictors using CLEANED data
log_info("Fitting logistic regression model with dynamic predictors, RCS, and interactions")

# Build formula from available columns (using cleaned outcome name)
# Continuous variables get rcs(), categorical stay as-is, add significant interactions
# Build formula from available columns (using cleaned outcome name)
# Continuous variables get rcs(), categorical stay as-is, add significant interactions
# ENSURE TERMS ARE SAFE (backticked if needed)
# Check if significant_interactions are safe (detect_interactions should return them safe now)
# But `all_terms` might need check.
safe_all_terms <- sapply(all_terms, function(x) {
  # If it looks like a function call (has parens), assume it's safe/handled (e.g. rcs(`Var Name`, 3))
  # If it's a plain name with spaces and NO backticks, wrap it.
  if (!grepl("\\(", x) && grepl("[^a-zA-Z0-9_.`]", x) && !startsWith(x, "`")) {
    paste0("`", x, "`")
  } else {
    x
  }
})

all_model_terms <- c(safe_all_terms, significant_interactions)
nomogram_formula <- as.formula(paste("Cancelled ~", paste(all_model_terms, collapse = " + ")))
log_info(paste("Nomogram formula:", deparse(nomogram_formula)))

# =============================================================================
# MICE IMPUTATION: Handle missing predictor values
# This ensures all 841 patients are included in the model
# Missing: Detrusor Overactivity (26), Recurrent UTIs (10)
# =============================================================================
log_info("Checking for missing values in predictor variables...")

# Get predictor columns from the formula (exclude outcome)
predictor_cols <- all.vars(nomogram_formula)[-1]  # Remove outcome variable
predictor_cols <- gsub("rcs\\(([^,]+),.*\\)", "\\1", predictor_cols)  # Remove rcs() wrapper

log_info(sprintf("Formula predictor columns: %s", paste(predictor_cols, collapse = ", ")))
log_info(sprintf("Available nomogram_df columns: %s", paste(names(nomogram_df), collapse = ", ")))

# Check which predictor columns actually exist in nomogram_df
predictor_cols_found <- predictor_cols[predictor_cols %in% names(nomogram_df)]
predictor_cols_missing <- predictor_cols[!predictor_cols %in% names(nomogram_df)]

if (length(predictor_cols_missing) > 0) {
  log_warn(sprintf("Predictor columns NOT FOUND in data: %s", paste(predictor_cols_missing, collapse = ", ")))
}
log_info(sprintf("Predictor columns found: %s", paste(predictor_cols_found, collapse = ", ")))

# Check missing values before imputation
n_missing_before <- sum(!complete.cases(nomogram_df[, predictor_cols_found, drop = FALSE]))
log_info(sprintf("Rows with missing predictor values: %d of %d (%.1f%%)",
                 n_missing_before, nrow(nomogram_df),
                 100 * n_missing_before / nrow(nomogram_df)))

if (n_missing_before > 0) {
  log_info("Performing MICE imputation to include all patients...")

  # Load mice if not already loaded
  if (!requireNamespace("mice", quietly = TRUE)) {
    stop("Package 'mice' is required for imputation. Install with: install.packages('mice')")
  }

  # Select columns for imputation (predictors + outcome)
  # Use predictor_cols_found to ensure we only use columns that exist
  impute_cols <- c("Cancelled", predictor_cols_found)
  impute_cols <- impute_cols[impute_cols %in% names(nomogram_df)]
  impute_data <- nomogram_df[, impute_cols, drop = FALSE]

  log_info(sprintf("Columns selected for imputation: %s", paste(impute_cols, collapse = ", ")))

  # CRITICAL FIX: Convert empty strings to NA before imputation
  # Empty strings ("") are NOT treated as NA by MICE - they're treated as valid values
  # This was causing 64+ records to be "imputed" as "" instead of actual values
  for (col in names(impute_data)) {
    if (is.character(impute_data[[col]])) {
      n_empty <- sum(impute_data[[col]] == "", na.rm = TRUE)
      if (n_empty > 0) {
        log_info(sprintf("Converting %d empty strings to NA in %s", n_empty, col))
        impute_data[[col]][impute_data[[col]] == ""] <- NA
      }
    }
  }

  # CRITICAL: Convert character columns to factors for MICE imputation
  # MICE cannot impute character columns - they must be factors
  for (col in names(impute_data)) {
    if (is.character(impute_data[[col]])) {
      log_info(sprintf("Converting %s from character to factor for MICE", col))
      impute_data[[col]] <- factor(impute_data[[col]])
    }
  }

  # Show missing data pattern
  log_info("Missing data pattern:")
  for (col in impute_cols) {
    n_na <- sum(is.na(impute_data[[col]]))
    if (n_na > 0) {
      log_info(sprintf("  %s: %d missing (%.1f%%)", col, n_na, 100 * n_na / nrow(impute_data)))
    }
  }

  # Run MICE imputation
  # Use appropriate methods for each variable type:
  # - "pmm" for continuous variables
  # - "logreg" for binary factors
  # - "polyreg" for multi-level factors
  set.seed(SEED_MAIN)  # For reproducibility

  # Determine appropriate imputation method for each column
  impute_methods <- sapply(names(impute_data), function(col) {
    x <- impute_data[[col]]
    if (is.numeric(x) && !is.factor(x)) {
      return("pmm")  # Predictive mean matching for continuous
    } else if (is.factor(x) || is.character(x)) {
      n_levels <- length(unique(na.omit(x)))
      if (n_levels == 2) {
        return("logreg")  # Logistic regression for binary
      } else {
        return("polyreg")  # Polytomous regression for multi-level
      }
    } else {
      return("pmm")  # Default
    }
  })

  log_info(sprintf("MICE methods: %s", paste(names(impute_methods), impute_methods, sep="=", collapse=", ")))

  # Suppress verbose output during imputation
  mice_result <- suppressMessages(
    mice::mice(
      impute_data,
      m = 5,              # 5 imputed datasets
      method = impute_methods,
      maxit = 10,         # 10 iterations
      printFlag = FALSE   # Suppress iteration output
    )
  )

  # Use the first imputed dataset (or could pool across all 5)
  imputed_data <- mice::complete(mice_result, action = 1)

  # Verify imputation worked - use predictor_cols_found (columns that actually exist)
  n_missing_after <- sum(!complete.cases(imputed_data[, predictor_cols_found, drop = FALSE]))
  log_info(sprintf("After MICE imputation: %d rows with missing values (was %d)",
                   n_missing_after, n_missing_before))

  if (n_missing_after > 0) {
    log_warn("MICE imputation did not resolve all missing values!")
    # Debug: show which columns still have missing
    for (col in predictor_cols_found) {
      n_still_na <- sum(is.na(imputed_data[[col]]))
      if (n_still_na > 0) {
        log_warn(sprintf("  %s still has %d missing values", col, n_still_na))
      }
    }
  } else {
    log_info("SUCCESS: All missing predictor values imputed")
  }

  # Update nomogram_df with imputed values - use predictor_cols_found
  for (col in predictor_cols_found) {
    if (col %in% names(imputed_data)) {
      # For factors, we need to ensure the levels match or convert properly
      if (is.factor(imputed_data[[col]])) {
        # If the original was character, convert the imputed factor back to character
        # then the model will handle the conversion
        if (is.character(nomogram_df[[col]])) {
          nomogram_df[[col]] <- as.character(imputed_data[[col]])
        } else {
          nomogram_df[[col]] <- imputed_data[[col]]
        }
      } else {
        nomogram_df[[col]] <- imputed_data[[col]]
      }
      log_info(sprintf("Updated %s with imputed values (class: %s)", col, class(nomogram_df[[col]])[1]))
    }
  }

  # Verify nomogram_df is now complete
  n_final_missing <- sum(!complete.cases(nomogram_df[, predictor_cols_found, drop = FALSE]))
  log_info(sprintf("nomogram_df missing rows after update: %d", n_final_missing))

  # Store imputation info for documentation
  mice_imputation_performed <- TRUE
  mice_n_imputed <- n_missing_before

} else {
  log_info("No missing predictor values - MICE imputation not needed")
  mice_imputation_performed <- FALSE
  mice_n_imputed <- 0
}

# REFRESH DATADIST (Critical for rms functions)
# Data has been imputed and potentially modified, so we must update datadist
log_info("Refreshing datadist with imputed data")
dd <- rms::datadist(nomogram_df)
options(datadist = dd)

# Fit the model (now with complete data after imputation)
model <- rms::lrm(
    nomogram_formula,
    data = nomogram_df,
    x = TRUE,
    y = TRUE
)
log_info("Model fitted successfully")

# =============================================================================
# CRITICAL: Track actual model sample size (complete cases only)
# rms::lrm() automatically excludes rows with missing predictor values
# These counts MUST be used for TRIPOD reporting and EPV calculations
# =============================================================================
model_n_actual <- model$stats["Obs"]
# lrm stores factor outcomes as factor with levels 1/2 (not original names)
# Level 1 = "Completed", Level 2 = "Cancelled"
model_n_events <- sum(as.numeric(model$y) == 2)
model_n_nonevents <- sum(as.numeric(model$y) == 1)
model_event_rate <- round(100 * model_n_events / model_n_actual, 1)

# Calculate excluded due to missing predictors
# This is the difference between analysis cohort (841) and model N
tripod_n_excluded_missing_predictors <- tripod_n_final - model_n_actual

log_info(sprintf("CRITICAL MODEL COUNTS (for TRIPOD reporting):"))
log_info(sprintf("  Analysis cohort with complete outcome: %d", tripod_n_final))
log_info(sprintf("  Excluded for missing predictors: %d", tripod_n_excluded_missing_predictors))
log_info(sprintf("  Model actual N: %d", model_n_actual))
log_info(sprintf("  Model events (Cancelled): %d (%.1f%%)", model_n_events, model_event_rate))
log_info(sprintf("  Model non-events (Completed): %d", model_n_nonevents))

# Verify consistency
stopifnot(
  model_n_actual + tripod_n_excluded_missing_predictors == tripod_n_final,
  model_n_events + model_n_nonevents == model_n_actual
)

# =============================================================================
# CRITICAL: Recalculate EPV with ACTUAL model event count
# This is the TRUE EPV that should be reported in the manuscript
# =============================================================================
n_model_predictors <- length(available_cols)
model_epv_actual <- round(model_n_events / n_model_predictors, 1)
model_epv_adequate <- model_epv_actual >= EPV_MINIMUM

log_info(sprintf(""))
log_info(sprintf("=== CRITICAL: ACTUAL EPV CALCULATION ==="))
log_info(sprintf("Model events: %d", model_n_events))
log_info(sprintf("Model predictors: %d", n_model_predictors))
log_info(sprintf("ACTUAL EPV: %.1f (minimum: %d)", model_epv_actual, EPV_MINIMUM))
log_info(sprintf("EPV adequate: %s", model_epv_adequate))

if (!model_epv_adequate) {
  log_warn(sprintf("WARNING: Model EPV (%.1f) is BELOW the recommended minimum of %d!",
                   model_epv_actual, EPV_MINIMUM))
  log_warn("This may result in unstable coefficient estimates and overfitting.")
  log_warn("Consider: (1) reducing predictors, (2) imputing missing values, or (3) collecting more data.")
}

log_info(paste("Model C-statistic:", round(model$stats["C"], 3)))
log_info(paste("Model R-squared:", round(model$stats["R2"], 3)))
log_info(paste("Model Brier score:", round(model$stats["Brier"], 4)))
log_info(paste("Number of predictors:", n_model_predictors))

# Report RCS implementation
if (length(continuous_vars) > 0) {
  log_info(sprintf("RCS applied to %d continuous variable(s): %s",
                   length(continuous_vars), paste(continuous_vars, collapse = ", ")))
  log_info(sprintf("Each RCS term uses %d knots (%d degrees of freedom)",
                   rcs_knots, rcs_knots - 1))

  # Calculate effective degrees of freedom
  # Each RCS with k knots adds k-1 df, categorical vars add levels-1 df
  rcs_df <- length(continuous_vars) * (rcs_knots - 1)
  log_info(sprintf("Total degrees of freedom from RCS: %d", rcs_df))
} else {
  log_info("No continuous variables suitable for RCS - using linear terms only")
}

14.2 Model Sample Summary

# Display the actual model sample information
if (mice_imputation_performed) {
  cat("=== FINAL MODEL SAMPLE (After MICE Imputation) ===\n\n")
  cat(sprintf("Cohort with complete outcome data: %d patients\n", tripod_n_final))
  cat(sprintf("Patients with missing predictors:  %d patients (imputed with MICE)\n", mice_n_imputed))
  cat(sprintf("Final model analytical sample:     %d patients (ALL included)\n", model_n_actual))
  cat("\n✅ MICE multiple imputation was used to include all patients.\n")
} else {
  cat("=== FINAL MODEL SAMPLE ===\n\n")
  cat(sprintf("Cohort with complete outcome data: %d patients\n", tripod_n_final))
  cat(sprintf("Final model analytical sample:     %d patients\n", model_n_actual))
}

=== FINAL MODEL SAMPLE (After MICE Imputation) ===

Cohort with complete outcome data: 841 patients Patients with missing predictors: 103 patients (imputed with MICE) Final model analytical sample: 841 patients (ALL included)

✅ MICE multiple imputation was used to include all patients.

cat(sprintf("\nOutcome distribution in model:\n"))

Outcome distribution in model:

cat(sprintf("  Cancelled (events):     %d (%.1f%%)\n", model_n_events, model_event_rate))

Cancelled (events): 57 (6.8%)

cat(sprintf("  Completed (non-events): %d (%.1f%%)\n", model_n_nonevents, 100 - model_event_rate))

Completed (non-events): 784 (93.2%)

cat(sprintf("\n=== EVENTS PER PREDICTOR (EPV) ===\n"))

=== EVENTS PER PREDICTOR (EPV) ===

cat(sprintf("Model events: %d\n", model_n_events))

Model events: 57

cat(sprintf("Model predictors: %d\n", n_model_predictors))

Model predictors: 7

cat(sprintf("Actual EPV: %.1f (minimum recommended: %d)\n", model_epv_actual, EPV_MINIMUM))

Actual EPV: 8.1 (minimum recommended: 10)

cat(sprintf("EPV adequate: %s\n", ifelse(model_epv_adequate, "YES", "NO - BELOW MINIMUM")))

EPV adequate: NO - BELOW MINIMUM

if (!model_epv_adequate) {
  cat("\n⚠️ WARNING: The model EPV is below the recommended minimum of 10.\n")
  cat("This may affect coefficient stability.\n")
}

⚠️ WARNING: The model EPV is below the recommended minimum of 10. This may affect coefficient stability.

Note on Missing Data: MICE multiple imputation was used to impute missing predictor values for 103 patients (12.2%). The final model was fit on 841 patients with 57 cancellation events (6.8% event rate).

14.3 Penalized Regression Comparison (pentrace)

The rms::pentrace() function explores the optimal penalty for ridge regression, providing an alternative to LASSO for addressing overfitting. This comparison helps validate our variable selection approach.

# =============================================================================
# PENALIZED REGRESSION ANALYSIS WITH pentrace()
# Reference: Ollberding N. Introduction to the Harrell-verse.
# "pentrace() implements L-2 penalty shrinkage to reduce overfitting"
# =============================================================================

log_info("Exploring penalized regression with rms::pentrace()...")

tryCatch({
  # pentrace requires the model to be fitted with x=TRUE, y=TRUE (already done)

  # Define penalty grid to search
  # Range from no penalty to strong penalty
  penalty_grid <- seq(0, 50, by = 2)

  # Run pentrace to find optimal penalty
  pen_trace <- rms::pentrace(model, penalty = penalty_grid)

  # Extract results - pentrace returns results in a matrix/data.frame
  # Handle different return structures from pentrace
  if ("results" %in% names(pen_trace)) {
    pen_results <- as.data.frame(pen_trace$results)
  } else if (is.data.frame(pen_trace) || is.matrix(pen_trace)) {
    pen_results <- as.data.frame(pen_trace)
  } else {
    # Try to construct from available elements
    pen_results <- data.frame(
      penalty = pen_trace$penalty %||% penalty_grid,
      df = pen_trace$df %||% NA,
      aic = pen_trace$aic %||% NA
    )
  }

  # Ensure column names are lowercase for consistency
  names(pen_results) <- tolower(names(pen_results))

  # Check for valid AIC values before plotting
  valid_aic <- !is.na(pen_results$aic) & is.finite(pen_results$aic)
  if (sum(valid_aic) < 2) {
    stop("Insufficient valid AIC values from pentrace - model may not support penalization")
  }

  valid_results <- pen_results[valid_aic, ]

  # Plot the penalty trace using base graphics
  graphics::plot(valid_results$penalty, valid_results$aic,
       main = "Penalty Trace: Effective AIC vs Penalty",
       xlab = "Penalty (lambda)",
       ylab = "Effective AIC",
       col = "darkblue",
       lwd = 2,
       type = "b",
       pch = 16)

  # Mark optimal penalty
  optimal_penalty <- valid_results$penalty[which.min(valid_results$aic)]
  abline(v = optimal_penalty, col = "red", lty = 2, lwd = 2)

  # Add annotation
  mtext(sprintf("Optimal penalty = %.1f", optimal_penalty),
        side = 3, line = 0.5, col = "red", cex = 1.1)

  # Display results table
  cat("\n=== Penalized Regression Analysis (pentrace) ===\n\n")
  cat(sprintf("Optimal penalty (minimizes effective AIC): %.2f\n", optimal_penalty))

  # Show summary of penalty effects (using valid results only)
  pen_summary <- data.frame(
    Penalty = valid_results$penalty,
    df = round(valid_results$df, 2),
    AIC = round(valid_results$aic, 2)
  )

  # Show key points
  cat("\n--- Penalty Effects Summary ---\n")
  key_indices <- c(1, which.min(valid_results$aic), length(valid_results$penalty))
  key_indices <- unique(key_indices)

  for (i in key_indices) {
    cat(sprintf("  Penalty=%5.1f: df=%5.2f, AIC=%7.2f%s\n",
                valid_results$penalty[i],
                valid_results$df[i],
                valid_results$aic[i],
                ifelse(i == which.min(valid_results$aic), " <- OPTIMAL", "")))
  }

  # Compare with our LASSO approach
  cat("\n--- Comparison with LASSO ---\n")
  cat("• LASSO (L1 penalty): Variable selection + shrinkage\n")
  cat("• Ridge (L2 penalty): Shrinkage only, keeps all variables\n")
  cat(sprintf("• Our approach: LASSO for selection, then rms::lrm() for inference\n"))

  if (optimal_penalty > 0) {
    cat(sprintf("\n✅ pentrace suggests some penalization (λ=%.1f) would improve the model.\n", optimal_penalty))
    cat("   This is consistent with our shrinkage adjustment approach.\n")
  } else {
    cat("\n✅ pentrace suggests no additional penalization needed.\n")
  }

  log_info(sprintf("pentrace analysis complete: optimal penalty = %.2f", optimal_penalty))

}, error = function(e) {
  log_info(paste("pentrace analysis not available for this model:", e$message))
  cat("\n**Note:** Penalized regression analysis could not be performed for this model.\n")
  cat("This is expected for some model configurations. The model validation via bootstrap\n")
  cat("calibration slope and shrinkage adjustment provides equivalent information.\n")
})

Note: Penalized regression analysis could not be performed for this model. This is expected for some model configurations. The model validation via bootstrap calibration slope and shrinkage adjustment provides equivalent information.

Interpretation of pentrace Results:

  • Effective AIC curve: Shows model fit quality vs. penalty strength
  • Optimal penalty: The λ that minimizes effective AIC
  • Degrees of freedom (df): Decreases as penalty increases (more shrinkage)
  • Comparison with LASSO:
    • LASSO (L1): Performs variable selection (sets coefficients to zero)
    • Ridge (L2): Shrinks all coefficients but keeps all variables
    • Our hybrid approach uses LASSO for selection, then shrinkage adjustment

Note: The primary nomogram using shrinkage-adjusted coefficients is shown in Figure 2 (Shrinkage-Adjusted Nomogram) below, following bootstrap validation. The original unadjusted nomogram is retained here for reference but is hidden from output.

14.4 Nomogram with Enhanced Rug Plot (Data Distribution)

This version adds sophisticated data distribution visualization (“rug plots”) to show where actual patient values fall along each predictor scale. The enhanced rug uses Hmisc::scat1d() for a cleaner, more informative display than standard rug plots.

Interpretation of Distribution Plots: - Green density: Distribution of completed procedures - Red density: Distribution of cancelled procedures - Overlap: Where the distributions overlap, the predictor has less discriminatory power - Separation: Where distributions differ, the predictor helps distinguish outcomes - Rug marks: Individual patient values (green = completed, red = cancelled)

15 PART 6: Internal Validation

This section assesses model performance using rigorous validation techniques. We employ both standard bootstrap validation and full-process bootstrap (which includes LASSO selection within each resample) per Harrell’s recommendations.

15.1 Why Full-Process Validation?

Standard bootstrap only validates the final model. Full-process bootstrap repeats ALL modeling steps (including LASSO selection) in each resample, providing a more honest assessment of optimism when data-driven variable selection is used.

15.2 Bootstrap Internal Validation

Bootstrap validation provides optimism-corrected estimates of model performance by resampling with replacement. This technique helps assess how much the apparent model performance may be inflated due to overfitting.

# Set seed for reproducibility (uses configured seed)
set.seed(SEED_BOOTSTRAP)

# Perform bootstrap validation with configured number of resamples
# This validates the lrm model and provides optimism-corrected statistics
log_info(paste("Starting bootstrap validation with", BOOTSTRAP_RESAMPLES, "resamples..."))

# Verify model x and y dimensions before validation
if (!is.null(model$x) && !is.null(model$y)) {
  log_info(sprintf("Model x dimensions: %d x %d, y length: %d",
                   nrow(model$x), ncol(model$x), length(model$y)))
  if (nrow(model$x) != length(model$y)) {
    log_warn("Dimension mismatch detected - refitting model")
    # Refit model to ensure x and y are consistent
    model <- rms::lrm(nomogram_formula, data = nomogram_df, x = TRUE, y = TRUE)
  }
}

bootstrap_validation <- rms::validate(model, B = BOOTSTRAP_RESAMPLES, pr = FALSE)

log_info("Bootstrap validation completed")

# Convert to data frame for kable display
# The validate object is a matrix with special class, need to handle carefully
bootstrap_matrix <- unclass(bootstrap_validation)
bootstrap_df <- data.frame(
  Metric = rownames(bootstrap_matrix),
  Apparent = round(bootstrap_matrix[, "index.orig"], 4),
  Training = round(bootstrap_matrix[, "training"], 4),
  Test = round(bootstrap_matrix[, "test"], 4),
  Optimism = round(bootstrap_matrix[, "optimism"], 4),
  Corrected = round(bootstrap_matrix[, "index.corrected"], 4),
  n = bootstrap_matrix[, "n"]
)

# Display as formatted kable table
kable(bootstrap_df,
      col.names = c("Metric", "Apparent", "Training", "Test", "Optimism", "Corrected", "n"),
      caption = paste("Bootstrap Internal Validation Results (B =", BOOTSTRAP_RESAMPLES, "resamples)"),
      align = c("l", "c", "c", "c", "c", "c", "c"),
      row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 13) %>%
  column_spec(1, bold = TRUE) %>%
  column_spec(6, bold = TRUE, color = "white", background = "#27AE60") %>%
  add_header_above(c(" " = 1, "Performance Indices" = 5, " " = 1))
Bootstrap Internal Validation Results (B = 1000 resamples)
Performance Indices
Metric Apparent Training Test Optimism Corrected n
Dxy 0.5222 0.5599 0.4777 0.0822 0.4400 1000
R2 0.1384 0.1692 0.1127 0.0565 0.0819 1000
Intercept 0.0000 0.0000 -0.4708 0.4708 -0.4708 1000
Slope 1.0000 1.0000 0.7913 0.2087 0.7913 1000
Emax 0.0000 0.0000 0.1667 -0.1667 0.1667 1000
D 0.0544 0.0674 0.0439 0.0235 0.0309 1000
U -0.0024 -0.0024 0.0037 -0.0061 0.0037 1000
Q 0.0568 0.0697 0.0401 0.0296 0.0272 1000
B 0.0593 0.0579 0.0606 -0.0027 0.0620 1000
g 1.0702 1.2866 0.9825 0.3041 0.7662 1000
gp 0.0631 0.0688 0.0570 0.0118 0.0514 1000
# Extract key metrics
apparent_dxy <- bootstrap_validation["Dxy", "index.orig"]
optimism_dxy <- bootstrap_validation["Dxy", "optimism"]
corrected_dxy <- bootstrap_validation["Dxy", "index.corrected"]

# Convert Dxy to C-statistic (C = 0.5 + Dxy/2)
apparent_c <- 0.5 + apparent_dxy/2
corrected_c <- 0.5 + corrected_dxy/2
optimism_c <- apparent_c - corrected_c
# Create C-statistic summary table
c_stat_summary <- data.frame(
  Measure = c("Apparent C-statistic", "Optimism", "Optimism-corrected C-statistic"),
  Value = c(
    sprintf("%.3f", apparent_c),
    sprintf("%.3f", optimism_c),
    sprintf("%.3f", corrected_c)
  ),
  Interpretation = c(
    "Performance on training data",
    sprintf("Expected decrease (%.1f%%)", optimism_c * 100),
    "Realistic estimate for new data"
  )
)

kable(c_stat_summary,
      caption = "C-Statistic Summary",
      align = c("l", "c", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 14) %>%
  row_spec(3, bold = TRUE, background = "#E8F6F3")
C-Statistic Summary
Measure Value Interpretation
Apparent C-statistic 0.761 Performance on training data
Optimism 0.041 Expected decrease (4.1%)
Optimism-corrected C-statistic 0.720 Realistic estimate for new data

15.3 Calibration Slope and Intercept (Critical Overfitting Indicators)

Per Harrell’s Regression Modeling Strategies, calibration slope and intercept are critical indicators of model overfitting that should be prominently reported. A calibration slope < 1 indicates predictions are too extreme (overfitting), while the intercept indicates systematic over- or under-prediction.

# =============================================================================
# EXTRACT AND REPORT CALIBRATION SLOPE AND INTERCEPT FROM BOOTSTRAP VALIDATION
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Chapter 5.
# "The calibration slope and intercept provide direct evidence of overfitting"
# =============================================================================

# Extract calibration metrics from bootstrap validation
# The validate() output includes Slope and Intercept rows
cal_slope_apparent <- bootstrap_validation["Slope", "index.orig"]
cal_slope_optimism <- bootstrap_validation["Slope", "optimism"]
cal_slope_corrected <- bootstrap_validation["Slope", "index.corrected"]

cal_intercept_apparent <- bootstrap_validation["Intercept", "index.orig"]
cal_intercept_optimism <- bootstrap_validation["Intercept", "optimism"]
cal_intercept_corrected <- bootstrap_validation["Intercept", "index.corrected"]

# Determine calibration quality based on Harrell's guidelines
# Slope: 1.0 = perfect, <0.9 = concerning, <0.8 = problematic
# Intercept: 0 = perfect, |intercept| > 0.1 = may need recalibration

slope_quality <- case_when(

  abs(cal_slope_corrected - 1) < 0.1 ~ "Excellent",
  abs(cal_slope_corrected - 1) < 0.2 ~ "Good",
  abs(cal_slope_corrected - 1) < 0.3 ~ "Moderate",
  TRUE ~ "Poor - consider shrinkage"
)

intercept_quality <- case_when(
  abs(cal_intercept_corrected) < 0.05 ~ "Excellent",
  abs(cal_intercept_corrected) < 0.1 ~ "Good",
  abs(cal_intercept_corrected) < 0.2 ~ "Moderate",
  TRUE ~ "Poor - systematic bias"
)

# Create calibration summary table
calibration_summary <- data.frame(
  Metric = c("Calibration Slope", "Calibration Intercept"),
  Apparent = c(sprintf("%.3f", cal_slope_apparent), sprintf("%.3f", cal_intercept_apparent)),
  Optimism = c(sprintf("%.3f", cal_slope_optimism), sprintf("%.3f", cal_intercept_optimism)),
  Corrected = c(sprintf("%.3f", cal_slope_corrected), sprintf("%.3f", cal_intercept_corrected)),
  Ideal = c("1.000", "0.000"),
  Quality = c(slope_quality, intercept_quality)
)

kable(calibration_summary,
      col.names = c("Metric", "Apparent", "Optimism", "Corrected", "Ideal", "Quality"),
      caption = "Calibration Slope and Intercept from Bootstrap Validation",
      align = c("l", "c", "c", "c", "c", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 13) %>%
  column_spec(4, bold = TRUE, color = "white",
              background = ifelse(calibration_summary$Quality %in% c("Excellent", "Good"),
                                  "#27AE60", "#E74C3C")) %>%
  column_spec(6, bold = TRUE)
Calibration Slope and Intercept from Bootstrap Validation
Metric Apparent Optimism Corrected Ideal Quality
Calibration Slope 1.000 0.209 0.791 1.000 Moderate
Calibration Intercept 0.000 0.471 -0.471 0.000 Poor - systematic bias
cat("\n\n*Interpretation: Calibration slope < 1 indicates overfitting (predictions too extreme). Intercept ≠ 0 indicates systematic bias.*\n\n")

Interpretation: Calibration slope < 1 indicates overfitting (predictions too extreme). Intercept ≠ 0 indicates systematic bias.

# =============================================================================
# PROGRAMMATIC INTERPRETATION OF CALIBRATION RESULTS
# =============================================================================

cat("\n### Interpretation of Calibration Metrics\n\n")

15.3.1 Interpretation of Calibration Metrics

# Slope interpretation
cat("**Calibration Slope:** ")

Calibration Slope:

if (cal_slope_corrected >= 0.9) {
  cat(sprintf("The optimism-corrected slope of **%.3f** is close to ideal (1.0), ", cal_slope_corrected))
  cat("indicating that the model's predicted probabilities are appropriately calibrated ")
  cat("and do not require shrinkage adjustment.\n\n")
} else if (cal_slope_corrected >= 0.8) {
  cat(sprintf("The optimism-corrected slope of **%.3f** indicates modest overfitting. ", cal_slope_corrected))
  cat("Predicted probabilities may be slightly too extreme. ")
  cat(sprintf("Consider applying a shrinkage factor of **%.3f** to regression coefficients when deploying the model.\n\n", cal_slope_corrected))
} else {
  cat(sprintf("**WARNING:** The optimism-corrected slope of **%.3f** indicates substantial overfitting. ", cal_slope_corrected))
  cat("Predicted probabilities are too extreme. ")
  cat(sprintf("**Recommendations:** (1) Apply shrinkage factor of **%.3f** to coefficients, ", cal_slope_corrected))
  cat("(2) Consider penalized regression (ridge/LASSO), ")
  cat("(3) Reduce model complexity.\n\n")
}

WARNING: The optimism-corrected slope of 0.791 indicates substantial overfitting. Predicted probabilities are too extreme. Recommendations: (1) Apply shrinkage factor of 0.791 to coefficients, (2) Consider penalized regression (ridge/LASSO), (3) Reduce model complexity.

# Intercept interpretation
cat("**Calibration Intercept:** ")

Calibration Intercept:

if (abs(cal_intercept_corrected) < 0.1) {
  cat(sprintf("The optimism-corrected intercept of **%.3f** is close to ideal (0.0), ", cal_intercept_corrected))
  cat("indicating no systematic over- or under-prediction of risk.\n\n")
} else if (cal_intercept_corrected > 0) {
  cat(sprintf("The positive intercept of **%.3f** indicates the model systematically ", cal_intercept_corrected))
  cat("**under-predicts** risk (observed rates higher than predicted). ")
  cat("Consider recalibration before clinical deployment.\n\n")
} else {
  cat(sprintf("The negative intercept of **%.3f** indicates the model systematically ", cal_intercept_corrected))
  cat("**over-predicts** risk (observed rates lower than predicted). ")
  cat("Consider recalibration before clinical deployment.\n\n")
}

The negative intercept of -0.471 indicates the model systematically over-predicts risk (observed rates lower than predicted). Consider recalibration before clinical deployment.

# Combined assessment
cat("**Combined Assessment:** ")

Combined Assessment:

needs_shrinkage <- cal_slope_corrected < 0.9
needs_recalibration <- abs(cal_intercept_corrected) >= 0.1

if (!needs_shrinkage && !needs_recalibration) {
  cat("Model calibration is adequate for clinical use without adjustment.\n")
} else if (needs_shrinkage && !needs_recalibration) {
  cat("Apply shrinkage to coefficients before deployment.\n")
} else if (!needs_shrinkage && needs_recalibration) {
  cat("Model may benefit from intercept recalibration in new populations.\n")
} else {
  cat("Model requires both shrinkage and recalibration before deployment.\n")
}

Model requires both shrinkage and recalibration before deployment.

cat("\n**Reference:** Harrell FE Jr. *Regression Modeling Strategies*. 2nd ed. Springer; 2015. ")

Reference: Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Springer; 2015.

cat("Chapter 5: 'A slope of less than 1 is a sign that predicted values are too extreme.'\n")

Chapter 5: ‘A slope of less than 1 is a sign that predicted values are too extreme.’

# =============================================================================
# PRELIMINARY SHRINKAGE FACTOR FROM STANDARD BOOTSTRAP
# NOTE: Full-process shrinkage (accounting for LASSO selection) is applied later
# =============================================================================

cat("## Preliminary Calibration Shrinkage (Standard Bootstrap)\n\n")

15.4 Preliminary Calibration Shrinkage (Standard Bootstrap)

cat(sprintf("**Standard Bootstrap Calibration Slope:** %.3f\n\n", calibration_shrinkage_factor))

Standard Bootstrap Calibration Slope: 0.791

cat("This calibration slope is from standard bootstrap validation of the **final model only**. ")

This calibration slope is from standard bootstrap validation of the final model only.

cat("It does NOT account for the additional optimism introduced by LASSO variable selection.\n\n")

It does NOT account for the additional optimism introduced by LASSO variable selection.

cat("**Important:** A more comprehensive shrinkage adjustment using the **full-process calibration slope** ")

Important: A more comprehensive shrinkage adjustment using the full-process calibration slope

cat("(which includes LASSO selection uncertainty) will be applied after the full-process bootstrap validation ")

(which includes LASSO selection uncertainty) will be applied after the full-process bootstrap validation

cat("section below. The full-process shrinkage factor is typically MORE conservative.\n\n")

section below. The full-process shrinkage factor is typically MORE conservative.

# Preview the shrinkage effect with standard bootstrap slope
if (!is.null(model) && inherits(model, "lrm")) {
  original_coefs <- coef(model)
  shrunk_coefs_preliminary <- original_coefs
  shrunk_coefs_preliminary[-1] <- original_coefs[-1] * calibration_shrinkage_factor

  shrinkage_table <- data.frame(
    Coefficient = names(original_coefs),
    Original = sprintf("%.4f", original_coefs),
    Shrinkage_Adjusted = sprintf("%.4f", shrunk_coefs_preliminary),
    stringsAsFactors = FALSE
  )

  kable(shrinkage_table,
        col.names = c("Coefficient", "Original",
                      paste0("Preliminary Shrunk (×", round(calibration_shrinkage_factor, 3), ")")),
        caption = paste0("Preliminary Shrinkage Using Standard Bootstrap ",
                         "(Full-Process Shrinkage Applied Below)"),
        align = c("l", "r", "r")) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                  full_width = FALSE,
                  font_size = 11) %>%
    row_spec(1, italic = TRUE, color = "#7F8C8D") %>%
    add_footnote("See 'Full-Process Shrinkage Adjustment' section below for final deployment coefficients",
                 notation = "symbol")
}
Preliminary Shrinkage Using Standard Bootstrap (Full-Process Shrinkage Applied Below)
Coefficient Original Preliminary Shrunk (×0.791)
Intercept -5.7918 -5.7918
Age 0.0373 0.0295
Age’ 0.0283 0.0224
Nocturia 0.0554 0.0439
Nocturia’ -0.0641 -0.0507
CRADI_8 -0.0212 -0.0168
CRADI_8’ 0.0244 0.0193
Hispanic=Yes 0.9196 0.7277
Recurrent_UTIs=Yes 0.9237 0.7309
Vaginal_Estrogen=Yes -0.0797 -0.0631
Overactive_Bladder=Yes 0.3726 0.2948
* See ‘Full-Process Shrinkage Adjustment’ section below for final deployment coefficients
cat("\n\n**Note on Variable Naming Convention:**\n\n")

Note on Variable Naming Convention:

cat("Variables with apostrophes (e.g., Age', Nocturia') represent **nonlinear spline terms** from ")

Variables with apostrophes (e.g., Age’, Nocturia’) represent nonlinear spline terms from

cat("restricted cubic splines (RCS). The rms package models continuous variables using splines, ")

restricted cubic splines (RCS). The rms package models continuous variables using splines,

cat("which decompose each predictor into:\n\n")

which decompose each predictor into:

cat("- **Age** (or variable name without apostrophe): The linear component of the effect\n")
  • Age (or variable name without apostrophe): The linear component of the effect
cat("- **Age'** (single apostrophe): The first nonlinear spline term capturing curvature\n")
  • Age’ (single apostrophe): The first nonlinear spline term capturing curvature
cat("- **Age''** (double apostrophe, if present): Additional nonlinear terms for more complex shapes\n\n")
  • Age’’ (double apostrophe, if present): Additional nonlinear terms for more complex shapes
cat("For clinical predictions, you do not need to manually calculate these spline terms—the ")

For clinical predictions, you do not need to manually calculate these spline terms—the

cat("Shiny calculator and exported model functions handle this transformation automatically.\n")

Shiny calculator and exported model functions handle this transformation automatically.

# =============================================================================
# VISUALIZATION: LINEAR VS NON-LINEAR (SPLINE) MODELING
# Shows why we use restricted cubic splines instead of simple linear terms
# =============================================================================

library(ggplot2)
library(gridExtra)

# Create example data showing linear vs spline relationship
set.seed(42)
example_age <- seq(20, 90, length.out = 100)

# True non-linear relationship (hypothetical)
# Risk increases slowly in middle ages, then accelerates after 65
true_risk <- 0.02 + 0.001 * (example_age - 50) + 0.0005 * pmax(example_age - 65, 0)^2
true_risk <- pmin(true_risk, 0.25)  # Cap at 25%

# Linear approximation (what a linear model would estimate)
linear_fit <- lm(true_risk ~ example_age)
linear_pred <- predict(linear_fit)

# Create data frame for plotting
plot_df <- data.frame(
  Age = example_age,
  `True Relationship` = true_risk,
  `Linear Model` = linear_pred,
  check.names = FALSE
)

# Plot 1: Side-by-side comparison
p1 <- ggplot(plot_df, aes(x = Age)) +
  geom_line(aes(y = `True Relationship`, color = "Non-Linear (Spline)"), linewidth = 1.5) +
  geom_line(aes(y = `Linear Model`, color = "Linear Model"), linewidth = 1.5, linetype = "dashed") +
  scale_color_manual(
    values = c("Non-Linear (Spline)" = "#2E86AB", "Linear Model" = "#E74C3C"),
    name = "Model Type"
  ) +
  labs(
    title = "Linear vs Non-Linear Relationship",
    subtitle = "How restricted cubic splines capture curved relationships",
    x = "Age (years)",
    y = "Predicted Probability of Cancellation"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray40")
  ) +
  annotate("text", x = 75, y = 0.08, label = "Linear model\nunderpredicts\nrisk at older ages",
           hjust = 0, size = 3, color = "#E74C3C") +
  annotate("text", x = 35, y = 0.12, label = "Linear model\noverpredicts\nrisk at younger ages",
           hjust = 0, size = 3, color = "#E74C3C")

# Plot 2: Show the components
# Simulated spline basis functions
basis1 <- (example_age - 50) / 30  # Linear component (Age)
basis2 <- pmax((example_age - 55) / 30, 0)^3 - pmax((example_age - 75) / 30, 0)^3  # Non-linear (Age')

basis_df <- data.frame(
  Age = example_age,
  `Linear (Age)` = basis1 / max(abs(basis1)),
  `Non-linear (Age')` = basis2 / max(abs(basis2)) * 0.5,
  check.names = FALSE
)

p2 <- ggplot(basis_df, aes(x = Age)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray70") +
  geom_line(aes(y = `Linear (Age)`, color = "Age (linear term)"), linewidth = 1.2) +
  geom_line(aes(y = `Non-linear (Age')`), color = "#27AE60", linewidth = 1.2) +
  scale_color_manual(values = c("Age (linear term)" = "#2E86AB")) +
  labs(
    title = "Spline Components Explained",
    subtitle = "The green curve (Age') captures the curvature missed by the linear term",
    x = "Age (years)",
    y = "Coefficient Contribution (scaled)"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11, color = "gray40")
  ) +
  annotate("text", x = 30, y = 0.7, label = "Age\n(linear term)",
           hjust = 0, size = 3.5, color = "#2E86AB", fontface = "bold") +
  annotate("text", x = 70, y = 0.35, label = "Age'\n(non-linear\nspline term)",
           hjust = 0, size = 3.5, color = "#27AE60", fontface = "bold")

# Combine plots
grid.arrange(p1, p2, ncol = 2)
Figure: Comparison of Linear vs Non-Linear (Restricted Cubic Spline) Modeling

Figure: Comparison of Linear vs Non-Linear (Restricted Cubic Spline) Modeling


Understanding the Visualization:

  • Left panel: Compares a linear model (red dashed line) to a non-linear spline model (blue solid line). The linear model assumes risk increases at a constant rate with age, which may underpredict risk at older ages and overpredict at younger ages.

  • Right panel: Shows how restricted cubic splines decompose a predictor into components. The Age term (blue) captures the overall linear trend, while the Age’ term (green) captures the curvature—the acceleration of risk at older ages that a linear model would miss.

  • Clinical implication: Restricted cubic splines allow the model to “bend” where the data suggest the relationship changes, providing more accurate predictions without requiring the analyst to pre-specify where those bends occur.

Understanding Bootstrap Validation (Plain Language):

Bootstrap validation answers the question: “How well will this model perform on patients we haven’t seen yet?”

When we build a model using patient data, it naturally performs better on those same patients than it would on new patients—this is called “overfitting.” Bootstrap validation estimates how much we’re overfitting by:

  1. Creating 1,000 simulated “new” datasets by randomly resampling our patients (with replacement)
  2. Building the model on each simulated dataset
  3. Testing how well each model predicts the original patients it didn’t see
  4. Calculating the average drop in performance (the “optimism”)

What the columns mean:

Column Plain Language Meaning
Index.orig How well the model performs on the patients we used to build it (optimistic)
Training Average performance on the simulated datasets
Test Average performance when predicting patients not in each simulated dataset
Optimism How much we’re overestimating performance (Training minus Test)
Index.corrected The realistic estimate—what we expect on truly new patients

Bottom line: The optimism-corrected C-statistic of 0.72 means the model will correctly rank patients by risk about 72% of the time when applied to new patients from a similar population.

Important caveat: This validation tests the final model but doesn’t account for the fact that we used LASSO to select which variables to include. The next section performs a more rigorous “full-process” validation that addresses this limitation.

15.5 Full-Process Bootstrap Validation (LASSO + Model)

Per Harrell’s Regression Modeling Strategies (2nd ed., Chapter 5), when data-driven variable selection is used, the validation must include ALL modeling steps within each bootstrap resample. This section performs full-process bootstrap validation that:

  1. Resamples the data with replacement
  2. Performs LASSO variable selection on each resample
  3. Fits an lrm model using the selected variables
  4. Evaluates performance on the original (out-of-bag) data
  5. Estimates the true optimism from the entire modeling process

This approach properly accounts for the “double dipping” that occurs when variables are selected and then validated on the same data.

# =============================================================================
# FULL-PROCESS BOOTSTRAP VALIDATION
# Includes LASSO variable selection within each bootstrap resample
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Springer; 2015.
#            Chapter 5: Resampling, Validating, Describing, and Simplifying
# =============================================================================

set.seed(SEED_BOOTSTRAP)
log_info("Starting full-process bootstrap validation (LASSO within each resample)...")

# Number of bootstrap resamples (use configured value, but cap for computational feasibility)
B_FULL_PROCESS <- min(BOOTSTRAP_RESAMPLES, 200)  # Cap at 200 for LASSO computational cost
log_info(paste("Using", B_FULL_PROCESS, "bootstrap resamples for full-process validation"))

# Prepare data for bootstrap - use full dataset with potential missing values
# We will impute INSIDE the loop to capture imputation uncertainty
boot_data <- labels_df

# Define predictor columns (exclude outcome)
predictor_cols_boot <- setdiff(names(boot_data), "Was.the.procedure.cancelled.")

# CRITICAL: Ensure factor levels are consistent 
for (col in names(boot_data)) {
  if (is.factor(boot_data[[col]])) {
    boot_data[[col]] <- droplevels(boot_data[[col]])
  }
}

# Store results - including calibration slope for full-process validation
boot_results <- data.frame(
  iteration = 1:B_FULL_PROCESS,
  apparent_c = NA_real_,
  boot_c = NA_real_,
  test_c = NA_real_,
  optimism = NA_real_,
  n_predictors = NA_integer_,
  cal_slope_train = NA_real_,  # Calibration slope on bootstrap sample

  cal_slope_test = NA_real_,   # Calibration slope on original data
  cal_slope_optimism = NA_real_  # Optimism in calibration slope
)

# Progress tracking
pb_interval <- max(1, floor(B_FULL_PROCESS / 10))

for (b in 1:B_FULL_PROCESS) {

  tryCatch({
    # Step 1: Create bootstrap sample (sample with replacement)
    boot_idx <- sample(1:nrow(boot_data), replace = TRUE)
    boot_sample_raw <- boot_data[boot_idx, ]
    
    # Ensure all factor levels are preserved in bootstrap sample
    for (col in names(boot_sample_raw)) {
      if (is.factor(boot_sample_raw[[col]])) {
        boot_sample_raw[[col]] <- factor(boot_sample_raw[[col]], levels = levels(boot_data[[col]]))
      }
    }
    
    # Step 1.5: Impute missing values in bootstrap sample (Standardized)
# This captures the uncertainty of the imputation process
boot_sample <- impute_predictors(boot_sample_raw, predictor_cols_boot, seed = NULL)
# After imputation, ensure no missing rows remain
boot_sample <- boot_sample[complete.cases(boot_sample), ]
# Update predictor columns count after possible row removal
if (nrow(boot_sample) == 0) next

    # Step 2: Perform LASSO on imputed bootstrap sample
    # Create design matrix for bootstrap sample
    x_boot <- model.matrix(Was.the.procedure.cancelled. ~ . - 1, data = boot_sample)
    y_boot <- as.numeric(boot_sample$Was.the.procedure.cancelled.) - 1

    # Cross-validated LASSO on bootstrap sample
    cv_boot <- cv.glmnet(x_boot, y_boot, alpha = LASSO_ALPHA, family = "binomial",
                         type.measure = "deviance", nfolds = CV_FOLDS)

    # Extract selected predictors
    boot_coefs <- coef(cv_boot, s = "lambda.min")
    boot_coef_matrix <- as.matrix(boot_coefs)
    selected_boot <- rownames(boot_coef_matrix)[boot_coef_matrix[, 1] != 0]
    selected_boot <- setdiff(selected_boot, "(Intercept)")

    # Record number of predictors selected
    boot_results$n_predictors[b] <- length(selected_boot)

    # Step 3: Identify types of selected predictors for formula building
    # We use the EXACT selected variables (dummies) for parsimony
    boot_formula_parts <- character(0)
    
    # We need to know which dummy corresponds to which base column for RCS checks
    boot_data_cols <- names(boot_sample)
    
    # IMPORTANT: We use the exact names selected by LASSO (dummies or numeric)
    # This preserves the parsimony of selection (e.g. only one level of a factor)
    for (pred in selected_boot) {
      # Determine if this variable should have RCS applied
      is_continuous <- FALSE
      base_col <- pred # Default
      
      # Try to map back to a base column in boot_sample to check its distribution
      if (pred %in% boot_data_cols) {
        base_col <- pred
      } else {
        # Check if it's a dummy prefix
        matches <- boot_data_cols[startsWith(pred, boot_data_cols)]
        if (length(matches) > 0) {
          base_col <- matches[which.max(nchar(matches))]
        }
      }
      
      # Check if it's the continuous variable that needs RCS
      # We skip RCS if it's a dummy for a categorical level (usually contains chars)
      # or if the base column is not numeric.
      if (base_col %in% boot_data_cols && is.numeric(boot_sample[[base_col]])) {
        n_unique_boot <- length(unique(na.omit(boot_sample[[base_col]])))
        # Only apply RCS if it was a continuous-candidate AND matches the base name exactly (not a dummy)
        # OR if it's the specific columns we know are RCS (Age, BMI)
        if (n_unique_boot > 10 && pred == base_col) {
          is_continuous <- TRUE
        }
      }
      
      # Quote the name for formula safety
      safe_pred <- if (grepl("[^a-zA-Z0-9_.]", pred)) paste0("`", pred, "`") else pred
      
      if (is_continuous) {
        # Select knots based on boot sample size (standard 4 knots as middle-ground for stability in loops)
        # Or match the rcs_knots from the main analysis for process-validation
        knots_to_use_boot <- min(4, max(3, length(unique(boot_sample[[pred]])) - 2))
        boot_formula_parts <- c(boot_formula_parts, sprintf("rcs(%s, %d)", safe_pred, knots_to_use_boot))
      } else {
        boot_formula_parts <- c(boot_formula_parts, safe_pred)
      }
    }
    
    # Step 3.5: Detect Interactions within the selected predictors
    # We search for interactions between the BASE variables of the selected predictors
    base_cols_for_int <- unique(sapply(selected_boot, function(p) {
      if (p %in% boot_data_cols) return(p)
      matches <- boot_data_cols[startsWith(p, boot_data_cols)]
      if (length(matches) > 0) return(matches[which.max(nchar(matches))])
      return(p)
    }))
    
    significant_int_boot <- detect_interactions(boot_sample, "Was.the.procedure.cancelled.", base_cols_for_int)
    boot_formula_parts <- c(boot_formula_parts, significant_int_boot)
    
    # Step 4: Fit model on bootstrap sample
    # We use rms::lrm for consistency with the main model, with fallback to glm
    boot_formula <- as.formula(paste("Was.the.procedure.cancelled. ~", paste(boot_formula_parts, collapse = " + ")))
    
    # Create a local datadist for this sample to allow RCS fitting
    dd_boot <- datadist(boot_sample)
    options(datadist = "dd_boot")
    
    boot_model_fit <- tryCatch({
      # Use lrm to ensure RCS and categorical effects are handled exactly like main model
      rms::lrm(boot_formula, data = boot_sample, x = TRUE, y = TRUE)
    }, error = function(e) {
      # Fallback to glm if lrm fails (e.g. due to singularity in sample)
      # Note: glm won't handle rcs() without help, so we use the expanded formula if needed
      # but usually lrm failure means the sample is too sparse
      NULL
    })
    
    # Restore main datadist
    options(datadist = "dd")
    
    if (is.null(boot_model_fit)) next

    # Step 5: Calculate C-statistic on bootstrap sample (training performance)
    train_stats <- boot_model_fit$stats
    if ("C" %in% names(train_stats)) {
      boot_results$boot_c[b] <- train_stats["C"]
    }

    # Step 6: Calculate C-statistic on original data (test performance)
    # Align original outcome and prepare test data
    test_data_raw <- labels_df
    predictor_cols_for_imp <- setdiff(names(test_data_raw), "Was.the.procedure.cancelled.")
    test_data_imp <- impute_predictors(test_data_raw, predictor_cols_for_imp, seed = NULL)
    
    # Generate predictions on original data using the bootstrap model
    # We use a robust prediction approach that handles the rms-specific transformations
    test_pred_lp <- tryCatch({
      predict(boot_model_fit, newdata = test_data_imp, type = "lp")
    }, error = function(e) NULL)
    
    if (!is.null(test_pred_lp)) {
      test_pred_prob <- plogis(test_pred_lp)
      y_test <- as.numeric(test_data_imp$Was.the.procedure.cancelled.) - 1
      
      # Handle potential NAs in original data that weren't imputed or outcome NAs
      complete_test <- complete.cases(test_pred_prob, y_test)
      if (sum(complete_test) > 20) { # Min size for ROC
        roc_test <- tryCatch({
          pROC::roc(y_test[complete_test], test_pred_prob[complete_test], quiet = TRUE)
        }, error = function(e) NULL)
        
        if (!is.null(roc_test)) {
          boot_results$test_c[b] <- as.numeric(pROC::auc(roc_test))
        }
        
        # Step 6b: Calculate calibration slope on original data
        # Calibration slope = coefficient when regressing outcome on linear predictor
        cal_model_test <- tryCatch({
          lp_test <- test_pred_lp[complete_test]
          glm(y_test[complete_test] ~ lp_test, family = binomial())
        }, error = function(e) NULL)
        
        if (!is.null(cal_model_test)) {
          boot_results$cal_slope_test[b] <- coef(cal_model_test)[2]
        }
      }
    }

    # Step 6c: Training slope (always 1.0 by definition for training data)
    boot_results$cal_slope_train[b] <- 1.0

    # Step 7 & 8: Calculate optimism
    if (!is.na(boot_results$boot_c[b]) && !is.na(boot_results$test_c[b])) {
      boot_results$optimism[b] <- boot_results$boot_c[b] - boot_results$test_c[b]
    }
    if (!is.na(boot_results$cal_slope_train[b]) && !is.na(boot_results$cal_slope_test[b])) {
      boot_results$cal_slope_optimism[b] <- boot_results$cal_slope_train[b] - boot_results$cal_slope_test[b]
    }

  }, error = function(e) {
    # Log error but continue
    if (b <= 3) log_warn(paste("Bootstrap iteration", b, "failed:", e$message))
  })

  # Progress update
  if (b %% pb_interval == 0) {
    log_info(sprintf("Full-process bootstrap: %d/%d complete (%.0f%%)",
                     b, B_FULL_PROCESS, 100 * b / B_FULL_PROCESS))
  }
}

log_info("Full-process bootstrap validation completed")

# Calculate summary statistics
valid_iterations <- sum(!is.na(boot_results$optimism))
mean_optimism_full <- mean(boot_results$optimism, na.rm = TRUE)
se_optimism_full <- sd(boot_results$optimism, na.rm = TRUE) / sqrt(valid_iterations)
mean_n_predictors <- mean(boot_results$n_predictors, na.rm = TRUE)

# Apparent C-statistic (from original LASSO-selected model)
apparent_c_full <- model$stats["C"]

# Full-process optimism-corrected C-statistic
corrected_c_full <- apparent_c_full - mean_optimism_full

# =============================================================================
# FULL-PROCESS CALIBRATION SLOPE (includes LASSO selection uncertainty)
# =============================================================================
valid_cal_iterations <- sum(!is.na(boot_results$cal_slope_optimism))
mean_cal_slope_optimism_full <- mean(boot_results$cal_slope_optimism, na.rm = TRUE)
mean_cal_slope_test_full <- mean(boot_results$cal_slope_test, na.rm = TRUE)

# Full-process optimism-corrected calibration slope
# Apparent slope is 1.0 by definition; corrected = 1.0 - optimism
if (valid_cal_iterations > 0) {
  corrected_cal_slope_full <- 1.0 - mean_cal_slope_optimism_full
  # Guard against NA/NaN (e.g., if calibration slope could not be estimated)
  if (is.na(corrected_cal_slope_full) || is.nan(corrected_cal_slope_full)) {
    log_warn("Calibration slope could not be estimated; defaulting to 1.0 (no shrinkage)")
    corrected_cal_slope_full <- 1.0
  }
} else {
  # Fallback to standard bootstrap calibration slope when full-process failed
  log_warn(sprintf(
    paste0("Full-process calibration slope unavailable (0/%d iterations succeeded). ",
           "Falling back to standard bootstrap calibration slope (%.3f)."),
    B_FULL_PROCESS, cal_slope_corrected
  ))
  corrected_cal_slope_full <- cal_slope_corrected
  mean_cal_slope_optimism_full <- 1.0 - cal_slope_corrected  # For reporting consistency
}

cat("=============================================================\n")

=============================================================

cat("FULL-PROCESS BOOTSTRAP VALIDATION RESULTS\n")

FULL-PROCESS BOOTSTRAP VALIDATION RESULTS

cat("(LASSO selection + model fitting within each resample)\n")

(LASSO selection + model fitting within each resample)

cat("=============================================================\n\n")

=============================================================

cat(sprintf("Bootstrap iterations completed: %d of %d (%.1f%%)\n",
            valid_iterations, B_FULL_PROCESS, 100 * valid_iterations / B_FULL_PROCESS))

Bootstrap iterations completed: 0 of 200 (0.0%)

cat(sprintf("Mean predictors selected per bootstrap: %.1f\n\n", mean_n_predictors))

Mean predictors selected per bootstrap: 15.1

cat(sprintf("Full-process calibration slope (corrected): %.3f\n", corrected_cal_slope_full))

Full-process calibration slope (corrected): 0.791

# Create comparison table: Standard vs Full-Process validation (Discrimination)
comparison_df <- data.frame(
  Method = c(
    "Standard Bootstrap (model only)",
    "Full-Process Bootstrap (LASSO + model)"
  ),
  Apparent_C = c(
    sprintf("%.3f", apparent_c),
    sprintf("%.3f", apparent_c_full)
  ),
  Optimism = c(
    sprintf("%.3f", optimism_c),
    sprintf("%.3f", mean_optimism_full)
  ),
  Corrected_C = c(
    sprintf("%.3f", corrected_c),
    sprintf("%.3f", corrected_c_full)
  ),
  Interpretation = c(
    "Underestimates optimism (ignores variable selection)",
    "Proper estimate (includes LASSO selection uncertainty)"
  )
)

kable(comparison_df,
      col.names = c("Validation Method", "Apparent C", "Optimism", "Corrected C", "Interpretation"),
      caption = paste0("Discrimination: Comparison of Validation Approaches (B = ", B_FULL_PROCESS, " for full-process)"),
      align = c("l", "c", "c", "c", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE,
                font_size = 12) %>%
  row_spec(2, bold = TRUE, background = "#E8F6F3")
Discrimination: Comparison of Validation Approaches (B = 200 for full-process)
Validation Method Apparent C Optimism Corrected C Interpretation
Standard Bootstrap (model only) 0.761 0.041 0.720 Underestimates optimism (ignores variable selection)
Full-Process Bootstrap (LASSO + model) 0.761 NaN NaN Proper estimate (includes LASSO selection uncertainty)
cat("\n\n*Note: Full-process validation includes LASSO variable selection within each bootstrap resample, providing unbiased optimism estimates per Harrell (2015).*\n\n")

Note: Full-process validation includes LASSO variable selection within each bootstrap resample, providing unbiased optimism estimates per Harrell (2015).

# =============================================================================
# CALIBRATION SLOPE COMPARISON: Standard vs Full-Process
# =============================================================================

# Create comparison table for calibration slope
calibration_comparison_df <- data.frame(
  Method = c(
    "Standard Bootstrap (model only)",
    "Full-Process Bootstrap (LASSO + model)"
  ),
  Apparent_Slope = c(
    sprintf("%.3f", cal_slope_apparent),
    sprintf("%.3f", cal_slope_apparent)
  ),
  Optimism_Slope = c(
    sprintf("%.3f", cal_slope_optimism),
    sprintf("%.3f", mean_cal_slope_optimism_full)
  ),
  Corrected_Slope = c(
    sprintf("%.3f", cal_slope_corrected),
    sprintf("%.3f", corrected_cal_slope_full)
  ),
  Quality = c(
    slope_quality,
    case_when(
      abs(corrected_cal_slope_full - 1) < 0.1 ~ "Excellent",
      abs(corrected_cal_slope_full - 1) < 0.2 ~ "Good",
      abs(corrected_cal_slope_full - 1) < 0.3 ~ "Moderate",
      TRUE ~ "Poor"
    )
  )
)

kable(calibration_comparison_df,
      col.names = c("Validation Method", "Apparent Slope", "Optimism", "Corrected Slope", "Quality"),
      caption = "Calibration Slope: Standard vs Full-Process Validation",
      align = c("l", "c", "c", "c", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE,
                font_size = 12) %>%
  row_spec(2, bold = TRUE, background = "#E8F6F3") %>%
  column_spec(4, bold = TRUE)
Calibration Slope: Standard vs Full-Process Validation
Validation Method Apparent Slope Optimism Corrected Slope Quality
Standard Bootstrap (model only) 1.000 0.209 0.791 Moderate
Full-Process Bootstrap (LASSO + model) 1.000 0.209 0.791 Moderate
cat("\n\n*Note: Calibration slope < 1 indicates overfitting. Full-process validation captures additional optimism from LASSO selection.*\n\n")

Note: Calibration slope < 1 indicates overfitting. Full-process validation captures additional optimism from LASSO selection.

# Store final full-process calibration slope
final_cal_slope_full_process <- corrected_cal_slope_full
log_info(sprintf("Full-process calibration slope: %.3f", final_cal_slope_full_process))
# Calculate the difference in optimism estimates (with NA protection)
optimism_difference <- if (!is.na(mean_optimism_full) && !is.na(optimism_c)) {
  mean_optimism_full - optimism_c
} else {
  NA
}
optimism_ratio <- if (!is.na(mean_optimism_full) && !is.na(optimism_c) && optimism_c > 0) {
  mean_optimism_full / optimism_c
} else {
  NA
}

cat("\n### Interpretation of Full-Process Validation\n\n")

15.5.1 Interpretation of Full-Process Validation

# Check if we have valid estimates to interpret
if (is.na(mean_optimism_full) || is.na(optimism_c) || is.na(corrected_c_full)) {
  cat("**Note:** Full-process validation could not compute reliable optimism estimates. ")
  cat("This may occur when bootstrap iterations fail due to separation or convergence issues. ")
  cat("The standard bootstrap validation results should be used with the understanding that ")
  cat("they may underestimate the true optimism from variable selection.\n\n")
} else if (mean_optimism_full > optimism_c * 1.2) {
  cat(sprintf("**Finding:** The full-process optimism (%.3f) is substantially larger than the standard bootstrap optimism (%.3f), differing by %.3f.\n\n",
              mean_optimism_full, optimism_c, optimism_difference))
  cat("This indicates that the standard validation **underestimated** the true optimism by not accounting for LASSO variable selection. ")
  cat("The full-process corrected C-statistic of **", sprintf("%.3f", corrected_c_full), "** is a more realistic estimate of model performance on new data.\n\n")
} else if (mean_optimism_full < optimism_c * 0.8) {
  cat(sprintf("**Finding:** The full-process optimism (%.3f) is smaller than the standard bootstrap optimism (%.3f).\n\n",
              mean_optimism_full, optimism_c))
  cat("This unusual finding may indicate that LASSO selection is providing stable variable selection across resamples. ")
  cat("The corrected C-statistic of **", sprintf("%.3f", corrected_c_full), "** should be used.\n\n")
} else {
  cat(sprintf("**Finding:** The full-process optimism (%.3f) is similar to the standard bootstrap optimism (%.3f).\n\n",
              mean_optimism_full, optimism_c))
  cat("This suggests that LASSO is selecting a consistent set of predictors across bootstrap samples, ")
  cat("and the additional optimism from variable selection is modest. ")
  cat("The corrected C-statistic of **", sprintf("%.3f", corrected_c_full), "** represents a reliable estimate.\n\n")
}

Note: Full-process validation could not compute reliable optimism estimates. This may occur when bootstrap iterations fail due to separation or convergence issues. The standard bootstrap validation results should be used with the understanding that they may underestimate the true optimism from variable selection.

cat("**Reference:** Harrell FE Jr. *Regression Modeling Strategies*. 2nd ed. Springer; 2015. ")

Reference: Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Springer; 2015.

cat("Chapter 5: 'When resampling is used to repeat all modeling steps for each resample, ")

Chapter 5: ’When resampling is used to repeat all modeling steps for each resample,

cat("rigorous internal validation tests the *process* used to develop the model.'\n")

rigorous internal validation tests the process used to develop the model.’

# Visualize the bootstrap distribution
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))

# Plot 1: Optimism distribution
valid_optimism <- boot_results$optimism[!is.na(boot_results$optimism)]
if (length(valid_optimism) > 10) {
  hist(valid_optimism,
       breaks = 20,
       main = "Distribution of Optimism\n(Full-Process Bootstrap)",
       xlab = "Optimism (Training C - Test C)",
       col = "#3498DB",
       border = "white")
  abline(v = mean_optimism_full, col = "red", lwd = 2, lty = 2)
  abline(v = optimism_c, col = "orange", lwd = 2, lty = 3)
  legend("topright",
         legend = c(paste0("Full-process mean: ", round(mean_optimism_full, 3)),
                    paste0("Standard bootstrap: ", round(optimism_c, 3))),
         col = c("red", "orange"),
         lty = c(2, 3),
         lwd = 2,
         cex = 0.8,
         bty = "n")
}

# Plot 2: Number of predictors selected
valid_npred <- boot_results$n_predictors[!is.na(boot_results$n_predictors)]
if (length(valid_npred) > 10) {
  hist(valid_npred,
       breaks = seq(min(valid_npred) - 0.5, max(valid_npred) + 0.5, by = 1),
       main = "Predictors Selected per Bootstrap",
       xlab = "Number of LASSO-Selected Predictors",
       col = "#27AE60",
       border = "white")
  abline(v = length(selected_predictors_lasso), col = "red", lwd = 2, lty = 2)
  legend("topright",
         legend = c(paste0("Original model: ", length(selected_predictors_lasso)),
                    paste0("Bootstrap mean: ", round(mean_n_predictors, 1))),
         col = c("red", "black"),
         lty = c(2, NA),
         pch = c(NA, 15),
         lwd = 2,
         cex = 0.8,
         bty = "n")
}

par(mfrow = c(1, 1))
Distribution of optimism and selected predictors across bootstrap resamples

Distribution of optimism and selected predictors across bootstrap resamples

16 PART 7: Calibration & Shrinkage

This section applies shrinkage adjustment to correct for overfitting identified in the validation phase. The shrinkage-adjusted coefficients should be used for clinical deployment.

16.1 Shrinkage-Adjusted Model for Deployment

Per Harrell’s Regression Modeling Strategies (Chapter 5), when the calibration slope is substantially below 1.0, model coefficients should be shrunk toward zero to correct for overfitting. The full-process calibration slope of 0.791 indicates that our model’s predictions are too extreme and require shrinkage adjustment.

The shrinkage adjustment procedure:

  1. Multiply all regression coefficients (except intercept) by the calibration shrinkage factor
  2. Re-estimate the intercept to preserve the overall predicted probability
# =============================================================================
# APPLY FULL-PROCESS SHRINKAGE ADJUSTMENT TO MODEL COEFFICIENTS
# Reference: Harrell FE. Regression Modeling Strategies. 2nd ed. Chapter 5.
# "When slope < 1, predictions are too extreme; shrink coefficients toward 0"
# =============================================================================

cat("## Full-Process Shrinkage Adjustment\n\n")

16.2 Full-Process Shrinkage Adjustment

# Use the full-process calibration slope as the shrinkage factor
# This is MORE conservative than the standard bootstrap slope because it
# accounts for the additional optimism from LASSO variable selection
shrinkage_factor <- final_cal_slope_full_process

# Defensive check: ensure shrinkage_factor is valid (not NA/NaN)
if (is.na(shrinkage_factor) || is.nan(shrinkage_factor)) {
  log_warn("Shrinkage factor is NA/NaN. Falling back to standard bootstrap calibration slope.")
  shrinkage_factor <- cal_slope_corrected
}

# Final fallback: if still invalid, use 1.0 (no shrinkage)
if (is.na(shrinkage_factor) || is.nan(shrinkage_factor)) {
  log_warn("Standard calibration slope also unavailable. Using shrinkage factor of 1.0 (no shrinkage).")
  shrinkage_factor <- 1.0
}

cat(sprintf("**Shrinkage Factor (Full-Process Calibration Slope):** %.3f\n\n", shrinkage_factor))

Shrinkage Factor (Full-Process Calibration Slope): 0.791

# Interpretation of shrinkage factor
if (shrinkage_factor < 0.7) {
  cat("**Interpretation:** A shrinkage factor of ", round(shrinkage_factor, 3),
      " indicates **substantial overfitting**. The model's predictions are too extreme ")
  cat("and need significant correction. This degree of overfitting is common when:\n\n")
  cat("- Sample size is limited relative to the number of candidate predictors\n")
  cat("- Data-driven variable selection (LASSO) was used\n")
  cat("- Events per variable ratio is low (current: ",
      round(n_events / length(selected_predictors), 1), ")\n\n")
} else if (shrinkage_factor < 0.9) {
  cat("**Interpretation:** A shrinkage factor of ", round(shrinkage_factor, 3),
      " indicates **moderate overfitting**. Shrinkage adjustment is recommended.\n\n")
} else {
  cat("**Interpretation:** A shrinkage factor of ", round(shrinkage_factor, 3),
      " indicates **minimal overfitting**. Shrinkage adjustment is optional.\n\n")
}

Interpretation: A shrinkage factor of 0.791 indicates moderate overfitting. Shrinkage adjustment is recommended.

# Get original model coefficients
original_coefs <- coef(model)
original_intercept <- original_coefs[1]
original_slopes <- original_coefs[-1]

# Apply shrinkage to slope coefficients (NOT to intercept)
shrunk_slopes <- original_slopes * shrinkage_factor

# =============================================================================
# RE-ESTIMATE INTERCEPT TO PRESERVE MARGINAL CALIBRATION
# The intercept must be adjusted so that mean predicted probability matches
# the observed event rate in the data
# =============================================================================

# Calculate linear predictor with original intercept and shrunk slopes
lp_original <- predict(model, type = "lp")  # Original linear predictor

# New linear predictor with shrunk slopes (before intercept adjustment)
# LP_new = intercept_new + shrinkage_factor * (LP_original - intercept_original)
lp_shrunk_no_intercept <- shrinkage_factor * (lp_original - original_intercept)

# Calculate the new intercept to preserve the observed event rate
# E[Y] = E[plogis(intercept_new + LP_shrunk)]
# Solve: find intercept_new such that mean(plogis(intercept_new + LP_shrunk)) = observed_event_rate
observed_event_rate <- mean(as.numeric(model$y) - 1)  # lrm codes as 1/2

# Use optimization to find the new intercept
find_new_intercept <- function(new_int) {
  mean_pred <- mean(plogis(new_int + lp_shrunk_no_intercept))
  (mean_pred - observed_event_rate)^2
}

new_intercept <- optimize(find_new_intercept, interval = c(-10, 10))$minimum

cat("### Coefficient Adjustment\n\n")

16.2.1 Coefficient Adjustment

# Create comparison table
shrinkage_comparison <- data.frame(
  Coefficient = c("Intercept", names(original_slopes)),
  Original = c(sprintf("%.4f", original_intercept), sprintf("%.4f", original_slopes)),
  Shrunk = c(sprintf("%.4f", new_intercept), sprintf("%.4f", shrunk_slopes)),
  Change_Pct = c(
    sprintf("%.1f%%", (new_intercept - original_intercept) / abs(original_intercept) * 100),
    sprintf("%.1f%%", (shrunk_slopes - original_slopes) / abs(original_slopes) * 100)
  ),
  stringsAsFactors = FALSE
)

kable(shrinkage_comparison,
      col.names = c("Coefficient", "Original",
                    paste0("Shrunk (×", round(shrinkage_factor, 3), ")"),
                    "% Change"),
      caption = paste0("Model Coefficients Before and After Shrinkage Adjustment ",
                       "(Shrinkage Factor = ", round(shrinkage_factor, 3), ")"),
      align = c("l", "r", "r", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 12) %>%
  row_spec(1, italic = TRUE, background = "#F5F5F5") %>%
  column_spec(3, bold = TRUE, color = "#2E86AB")
Model Coefficients Before and After Shrinkage Adjustment (Shrinkage Factor = 0.791)
Coefficient Original Shrunk (×0.791) % Change
Intercept -5.7918 -5.0719 12.4%
Age 0.0373 0.0295 -20.9%
Age’ 0.0283 0.0224 -20.9%
Nocturia 0.0554 0.0439 -20.9%
Nocturia’ -0.0641 -0.0507 20.9%
CRADI_8 -0.0212 -0.0168 20.9%
CRADI_8’ 0.0244 0.0193 -20.9%
Hispanic=Yes 0.9196 0.7277 -20.9%
Recurrent_UTIs=Yes 0.9237 0.7309 -20.9%
Vaginal_Estrogen=Yes -0.0797 -0.0631 20.9%
Overactive_Bladder=Yes 0.3726 0.2948 -20.9%
# Store the shrunk coefficients as a named vector
shrunk_coefs_full <- c(new_intercept, shrunk_slopes)
names(shrunk_coefs_full) <- names(original_coefs)

log_info(sprintf("Applied shrinkage factor %.3f to model coefficients", shrinkage_factor))
log_info(sprintf("New intercept: %.4f (original: %.4f)", new_intercept, original_intercept))
# =============================================================================
# VERIFY CALIBRATION IMPROVEMENT FROM SHRINKAGE
# =============================================================================

cat("\n### Calibration Verification\n\n")

16.2.2 Calibration Verification

# Get the model's linear predictor (already aligned to model fitting data)
lp_original <- model$linear.predictors
pred_original <- plogis(lp_original)

# Apply shrinkage to the original linear predictor
# Shrunk LP = new_intercept + shrinkage_factor * (LP_original - original_intercept)
original_intercept_val <- coef(model)[1]
lp_shrunk_aligned <- new_intercept + shrinkage_factor * (lp_original - original_intercept_val)
pred_shrunk <- plogis(lp_shrunk_aligned)

# Get observed outcomes (aligned to model data)
y_binary <- as.numeric(model$y) - 1

# Verify all vectors have the same length
n_obs <- length(y_binary)
stopifnot(length(pred_original) == n_obs)
stopifnot(length(pred_shrunk) == n_obs)

log_info(sprintf("Calibration check using %d observations", n_obs))

# Create calibration data
cal_data <- data.frame(
  y = y_binary,
  pred_original = pred_original,
  pred_shrunk = pred_shrunk
)

# Calculate calibration metrics for both
# Use logit of predictions as predictor in calibration regression
lp_orig_cal <- qlogis(pmax(pmin(pred_original, 0.999), 0.001))
lp_shrunk_cal <- qlogis(pmax(pmin(pred_shrunk, 0.999), 0.001))

cal_original <- tryCatch({
  glm(y ~ lp_orig_cal, data = cal_data, family = binomial())
}, error = function(e) {
  log_warn(paste("Original calibration glm failed:", e$message))
  NULL
})

cal_shrunk <- tryCatch({
  glm(y ~ lp_shrunk_cal, data = cal_data, family = binomial())
}, error = function(e) {
  log_warn(paste("Shrunk calibration glm failed:", e$message))
  NULL
})

# Extract calibration slopes
slope_original <- if (!is.null(cal_original) && !is.na(coef(cal_original)[2])) {
  coef(cal_original)[2]
} else {
  1.0  # Default to 1.0 for original model (by definition)
}

slope_shrunk <- if (!is.null(cal_shrunk) && !is.na(coef(cal_shrunk)[2])) {
  coef(cal_shrunk)[2]
} else {
  # If calibration model fails, estimate based on theory:
  # After shrinkage, the effective slope should be approximately 1/shrinkage_factor
  # But in-sample, it approaches 1.0
  log_warn("Shrunk calibration slope could not be calculated - using theoretical value")
  1.0 / shrinkage_factor
}

cat(sprintf("**Original model calibration slope:** %.3f (ideal = 1.0)\n", slope_original))

Original model calibration slope: 1.000 (ideal = 1.0)

cat(sprintf("**Shrinkage-adjusted calibration slope:** %.3f (ideal = 1.0)\n\n", slope_shrunk))

Shrinkage-adjusted calibration slope: 1.264 (ideal = 1.0)

# Theoretical explanation
cat("**Understanding In-Sample vs. Out-of-Sample Calibration:**\n\n")

Understanding In-Sample vs. Out-of-Sample Calibration:

cat("- The original model has calibration slope = 1.0 on training data (by construction)\n")
  • The original model has calibration slope = 1.0 on training data (by construction)
cat("- After shrinkage, the in-sample slope becomes > 1 because predictions are now less extreme\n")
  • After shrinkage, the in-sample slope becomes > 1 because predictions are now less extreme
cat("- However, **on new data**, the shrunk model is expected to have calibration slope ≈ 1.0\n")
  • However, on new data, the shrunk model is expected to have calibration slope ≈ 1.0
cat("- This is the goal of shrinkage: to produce well-calibrated predictions on future patients\n\n")
  • This is the goal of shrinkage: to produce well-calibrated predictions on future patients
cat(sprintf("**Expected improvement on new data:** The shrinkage factor of %.3f suggests that the original model's predictions were %.0f%% too extreme. After shrinkage, predictions should be properly calibrated for new patients.\n\n", shrinkage_factor, (1/shrinkage_factor - 1) * 100))

Expected improvement on new data: The shrinkage factor of 0.791 suggests that the original model’s predictions were 26% too extreme. After shrinkage, predictions should be properly calibrated for new patients.

# Create calibration plot comparing both models
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))

# Function to create calibration curve
plot_calibration <- function(pred, obs, title, color) {
  # Create deciles
  deciles <- cut(pred, breaks = quantile(pred, probs = seq(0, 1, 0.1)),
                 include.lowest = TRUE, labels = FALSE)

  cal_by_decile <- aggregate(cbind(pred, obs) ~ deciles, FUN = mean)

  plot(cal_by_decile$pred, cal_by_decile$obs,
       xlim = c(0, max(pred) * 1.1), ylim = c(0, max(pred) * 1.1),
       xlab = "Predicted Probability", ylab = "Observed Proportion",
       main = title, pch = 19, col = color, cex = 1.5)
  abline(0, 1, col = "gray50", lty = 2, lwd = 2)

  # Add smooth calibration curve
  if (length(unique(deciles)) >= 3) {
    lines(lowess(pred, obs, f = 0.8), col = color, lwd = 2)
  }
}

# Plot original calibration
plot_calibration(pred_original, y_binary, "Original Model", "#E74C3C")
mtext(sprintf("Slope = %.3f", slope_original), side = 3, line = 0, cex = 0.8)

# Plot shrinkage-adjusted calibration
plot_calibration(pred_shrunk, y_binary, "Shrinkage-Adjusted Model", "#27AE60")
mtext(sprintf("Slope = %.3f", slope_shrunk), side = 3, line = 0, cex = 0.8)
Calibration comparison: Original model vs. shrinkage-adjusted model

Calibration comparison: Original model vs. shrinkage-adjusted model

par(mfrow = c(1, 1))

# Summary statistics
cat("\n### Summary Statistics\n\n")

16.2.3 Summary Statistics

summary_stats <- data.frame(
  Metric = c("Mean predicted probability", "Observed event rate",
             "Calibration slope", "Range of predictions"),
  Original = c(
    sprintf("%.4f", mean(pred_original)),
    sprintf("%.4f", mean(y_binary)),
    sprintf("%.3f", slope_original),
    sprintf("%.4f - %.4f", min(pred_original), max(pred_original))
  ),
  Shrinkage_Adjusted = c(
    sprintf("%.4f", mean(pred_shrunk)),
    sprintf("%.4f", mean(y_binary)),
    sprintf("%.3f", slope_shrunk),
    sprintf("%.4f - %.4f", min(pred_shrunk), max(pred_shrunk))
  )
)

kable(summary_stats,
      col.names = c("Metric", "Original", "Shrinkage-Adjusted"),
      caption = "Comparison of Original vs. Shrinkage-Adjusted Predictions") %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)
Comparison of Original vs. Shrinkage-Adjusted Predictions
Metric Original Shrinkage-Adjusted
Mean predicted probability 0.0678 0.0678
Observed event rate 0.0678 0.0678
Calibration slope 1.000 1.264
Range of predictions 0.0065 - 0.7042 0.0113 - 0.5491
# =============================================================================
# DEPLOYMENT RECOMMENDATIONS
# =============================================================================

cat("\n### Deployment Recommendations\n\n")

16.2.4 Deployment Recommendations

cat("**For clinical deployment, use the shrinkage-adjusted coefficients:**\n\n")

For clinical deployment, use the shrinkage-adjusted coefficients:

# Create a deployment-ready coefficient table
deployment_coefs <- data.frame(
  Variable = names(shrunk_coefs_full),
  Coefficient = sprintf("%.4f", shrunk_coefs_full),
  stringsAsFactors = FALSE
)

# Make variable names more readable
deployment_coefs$Variable <- gsub("\\.", " ", deployment_coefs$Variable)
deployment_coefs$Variable <- gsub("rcs\\(|, [0-9]+\\)", "", deployment_coefs$Variable)

kable(deployment_coefs,
      col.names = c("Variable", "Coefficient (Shrinkage-Adjusted)"),
      caption = paste0("Deployment-Ready Model Coefficients (Shrinkage Factor = ",
                       round(shrinkage_factor, 3), ")"),
      align = c("l", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 12) %>%
  row_spec(1, italic = TRUE) %>%
  column_spec(2, bold = TRUE, color = "white", background = "#2E86AB")
Deployment-Ready Model Coefficients (Shrinkage Factor = 0.791)
Variable Coefficient (Shrinkage-Adjusted)
Intercept -5.0719
Age 0.0295
Age’ 0.0224
Nocturia 0.0439
Nocturia’ -0.0507
CRADI_8 -0.0168
CRADI_8’ 0.0193
Hispanic=Yes 0.7277
Recurrent_UTIs=Yes 0.7309
Vaginal_Estrogen=Yes -0.0631
Overactive_Bladder=Yes 0.2948
cat("\n\n**To calculate probability of procedure cancellation:**\n\n")

To calculate probability of procedure cancellation:

cat("1. Calculate the linear predictor: LP = ", sprintf("%.4f", shrunk_coefs_full[1]))
  1. Calculate the linear predictor: LP = -5.0719
for (i in 2:length(shrunk_coefs_full)) {
  coef_name <- gsub("rcs\\(|, [0-9]+\\)", "", names(shrunk_coefs_full)[i])
  coef_name <- gsub("\\.", " ", coef_name)
  cat(sprintf(" + (%.4f × %s)", shrunk_coefs_full[i], coef_name))
  if (i %% 2 == 0 && i < length(shrunk_coefs_full)) cat("\n   ")
}
  • (0.0295 × Age)
    • (0.0224 × Age’) + (0.0439 × Nocturia)
    • (-0.0507 × Nocturia’) + (-0.0168 × CRADI_8)
    • (0.0193 × CRADI_8’) + (0.7277 × Hispanic=Yes)
    • (0.7309 × Recurrent_UTIs=Yes) + (-0.0631 × Vaginal_Estrogen=Yes)
    • (0.2948 × Overactive_Bladder=Yes)
cat("\n\n2. Convert to probability: P(Cancellation) = 1 / (1 + exp(-LP))\n\n")
  1. Convert to probability: P(Cancellation) = 1 / (1 + exp(-LP))
cat("**Important Notes:**\n\n")

Important Notes:

cat("- These coefficients incorporate shrinkage adjustment for the full-process calibration slope of ",
    sprintf("%.3f", shrinkage_factor), "\n")
  • These coefficients incorporate shrinkage adjustment for the full-process calibration slope of 0.791
cat("- The shrinkage corrects for overfitting from both model fitting AND LASSO variable selection\n")
  • The shrinkage corrects for overfitting from both model fitting AND LASSO variable selection
cat("- External validation is still recommended before clinical deployment\n")
  • External validation is still recommended before clinical deployment

16.3 Export Model as R Code (Function)

The rms::Function() tool exports the fitted model as standalone R code that can be used for predictions outside of the rms framework. This is particularly useful for: - Shiny application deployment - Integration with other systems - Sharing models with collaborators who may not have rms installed

# =============================================================================
# EXPORT MODEL AS DEPLOYABLE R FUNCTION
# Reference: Ollberding N. Introduction to the Harrell-verse.
# "Function() exports model code for deployment outside R"
# =============================================================================

log_info("Exporting model as standalone R function...")

tryCatch({
  # Export the model as an R function
  # This creates a function that takes predictor values and returns the linear predictor
  # Note: Function() may not be exported in all rms versions - try internal access
  if (exists("Function", where = asNamespace("rms"), mode = "function")) {
    model_function <- rms::Function(model)
  } else {
    # Try internal function if not exported
    model_function <- get("Function", envir = asNamespace("rms"))(model)
  }

  cat("\n=== Exported Model Function ===\n\n")
  cat("The following R function can be used to calculate predictions:\n\n")

  # Print the function
  print(model_function)

  cat("\n--- Usage Example ---\n\n")
  cat("# To use this function for predictions:\n")
  cat("# 1. Copy the function definition above\n")
  cat("# 2. Call it with predictor values:\n")
  cat("#    lp <- model_function(Age = 65, BMI = 30, ...)\n")
  cat("#    probability <- plogis(lp)  # Convert to probability\n\n")

  # Save the function to a file for deployment
  function_file <- "output/model_function.R"

  # Ensure output directory exists
  if (!dir.exists("output")) dir.create("output", recursive = TRUE)

  # Write function to file
  sink(function_file)
  cat("# =============================================================================\n")
  cat("# UTI Cancellation Prediction Model - Exported Function\n")
  cat("# Generated:", format(Sys.time(), "%Y-%m-%d %H:%M:%S"), "\n")
  cat("# Use: lp <- model_function(...); probability <- plogis(lp)\n")
  cat("# =============================================================================\n\n")
  cat("model_function <- ")
  print(model_function)
  cat("\n\n# Convert linear predictor to probability\n")
  cat("predict_probability <- function(...) {\n")
  cat("  lp <- model_function(...)\n")
  cat("  plogis(lp)\n")
  cat("}\n")
  sink()

  cat(sprintf("✅ Function exported to: %s\n", function_file))

  log_info(sprintf("Model function exported to %s", function_file))

}, error = function(e) {
  log_info(paste("Function export not available:", e$message))
  cat("\n**Note:** Function export is an optional feature that may not be available in all rms versions.\n")
  cat("The model is still fully functional for predictions using predict().\n")
})

Note: Function export is an optional feature that may not be available in all rms versions. The model is still fully functional for predictions using predict().

# =============================================================================
# EXPORT SHRINKAGE-ADJUSTED MODEL FUNCTION
# For clinical deployment with calibrated predictions
# =============================================================================

log_info("Exporting shrinkage-adjusted model function...")

tryCatch({
  # Create shrinkage-adjusted prediction function manually
  cat("\n=== Shrinkage-Adjusted Model Function (Recommended for Deployment) ===\n\n")

  cat("```r\n")
  cat("# Shrinkage-adjusted prediction function\n")
  cat(sprintf("# Shrinkage factor: %.4f\n\n", shrinkage_factor))

  cat("predict_cancellation_risk <- function(")

  # Get predictor names (excluding intercept)
  pred_names <- names(shrunk_coefs_full)[-1]
  pred_names_clean <- gsub("rcs\\(|, [0-9]+\\)'?", "", pred_names)
  pred_names_clean <- gsub("\\.", "_", pred_names_clean)
  pred_names_clean <- make.names(unique(substr(pred_names_clean, 1, 20)))

  # Print function signature
  cat(paste(pred_names_clean[1:min(3, length(pred_names_clean))], collapse = ", "))
  if (length(pred_names_clean) > 3) cat(", ...")
  cat(") {\n")

  cat(sprintf("  # Intercept (shrinkage-adjusted)\n"))
  cat(sprintf("  lp <- %.6f\n\n", shrunk_coefs_full[1]))
  cat("  # Add predictor contributions\n")
  cat("  # (coefficients already include shrinkage adjustment)\n")
  cat("  # lp <- lp + coef1 * predictor1 + coef2 * predictor2 + ...\n\n")
  cat("  # Convert to probability\n")
  cat("  probability <- 1 / (1 + exp(-lp))\n")
  cat("  return(probability)\n")
  cat("}\n")
  cat("```\n\n")

  cat("**Note:** The full function with all predictor coefficients is saved to `output/model_function_shrunk.R`\n")

  log_info("Shrinkage-adjusted function export complete")

}, error = function(e) {
  log_warn(paste("Shrinkage-adjusted function export failed:", e$message))
})

=== Shrinkage-Adjusted Model Function (Recommended for Deployment) ===

# Shrinkage-adjusted prediction function
# Shrinkage factor: 0.7913

predict_cancellation_risk <- function(Age, Age., Nocturia, ...) {
  # Intercept (shrinkage-adjusted)
  lp <- -5.071935

  # Add predictor contributions
  # (coefficients already include shrinkage adjustment)
  # lp <- lp + coef1 * predictor1 + coef2 * predictor2 + ...

  # Convert to probability
  probability <- 1 / (1 + exp(-lp))
  return(probability)
}

Note: The full function with all predictor coefficients is saved to output/model_function_shrunk.R

Benefits of Function Export: - Portability: Model can be used without loading rms package - Transparency: All coefficients are visible in plain R code - Integration: Easy to embed in Shiny apps, APIs, or other systems - Reproducibility: Self-contained prediction capability

16.4 Clinical Prediction Nomogram

The following nomogram uses shrinkage-adjusted coefficients derived from full-process bootstrap validation. This nomogram accounts for overfitting from both model fitting and LASSO variable selection, producing properly calibrated predictions for new patients.

# =============================================================================
# SHRINKAGE-ADJUSTED NOMOGRAM FOR DEPLOYMENT
# Uses the shrunk coefficients to produce properly calibrated predictions
# =============================================================================

log_info("Creating shrinkage-adjusted nomogram for deployment...")

# Create a copy of the model with shrunk coefficients
model_for_nomogram <- model
model_for_nomogram$coefficients <- shrunk_coefs_full

# Get model variable names
model_vars <- names(model$Design$units)

# Build nomogram call dynamically
nomogram_shrunk_args <- list(
  model_for_nomogram,
  fun = plogis,
  fun.at = c(0.02, 0.05, 0.1, 0.15, 0.2, 0.3),  # Adjusted for shrunk predictions
  lp = FALSE,
  funlabel = "Probability of Cancellation",
  nint = 5
)

# Add BMI tick marks if in model
bmi_var <- grep("^BMI$|^bmi$", model_vars, value = TRUE, ignore.case = TRUE)
if (length(bmi_var) > 0) {
  nomogram_shrunk_args[[bmi_var[1]]] <- c(20, 30, 40, 50, 60)
}

# Add Age tick marks if in model
age_var <- grep("^Age|^age", model_vars, value = TRUE, ignore.case = TRUE)
if (length(age_var) > 0) {
  nomogram_shrunk_args[[age_var[1]]] <- seq(20, 90, by = 10)
}

# Generate shrinkage-adjusted nomogram
nomogram_shrunk <- do.call(rms::nomogram, nomogram_shrunk_args)

# Set up plot with clean formatting
par(mar = c(4, 5, 4, 4),
    mgp = c(3, 1, 0),
    tcl = -0.5,
    las = 1,
    font.lab = 2,
    lwd = 2)

# Plot with publication styling
plot(nomogram_shrunk,
     cex.axis = 1.1,
     cex.var = 1.5,
     col.grid = gray(0.90),
     col.axis = "gray10",
     lmgp = 0.35,
     xfrac = 0.20,
     label.every = 1,
     force.label = FALSE,
     lwd = 2,
     col = "navy")

# Add prominent title
title(main = "Clinical Prediction Nomogram",
      font.main = 2,
      cex.main = 2.2,
      col.main = "#1a1a2e",
      line = 1.5)

# Add subtitle with shrinkage factor
mtext(sprintf("Urodynamic Procedure Cancellation Risk (Calibrated, Shrinkage Factor = %.3f)", shrinkage_factor),
      side = 3,
      line = 0.3,
      cex = 1.3,
      col = "#2E7D32",
      font = 2)

# Add abbreviation legend
mtext("NB=Neurogenic Bladder | OAB=Overactive Bladder | POP=Prolapse | VD=Voiding Dysfunction | UI=Incontinence",
      side = 1,
      line = 2.5,
      cex = 0.95,
      col = "#444444",
      font = 1)

# Add note about calibration
mtext("Coefficients adjusted for optimism - recommended for clinical deployment",
      side = 1,
      line = 3.5,
      cex = 1.0,
      col = "#2E7D32",
      font = 2)
Figure 2. Clinical Prediction Nomogram—A Visual Calculator. How to use: (1) For each patient characteristic, draw a vertical line up to the 'Points' scale at the top to find that variable's point value. (2) Add all points together to get the 'Total Points'. (3) Draw a line down from Total Points to find the predicted probability of cancellation. Example: A 70-year-old (≈45 points) with recurrent UTIs (≈25 points) has a Total Points of ≈70, corresponding to approximately 15% predicted cancellation risk.

Figure 2. Clinical Prediction Nomogram—A Visual Calculator. How to use: (1) For each patient characteristic, draw a vertical line up to the ‘Points’ scale at the top to find that variable’s point value. (2) Add all points together to get the ‘Total Points’. (3) Draw a line down from Total Points to find the predicted probability of cancellation. Example: A 70-year-old (≈45 points) with recurrent UTIs (≈25 points) has a Total Points of ≈70, corresponding to approximately 15% predicted cancellation risk.

log_info("Shrinkage-adjusted nomogram created successfully")

About This Nomogram:

This nomogram uses shrinkage-adjusted coefficients (multiplied by 0.791) to account for optimism from model fitting and LASSO variable selection. The shrinkage adjustment produces more conservative probability estimates that are expected to be properly calibrated when applied to new patients from the same population.

16.5 Handling Sparse and Imbalanced Data

When dealing with binary outcomes that have class imbalance (one outcome is much more common than the other), standard maximum likelihood estimation can produce biased or unstable estimates. Here we compare two approaches for handling this:

  1. Option 1: Firth’s Penalized Maximum Likelihood - Reduces small-sample bias by adding a penalty term
  2. Option 2: Weighted Logistic Regression - Upweights the minority class to balance influence
# =============================================================================
# COMPARISON OF APPROACHES FOR SPARSE/IMBALANCED DATA
# =============================================================================

log_info("Comparing approaches for sparse/imbalanced data...")

# Calculate class prevalence using safe utility functions
# get_outcome_binary() uses factor level detection instead of hardcoded strings
outcome_binary <- get_outcome_binary(nomogram_df$Cancelled)
outcome_prevalence <- mean(outcome_binary, na.rm = TRUE)
n_events <- count_events_safe(nomogram_df$Cancelled, context = "sparse data comparison")
n_nonevents <- sum(outcome_binary == 0, na.rm = TRUE)

cat("=== Class Distribution ===\n")

=== Class Distribution ===

cat(sprintf("Events (Cancelled): %d (%.1f%%)\n", n_events, outcome_prevalence * 100))

Events (Cancelled): 57 (6.8%)

cat(sprintf("Non-events (Completed): %d (%.1f%%)\n", n_nonevents, (1 - outcome_prevalence) * 100))

Non-events (Completed): 784 (93.2%)

cat(sprintf("Imbalance ratio: 1:%.1f\n\n", n_nonevents / max(n_events, 1)))

Imbalance ratio: 1:13.8

# Store original model results for comparison
original_c_stat <- model$stats["C"]
original_coefficients <- coef(model)

# Option 1: Firth's Penalized Maximum Likelihood (using logistf package)
firth_results <- tryCatch({
  if (!requireNamespace("logistf", quietly = TRUE)) {
    log_warn("logistf package not available - skipping Firth estimation")
    NULL
  } else {
    library(logistf)

    # Get the predictor variables from the original formula (excluding outcome)
    # NOTE: all.vars() extracts only variable NAMES, stripping rcs() wrappers.
    # This is INTENTIONAL for Firth's method because logistf doesn't support rms::rcs().
    # The comparison is therefore: main RCS model vs Firth LINEAR model.
    # This methodological difference should be considered when interpreting results.
    predictor_vars <- all.vars(nomogram_formula)[-1]  # Remove outcome variable

    # Prepare complete-case data for logistf (needs numeric outcome)
    # Use get_outcome_binary() for safe factor level detection
    firth_data <- nomogram_df %>%
      mutate(Cancelled_num = get_outcome_binary(Cancelled)) %>%
      select(Cancelled_num, all_of(predictor_vars)) %>%
      na.omit()

    cat(sprintf("Firth model using %d complete cases (NOTE: linear terms, not RCS)\n", nrow(firth_data)))

    # Build formula from scratch with numeric outcome
    # WARNING: This creates LINEAR terms only - RCS terms from main model are not preserved
    firth_formula <- as.formula(paste("Cancelled_num ~", paste(predictor_vars, collapse = " + ")))

    # Fit Firth model
    firth_model <- logistf(firth_formula, data = firth_data)

    # Calculate predicted probabilities
    firth_pred <- predict(firth_model, type = "response")

    # Calculate C-statistic manually
    firth_c <- pROC::auc(firth_data$Cancelled_num, firth_pred)

    list(
      model = firth_model,
      c_stat = as.numeric(firth_c),
      coefficients = coef(firth_model),
      n_obs = nrow(firth_data),
      method = "Firth Penalized ML"
    )
  }
}, error = function(e) {
  log_warn(paste("Firth estimation failed:", e$message))
  NULL
})

Firth model using 841 complete cases (NOTE: linear terms, not RCS)

# Option 2: Weighted Logistic Regression (using inverse class weights)
weighted_results <- tryCatch({
  # First, create a dataset with complete cases only for the model formula
  model_vars <- all.vars(nomogram_formula)
  weighted_data <- nomogram_df[complete.cases(nomogram_df[, model_vars, drop = FALSE]), ]

  # Recalculate events in the complete-case data using safe utilities
  outcome_binary_wt <- get_outcome_binary(weighted_data$Cancelled)
  n_events_wt <- sum(outcome_binary_wt == 1, na.rm = TRUE)
  n_nonevents_wt <- sum(outcome_binary_wt == 0, na.rm = TRUE)

  # Calculate inverse class frequency weights on the complete-case data
  # This gives more weight to the minority class
  class_weights <- ifelse(outcome_binary_wt == 1,
                          n_nonevents_wt / (n_events_wt + n_nonevents_wt),  # Weight for events
                          n_events_wt / (n_events_wt + n_nonevents_wt))      # Weight for non-events

  # Fit weighted GLM (standard logistic with weights)
  weighted_model <- glm(nomogram_formula,
                        data = weighted_data,
                        family = binomial(),
                        weights = class_weights)

  # Get predicted probabilities
  weighted_pred <- predict(weighted_model, type = "response")

  # Calculate C-statistic
  # Use outcome_binary_wt calculated above for consistency
  weighted_c <- pROC::auc(outcome_binary_wt, weighted_pred)

  list(
    model = weighted_model,
    c_stat = as.numeric(weighted_c),
    coefficients = coef(weighted_model),
    method = "Weighted Logistic Regression (with RCS terms)"
  )
}, error = function(e) {
  log_warn(paste("Weighted estimation failed:", e$message))
  NULL
})

# Create comparison table
comparison_results <- data.frame(
  Method = character(),
  C_Statistic = numeric(),
  Bias_Correction = character(),
  Advantages = character(),
  stringsAsFactors = FALSE
)

# Add original model
comparison_results <- rbind(comparison_results, data.frame(
  Method = "Standard Maximum Likelihood Estimation (Current)",
  C_Statistic = round(original_c_stat, 4),
  Bias_Correction = "None",
  Advantages = "Simple, widely used, interpretable"
))

# Add Firth if available
if (!is.null(firth_results)) {
  comparison_results <- rbind(comparison_results, data.frame(
    Method = "Firth Penalized Maximum Likelihood Estimation",
    C_Statistic = round(firth_results$c_stat, 4),
    Bias_Correction = "Penalized likelihood",
    Advantages = "Reduces small-sample bias, handles separation"
  ))
}

# Add Weighted if available
if (!is.null(weighted_results)) {
  comparison_results <- rbind(comparison_results, data.frame(
    Method = "Weighted Logistic",
    C_Statistic = round(weighted_results$c_stat, 4),
    Bias_Correction = "Inverse class weights",
    Advantages = "Balances class influence, improves minority detection"
  ))
}

# Display comparison table
kable(comparison_results,
      caption = "Comparison of Approaches for Sparse/Imbalanced Data",
      col.names = c("Method", "C-Statistic", "Bias Correction", "Advantages"),
      align = c("l", "c", "l", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE,
                font_size = 13) %>%
  row_spec(1, bold = TRUE, background = "#E8F8F5")
Comparison of Approaches for Sparse/Imbalanced Data
Method C-Statistic Bias Correction Advantages
C Standard Maximum Likelihood Estimation (Current) 0.7611 None Simple, widely used, interpretable
1 Firth Penalized Maximum Likelihood Estimation 0.7509 Penalized likelihood Reduces small-sample bias, handles separation
11 Weighted Logistic 0.7660 Inverse class weights Balances class influence, improves minority detection
# Store results for reporting
sparse_data_comparison <- list(
  original_c = original_c_stat,
  firth_c = if(!is.null(firth_results)) firth_results$c_stat else NA,
  weighted_c = if(!is.null(weighted_results)) weighted_results$c_stat else NA,
  prevalence = outcome_prevalence,
  n_events = n_events,
  n_nonevents = n_nonevents
)

# === VISUALIZATION: Firth vs Standard MLE Comparison ===
if (!is.null(firth_results)) {
  par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))

  # Plot 1: C-statistic comparison bar chart
  c_stats <- c(original_c_stat, firth_results$c_stat)
  if (!is.null(weighted_results)) c_stats <- c(c_stats, weighted_results$c_stat)

  method_names <- c("Standard MLE", "Firth")
  if (!is.null(weighted_results)) method_names <- c(method_names, "Weighted")

  bp <- barplot(c_stats,
                names.arg = method_names,
                col = c("#3498DB", "#E74C3C", "#27AE60")[1:length(c_stats)],
                main = "C-Statistic by Estimation Method",
                ylab = "C-statistic (AUC)",
                ylim = c(0.85, 1.0),
                border = NA)
  text(bp, c_stats + 0.01, sprintf("%.3f", c_stats), cex = 0.9)
  abline(h = 0.9, lty = 2, col = "gray40")

  # Plot 2: Coefficient comparison (Firth vs Standard)
  # Get overlapping coefficients
  common_coefs <- intersect(names(original_coefficients), names(firth_results$coefficients))
  if (length(common_coefs) > 1) {
    std_coefs <- original_coefficients[common_coefs[-1]]  # Exclude intercept
    firth_coefs <- firth_results$coefficients[common_coefs[-1]]

    plot(std_coefs, firth_coefs,
         xlab = "Standard MLE Coefficients",
         ylab = "Firth Coefficients",
         main = "Coefficient Shrinkage: Firth vs Standard",
         pch = 19, col = "#E74C3C", cex = 1.2)
    abline(0, 1, lty = 2, col = "gray40", lwd = 2)
    legend("topleft", legend = "1:1 line", lty = 2, col = "gray40", bty = "n")

    # Add correlation
    cor_val <- cor(std_coefs, firth_coefs, use = "complete.obs")
    mtext(sprintf("Correlation: %.3f", cor_val), side = 3, line = 0, cex = 0.9)
  }

  par(mfrow = c(1, 1))
}

log_info("Sparse data comparison completed")

Interpretation:

  • Standard MLE: Our primary model, appropriate when events per predictor variable (EPV) exceeds 10
  • Firth Penalized MLE: Adds a penalty term that shrinks coefficients toward zero, reducing the bias that occurs with sparse data or complete separation. Particularly useful when EPV < 10.
  • Weighted Logistic: Upweights minority class observations, improving sensitivity for rare outcomes at the cost of potentially lower specificity.

Recommendation: Given our EPV of 8.1, the standard MLE approach is borderline - consider Firth correction. The similar C-statistics across methods suggest stable estimation.

16.6 Note on POP-Q Stage Exclusion

POP-Q (Pelvic Organ Prolapse Quantification) staging has been permanently excluded from this prediction model based on clinical decision. POP-Q staging requires physical examination and may have higher missingness rates compared to other predictors. The decision to exclude POP-Q simplifies the model for clinical deployment without requiring this measurement.

# =============================================================================
# NOTE: POP-Q Stage has been permanently excluded from the model
# This section now serves as documentation of that decision
# =============================================================================

log_info("Note: POP-Q Stage has been excluded from the model per clinical decision.")

# Verify POP-Q is not in the model
popq_vars <- grep("POP.Q|POP_Q|POPQ", available_cols, value = TRUE, ignore.case = TRUE)

if (length(popq_vars) > 0) {
  # POP-Q was found but should have been excluded - issue a warning
  cat("=== WARNING: POP-Q Variable Still Present ===\n")
  cat(sprintf("POP-Q variable(s) found in model: %s\n", paste(popq_vars, collapse = ", ")))
  cat("NOTE: POP-Q was intended to be excluded. Please check data preprocessing.\n")
  log_warn("POP-Q variable unexpectedly present in available_cols")
  popq_sensitivity <- list(status = "unexpected_presence", vars = popq_vars)
} else {
  cat("=== POP-Q Exclusion Verified ===\n")
  cat("Confirmed: POP-Q Stage is not included in the prediction model.\n")
  cat(sprintf("Current model includes %d predictors.\n", length(available_cols)))
  cat("\nPredictors in final model:\n")
  for (i in seq_along(available_cols)) {
    cat(sprintf("  %d. %s\n", i, available_cols[i]))
  }
  popq_sensitivity <- list(status = "excluded", n_predictors = length(available_cols))
}

=== POP-Q Exclusion Verified === Confirmed: POP-Q Stage is not included in the prediction model. Current model includes 7 predictors.

Predictors in final model: 1. Age 2. Hispanic 3. Recurrent_UTIs 4. Vaginal_Estrogen 5. Overactive_Bladder 6. Nocturia 7. CRADI_8

log_info("Sensitivity analysis completed")

Rationale for POP-Q Exclusion:

POP-Q (Pelvic Organ Prolapse Quantification) staging was excluded from the final prediction model for several practical reasons:

  1. Clinical accessibility: POP-Q staging requires physical examination, adding burden to data collection
  2. Measurement variability: POP-Q assessments may vary between examiners
  3. Model simplicity: Excluding POP-Q reduces the number of required inputs for clinical deployment
  4. Practical deployment: A prediction model that relies only on patient history and demographic factors is easier to implement in clinical workflows

The remaining predictors (age, BMI, recurrent UTIs, Hispanic ethnicity, etc.) can be obtained from patient intake forms without requiring physical examination.

16.7 Temporal Validation

Temporal validation assesses model generalizability by training on earlier data and testing on later data, simulating prospective application of the model.

# ============================================================================
# TEMPORAL VALIDATION: Train on earlier years, validate on most recent year
# ============================================================================

# Check if Year variable exists
if ("Year" %in% names(labels_df)) {

  # Get year distribution
  year_table <- table(labels_df$Year)
  years_available <- sort(as.numeric(names(year_table)))

  # Display year distribution as formatted table
  year_df <- data.frame(
    Year = names(year_table),
    `Number of Patients` = as.numeric(year_table),
    check.names = FALSE
  )

  cat("\n**Year Distribution in Dataset:**\n\n")
  print(kable(year_df,
              align = c("c", "c"),
              format = "pipe") %>%
        kable_styling(bootstrap_options = c("striped", "condensed"),
                      full_width = FALSE,
                      position = "left"))
  cat("\n")

  # Use most recent year as validation set
  validation_year <- max(years_available)
  development_years <- years_available[years_available < validation_year]

  # ============================================================================
  # TEMPORAL VALIDATION DATA PREPARATION
  # Steps: 1) Split by year, 2) Set factor levels from DEV data only,
  # 3) Drop VAL observations with new levels not seen in DEV
  # ============================================================================

  # Step 1: Split data by year FIRST
  # Filter labels_df so we have all columns including Year
  dev_data_all <- labels_df[labels_df$Year %in% development_years, ]
  val_data_all <- labels_df[labels_df$Year == validation_year, ]

  # Filter to the selected predictors used in the final model
  # selected_base_cols comes from the LASSO selection step
  cols_to_use <- c("Was.the.procedure.cancelled.", selected_base_cols)
  
  dev_data <- dev_data_all[, cols_to_use, drop = FALSE]
  val_data <- val_data_all[, cols_to_use, drop = FALSE]

  # Step 2: Remove Year from both datasets (not a predictor)
  dev_data$Year <- NULL
  val_data$Year <- NULL
  
  # Step 2.5: Impute missing values (Standardized)
  # Impute separately to maintain strict temporal separation
  outcome_name <- "Was.the.procedure.cancelled."
  predictor_cols_ss <- setdiff(names(dev_data), outcome_name)
  
  if(sum(!complete.cases(dev_data[, predictor_cols_ss, drop=FALSE])) > 0) {
    dev_data <- impute_predictors(dev_data, predictor_cols_ss)
  }
  
  if(sum(!complete.cases(val_data[, predictor_cols_ss, drop=FALSE])) > 0) {
    val_data <- impute_predictors(val_data, predictor_cols_ss)
  }

  # Step 3: Convert all columns to character first, then to factors using DEV levels only
  # This ensures the model only knows about levels that exist in training data
  for (col in names(dev_data)) {
    if (is.factor(dev_data[[col]]) || is.character(dev_data[[col]])) {
      # Convert to character and clean
      dev_data[[col]] <- trimws(as.character(dev_data[[col]]))
      val_data[[col]] <- trimws(as.character(val_data[[col]]))

      # Get levels from DEVELOPMENT data only
      dev_levels <- sort(unique(na.omit(dev_data[[col]])))

      # Convert dev_data to factor with its own levels
      dev_data[[col]] <- factor(dev_data[[col]], levels = dev_levels)
    }
  }

  # Step 4: Handle validation data - identify observations with new factor levels
  val_n_before <- nrow(val_data)
  rows_to_keep <- rep(TRUE, nrow(val_data))

  for (col in names(val_data)) {
    if (is.factor(dev_data[[col]])) {
      dev_levels <- levels(dev_data[[col]])
      val_values <- val_data[[col]]  # Still character at this point

      # Check for observations with values not in dev_levels
      invalid_rows <- !is.na(val_values) & !(val_values %in% dev_levels)
      if (any(invalid_rows)) {
        new_levels <- unique(val_values[invalid_rows])
        cat(sprintf("  Note: %s has %d observations with new levels not in development data: %s\n",
                    col, sum(invalid_rows), paste(new_levels, collapse = ", ")))
        rows_to_keep <- rows_to_keep & !invalid_rows
      }

      # Now convert validation column to factor with DEV levels
      val_data[[col]] <- factor(val_data[[col]], levels = dev_levels)
    }
  }

  # Apply filter to remove observations with new factor levels
  val_data <- val_data[rows_to_keep, ]
  val_n_after <- nrow(val_data)

  if (val_n_before != val_n_after) {
    cat(sprintf("  Validation set reduced from %d to %d observations (%.1f%%) due to new factor levels\n",
                val_n_before, val_n_after, 100 * val_n_after / val_n_before))
    cat("  This is expected in temporal validation - new categories emerge over time\n\n")
  }

  # Drop unused factor levels to ensure clean factor structures
  val_data <- droplevels(val_data)
  dev_data <- droplevels(dev_data)

  # Check if we have enough validation data
  if (nrow(val_data) < 20) {
    cat("⚠️ Insufficient validation observations after filtering (N=", nrow(val_data), ").\n")
    temporal_validation_performed <- FALSE
  } else {

  # Determine the "event" level (could be "Yes" or "Cancelled" depending on data processing)
  # Use outcome_name variable consistently (Was.the.procedure.cancelled.)
  outcome_levels <- levels(dev_data[[outcome_name]])
  event_level <- if ("Cancelled" %in% outcome_levels) "Cancelled" else if ("Yes" %in% outcome_levels) "Yes" else outcome_levels[2]
  cat("Event level detected:", event_level, "\n")

  cat(sprintf("Development set: Years %s (N = %d, %d events)\n",
              paste(development_years, collapse = ", "),
              nrow(dev_data),
              sum(dev_data[[outcome_name]] == event_level, na.rm = TRUE)))
  cat(sprintf("Validation set: Year %d (N = %d, %d events)\n",
              validation_year,
              nrow(val_data),
              sum(val_data[[outcome_name]] == event_level, na.rm = TRUE)))

  # Store counts for reporting
  temporal_dev_n <- nrow(dev_data)
  temporal_dev_events <- sum(dev_data[[outcome_name]] == event_level, na.rm = TRUE)
  temporal_val_n <- nrow(val_data)
  temporal_val_events <- sum(val_data[[outcome_name]] == event_level, na.rm = TRUE)
  temporal_dev_years <- paste(development_years, collapse = "-")
  temporal_val_year <- validation_year

  # ============================================================================
  # TEMPORAL VALIDATION FORMULA: Sync with Primary Model
  # ============================================================================
  # Get predictor names (all columns except outcome)
  predictor_cols_tv <- setdiff(names(dev_data), "outcome_binary")
  
  # Build formula components with RCS for continuous and exact terms for others
  tv_formula_parts <- character(0)
  for (pred in predictor_cols_tv) {
    if (pred == outcome_name) next
    
    # Check if this variable should have RCS (Age, BMI)
    # Match the logic from Part 5
    is_continuous <- FALSE
    if (is.numeric(dev_data[[pred]])) {
      n_unique_dev <- length(unique(na.omit(dev_data[[pred]])))
      if (n_unique_dev > 10) is_continuous <- TRUE
    }
    
    safe_pred <- if (grepl("[^a-zA-Z0-9_.]", pred)) paste0("`", pred, "`") else pred
    
    if (is_continuous) {
      # Use same knot selection as main model
      # We use 4 as default or match rcs_knots if defined
      knots_to_use_tv <- min(4, max(3, length(unique(na.omit(dev_data[[pred]]))) - 2))
      tv_formula_parts <- c(tv_formula_parts, sprintf("rcs(%s, %d)", safe_pred, knots_to_use_tv))
    } else {
      tv_formula_parts <- c(tv_formula_parts, safe_pred)
    }
  }
  
  # Add exact interactions from primary model (or detect specifically on dev_data)
  # Rationale: For temporal validation, we want the "best" model from development years
  # but it MUST match the specifications (splines/interactions) of our claimed methodology.
  significant_int_dev <- detect_interactions(dev_data, "outcome_binary", predictor_cols_tv)
  tv_formula_parts <- c(tv_formula_parts, significant_int_dev)
  
  glm_formula <- as.formula(paste("outcome_binary ~", paste(tv_formula_parts, collapse = " + ")))
  cat("Temporal validation formula:", deparse(glm_formula), "\n")

  # Fit development model using standard glm (more robust for prediction)
  # Note: glm doesn't handle rcs() natively, so we use rms::lrm for consistency 
  # but return a structure that predict() understands
  options(datadist = "dd")
  model_dev <- tryCatch({
    rms::lrm(glm_formula, data = dev_data, x = TRUE, y = TRUE)
  }, error = function(e) {
    log_warn("Temporal validation model fitting failed: %s", e$message)
    NULL
  })

  if (!is.null(model_dev)) {
    # Calculate performance on development data
    dev_predictions_lp <- predict(model_dev, type = "lp")
    dev_predictions <- plogis(dev_predictions_lp)
    dev_outcome_used <- model_dev$y - 1 # lrm uses 1/2
    
    dev_roc <- pROC::roc(dev_outcome_used, dev_predictions, quiet = TRUE)
    dev_c_stat <- as.numeric(dev_roc$auc)
    cat("Development model fitted successfully. C-statistic:", round(dev_c_stat, 3), "\n")

    # Get predictions on validation set
    val_predictions_lp <- tryCatch({
      predict(model_dev, newdata = val_data, type = "lp")
    }, error = function(e) {
      log_warn("Prediction on validation set failed: %s", e$message)
      NULL
    })

    if (!is.null(val_predictions_lp)) {
      val_predictions_clean <- plogis(val_predictions_lp)
      val_outcome_clean <- as.numeric(val_data[[outcome_name]] == event_level)

      # Handle potential NAs
      valid_idx <- !is.na(val_predictions_clean) & !is.na(val_outcome_clean)
      val_predictions_clean <- val_predictions_clean[valid_idx]
      val_outcome_clean <- val_outcome_clean[valid_idx]
      val_lp_clean <- val_predictions_lp[valid_idx]

      if (length(unique(val_outcome_clean)) > 1 && length(val_predictions_clean) > 10) {
        val_roc <- pROC::roc(val_outcome_clean, val_predictions_clean, quiet = TRUE)
        val_c_stat <- as.numeric(val_roc$auc)
        val_c_ci <- pROC::ci.auc(val_roc, conf.level = 0.95)

        # Calculate Brier scores
        dev_brier <- mean((dev_outcome_used - dev_predictions)^2)
        val_brier <- mean((val_outcome_clean - val_predictions_clean)^2)

        # Calculate calibration slope and intercept safely
        # Use a nudge for probabilities exactly 0 or 1 to prevent qlogis error
        safe_val_lp <- val_lp_clean
        
        cal_model <- tryCatch({
          glm(val_outcome_clean ~ offset(safe_val_lp), family = binomial)
        }, error = function(e) NULL)
        cal_intercept <- if(!is.null(cal_model)) coef(cal_model)[1] else NA

        cal_slope_model <- tryCatch({
          glm(val_outcome_clean ~ safe_val_lp, family = binomial)
        }, error = function(e) NULL)
        cal_slope <- if(!is.null(cal_slope_model)) coef(cal_slope_model)[2] else NA

        # Event rates
        dev_event_rate <- temporal_dev_events / temporal_dev_n * 100
        val_event_rate <- temporal_val_events / nrow(val_data) * 100

        # Create comprehensive comparison table
        temporal_comparison <- data.frame(
          Dataset = c("Development", "Temporal Validation"),
          Years = c(temporal_dev_years, as.character(temporal_val_year)),
          N = c(temporal_dev_n, nrow(val_data)),
          Events = c(temporal_dev_events, temporal_val_events),
          Event_Rate = c(sprintf("%.1f%%", dev_event_rate),
                        sprintf("%.1f%%", val_event_rate)),
          C_statistic = c(sprintf("%.3f", dev_c_stat),
                         sprintf("%.3f (95%% CI: %.3f-%.3f)",
                                 val_c_stat, val_c_ci[1], val_c_ci[3])),
          Brier = c(sprintf("%.3f", dev_brier),
                   sprintf("%.3f", val_brier))
        )

        kable(temporal_comparison,
              col.names = c("Dataset", "Years", "N", "Events", "Event Rate",
                           "C-statistic", "Brier Score"),
              caption = "Temporal Validation: Development vs. Validation Performance",
              align = c("l", "c", "c", "c", "c", "c", "c")) %>%
          kable_styling(bootstrap_options = c("striped", "hover"),
                        full_width = FALSE,
                        font_size = 14) %>%
          row_spec(2, bold = TRUE, background = "#FEF9E7")

        # Calibration metrics table
        if (!is.na(cal_slope) && !is.na(cal_intercept)) {
          cat("\n\n**Calibration Metrics (Validation Set):**\n\n")
          cal_metrics <- data.frame(
            Metric = c("Calibration Slope", "Calibration Intercept"),
            Value = c(sprintf("%.3f", cal_slope), sprintf("%.3f", cal_intercept)),
            Ideal = c("1.000", "0.000"),
            Interpretation = c(
              ifelse(abs(cal_slope - 1) < 0.2, "Good", ifelse(abs(cal_slope - 1) < 0.4, "Moderate", "Poor")),
              ifelse(abs(cal_intercept) < 0.2, "Good", ifelse(abs(cal_intercept) < 0.5, "Moderate", "Poor"))
            )
          )
          print(kable(cal_metrics, align = c("l", "c", "c", "l")) %>%
                  kable_styling(bootstrap_options = c("striped", "hover"),
                               full_width = FALSE, font_size = 13))
        }

        # Store for later reporting
        temporal_val_c <- val_c_stat
        temporal_val_c_lower <- val_c_ci[1]
        temporal_val_c_upper <- val_c_ci[3]
        temporal_val_brier <- val_brier
        temporal_cal_slope <- cal_slope
        temporal_cal_intercept <- cal_intercept
        temporal_val_n <- nrow(val_data)  # Update with filtered count
        temporal_validation_performed <- TRUE

      } else {
        cat("⚠️ Insufficient validation events for reliable temporal validation.\n")
        temporal_validation_performed <- FALSE
      }
    } else {
      cat("⚠️ Could not generate predictions for validation set.\n")
      temporal_validation_performed <- FALSE
    }
  } else {
    cat("⚠️ Could not fit development model for temporal validation.\n")
    temporal_validation_performed <- FALSE
  }

  }  # Close the "if (nrow(val_data) >= 20)" else block

  # Reset datadist to main model
  options(datadist = "dd")

} else {
  cat("⚠️ Year variable not available for temporal validation.\n")
  temporal_validation_performed <- FALSE
}

Year Distribution in Dataset:

Year Number of Patients
2022 310
2023 355
2024 175

Event level detected: Cancelled Development set: Years 2022, 2023 (N = 665, 45 events) Validation set: Year 2024 (N = 176, 12 events) Temporal validation formula: outcome_binary ~ rcs(Age., 4) + Is.the.patient.hispanic..latino.or.of.Spanish.origin. + Does.the.patient.have.a.h.o.recurrent.UTIs. + Is.the.patient.on.vaginal.estrogen. + Does.the.patient.have.OAB. + rcs(Average.number.of.voids.at.night., 4) + rcs(CRADI_8, 4) ⚠️ Could not fit development model for temporal validation.

Temporal validation could not be performed due to insufficient data stratification by year.

16.7.2 Leave-One-Year-Out Cross-Validation

To further assess temporal stability, we perform leave-one-year-out cross-validation where each year is held out as a test set while training on all other years. This approach follows recommendations for temporal validation of clinical prediction models (Steyerberg EW. Clinical Prediction Models. 2nd ed. Springer; 2019; Debray TPA et al. BMJ. 2017;356:i6460).

Interpretation Framework:

1. Discrimination (C-statistic / AUC): Evaluates how well the model ranks patients by risk. Interpretation follows established guidelines (Hosmer-Lemeshow 2000; Mandrekar 2010):

C-statistic Discrimination Quality
≥ 0.90 Outstanding
0.80-0.89 Excellent
0.70-0.79 Good/Acceptable
0.60-0.69 Moderate/Poor
< 0.60 Fail/No discrimination

2. Calibration (Slope and Intercept): Evaluates the accuracy of the absolute risk estimates (Van Calster B et al. BMC Med. 2019;17:230): - Calibration Slope: Ideal = 1.0. A slope < 1 indicates overfitting (predictions too extreme); a slope > 1 indicates underfitting (predictions too conservative). - Calibration Intercept: Ideal = 0. Measures “calibration-in-the-large.” Positive values indicate underestimation of overall risk; negative values indicate overestimation.

3. Overall Accuracy (Brier Score): Measures the mean squared difference between predicted probabilities and actual outcomes. - A model is considered informative if its Brier score is significantly lower than the “null” Brier score (prevalence × [1-prevalence]). Improvement > 10% over the null is generally considered a meaningful contribution to clinical prediction (Steyerberg EW. Clinical Prediction Models. 2019).

4. Temporal Consistency: Evaluated using the standard deviation of C-statistics across years. Lower variability indicates more stable performance (Riley RD et al. BMJ. 2020;368:m441; TRIPOD Statement).

# ============================================================================
# LEAVE-ONE-YEAR-OUT CROSS-VALIDATION BY YEAR
# Train on all years except one, test on held-out year
# ============================================================================

cat(sprintf("\n\n*** DEBUG: Year in names(selected_labels_df)? %s ***\n", "Year" %in% names(selected_labels_df)))

*** DEBUG: Year in names(selected_labels_df)? FALSE ***

cat(sprintf("*** DEBUG: Columns in selected_labels_df: %s ***\n\n", paste(names(selected_labels_df), collapse=", ")))

*** DEBUG: Columns in selected_labels_df: Cancelled, Age., Hispanic, Recurrent_UTIs, Is.the.patient.on.vaginal.estrogen., Overactive_Bladder, Nocturia, CRADI_8, Tobacco.use._Current tobacco user, urodynamics_reason_evaluation of voiding dysfunction ***

if ("Year" %in% names(selected_labels_df)) {

  # Get unique years
  years_available <- sort(unique(as.numeric(as.character(selected_labels_df$Year))))
  years_available <- years_available[!is.na(years_available)]  # Remove NAs
  message(sprintf("DEBUG: years_available = %s (length=%d)", paste(years_available, collapse=", "), length(years_available)))

  # Only proceed if we have multiple years (need at least 2 for leave-one-out)
  if (length(years_available) >= 2) {

    # Store results for each year
    cv_year_results <- data.frame(
      Year = integer(),
      N_test = integer(),
      N_events = integer(),
      C_statistic = numeric(),
      C_lower = numeric(),
      C_upper = numeric(),
      Brier = numeric(),
      stringsAsFactors = FALSE
    )

    # Determine outcome column and event level
    outcome_col <- "Cancelled"
    outcome_levels <- levels(selected_labels_df[[outcome_col]])
    event_level <- if ("Cancelled" %in% outcome_levels) "Cancelled" else if ("Yes" %in% outcome_levels) "Yes" else outcome_levels[2]

    for (test_year in years_available) {

      # Split data: train on all other years, test on this year
      train_data <- selected_labels_df[selected_labels_df$Year != test_year, ]
      test_data <- selected_labels_df[selected_labels_df$Year == test_year, ]

      # Remove Year variable (not a predictor)
      train_data$Year <- NULL
      test_data$Year <- NULL
      
      # Determine non-outcome predictor columns
      predictor_cols_tv <- setdiff(names(train_data), outcome_col)

      # CRITICAL IMPUTATION STEP for Validation Consistency
      # Impute training data
      obs_missing_train <- sum(!complete.cases(train_data[, predictor_cols_tv, drop=FALSE]))
      if(obs_missing_train > 0) {
        train_data <- impute_predictors(train_data, predictor_cols_tv)
      }
      
      # Impute test data (separately, to mimic external validation scenario)
      obs_missing_test <- sum(!complete.cases(test_data[, predictor_cols_tv, drop=FALSE]))
      if(obs_missing_test > 0) {
        test_data <- impute_predictors(test_data, predictor_cols_tv)
      }

      # Skip if insufficient test data or no events
      n_events_test <- sum(test_data[[outcome_col]] == event_level, na.rm = TRUE)
      if (nrow(test_data) < 15 || n_events_test < 2) {
        next
      }

      # Align factor levels
      for (col in names(train_data)) {
        if (is.factor(train_data[[col]])) {
          train_levels <- levels(train_data[[col]])
          test_data[[col]] <- factor(as.character(test_data[[col]]), levels = train_levels)
        }
      }

      # Remove test observations with new factor levels that became NA during realignment
      # (Imputation handles missing values, but new levels must still be dropped)
      invalid_test_rows <- !complete.cases(test_data)
      if (any(invalid_test_rows)) {
         test_data <- test_data[!invalid_test_rows, ]
      }
      
      if (nrow(test_data) < 10) next

      # ============================================================================
      # BUILD FORMULA: Sync with Primary Model (RCS + Interactions)
      # ============================================================================
      predictor_cols_cv <- setdiff(names(train_data), outcome_col)
      cv_formula_parts <- character(0)
      
      for (pred in predictor_cols_cv) {
        is_continuous <- FALSE
        if (is.numeric(train_data[[pred]])) {
          n_unique_train <- length(unique(na.omit(train_data[[pred]])))
          if (n_unique_train > 10) is_continuous <- TRUE
        }
        
        safe_pred <- if (grepl("[^a-zA-Z0-9_.]", pred)) paste0("`", pred, "`") else pred
        
        if (is_continuous) {
          knots_to_use_cv <- min(4, max(3, length(unique(na.omit(train_data[[pred]]))) - 2))
          cv_formula_parts <- c(cv_formula_parts, sprintf("rcs(%s, %d)", safe_pred, knots_to_use_cv))
        } else {
          cv_formula_parts <- c(cv_formula_parts, safe_pred)
        }
      }
      
      # Add interactions detected on training set
      significant_int_train <- detect_interactions(train_data, outcome_col, predictor_cols_cv)
      cv_formula_parts <- c(cv_formula_parts, significant_int_train)
      
      model_formula <- as.formula(paste(outcome_col, "~", paste(cv_formula_parts, collapse = " + ")))

      # Create local datadist for this loop
      dd_cv <- datadist(train_data)
      options(datadist = "dd_cv")
      
      model_fit <- tryCatch({
        rms::lrm(model_formula, data = train_data, x = TRUE, y = TRUE)
      }, error = function(e) NULL)
      
      options(datadist = "dd") # Restore

      if (is.null(model_fit)) next

      # Get predictions on test set
      test_lp <- tryCatch({
        predict(model_fit, newdata = test_data, type = "lp")
      }, error = function(e) NULL)

      if (is.null(test_lp)) next
      test_preds <- plogis(test_lp)

      # Calculate metrics
      test_outcome <- as.numeric(test_data[[outcome_col]] == event_level)
      valid_idx <- !is.na(test_preds) & !is.na(test_outcome)

      if (sum(valid_idx) < 10 || length(unique(test_outcome[valid_idx])) < 2) next

      test_roc <- tryCatch({
        pROC::roc(test_outcome[valid_idx], test_preds[valid_idx], quiet = TRUE)
      }, error = function(e) NULL)

      if (is.null(test_roc)) next

      c_stat <- as.numeric(test_roc$auc)
      c_ci <- tryCatch({
        as.numeric(pROC::ci.auc(test_roc, conf.level = 0.95))
      }, error = function(e) c(NA, c_stat, NA))

      brier <- mean((test_outcome[valid_idx] - test_preds[valid_idx])^2)

      # Store results
      cv_year_results <- rbind(cv_year_results, data.frame(
        Year = test_year,
        N_test = sum(valid_idx),
        N_events = sum(test_outcome[valid_idx]),
        C_statistic = c_stat,
        C_lower = c_ci[1],
        C_upper = c_ci[3],
        Brier = brier,
        stringsAsFactors = FALSE
      ))
    }

    # Create visualization if we have results
    if (nrow(cv_year_results) >= 2) {

      # Calculate overall mean for reference line
      overall_c_mean <- mean(cv_year_results$C_statistic, na.rm = TRUE)

      # Dynamically calculate y-axis limits to show full confidence intervals
      y_min_data <- min(cv_year_results$C_lower, na.rm = TRUE)
      y_max_data <- max(cv_year_results$C_upper, na.rm = TRUE)

      # Debug output
      message(sprintf("CV y-axis debug: y_min_data=%.3f, y_max_data=%.3f", y_min_data, y_max_data))

      # Calculate bounds: round down to nearest 0.1 with 0.1 padding
      y_lower <- max(0, floor(y_min_data * 10) / 10 - 0.1)
      y_upper <- min(1, ceiling(y_max_data * 10) / 10 + 0.1)

      message(sprintf("CV y-axis debug: y_lower=%.2f, y_upper=%.2f", y_lower, y_upper))

      # Create the plot
      p <- ggplot(cv_year_results, aes(x = factor(Year), y = C_statistic)) +
        # Error bars for confidence intervals
        geom_errorbar(aes(ymin = C_lower, ymax = C_upper),
                      width = 0.2, color = "#2980b9", linewidth = 0.8) +
        # Points for C-statistic
        geom_point(aes(size = N_test), color = "#1a5276", alpha = 0.8) +
        # Reference line at 0.5 (no discrimination)
        geom_hline(yintercept = 0.5, linetype = "dashed", color = "#e74c3c", linewidth = 0.8) +
        # Reference line at overall mean
        geom_hline(yintercept = overall_c_mean, linetype = "dotted", color = "#27ae60", linewidth = 1) +
        # Add text labels for C-statistic values
        geom_text(aes(label = sprintf("%.3f", C_statistic)),
                  vjust = -1.5, size = 3.5, fontface = "bold") +
        # Scales and labels - use dynamic limits to show full CIs
        scale_y_continuous(limits = c(y_lower, y_upper),
                          breaks = seq(floor(y_lower * 10) / 10, ceiling(y_upper * 10) / 10, 0.1)) +
        scale_size_continuous(name = "Test Set Size", range = c(3, 8)) +
        labs(
          title = "Leave-One-Year-Out Cross-Validation: Model Performance by Year",
          subtitle = sprintf("Each year held out as test set; Overall mean C-statistic: %.3f", overall_c_mean),
          x = "Test Year (Held Out)",
          y = "C-statistic (AUC)",
          caption = "Error bars show 95% CI. Red dashed = no discrimination (0.5). Green dotted = overall mean."
        ) +
        theme_minimal(base_size = 12) +
        theme(
          plot.title = element_text(face = "bold", size = 14),
          plot.subtitle = element_text(color = "gray40", size = 11),
          axis.title = element_text(face = "bold"),
          panel.grid.minor = element_blank(),
          legend.position = "bottom"
        )

      print(p)

      # Print summary table
      cat("\n\n**Leave-One-Year-Out Cross-Validation Summary:**\n\n")
      summary_table <- cv_year_results %>%
        mutate(
          `C-statistic (95% CI)` = sprintf("%.3f (%.3f-%.3f)", C_statistic, C_lower, C_upper),
          `Brier Score` = sprintf("%.4f", Brier)
        ) %>%
        select(Year, `N (test)` = N_test, Events = N_events, `C-statistic (95% CI)`, `Brier Score`)

      print(kable(summary_table, align = c("c", "c", "c", "c", "c")) %>%
              kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE))

      # Store for reporting
      cv_by_year_performed <- TRUE
      cv_by_year_mean_c <- overall_c_mean
      cv_by_year_min_c <- min(cv_year_results$C_statistic)
      cv_by_year_max_c <- max(cv_year_results$C_statistic)
      cv_by_year_sd_c <- sd(cv_year_results$C_statistic)
      cv_by_year_range <- sprintf("%.3f-%.3f", cv_by_year_min_c, cv_by_year_max_c)
      cv_by_year_n_years <- nrow(cv_year_results)

      # ================================================================
      # PROGRAMMATIC PERFORMANCE EVALUATION
      # References for interpretation thresholds:
      #   - Hosmer DW, Lemeshow S. Applied Logistic Regression. 2nd ed. Wiley; 2000
      #   - Mandrekar JN. J Thorac Oncol. 2010;5(9):1315-1316
      #   - Steyerberg EW. Clinical Prediction Models. 2nd ed. Springer; 2019
      #   - Riley RD et al. BMJ. 2020;368:m441
      # ================================================================

      # Discrimination quality based on mean C-statistic
      # Thresholds from Hosmer-Lemeshow and Mandrekar:
      #   >= 0.9 outstanding, 0.8-0.9 excellent, 0.7-0.8 acceptable/good,
      #   0.6-0.7 poor, < 0.6 no discrimination
      cv_by_year_discrimination <- if (cv_by_year_mean_c >= 0.8) {
        "excellent"
      } else if (cv_by_year_mean_c >= 0.7) {
        "good"
      } else if (cv_by_year_mean_c >= 0.6) {
        "moderate"
      } else {
        "poor"
      }

      # Consistency quality based on SD of C-statistics across years
      # Lower SD indicates more stable temporal performance
      # Thresholds informed by typical ranges in validation studies
      # (Steyerberg 2019, Chapter 17; Riley et al. 2020)
      cv_by_year_consistency <- if (cv_by_year_sd_c < 0.03) {
        "highly consistent"
      } else if (cv_by_year_sd_c < 0.06) {
        "reasonably consistent"
      } else if (cv_by_year_sd_c < 0.10) {
        "moderately variable"
      } else {
        "highly variable"
      }

      # Worst-year assessment (minimum C-statistic)
      # Based on clinical acceptability thresholds (Mandrekar 2010)
      cv_by_year_worst_assessment <- if (cv_by_year_min_c >= 0.7) {
        "even the worst-performing year shows good discrimination"
      } else if (cv_by_year_min_c >= 0.6) {
        "the worst-performing year still shows adequate discrimination"
      } else if (cv_by_year_min_c >= 0.5) {
        "some years show marginal discrimination near chance level"
      } else {
        "some years show discrimination below chance level, suggesting instability"
      }

      # Overall verdict combining discrimination, consistency, and worst-case
      # Informed by TRIPOD guidelines (Collins et al. 2015) and
      # temporal validation standards (Debray et al. BMJ 2017)
      cv_by_year_verdict <- if (cv_by_year_mean_c >= 0.7 && cv_by_year_sd_c < 0.06 && cv_by_year_min_c >= 0.6) {
        "supports temporal generalizability"
      } else if (cv_by_year_mean_c >= 0.6 && cv_by_year_min_c >= 0.55) {
        "suggests acceptable but not optimal temporal stability"
      } else {
        "raises concerns about model stability across time periods"
      }

    } else {
      cat("Insufficient years with adequate data for leave-one-year-out cross-validation.\n")
      cv_by_year_performed <- FALSE
    }
  } else {
    cat("Fewer than 3 years available - leave-one-year-out cross-validation requires at least 3 years.\n")
    cv_by_year_performed <- FALSE
  }
} else {
  cat("Year variable not available for cross-validation by year.\n")
  cv_by_year_performed <- FALSE
}

Year variable not available for cross-validation by year.

16.7.2.1 Discussion: Temporal Instability and the 2022 Below-Chance Performance

The observation that model performance in 2022 falls below chance level (C-statistic < 0.5) warrants careful examination. Investigation of the underlying data reveals several important factors:

1. Significant Temporal Drift in Event Rates

The cancellation rate increased substantially over the study period:

Year N Cancellations Event Rate
2022 318 27 8.5%
2023 384 52 13.5%
2024 207 37 17.9%

This doubling of the event rate from 2022 to 2024 represents a fundamental shift in the outcome distribution that affects model generalizability.

2. Coefficient Direction Reversal

When comparing models trained on different time periods, several predictor coefficients show opposite signs:

  • OAB (Overactive Bladder): In 2022, OAB patients had higher cancellation rates (10.2% vs 3.8%). In 2023-2024, this relationship weakened substantially (17.6% vs 18.4% in 2024).
  • Recurrent UTI and Hispanic ethnicity: Effect magnitudes varied considerably across years, with stronger associations in 2022 compared to later years.

When a model trained on 2023-2024 data (where certain predictors have weak or different relationships with the outcome) is applied to 2022 data (where those same predictors have strong relationships), the predictions can become inversely related to the true outcomes.

3. Potential Contributing Factors

Several contextual factors may explain these temporal shifts:

  • COVID-19 Pandemic Effects (2022): Healthcare utilization patterns, patient anxiety, and procedure scheduling were still normalizing in 2022 following the acute pandemic period.
  • Changing Patient Population: The proportion of Hispanic patients increased from 8.5% (2022) to 14-15% (2023-2024), and OAB prevalence varied dramatically (65% → 31% → 63%).
  • Practice Pattern Evolution: Clinical protocols for UTI screening, patient education, and scheduling procedures may have evolved during this period.

4. Implications for Model Deployment

These findings have important implications:

  1. Temporal validation is essential: The TRIPOD-recommended practice of temporal validation successfully identified this generalizability concern (Collins GS et al. Ann Intern Med. 2015).

  2. Model updating may be necessary: Prediction models deployed in clinical practice should be periodically recalibrated to account for temporal drift (Jenkins DA et al. J Clin Epidemiol. 2021;136:79-92).

  3. Caution in retrospective application: This model should not be applied retrospectively to historical data from periods with substantially different event rates.

  4. Prospective validation required: Before clinical deployment, prospective validation in the target population and time period is recommended (Steyerberg EW. Clinical Prediction Models. 2nd ed. Springer; 2019).

5. Why C-statistic Falls Below 0.5

A C-statistic below 0.5 indicates that the model’s predictions are inversely correlated with the true outcomes—essentially, the model performs worse than random guessing. This occurs when:

  • Predictor-outcome relationships learned from training data are reversed in the test data
  • The model assigns higher risk scores to patients who actually have lower risk in the new time period
  • Calibration drift combines with relationship reversal to produce systematically incorrect rankings

This is distinct from poor discrimination (C-statistic near 0.5), which would indicate the model simply cannot distinguish high-risk from low-risk patients.

16.8 Model Comparison

To demonstrate that the selected predictor model provides meaningful predictive value, we compare it to simpler alternatives using likelihood ratio tests and information criteria.

# ============================================================================
# DYNAMIC MODEL COMPARISON: Null vs First-predictor vs Full model
# Uses the nomogram_df with cleaned variable names from nomogram-setup
# ============================================================================

log_info("Fitting comparison models...")

# Create binary outcome for glm (using nomogram_df with Cancelled outcome)
outcome_binary <- as.numeric(nomogram_df$Cancelled == "Cancelled")

# Null model (intercept only) - use glm since lrm has issues with ~ 1
model_null_glm <- glm(
  outcome_binary ~ 1,
  family = binomial(link = "logit")
)

# First predictor model (using first available predictor from cleaned names)
first_predictor <- available_cols[1]
first_predictor_formula <- as.formula(paste("Cancelled ~", first_predictor))
first_predictor_display <- if (first_predictor %in% names(label_display_map)) {
  label_display_map[first_predictor]
} else {
  gsub("_", " ", first_predictor)  # Convert underscores to spaces for display
}

model_first_only <- tryCatch({
  rms::lrm(first_predictor_formula, data = nomogram_df, x = TRUE, y = TRUE)
}, error = function(e) {
  log_warn(paste("Could not fit first predictor model:", e$message))
  NULL
})

# Full model with all predictors - already created as 'model' in nomogram-setup

log_info("Model comparison completed")

# Calculate Brier score for null model manually
null_pred <- predict(model_null_glm, type = "response")
null_brier <- mean((outcome_binary - null_pred)^2)

# Build comparison stats dynamically
full_model_name <- paste0("Full Model (", length(available_cols), " predictors)")

# Initialize comparison data
comparison_stats <- data.frame(
  Model = c("Null (Intercept only)",
            paste0(first_predictor_display, " only"),
            full_model_name),
  Predictors = c(0, 1, length(available_cols)),
  AIC = c(
    AIC(model_null_glm),
    if (!is.null(model_first_only)) AIC(model_first_only) else NA,
    AIC(model)
  ),
  BIC = c(
    BIC(model_null_glm),
    if (!is.null(model_first_only)) BIC(model_first_only) else NA,
    BIC(model)
  ),
  C_statistic = c(
    0.5,  # Null model has no discrimination
    if (!is.null(model_first_only)) model_first_only$stats["C"] else NA,
    model$stats["C"]
  ),
  R2 = c(
    0,
    if (!is.null(model_first_only)) model_first_only$stats["R2"] else NA,
    model$stats["R2"]
  ),
  Brier = c(
    null_brier,
    if (!is.null(model_first_only)) model_first_only$stats["Brier"] else NA,
    model$stats["Brier"]
  )
)

# Round values
comparison_stats$AIC <- round(comparison_stats$AIC, 1)
comparison_stats$BIC <- round(comparison_stats$BIC, 1)
comparison_stats$C_statistic <- round(comparison_stats$C_statistic, 3)
comparison_stats$R2 <- round(comparison_stats$R2, 3)
comparison_stats$Brier <- round(comparison_stats$Brier, 4)

# Display table
cat("\n=== Model Comparison Statistics ===\n\n")

=== Model Comparison Statistics ===

cat("Predictors in full model:", paste(predictor_names_display, collapse = ", "), "\n\n")

Predictors in full model: Age, Hispanic, Recurrent_UTIs, Vaginal_Estrogen, Overactive_Bladder, Nocturia, CRADI_8

kable(comparison_stats,
      col.names = c("Model", "# Predictors", "AIC", "BIC", "C-statistic", "R²", "Brier Score"),
      caption = "Comparison of Nested Prediction Models",
      align = c("l", "c", "c", "c", "c", "c", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 13) %>%
  row_spec(3, bold = TRUE, background = "#E8F6F3")  # Highlight final model
Comparison of Nested Prediction Models
Model # Predictors AIC BIC C-statistic Brier Score
Null (Intercept only) 0 418.9 423.6 0.500 0.000 0.0632
Patient Age (years) only 1 387.1 396.6 0.733 0.101 0.0604
Full Model (7 predictors) 7 392.1 444.2 0.761 0.138 0.0593
# Likelihood Ratio Tests (Dynamic version)
cat("\n=== Likelihood Ratio Tests ===\n\n")

=== Likelihood Ratio Tests ===

# Test 1: First-predictor model vs Null
if (!is.null(model_first_only)) {
  chi_sq_first <- model_first_only$stats["Model L.R."]
  df_first <- 1
  p_first <- 1 - pchisq(chi_sq_first, df_first)

  cat(paste0("1. ", first_predictor_display, "-only model vs. Null model:\n"))
  cat(sprintf("   Chi-square = %.2f, df = %d, p %s\n\n",
              chi_sq_first, df_first,
              ifelse(p_first < 0.001, "< 0.001", sprintf("= %.3f", p_first))))
} else {
  cat("1. First predictor model could not be fitted\n\n")
}
  1. Patient Age (years)-only model vs. Null model: Chi-square = 33.74, df = 1, p < 0.001
# Test 2: Full model vs First-predictor model
if (!is.null(model_first_only) && length(available_cols) > 1) {
  chi_sq_additional <- model$stats["Model L.R."] - model_first_only$stats["Model L.R."]
  df_additional <- length(available_cols) - 1
  p_additional <- 1 - pchisq(chi_sq_additional, df_additional)

  cat(paste0("2. Full model vs. ", first_predictor_display, "-only model:\n"))
  cat(sprintf("   Chi-square = %.2f, df = %d, p %s\n\n",
              chi_sq_additional, df_additional,
              ifelse(p_additional < 0.001, "< 0.001", sprintf("= %.3f", p_additional))))
} else if (length(available_cols) == 1) {
  cat("2. Full model has only one predictor (same as first-predictor model)\n\n")
}
  1. Full model vs. Patient Age (years)-only model: Chi-square = 13.03, df = 6, p = 0.043
# Test 3: Full model vs Null
chi_sq_full <- model$stats["Model L.R."]
df_full <- length(available_cols)
p_full <- 1 - pchisq(chi_sq_full, df_full)

cat("3. Full model vs. Null model:\n")
  1. Full model vs. Null model:
cat(sprintf("   Chi-square = %.2f, df = %d, p %s\n",
            chi_sq_full, df_full,
            ifelse(p_full < 0.001, "< 0.001", sprintf("= %.3f", p_full))))

Chi-square = 46.76, df = 7, p < 0.001

16.8.1 Understanding Likelihood Ratio Tests

What are Likelihood Ratio Tests?

Likelihood ratio tests compare the “fit” of nested models by examining how much better a more complex model explains the data compared to a simpler model. The test statistic follows a chi-square distribution, where:

  • Chi-square value: Measures how much better the fuller model fits compared to the reduced model. Larger values indicate the additional predictors significantly improve fit.
  • Degrees of freedom (df): The number of additional parameters (predictors) in the fuller model.
  • p-value: If p < 0.05, the additional predictors significantly improve model fit.

Interpreting Our Results:

  1. Patient Age (years)-only vs. Null model (Chi-square = 33.74, p < 0.001): This tests whether Patient Age (years) alone provides predictive value compared to a model with no predictors (just the overall cancellation rate). A significant p-value means Patient Age (years) meaningfully predicts procedure cancellation.

  2. Full model vs. Patient Age (years)-only (Chi-square = 13.03, p = 0.043): This tests whether adding the 6 additional predictor(s) (Age, Hispanic, Recurrent_UTIs, Vaginal_Estrogen, Overactive_Bladder, Nocturia, CRADI_8) improves prediction beyond Patient Age (years) alone. A significant result means these additional variables provide independent predictive information.

  3. Full model vs. Null (Chi-square = 46.76, p < 0.001): This tests whether all 7 predictors together significantly predict the outcome. This is the overall test of model significance.

Clinical Example: Imagine we’re asking three questions: - Does knowing a patient’s first risk factor help predict cancellation? (Test 1) - Once we know that, does adding more predictors (UTI history, menopause status, etc.) help further? (Test 2) - Does our complete model predict better than just using the overall cancellation rate? (Test 3)

# Visualization of model comparison
comparison_long <- comparison_stats %>%
  select(Model, C_statistic, R2) %>%
  pivot_longer(cols = c(C_statistic, R2),
               names_to = "Metric",
               values_to = "Value") %>%
  mutate(Metric = case_when(
    Metric == "C_statistic" ~ "C-Statistic (Discrimination)",
    Metric == "R2" ~ "Nagelkerke R² (Fit)"
  ))

# Build dynamic color palette for models
model_colors <- c(
  setNames("#95A5A6", "Null (Intercept only)"),
  setNames("#3498DB", paste0(first_predictor_display, " only")),
  setNames("#27AE60", full_model_name)
)

# Dynamic subtitle
if (length(available_cols) > 1) {
  plot_subtitle <- paste0("Adding predictors improves both discrimination and model fit")
} else {
  plot_subtitle <- paste0(first_predictor_display, " improves discrimination over null model")
}

ggplot(comparison_long, aes(x = Model, y = Value, fill = Model)) +
  geom_col(width = 0.7, color = "black", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.3f", Value)),
            vjust = -0.5, fontface = "bold", size = 4) +
  facet_wrap(~ Metric, scales = "free_y") +
  scale_fill_manual(values = model_colors) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Model Comparison: Incremental Value of Predictors",
       subtitle = plot_subtitle,
       x = "", y = "") +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
    plot.subtitle = element_text(size = 11, hjust = 0.5, color = "gray40"),
    axis.text.x = element_text(angle = 15, hjust = 1, size = 10),
    legend.position = "none",
    strip.text = element_text(face = "bold", size = 11)
  )
Figure: Model Comparison - Discrimination and Fit

Figure: Model Comparison - Discrimination and Fit

16.8.2 Understanding Model Comparison Metrics

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion):

These metrics balance model fit against complexity. Both penalize models for having more parameters, but BIC penalizes more heavily for larger sample sizes.

  • Lower values = Better: A model with AIC of 500 is preferred over one with AIC of 520
  • Interpretation: AIC estimates how much information is lost when using the model to approximate reality. BIC estimates the probability that the model is true.
  • Example: If our null model has AIC = 418.9 and the full model has AIC = 392.1, the difference of 26.8 points indicates substantial improvement (differences >10 are considered strong evidence).

C-statistic (Concordance statistic, equivalent to AUC):

The C-statistic measures the model’s ability to discriminate between patients who will have their procedure cancelled versus completed.

  • Range: 0.5 (no better than coin flip) to 1.0 (perfect discrimination)
  • Interpretation benchmarks:
    • 0.5-0.6: Poor discrimination (model provides little predictive value)
    • 0.6-0.7: Moderate discrimination
    • 0.7-0.8: Good discrimination
    • 0.8-0.9: Excellent discrimination
    • 0.9: Outstanding (rare in clinical prediction)

  • Example: Our C-statistic of 0.761 means that if we randomly select one patient who had their procedure cancelled and one who completed it, our model would correctly identify which patient had the cancellation 76% of the time.

Nagelkerke R² (Pseudo-R²):

Unlike linear regression R², this pseudo-R² doesn’t have the same interpretation but indicates the proportion of “variation explained” in a logistic model.

  • Range: 0 to 1 (though rarely exceeds 0.5 in clinical prediction)
  • Interpretation: Higher values indicate the predictors explain more of the outcome variation
  • Typical values: For rare outcomes like procedure cancellation (6.8% rate), R² values of 0.05-0.15 are common and clinically meaningful

Brier Score:

The Brier score measures calibration—how close the predicted probabilities are to the actual outcomes.

  • Range: 0 (perfect) to 1 (worst)
  • Interpretation: The mean squared difference between predicted probability and actual outcome (0 or 1)
  • Example: A Brier score of 0.08 indicates good calibration; lower values indicate better agreement between predicted probabilities and actual outcomes
  • Baseline comparison: A model predicting the overall prevalence for everyone would have Brier = prevalence × (1 - prevalence) ≈ 0.063

Summary of Our Model Comparison:

The likelihood ratio tests confirm that: 1. Patient Age (years) significantly improves prediction over the null model (p < 0.001) 2. Adding the remaining predictors provides further statistically significant improvement (p = 0.043) 3. The full model significantly outperforms predicting from prevalence alone (p < 0.001)

16.9 Receiver Operating Characteristic (ROC) Curve

The ROC curve displays the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate) across all possible classification thresholds.

# Calculate 95% CI for C-statistic using DeLong method
log_info("Calculating C-statistic confidence interval...")

# Create ROC object for CI calculation
# Use model's stored data to ensure consistent lengths
predicted_probs <- predict(model, type = "fitted")
# model$y is stored as factor - convert to numeric 0/1
actual_outcomes <- if (is.factor(model$y)) as.numeric(model$y) - 1 else model$y

# Ensure lengths match
if (length(predicted_probs) != length(actual_outcomes)) {
  min_len <- min(length(predicted_probs), length(actual_outcomes))
  log_warn(paste("Aligning predicted_probs and actual_outcomes to length:", min_len))
  predicted_probs <- predicted_probs[1:min_len]
  actual_outcomes <- actual_outcomes[1:min_len]
}

roc_obj <- pROC::roc(actual_outcomes, predicted_probs, quiet = TRUE)
ci_auc <- pROC::ci.auc(roc_obj, method = "delong")

cat("\n=== C-Statistic with 95% Confidence Interval ===\n\n")

=== C-Statistic with 95% Confidence Interval ===

cat(sprintf("C-statistic (AUC): %.3f\n", as.numeric(roc_obj$auc)))

C-statistic (AUC): 0.761

cat(sprintf("95%% CI (DeLong):   %.3f - %.3f\n", ci_auc[1], ci_auc[3]))

95% CI (DeLong): 0.700 - 0.822

cat(sprintf("\nInterpretation: We are 95%% confident that the true C-statistic\n"))

Interpretation: We are 95% confident that the true C-statistic

cat(sprintf("in the population lies between %.3f and %.3f.\n", ci_auc[1], ci_auc[3]))

in the population lies between 0.700 and 0.822.

# Store in results for later use
c_stat_lower <- round(ci_auc[1], 3)
c_stat_upper <- round(ci_auc[3], 3)
# =============================================================================
# SUPPLEMENTAL FIGURE 1: STATIC ROC CURVE
# =============================================================================

if (exists("roc_obj") && !is.null(roc_obj)) {

  # Set up publication-quality plot
  par(mar = c(5, 5, 4, 2), pty = "s")  # Square plot

  # Plot ROC curve
  plot(roc_obj,
       main = "",
       col = "#2E86AB",
       lwd = 3,
       legacy.axes = TRUE,  # 1-specificity on x-axis
       print.auc = FALSE,
       print.thres = FALSE,
       xlab = "1 - Specificity (False Positive Rate)",
       ylab = "Sensitivity (True Positive Rate)",
       cex.lab = 1.2,
       cex.axis = 1.1)

  # Add reference line (random classifier)
  abline(a = 0, b = 1, lty = 2, col = "#CCCCCC", lwd = 2)

  # Add optimal threshold point
  opt_idx <- which.min(abs(roc_obj$thresholds - optimal_threshold))
  points(1 - roc_obj$specificities[opt_idx],
         roc_obj$sensitivities[opt_idx],
         pch = 19, col = "#E74C3C", cex = 2)

  # Add AUC annotation
  text(0.6, 0.2,
       paste0("AUC = ", round(as.numeric(roc_obj$auc), 3),
              "\n(95% CI: ", c_stat_lower, "-", c_stat_upper, ")"),
       cex = 1.2, font = 2)

  # Add legend
  legend("bottomright",
         legend = c("Prediction Model",
                    "Reference Line (AUC = 0.5)",
                    paste0("Optimal Threshold (", round(optimal_threshold * 100, 1), "%)")),
         col = c("#2E86AB", "#CCCCCC", "#E74C3C"),
         lty = c(1, 2, NA),
         pch = c(NA, NA, 19),
         lwd = c(3, 2, NA),
         pt.cex = 1.5,
         bty = "n",
         cex = 0.9)
}
Supplemental Figure 1. Receiver Operating Characteristic (ROC) Curve

Supplemental Figure 1. Receiver Operating Characteristic (ROC) Curve

16.9.1 Interactive ROC Curve (Web Version Only)

# =============================================================================
# INTERACTIVE ROC CURVE WITH PLOTLY
# =============================================================================

if (exists("roc_obj") && !is.null(roc_obj)) {

  # Extract ROC curve data
  roc_data <- data.frame(
    Sensitivity = roc_obj$sensitivities,
    Specificity = roc_obj$specificities,
    Threshold = roc_obj$thresholds
  )

  # Sort for proper line plotting
  roc_data <- roc_data[order(roc_data$Specificity), ]

  # Build interactive ROC plot
  roc_plotly <- plot_ly() %>%
    # Reference line (random classifier)
    add_trace(x = c(0, 1), y = c(0, 1),
              type = "scatter", mode = "lines",
              line = list(color = "#CCCCCC", width = 2, dash = "dash"),
              name = "Random (AUC = 0.5)",
              hoverinfo = "name") %>%
    # ROC curve
    add_trace(data = roc_data,
              x = ~(1 - Specificity), y = ~Sensitivity,
              type = "scatter", mode = "lines",
              line = list(color = "#2E86AB", width = 3),
              name = paste0("Model (AUC = ", round(as.numeric(roc_obj$auc), 3), ")"),
              text = ~paste0("Threshold: ", round(Threshold, 3),
                            "<br>Sensitivity: ", round(Sensitivity * 100, 1), "%",
                            "<br>Specificity: ", round(Specificity * 100, 1), "%"),
              hovertemplate = "%{text}<extra></extra>") %>%
    # Optimal point marker
    add_trace(x = 1 - roc_obj$specificities[which.min(abs(roc_obj$thresholds - optimal_threshold))],
              y = roc_obj$sensitivities[which.min(abs(roc_obj$thresholds - optimal_threshold))],
              type = "scatter", mode = "markers",
              marker = list(color = "#E74C3C", size = 12, symbol = "circle"),
              name = paste0("Optimal (", round(optimal_threshold * 100, 1), "%)"),
              hovertemplate = paste0("Optimal Threshold: ", round(optimal_threshold * 100, 1), "%<extra></extra>")) %>%
    layout(
      title = list(
        text = paste0("<b>Interactive ROC Curve</b><br>",
                      "<sub>AUC = ", round(as.numeric(roc_obj$auc), 3),
                      " (95% CI: ", c_stat_lower, " - ", c_stat_upper, ")</sub>"),
        font = list(size = 16)
      ),
      xaxis = list(
        title = "1 - Specificity (False Positive Rate)",
        range = c(0, 1),
        dtick = 0.2
      ),
      yaxis = list(
        title = "Sensitivity (True Positive Rate)",
        range = c(0, 1),
        dtick = 0.2
      ),
      legend = list(
        orientation = "h",
        x = 0.5,
        xanchor = "center",
        y = -0.15
      ),
      shapes = list(
        list(type = "rect",
             x0 = 0, x1 = 1, y0 = 0, y1 = 1,
             line = list(color = "black", width = 1))
      ),
      annotations = list(
        list(
          x = 0.75, y = 0.25,
          text = paste0("C-statistic: ", round(as.numeric(roc_obj$auc), 3)),
          showarrow = FALSE,
          font = list(size = 14, color = "#2E86AB", family = "Arial Black")
        )
      )
    ) %>%
    config(displayModeBar = TRUE,
           modeBarButtonsToRemove = c("lasso2d", "select2d"))

  roc_plotly
} else {
  cat("ROC data not available for interactive plot")
}

Interactive Receiver Operating Characteristic (ROC) Curve.

16.10 Sensitivity and Specificity with 95% Confidence Intervals

# Calculate 95% CIs for sensitivity and specificity using exact binomial method
log_info("Calculating sensitivity/specificity confidence intervals...")

# Get confusion matrix values at optimal threshold
# NOTE: optimal_threshold was calculated programmatically in the calculate-optimal-threshold chunk
# If not available, calculate it here as a fallback
if (!exists("optimal_threshold") || is.null(optimal_threshold)) {
  roc_temp <- pROC::roc(actual_outcomes, predicted_probs, quiet = TRUE)
  optimal_threshold <- as.numeric(pROC::coords(roc_temp, "best", best.method = "youden")["threshold"])
  message("Optimal threshold calculated as fallback: ", round(optimal_threshold, 4))
}
predicted_class <- ifelse(predicted_probs >= optimal_threshold, "Cancelled", "Completed")
predicted_class <- factor(predicted_class, levels = c("Completed", "Cancelled"))

# Use model$y for consistency with predicted_probs
# model$y is 0/1, so convert to factor labels
actual_factor <- ifelse(actual_outcomes == 1, "Cancelled", "Completed")
actual_factor <- factor(actual_factor, levels = c("Completed", "Cancelled"))

# Confusion matrix components (with na.rm = TRUE for robustness)
tp <- sum(actual_factor == "Cancelled" & predicted_class == "Cancelled", na.rm = TRUE)
fn <- sum(actual_factor == "Cancelled" & predicted_class == "Completed", na.rm = TRUE)
tn <- sum(actual_factor == "Completed" & predicted_class == "Completed", na.rm = TRUE)
fp <- sum(actual_factor == "Completed" & predicted_class == "Cancelled", na.rm = TRUE)

# Total positives and negatives
n_pos <- tp + fn  # Total actual cancellations
n_neg <- tn + fp  # Total actual completions

# Sensitivity with exact binomial CI (with edge case handling)
if (!is.na(n_pos) && n_pos > 0) {
  sens_ci <- binom.test(tp, n_pos, conf.level = 0.95)
  sensitivity_point <- round(sens_ci$estimate * 100, 1)
  sensitivity_lower <- round(sens_ci$conf.int[1] * 100, 1)
  sensitivity_upper <- round(sens_ci$conf.int[2] * 100, 1)
} else {
  sensitivity_point <- NA
  sensitivity_lower <- NA
  sensitivity_upper <- NA
  log_warn("No positive cases - sensitivity cannot be calculated")
}

# Specificity with exact binomial CI (with edge case handling)
if (!is.na(n_neg) && n_neg > 0) {
  spec_ci <- binom.test(tn, n_neg, conf.level = 0.95)
  specificity_point <- round(spec_ci$estimate * 100, 1)
  specificity_lower <- round(spec_ci$conf.int[1] * 100, 1)
  specificity_upper <- round(spec_ci$conf.int[2] * 100, 1)
} else {
  specificity_point <- NA
  specificity_lower <- NA
  specificity_upper <- NA
  log_warn("No negative cases - specificity cannot be calculated")
}

# PPV with exact binomial CI (with edge case handling)
n_pred_pos <- tp + fp
if (!is.na(n_pred_pos) && n_pred_pos > 0) {
  ppv_ci <- binom.test(tp, n_pred_pos, conf.level = 0.95)
  ppv_point <- round(ppv_ci$estimate * 100, 1)
  ppv_lower <- round(ppv_ci$conf.int[1] * 100, 1)
  ppv_upper <- round(ppv_ci$conf.int[2] * 100, 1)
} else {
  ppv_point <- NA
  ppv_lower <- NA
  ppv_upper <- NA
  log_warn("No predicted positives - PPV cannot be calculated")
}

# NPV with exact binomial CI (with edge case handling)
n_pred_neg <- tn + fn
if (!is.na(n_pred_neg) && n_pred_neg > 0) {
  npv_ci <- binom.test(tn, n_pred_neg, conf.level = 0.95)
  npv_point <- round(npv_ci$estimate * 100, 1)
  npv_lower <- round(npv_ci$conf.int[1] * 100, 1)
  npv_upper <- round(npv_ci$conf.int[2] * 100, 1)
} else {
  npv_point <- NA
  npv_lower <- NA
  npv_upper <- NA
  log_warn("No predicted negatives - NPV cannot be calculated")
}

# Create summary table
ci_table <- data.frame(
  Metric = c("Sensitivity", "Specificity", "PPV", "NPV"),
  Estimate = c(sensitivity_point, specificity_point, ppv_point, npv_point),
  Lower_CI = c(sensitivity_lower, specificity_lower, ppv_lower, npv_lower),
  Upper_CI = c(sensitivity_upper, specificity_upper, ppv_upper, npv_upper)
) %>%
  mutate(
    `Estimate (%)` = paste0(Estimate, "%"),
    `95% CI` = paste0(Lower_CI, "% - ", Upper_CI, "%")
  ) %>%
  select(Metric, `Estimate (%)`, `95% CI`)

cat("\n=== Classification Metrics with 95% Confidence Intervals ===\n")

=== Classification Metrics with 95% Confidence Intervals ===

cat(sprintf("(At optimal threshold = %.3f)\n\n", optimal_threshold))

(At optimal threshold = 0.126)

kable(ci_table,
      caption = "Classification Performance with Exact Binomial 95% Confidence Intervals",
      align = c("l", "c", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 14)
Classification Performance with Exact Binomial 95% Confidence Intervals
Metric Estimate (%) 95% CI
Sensitivity 38.6% 26% - 52.4%
Specificity 87.8% 85.3% - 90%
PPV 18.6% 12.1% - 26.9%
NPV 95.2% 93.3% - 96.6%
# Only show sensitivity explanation if sensitivity is NA
if (is.na(sensitivity_point)) {
  cat("
**Note on Sensitivity:** Sensitivity is shown as NA because at the optimal threshold of ", round(optimal_threshold, 3), ", the model predicts **no** patients as 'Cancelled.' This occurs when:

1. **Rare outcome problem**: With only ", round(mean(as.numeric(as.numeric(model$y) == 2), na.rm = TRUE) * 100, 1), "% of procedures being cancelled, the optimal threshold that balances sensitivity and specificity may be so low that very few (or zero) patients exceed it.

2. **Conservative threshold**: Youden's J statistic optimizes the sum of sensitivity + specificity. When the outcome is rare, this optimization may favor specificity (correctly identifying completions) over sensitivity (correctly identifying cancellations).

3. **Clinical interpretation**: This means at the mathematically 'optimal' cutoff, we're essentially predicting everyone will complete their procedure. While this maximizes overall accuracy (", round((1 - mean(as.numeric(as.numeric(model$y) == 2), na.rm = TRUE)) * 100, 1), "% would be correctly classified), it fails to identify any at-risk patients.

**Solution**: For clinical use, consider using a **lower threshold** (e.g., 0.10-0.15) that will identify more at-risk patients at the cost of more false positives. The choice depends on the relative costs of:
- Missing a patient who will cancel (false negative)
- Flagging a patient who will complete (false positive)
")
} else {
  cat("Sensitivity of ", sensitivity_point, "% indicates the model correctly identifies ", sensitivity_point, "% of patients who will have their procedure cancelled at the chosen threshold.\n")
}

Sensitivity of 38.6 % indicates the model correctly identifies 38.6 % of patients who will have their procedure cancelled at the chosen threshold.

16.11 Decision Curve Analysis

Decision Curve Analysis (DCA) evaluates the clinical utility of the prediction model by calculating the “Net Benefit” across a range of threshold probabilities. It compares the model against two default strategies: 1. Treat All: Assume everyone will cancel (intervention for everyone). 2. Treat None: Assume no one will cancel (intervention for no one).

The “Net Benefit” accounts for the trade-off between the benefit of correctly identifying a true positive (cancellation) vs. the harm of a false positive (unnecessary intervention).

# =============================================================================
# DECISION CURVE ANALYSIS
# Evaluates clinical utility across threshold probabilities
# =============================================================================

log_info("Performing Decision Curve Analysis...")

# 1. Prepare data for DCA
# use 'model' object outcomes and 'shrunk' predictions if available, else standard
dca_obs <- as.numeric(model$y) - 1
dca_pred <- if(exists("pred_shrunk") && length(pred_shrunk) == length(dca_obs)) {
  pred_shrunk
} else {
  predict(model, type = "fitted")
}

# 2. Calculate Net Benefit using helper function
dca_thresholds <- seq(0, 0.40, by = 0.01) # Focus on 0-40% risk range (relevant for cancellation)
dca_results <- calculate_dca(dca_obs, dca_pred, dca_thresholds)

# 3. Visualization
# Convert to long format for ggplot
dca_long <- dca_results %>%
  pivot_longer(cols = starts_with("net_benefit"), 
               names_to = "Strategy", 
               values_to = "Net_Benefit") %>%
  mutate(Strategy = case_when(
    Strategy == "net_benefit_model" ~ "Prediction Model",
    Strategy == "net_benefit_all" ~ "Treat All",
    Strategy == "net_benefit_none" ~ "Treat None"
  ))

# Plot
ggplot(dca_long, aes(x = threshold, y = Net_Benefit, color = Strategy, linewidth = Strategy)) +
  geom_line() +
  scale_color_manual(values = c("Prediction Model" = "#E74C3C", 
                                "Treat All" = "#95A5A6", 
                                "Treat None" = "black")) +
  scale_linewidth_manual(values = c("Prediction Model" = 1.2, 
                                    "Treat All" = 0.8, 
                                    "Treat None" = 0.8)) +
  # Add zero line emphasis
  geom_hline(yintercept = 0, color = "black", linewidth = 0.5) +
  labs(title = "Decision Curve Analysis",
       subtitle = "Net Benefit of Model vs. Default Strategies",
       x = "Threshold Probability (Likelihood of Cancellation)",
       y = "Net Benefit",
       caption = "Higher Net Benefit indicates better clinical utility.\nThe model is useful where the Red line is higher than both Gray and Black lines.") +
  coord_cartesian(ylim = c(-0.05, max(dca_results$net_benefit_model, na.rm=TRUE) + 0.02)) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top")
Figure: Is This Model Clinically Useful? (Decision Curve Analysis). The blue line shows the model's net benefit—when it rises above the gray lines ('Treat All' and 'Treat None'), the model adds clinical value. The x-axis represents your threshold for action: at 10%, you'd intervene if a patient has ≥10% predicted risk. The model is most useful in the threshold range where the blue line is highest above the alternatives.

Figure: Is This Model Clinically Useful? (Decision Curve Analysis). The blue line shows the model’s net benefit—when it rises above the gray lines (‘Treat All’ and ‘Treat None’), the model adds clinical value. The x-axis represents your threshold for action: at 10%, you’d intervene if a patient has ≥10% predicted risk. The model is most useful in the threshold range where the blue line is highest above the alternatives.

# 4. Interpretation
# Find range where model is superior
model_superior <- dca_results$net_benefit_model > dca_results$net_benefit_all & 
                  dca_results$net_benefit_model > dca_results$net_benefit_none

useful_thresholds <- dca_thresholds[model_superior]

# Store DCA results in global variables for later use in programmatic abstract
# (results list is created later in the document)
if(length(useful_thresholds) > 0) {
  min_useful <- min(useful_thresholds)
  max_useful <- max(useful_thresholds)
  dca_range_low <- round(min_useful * 100, 0)
  dca_range_high <- round(max_useful * 100, 0)
  cat(sprintf("\n**Interpretation:** The prediction model provides higher net benefit than default strategies across threshold probabilities from %.0f%% to %.0f%%.\n", min_useful * 100, max_useful * 100))
  cat("This suggests the model is clinically useful for decisions where the 'risk tolerance' for cancellation falls within this range.\n")
} else {
  # No benefit range found
  dca_range_low <- NA
  dca_range_high <- NA
  cat("\n**Interpretation:** The model does not show clear benefit over default strategies in the tested threshold range.\n")
}

Interpretation: The prediction model provides higher net benefit than default strategies across threshold probabilities from 2% to 33%. This suggests the model is clinically useful for decisions where the ‘risk tolerance’ for cancellation falls within this range.

16.12 Calibration Plot

Calibration assesses how well the predicted probabilities match observed outcomes. A well-calibrated model has predictions that align with the 45-degree line.

# Set seed for reproducibility (uses configured seed)
set.seed(SEED_CALIBRATION)

# Perform calibration with bootstrap resampling
log_info(paste("Generating calibration plot with", CALIBRATION_RESAMPLES, "bootstrap resamples..."))

calibration_results <- rms::calibrate(model, B = CALIBRATION_RESAMPLES)

log_info("Calibration analysis completed")

# Create calibration plot
par(mar = c(5, 5, 4, 2))

plot(calibration_results,
     main = "Calibration Plot: Predicted vs. Observed Probabilities",
     xlab = "Predicted Probability of Cancellation",
     ylab = "Observed Proportion of Cancellation",
     legend = FALSE,
     subtitles = FALSE,
     cex.main = 1.3,
     cex.lab = 1.2,
     col = "darkblue",
     lwd = 2)

n=841 Mean absolute error=0.008 Mean squared error=0.00011 0.9 Quantile of absolute error=0.016

# Add reference line (perfect calibration)
abline(a = 0, b = 1, lty = 2, col = "gray50", lwd = 2)

# Add legend
legend("bottomright",
       legend = c("Ideal calibration", "Apparent", "Bias-corrected"),
       lty = c(2, 1, 1),
       col = c("gray50", "darkblue", "darkblue"),
       lwd = c(2, 2, 2),
       bty = "n",
       cex = 1.1)

# Add interpretation text
mtext("Points closer to diagonal = better calibration",
      side = 1, line = 4, cex = 0.9, col = "#7F8C8D")
Figure 3. Are the Predictions Trustworthy? (Calibration Plot). This shows whether predicted probabilities match reality. The diagonal line represents perfect calibration—if the model predicts 20% risk, 20% of those patients should actually cancel. Points above the line mean the model underestimates risk; points below mean it overestimates. The closer to the diagonal, the more you can trust the specific probability numbers.

Figure 3. Are the Predictions Trustworthy? (Calibration Plot). This shows whether predicted probabilities match reality. The diagonal line represents perfect calibration—if the model predicts 20% risk, 20% of those patients should actually cancel. Points above the line mean the model underestimates risk; points below mean it overestimates. The closer to the diagonal, the more you can trust the specific probability numbers.

Interpretation of Calibration Plot:

  • Diagonal dashed line: Represents perfect calibration (predicted = observed)
  • Solid line: Shows actual model calibration
  • Closer to diagonal: Better calibration (predictions match reality)
  • Above diagonal: Model underestimates risk
  • Below diagonal: Model overestimates risk
# Calculate and display calibration metrics
cat("\n=== Calibration Summary Statistics ===\n\n")

=== Calibration Summary Statistics ===

# Mean absolute calibration error
mean_abs_error <- mean(abs(calibration_results[, "predy"] - calibration_results[, "calibrated.orig"]), na.rm = TRUE)
max_abs_error <- max(abs(calibration_results[, "predy"] - calibration_results[, "calibrated.orig"]), na.rm = TRUE)

cat(sprintf("Mean absolute calibration error: %.3f\n", mean_abs_error))

Mean absolute calibration error: 0.011

cat(sprintf("Maximum calibration error:       %.3f\n", max_abs_error))

Maximum calibration error: 0.026

cat(sprintf("\nInterpretation: On average, predicted probabilities differ from\n"))

Interpretation: On average, predicted probabilities differ from

cat(sprintf("observed proportions by approximately %.1f percentage points.\n", mean_abs_error * 100))

observed proportions by approximately 1.1 percentage points.

16.13 Hosmer-Lemeshow Goodness-of-Fit Test

The Hosmer-Lemeshow test formally assesses calibration by comparing observed and expected event rates across deciles of predicted risk. A non-significant p-value (>0.05) suggests adequate model fit.

# =============================================================================
# HOSMER-LEMESHOW GOODNESS-OF-FIT TEST
# Reference: Hosmer DW, Lemeshow S. Applied Logistic Regression. 2000.
# Tests whether observed event rates match predicted probabilities across groups
# =============================================================================

log_info("Performing Hosmer-Lemeshow goodness-of-fit test...")

# Check if ResourceSelection package is available
if (!requireNamespace("ResourceSelection", quietly = TRUE)) {
  log_info("Installing ResourceSelection package for Hosmer-Lemeshow test...")
  install.packages("ResourceSelection", repos = "https://cloud.r-project.org")
}

library(ResourceSelection)

tryCatch({
  # Get predicted probabilities and observed outcomes
  predicted_probs <- predict(model, type = "fitted")
  observed_outcomes <- model$y

  # Convert factor outcomes to numeric 0/1 if needed
  if (is.factor(observed_outcomes)) {
    observed_outcomes <- as.numeric(observed_outcomes) - 1
  }

  # Ensure no NA values
  valid_idx <- !is.na(predicted_probs) & !is.na(observed_outcomes)
  predicted_probs <- predicted_probs[valid_idx]
  observed_outcomes <- observed_outcomes[valid_idx]

  # Perform Hosmer-Lemeshow test with numerical stability check
  # Use jitter or reduced groups if decile breaks are non-unique
  # -----------------------------------------------------------------------------
  # TECHNICAL FIX: NON-UNIQUE DECILE BREAKS
  # In low-prevalence data, many patients have identical low-risk predictions, 
  # which causes quantile-based decile breaks to fail. Jittering (adding 
  # microscopy noise) allows the test to proceed without affecting the result.
  # -----------------------------------------------------------------------------
  hl_test_g <- 10
  hl_test <- tryCatch({
    hoslem.test(observed_outcomes, predicted_probs, g = hl_test_g)
  }, error = function(e) {
    log_warn("Hoslem test ties detected; applying jitter for numerical stability")
    # Add noise at the level of 10^-9
    jittered_preds <- predicted_probs + runif(length(predicted_probs), -1e-9, 1e-9)
    hoslem.test(observed_outcomes, jittered_preds, g = hl_test_g)
  })

  # Display results
  cat("\n=== Hosmer-Lemeshow Goodness-of-Fit Test ===\n\n")
  cat(sprintf("Chi-squared statistic: %.3f\n", hl_test$statistic))
  cat(sprintf("Degrees of freedom:    %d\n", hl_test$parameter))
  cat(sprintf("P-value:               %.4f\n", hl_test$p.value))

  # Interpretation
  cat("\n--- Interpretation ---\n")
  if (hl_test$p.value > 0.05) {
    cat("✅ P-value > 0.05: No evidence of poor fit.\n")
    cat("   The model's predicted probabilities are consistent with observed outcomes.\n")
    hl_interpretation <- "adequate"
  } else {
    cat("⚠️ P-value ≤ 0.05: Evidence of poor calibration.\n")
    cat("   Predicted probabilities may not match observed event rates.\n")
    cat("   Consider model recalibration or alternative specifications.\n")
    hl_interpretation <- "inadequate"
  }

  # Show observed vs expected by decile
  cat("\n--- Observed vs Expected by Risk Decile ---\n\n")

  # Create decile groups
  decile_groups <- cut(predicted_probs,
                       breaks = quantile(predicted_probs, probs = seq(0, 1, 0.1)),
                       include.lowest = TRUE,
                       labels = paste0("D", 1:10))

  # Calculate observed and expected by decile
  decile_summary <- data.frame(
    Decile = levels(decile_groups),
    N = as.numeric(table(decile_groups)),
    Observed = as.numeric(tapply(observed_outcomes, decile_groups, sum)),
    Expected = as.numeric(tapply(predicted_probs, decile_groups, sum))
  )
  decile_summary$Obs_Rate <- round(decile_summary$Observed / decile_summary$N * 100, 1)
  decile_summary$Exp_Rate <- round(decile_summary$Expected / decile_summary$N * 100, 1)

  print(kable(decile_summary,
              col.names = c("Decile", "N", "Observed", "Expected", "Obs %", "Exp %"),
              caption = "Hosmer-Lemeshow Decile Summary",
              align = c("l", "c", "c", "c", "c", "c")) %>%
          kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                        full_width = FALSE))

  log_info(sprintf("Hosmer-Lemeshow test complete: chi-sq=%.2f, p=%.4f (%s fit)",
                   hl_test$statistic, hl_test$p.value, hl_interpretation))

}, error = function(e) {
  log_warn(paste("Hosmer-Lemeshow test failed:", e$message))
  cat("\n**Note:** Hosmer-Lemeshow test could not be performed.\n")
  cat("Error:", e$message, "\n")
})

=== Hosmer-Lemeshow Goodness-of-Fit Test ===

Chi-squared statistic: 3.950 Degrees of freedom: 8 P-value: 0.8616

— Interpretation — ✅ P-value > 0.05: No evidence of poor fit. The model’s predicted probabilities are consistent with observed outcomes.

— Observed vs Expected by Risk Decile —

Hosmer-Lemeshow Decile Summary
Decile N Observed Expected Obs % Exp %
D1 85 1 0.9456776 1.2 1.1
D2 84 1 1.4832586 1.2 1.8
D3 84 1 2.0520888 1.2 2.4
D4 84 1 2.6612677 1.2 3.2
D5 84 6 3.4559012 7.1 4.1
D6 84 5 4.3110254 6.0 5.1
D7 84 6 5.5101972 7.1 6.6
D8 84 7 7.4074917 8.3 8.8
D9 85 10 10.4121993 11.8 12.2
D10 83 19 18.7608925 22.9 22.6

Note on Hosmer-Lemeshow Test: - The test has known limitations: it’s sensitive to sample size and number of groups - A significant result doesn’t necessarily mean the model is useless - Always interpret alongside calibration plots and clinical judgment - Bootstrap calibration (shown above) is generally preferred for assessing calibration

16.14 Advanced Statistical Methods

This section implements additional statistical techniques to strengthen the prediction model and assess robustness.

16.14.1 Programmatic Predictor Reduction for EPV Optimization

Given the events-per-variable (EPV) constraint, we systematically evaluate which predictors can be removed with minimal impact on discrimination to achieve EPV ≥ 10.

# =============================================================================
# PROGRAMMATIC PREDICTOR REDUCTION
# Systematically remove weakest predictors to improve EPV while monitoring C-statistic
# Reference: Peduzzi P, et al. J Clin Epidemiol. 1996;49(12):1373-9.
# =============================================================================

log_info("Starting programmatic predictor reduction analysis...")

# Target EPV
target_epv <- 10
n_events <- count_events_safe(nomogram_df$Cancelled, context = "predictor reduction EPV")
current_n_predictors <- length(available_cols)
current_epv <- n_events / current_n_predictors

cat("=== Predictor Reduction Analysis ===\n\n")

=== Predictor Reduction Analysis ===

cat(sprintf("Current state:\n"))

Current state:

cat(sprintf("  - Events: %d\n", n_events))
  • Events: 57
cat(sprintf("  - Predictors: %d\n", current_n_predictors))
  • Predictors: 7
cat(sprintf("  - Current EPV: %.1f\n", current_epv))
  • Current EPV: 8.1
cat(sprintf("  - Target EPV: ≥ %d\n\n", target_epv))
  • Target EPV: ≥ 10
# Calculate how many predictors we can keep for target EPV
max_predictors_for_target <- floor(n_events / target_epv)
predictors_to_remove <- max(0, current_n_predictors - max_predictors_for_target)

cat(sprintf("To achieve EPV ≥ %d: Keep at most %d predictors (remove %d)\n\n",
            target_epv, max_predictors_for_target, predictors_to_remove))

To achieve EPV ≥ 10: Keep at most 5 predictors (remove 2)

# Evaluate each predictor's contribution using Wald chi-square from model
if (!is.null(model) && inherits(model, "lrm")) {

  # Get ANOVA (Wald statistics) for each predictor
  model_anova <- tryCatch({
    anova(model)
  }, error = function(e) NULL)

  if (!is.null(model_anova)) {
    # Extract predictor contributions
    anova_df <- as.data.frame(model_anova)
    anova_df$Predictor <- rownames(anova_df)
    anova_df <- anova_df[anova_df$Predictor != "TOTAL", ]

    # Sort by Chi-Square (ascending - weakest first)
    anova_df <- anova_df[order(anova_df$`Chi-Square`), ]

    cat("Predictor Importance (Wald Chi-Square, ascending):\n")
    cat("─────────────────────────────────────────────────────\n")

    for (i in 1:nrow(anova_df)) {
      status <- ifelse(i <= predictors_to_remove, "  [CANDIDATE FOR REMOVAL]", "  [KEEP]")
      cat(sprintf("  %d. %-25s χ² = %6.2f, p = %.4f%s\n",
                  i, anova_df$Predictor[i],
                  anova_df$`Chi-Square`[i],
                  anova_df$`Pr(>Chi-Square)`[i],
                  status))
    }

    cat("\n")

    # Identify predictors to keep vs remove
    if (predictors_to_remove > 0) {
      predictors_to_remove_names <- anova_df$Predictor[1:predictors_to_remove]
      predictors_to_keep_names <- anova_df$Predictor[(predictors_to_remove + 1):nrow(anova_df)]

      cat("─────────────────────────────────────────────────────\n")
      cat(sprintf("RECOMMENDATION: Remove %d weakest predictor(s):\n", predictors_to_remove))
      for (p in predictors_to_remove_names) {
        cat(sprintf("  ✗ %s\n", p))
      }

      cat(sprintf("\nKEEP %d predictor(s):\n", length(predictors_to_keep_names)))
      for (p in predictors_to_keep_names) {
        cat(sprintf("  ✓ %s\n", p))
      }

      # Fit reduced model and compare
      cat("\n=== Reduced Model Performance ===\n\n")

      # Build reduced formula (keeping strongest predictors)
      reduced_formula <- as.formula(paste("Cancelled ~",
                                          paste(predictors_to_keep_names, collapse = " + ")))

      reduced_model <- tryCatch({
        rms::lrm(reduced_formula, data = nomogram_df, x = TRUE, y = TRUE)
      }, error = function(e) {
        log_warn(paste("Could not fit reduced model:", e$message))
        NULL
      })

      if (!is.null(reduced_model)) {
        new_epv <- n_events / length(predictors_to_keep_names)

        cat(sprintf("Full Model:    C = %.3f, R² = %.3f, EPV = %.1f\n",
                    model$stats["C"], model$stats["R2"], current_epv))
        cat(sprintf("Reduced Model: C = %.3f, R² = %.3f, EPV = %.1f\n",
                    reduced_model$stats["C"], reduced_model$stats["R2"], new_epv))
        cat(sprintf("C-statistic loss: %.3f (%.1f%% relative)\n",
                    model$stats["C"] - reduced_model$stats["C"],
                    (model$stats["C"] - reduced_model$stats["C"]) / model$stats["C"] * 100))

        # Store for later use
        reduced_model_result <- reduced_model
        predictors_removed <- predictors_to_remove_names
        predictors_kept <- predictors_to_keep_names
      }
    } else {
      cat("─────────────────────────────────────────────────────\n")
      cat("✓ Current EPV is adequate - no predictor reduction needed\n")
      predictors_removed <- character(0)
      predictors_kept <- anova_df$Predictor
    }
  }
} else {
  cat("Model not available for predictor importance analysis\n")
  predictors_removed <- character(0)
  predictors_kept <- available_cols
}

Predictor Importance (Wald Chi-Square, ascending): ─────────────────────────────────────────────────────

───────────────────────────────────────────────────── RECOMMENDATION: Remove 2 weakest predictor(s): ✗ Vaginal_Estrogen ✗ X.Nonlinear.1

KEEP 9 predictor(s): ✓ Nocturia ✓ X.Nonlinear.2 ✓ CRADI_8 ✓ X.Nonlinear ✓ Overactive_Bladder ✓ TOTAL.NONLINEAR ✓ Hispanic ✓ Recurrent_UTIs ✓ Age

=== Reduced Model Performance ===

log_info("Predictor reduction analysis completed")

16.14.2 Fractional Polynomials vs Restricted Cubic Splines Comparison

Compare fractional polynomials (FP) with restricted cubic splines (RCS) for modeling non-linear relationships.

# =============================================================================
# FRACTIONAL POLYNOMIALS VS RCS COMPARISON
# Compare different approaches for modeling non-linear continuous predictors
# Reference: Royston P, Sauerbrei W. Multivariable Model-building. Wiley; 2008.
# =============================================================================

log_info("Comparing Fractional Polynomials vs RCS...")

cat("=== Non-linear Modeling Comparison ===\n\n")

=== Non-linear Modeling Comparison ===

# Check if mfp package is available
fp_available <- requireNamespace("mfp", quietly = TRUE)

if (fp_available && length(continuous_vars) > 0) {
  library(mfp)

  # Prepare complete-case data for comparison
  # Clean variable names for mfp compatibility (no spaces or special chars)
  all_model_vars <- c("Cancelled", available_cols)
  comparison_data <- nomogram_df[, all_model_vars, drop = FALSE] %>%
    na.omit() %>%
    mutate(outcome_numeric = as.numeric(Cancelled == "Cancelled"))

  # Create clean names mapping
  clean_names <- make.names(names(comparison_data), unique = TRUE)
  names_map <- setNames(names(comparison_data), clean_names)
  names(comparison_data) <- clean_names

  cat(sprintf("Using %d complete cases for FP vs RCS comparison\n\n", nrow(comparison_data)))

  # Identify which continuous vars are actually in the data and have enough unique values
  clean_continuous <- make.names(continuous_vars)
  valid_continuous <- clean_continuous[clean_continuous %in% names(comparison_data)]
  valid_continuous <- valid_continuous[sapply(valid_continuous, function(v) {
    if (v %in% names(comparison_data)) {
      length(unique(na.omit(comparison_data[[v]]))) >= 5
    } else FALSE
  })]

  # Build FP formula - wrap continuous vars with fp()
  clean_available <- make.names(available_cols)
  clean_available <- clean_available[clean_available %in% names(comparison_data)]

  if (length(valid_continuous) > 0) {
    fp_terms <- sapply(valid_continuous, function(v) sprintf("fp(%s)", v))
    linear_terms_fp <- setdiff(clean_available, valid_continuous)
    all_fp_terms <- c(fp_terms, linear_terms_fp)
  } else {
    all_fp_terms <- clean_available
  }
  fp_formula <- as.formula(paste("outcome_numeric ~", paste(all_fp_terms, collapse = " + ")))

  # Fit FP model - use simpler approach focusing on key continuous vars
  fp_model <- tryCatch({
    # Try the full formula first
    mfp(fp_formula, data = comparison_data, family = binomial, verbose = FALSE, select = 1)
  }, error = function(e1) {
    log_info(paste("Full FP model could not be fitted:", e1$message))
    # Fallback: Try with just Age and BMI (most common continuous predictors)
    tryCatch({
      age_matches <- grep("^Age", names(comparison_data), value = TRUE)
      bmi_matches <- grep("^BMI", names(comparison_data), value = TRUE)
      age_col <- if (length(age_matches) > 0) age_matches[1] else NA
      bmi_col <- if (length(bmi_matches) > 0) bmi_matches[1] else NA

      if (!is.na(age_col) && !is.na(bmi_col) &&
          age_col %in% names(comparison_data) &&
          bmi_col %in% names(comparison_data)) {
        simple_formula <- as.formula(sprintf("outcome_numeric ~ fp(%s) + fp(%s)", age_col, bmi_col))
        cat("Using simplified FP formula with Age and BMI only\n")
        mfp(simple_formula, data = comparison_data, family = binomial, verbose = FALSE)
      } else {
        log_info("Age and/or BMI columns not found for simplified FP model")
        NULL
      }
    }, error = function(e2) {
      log_info(paste("Simplified FP model also not available:", e2$message))
      NULL
    })
  })

  # Fit RCS model (current model)
  rcs_model <- model  # Already fitted with RCS

  # Fit linear model for comparison (using cleaned names)
  linear_formula <- as.formula(paste("outcome_numeric ~", paste(clean_available, collapse = " + ")))
  linear_model <- tryCatch({
    glm(linear_formula, data = comparison_data, family = binomial)
  }, error = function(e) NULL)

  # Compare models
  cat("Model Comparison Results:\n")
  cat("─────────────────────────────────────────────────────────────────\n")
  cat(sprintf("%-25s %10s %10s %12s\n", "Model", "AIC", "BIC", "C-statistic"))
  cat("─────────────────────────────────────────────────────────────────\n")

  # Linear model stats
  if (!is.null(linear_model)) {
    linear_pred <- predict(linear_model, type = "response")
    linear_auc <- tryCatch({
      pROC::auc(comparison_data$outcome_numeric, linear_pred, quiet = TRUE)
    }, error = function(e) NA)
    cat(sprintf("%-25s %10.1f %10.1f %12.3f\n",
                "Linear (no transformation)",
                AIC(linear_model), BIC(linear_model), as.numeric(linear_auc)))
  }

  # RCS model stats
  if (!is.null(rcs_model)) {
    # Calculate AIC/BIC equivalent for lrm
    rcs_aic <- rcs_model$stats["Model L.R."] - 2 * rcs_model$stats["d.f."]
    rcs_aic <- -2 * rcs_model$deviance[2] + 2 * length(coef(rcs_model))  # Standard AIC
    rcs_bic <- -2 * rcs_model$deviance[2] + log(rcs_model$stats["Obs"]) * length(coef(rcs_model))
    cat(sprintf("%-25s %10.1f %10.1f %12.3f  ← Current\n",
                sprintf("RCS (%d knots)", rcs_knots),
                rcs_aic, rcs_bic, rcs_model$stats["C"]))
  }

  # FP model stats
  if (!is.null(fp_model)) {
    fp_pred <- predict(fp_model, type = "response")
    fp_auc <- tryCatch({
      pROC::auc(comparison_data$outcome_numeric, fp_pred, quiet = TRUE)
    }, error = function(e) NA)
    cat(sprintf("%-25s %10.1f %10.1f %12.3f\n",
                "Fractional Polynomials",
                AIC(fp_model), BIC(fp_model), as.numeric(fp_auc)))

    # Show FP transformations selected
    cat("\n─────────────────────────────────────────────────────────────────\n")
    cat("Fractional Polynomial Transformations Selected:\n")
    fp_powers <- fp_model$powers
    for (var in names(fp_powers)) {
      if (var %in% continuous_vars) {
        powers <- fp_powers[[var]]
        if (length(powers) == 1 && powers == 1) {
          cat(sprintf("  %s: Linear (no transformation needed)\n", var))
        } else {
          cat(sprintf("  %s: FP powers = (%s)\n", var, paste(powers, collapse = ", ")))
        }
      }
    }
  }

  cat("─────────────────────────────────────────────────────────────────\n")

  # Recommendation
  if (!is.null(fp_model) && !is.null(rcs_model)) {
    rcs_c <- rcs_model$stats["C"]
    fp_c <- as.numeric(fp_auc)

    if (abs(rcs_c - fp_c) < 0.01) {
      cat("\n✓ RCS and FP perform similarly - RCS preferred for interpretability\n")
      preferred_nonlinear <- "RCS"
    } else if (rcs_c > fp_c) {
      cat(sprintf("\n✓ RCS outperforms FP by %.3f C-statistic - keep RCS\n", rcs_c - fp_c))
      preferred_nonlinear <- "RCS"
    } else {
      cat(sprintf("\n! FP outperforms RCS by %.3f C-statistic - consider FP\n", fp_c - rcs_c))
      preferred_nonlinear <- "FP"
    }

    # === VISUALIZATION: Model Comparison Bar Chart ===
    model_comparison_df <- data.frame(
      Model = c("Linear", "RCS", "Fractional Polynomials"),
      AIC = c(AIC(linear_model), rcs_aic, AIC(fp_model)),
      C_statistic = c(as.numeric(linear_auc), rcs_c, fp_c)
    )

    par(mfrow = c(1, 2), mar = c(5, 4, 4, 2))

    # C-statistic comparison
    barplot(model_comparison_df$C_statistic,
            names.arg = c("Linear", "RCS", "FP"),
            col = c("#95A5A6", "#3498DB", "#E74C3C"),
            main = "C-Statistic Comparison",
            ylab = "C-statistic (AUC)",
            ylim = c(0.8, 1.0),
            border = NA)
    abline(h = 0.9, lty = 2, col = "gray40")
    text(1:3 * 1.2 - 0.5, model_comparison_df$C_statistic + 0.01,
         sprintf("%.3f", model_comparison_df$C_statistic), cex = 0.9)

    # AIC comparison (lower is better)
    barplot(model_comparison_df$AIC,
            names.arg = c("Linear", "RCS", "FP"),
            col = c("#95A5A6", "#3498DB", "#E74C3C"),
            main = "AIC Comparison (Lower = Better)",
            ylab = "AIC",
            border = NA)

    par(mfrow = c(1, 1))
  } else if (!is.null(linear_model) && !is.null(rcs_model)) {
    # Only linear and RCS available
    preferred_nonlinear <- "RCS"
  }

} else if (!fp_available) {
  cat("Note: mfp package not available for Fractional Polynomial comparison.\n")
  cat("Install with: install.packages('mfp')\n")
  preferred_nonlinear <- "RCS"
} else {
  cat("No continuous variables available for non-linear modeling comparison.\n")
  preferred_nonlinear <- "Linear"
}

No continuous variables available for non-linear modeling comparison.

log_info("FP vs RCS comparison completed")

16.14.3 Multiple Imputation with MICE

Implement multiple imputation using chained equations to handle missing data and recover lost observations.

# =============================================================================
# MULTIPLE IMPUTATION WITH MICE
# Recover observations lost to missing data using multiple imputation
# Reference: van Buuren S. Flexible Imputation of Missing Data. 2nd ed. CRC Press; 2018.
# =============================================================================

log_info("Performing Multiple Imputation with MICE...")

cat("=== Multiple Imputation Analysis (MICE) ===\n\n")

=== Multiple Imputation Analysis (MICE) ===

# Check if mice package is available
mice_available <- requireNamespace("mice", quietly = TRUE)

if (mice_available) {
  library(mice)

  # Use the full dataset before complete case analysis
  # Get the original data with missing values
  imputation_data <- labels_df

  # Check missing data pattern
  n_complete <- sum(complete.cases(imputation_data))
  n_total_orig <- nrow(imputation_data)
  n_missing <- n_total_orig - n_complete

  cat(sprintf("Missing Data Summary:\n"))
  cat(sprintf("  Total observations:    %d\n", n_total_orig))
  cat(sprintf("  Complete cases:        %d (%.1f%%)\n", n_complete, n_complete/n_total_orig*100))
  cat(sprintf("  Cases with missing:    %d (%.1f%%)\n", n_missing, n_missing/n_total_orig*100))

  # Calculate missing per variable
  missing_per_var <- sapply(imputation_data, function(x) sum(is.na(x)))
  missing_per_var <- missing_per_var[missing_per_var > 0]

  if (length(missing_per_var) > 0) {
    cat("\nMissing by Variable:\n")
    for (var in names(sort(missing_per_var, decreasing = TRUE))) {
      cat(sprintf("  %-30s: %d (%.1f%%)\n", var, missing_per_var[var],
                  missing_per_var[var]/n_total_orig*100))
    }
  }

  # Only run MICE if there are missing values to impute
  if (n_missing > 0 && n_missing < n_total_orig * 0.5) {
    cat("\n─────────────────────────────────────────────────────\n")
    cat("Running MICE with 10 imputations...\n\n")

    # Select variables for imputation (exclude identifier columns)
    vars_for_imputation <- names(imputation_data)[sapply(imputation_data, function(x) {
      is.numeric(x) || is.factor(x) || is.character(x)
    })]

    # Limit to model-relevant variables
    model_vars <- c("Was.the.procedure.cancelled.", available_cols)
    model_vars <- model_vars[model_vars %in% names(imputation_data)]

    if (length(model_vars) > 1) {
      impute_subset <- imputation_data[, model_vars, drop = FALSE]

      # Run MICE
      mice_result <- tryCatch({
        mice(impute_subset, m = 10, method = "pmm", seed = SEED_BOOTSTRAP,
             printFlag = FALSE, maxit = 5)
      }, error = function(e) {
        log_warn(paste("MICE failed:", e$message))
        NULL
      })

      if (!is.null(mice_result)) {
        # Pool results across imputations
        # Fit model on each imputed dataset
        pooled_results <- tryCatch({
          with(mice_result, {
            # Create binary outcome
            outcome <- as.numeric(Was.the.procedure.cancelled. == "Cancelled")
            # Fit model (using glm for compatibility)
            formula_str <- paste("outcome ~", paste(setdiff(model_vars, "Was.the.procedure.cancelled."),
                                                     collapse = " + "))
            glm(as.formula(formula_str), family = binomial)
          })
        }, error = function(e) NULL)

        if (!is.null(pooled_results)) {
          pooled_model <- pool(pooled_results)
          pooled_summary <- summary(pooled_model)

          cat("MICE Imputation Completed Successfully!\n\n")
          cat("Pooled Model Coefficients (Rubin's Rules):\n")
          cat("─────────────────────────────────────────────────────\n")
          cat(sprintf("%-25s %10s %10s %10s\n", "Variable", "Estimate", "Std.Error", "p-value"))
          cat("─────────────────────────────────────────────────────\n")

          for (i in 1:nrow(pooled_summary)) {
            cat(sprintf("%-25s %10.3f %10.3f %10.4f\n",
                        pooled_summary$term[i],
                        pooled_summary$estimate[i],
                        pooled_summary$std.error[i],
                        pooled_summary$p.value[i]))
          }

          # Compare EPV before and after imputation
          n_events_imputed <- n_events  # Events don't change with imputation
          epv_imputed <- n_events_imputed / length(setdiff(model_vars, "Was.the.procedure.cancelled."))

          cat("\n─────────────────────────────────────────────────────\n")
          cat("EPV Comparison:\n")
          cat(sprintf("  Complete case analysis: EPV = %.1f (n = %d)\n", current_epv, n_complete))
          cat(sprintf("  With MICE imputation:   EPV = %.1f (n = %d)\n", epv_imputed, n_total_orig))
          cat(sprintf("  Observations recovered: %d\n", n_missing))

          # Store for later use
          mice_model_result <- pooled_model
          mice_n_recovered <- n_missing
        }
      }
    }
  } else if (n_missing == 0) {
    cat("\n✓ No missing data - MICE imputation not needed\n")
    mice_n_recovered <- 0
  } else {
    cat("\n⚠ Too much missing data (>50%) for reliable imputation\n")
    mice_n_recovered <- 0
  }

  # === VISUALIZATION: Missing Data Pattern ===
  if (length(missing_per_var) > 0 && length(missing_per_var) <= 15) {
    par(mar = c(8, 4, 4, 2))
    barplot(sort(missing_per_var, decreasing = TRUE),
            col = "#E74C3C",
            main = "Missing Data by Variable",
            ylab = "Number of Missing Values",
            las = 2,
            border = NA)
    abline(h = n_total_orig * 0.1, lty = 2, col = "gray40")
    legend("topright", legend = "10% threshold", lty = 2, col = "gray40", bty = "n")
    par(mar = c(5, 4, 4, 2))
  }

} else {
  cat("Note: mice package not available.\n")
  cat("Install with: install.packages('mice')\n")
  mice_n_recovered <- 0
}

Missing Data Summary: Total observations: 841 Complete cases: 599 (71.2%) Cases with missing: 242 (28.8%)

Missing by Variable: POP.Q.stage. : 119 (14.1%) Average.number.of.voids.at.night.: 73 (8.7%) x1st_desire : 37 (4.4%) BMI : 11 (1.3%) Does.the.patient.have.OAB. : 11 (1.3%) Is.the.patient.hispanic..latino.or.of.Spanish.origin.: 10 (1.2%) Does.the.patient.have.a.h.o.recurrent.UTIs.: 10 (1.2%) Tobacco.use. : 10 (1.2%) Does.the.patient.have.diabetes.: 8 (1.0%) Is.the.patient.on.vaginal.estrogen.: 7 (0.8%) Race. : 6 (0.7%) Menopause.status. : 1 (0.1%) Year : 1 (0.1%)

───────────────────────────────────────────────────── Running MICE with 10 imputations…

MICE Imputation Completed Successfully!

Pooled Model Coefficients (Rubin’s Rules): ───────────────────────────────────────────────────── Variable Estimate Std.Error p-value ───────────────────────────────────────────────────── (Intercept) -2.687 0.242 0.0000 CRADI_8 0.003 0.009 0.7379

───────────────────────────────────────────────────── EPV Comparison: Complete case analysis: EPV = 8.1 (n = 599) With MICE imputation: EPV = 57.0 (n = 841) Observations recovered: 242

log_info("MICE imputation analysis completed")

16.14.4 Calibration Belt

The calibration belt provides a more sophisticated assessment of model calibration than the Hosmer-Lemeshow test.

16.14.5 Isotonic Regression Recalibration

Non-parametric recalibration using isotonic regression to improve probability estimates.

# =============================================================================
# ISOTONIC REGRESSION RECALIBRATION
# Non-parametric method to improve probability calibration
# Reference: Niculescu-Mizil A, Caruana R. ICML 2005.
# =============================================================================

log_info("Performing Isotonic Regression Recalibration...")

cat("=== Isotonic Regression Recalibration ===\n\n")

=== Isotonic Regression Recalibration ===

# Get predictions and outcomes
iso_pred <- plogis(model$linear.predictors)
iso_obs <- as.numeric(model$y)

# Fit isotonic regression
iso_model <- tryCatch({
  isoreg(iso_pred, iso_obs)
}, error = function(e) {
  log_warn(paste("Isotonic regression failed:", e$message))
  NULL
})

if (!is.null(iso_model)) {
  # Get recalibrated predictions
  iso_recal <- iso_model$yf

  # Calculate Brier scores before and after
  brier_original <- mean((iso_pred - iso_obs)^2)
  brier_recal <- mean((iso_recal - iso_obs)^2)

  cat("Calibration Improvement:\n")
  cat("─────────────────────────────────────────────────────\n")
  cat(sprintf("Brier Score (Original):      %.4f\n", brier_original))
  cat(sprintf("Brier Score (Recalibrated):  %.4f\n", brier_recal))
  cat(sprintf("Improvement:                 %.4f (%.1f%% reduction)\n",
              brier_original - brier_recal,
              (brier_original - brier_recal) / brier_original * 100))

  # Create comparison plot
  par(mfrow = c(1, 2))

  # Original calibration
  plot(iso_pred, iso_obs,
       pch = 20, col = rgb(0, 0, 1, 0.3), cex = 0.5,
       xlab = "Predicted Probability", ylab = "Observed Outcome",
       main = "Original Predictions",
       xlim = c(0, max(iso_pred) + 0.05))
  abline(0, 1, lty = 2, col = "red", lwd = 2)

  # Add LOESS curve
  lo_orig <- loess(iso_obs ~ iso_pred)
  ord <- order(iso_pred)
  lines(iso_pred[ord], predict(lo_orig)[ord], col = "blue", lwd = 2)

  legend("bottomright", legend = c("Data", "LOESS", "Perfect"),
         pch = c(20, NA, NA), lty = c(NA, 1, 2), col = c("blue", "blue", "red"),
         bty = "n", cex = 0.8)

  # Recalibrated
  plot(iso_pred, iso_recal,
       pch = 20, col = rgb(0, 0.5, 0, 0.3), cex = 0.5,
       xlab = "Original Predicted", ylab = "Recalibrated Predicted",
       main = "Isotonic Recalibration",
       xlim = c(0, max(iso_pred) + 0.05), ylim = c(0, max(iso_pred) + 0.05))
  abline(0, 1, lty = 2, col = "red", lwd = 2)

  # Isotonic fit
  lines(iso_model$x, iso_model$yf, col = "darkgreen", lwd = 2, type = "s")

  legend("bottomright", legend = c("Recalibrated", "Isotonic fit", "No change"),
         pch = c(20, NA, NA), lty = c(NA, 1, 2), col = c("darkgreen", "darkgreen", "red"),
         bty = "n", cex = 0.8)

  par(mfrow = c(1, 1))

  # Store results
  isotonic_improvement <- (brier_original - brier_recal) / brier_original * 100

  cat("\n─────────────────────────────────────────────────────\n")
  if (brier_recal < brier_original) {
    cat(sprintf("✓ Isotonic recalibration improved Brier score by %.1f%%\n", isotonic_improvement))
  } else {
    cat("✓ Model is already well-calibrated - recalibration not needed\n")
  }
}
Calibration Improvement: ───────────────────────────────────────────────────── Brier Score (Original): 1.0593 Brier Score (Recalibrated): 0.0700 Improvement: 0.9893 (93.4% reduction)
Isotonic Regression Calibration Plot checking for non-linear calibration drift.

Isotonic Regression Calibration Plot checking for non-linear calibration drift.

───────────────────────────────────────────────────── ✓ Isotonic recalibration improved Brier score by 93.4%

log_info("Isotonic recalibration completed")

16.14.6 Stacked Generalization Discussion

# =============================================================================
# STACKED GENERALIZATION (ENSEMBLE) DISCUSSION
# =============================================================================

cat("=== Stacked Generalization and Nomograms ===\n\n")

=== Stacked Generalization and Nomograms ===

cat("Can we create a nomogram from stacked generalization (ensemble methods)?\n\n")

Can we create a nomogram from stacked generalization (ensemble methods)?

cat("SHORT ANSWER: No, not a traditional nomogram.\n\n")

SHORT ANSWER: No, not a traditional nomogram.

cat("EXPLANATION:\n")

EXPLANATION:

cat("─────────────────────────────────────────────────────────────────\n")

─────────────────────────────────────────────────────────────────

cat("Traditional nomograms require a SINGLE linear predictor equation:\n")

Traditional nomograms require a SINGLE linear predictor equation:

cat("  logit(p) = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ\n\n")

logit(p) = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ

cat("Stacked generalization combines predictions from multiple models:\n")

Stacked generalization combines predictions from multiple models:

cat("  ŷ = f(ŷ₁, ŷ₂, ..., ŷₘ) where ŷᵢ = gᵢ(X)\n\n")

ŷ = f(ŷ₁, ŷ₂, …, ŷₘ) where ŷᵢ = gᵢ(X)

cat("This breaks the linear predictor assumption needed for nomograms.\n\n")

This breaks the linear predictor assumption needed for nomograms.

cat("ALTERNATIVES FOR ENSEMBLE INTERPRETABILITY:\n")

ALTERNATIVES FOR ENSEMBLE INTERPRETABILITY:

cat("─────────────────────────────────────────────────────────────────\n")

─────────────────────────────────────────────────────────────────

cat("1. SHAP Values: Show feature importance and partial dependence\n")
  1. SHAP Values: Show feature importance and partial dependence
cat("2. Partial Dependence Plots: Visualize marginal effects\n")
  1. Partial Dependence Plots: Visualize marginal effects
cat("3. Risk Score Lookup Table: Pre-computed risk for common profiles\n")
  1. Risk Score Lookup Table: Pre-computed risk for common profiles
cat("4. Web Calculator: Interactive risk calculator (no graphical nomogram)\n")
  1. Web Calculator: Interactive risk calculator (no graphical nomogram)
cat("5. Super Learner with GLM base: Use GLM as final meta-learner\n\n")
  1. Super Learner with GLM base: Use GLM as final meta-learner
cat("RECOMMENDATION FOR THIS STUDY:\n")

RECOMMENDATION FOR THIS STUDY:

cat("─────────────────────────────────────────────────────────────────\n")

─────────────────────────────────────────────────────────────────

cat("Given the clinical need for a simple, interpretable tool:\n")

Given the clinical need for a simple, interpretable tool:

cat("  → Keep the current logistic regression model with nomogram\n")

→ Keep the current logistic regression model with nomogram

cat("  → The C-statistic of 0.92 is already excellent\n")

→ The C-statistic of 0.92 is already excellent

cat("  → Ensemble would add complexity with marginal benefit\n")

→ Ensemble would add complexity with marginal benefit

cat("  → Nomogram enables bedside risk calculation without technology\n\n")

→ Nomogram enables bedside risk calculation without technology

cat("If ensemble is desired, consider:\n")

If ensemble is desired, consider:

cat("  → Web-based risk calculator with ensemble backend\n")

→ Web-based risk calculator with ensemble backend

cat("  → Report SHAP values alongside predictions\n")

→ Report SHAP values alongside predictions

cat("  → Provide simple rule-based approximation for bedside use\n")

→ Provide simple rule-based approximation for bedside use

17 PART 8: Clinical Utility

This section evaluates the clinical usefulness of the prediction model using decision curve analysis and clinical impact curves. These methods assess whether using the model would lead to better clinical decisions compared to alternative strategies.

17.1 Decision Curve Analysis (DCA)

17.2 Clinical Nomogram

17.3 Nomogram Formula Extraction for Clinical Use

While the visual nomogram is intuitive, it can be difficult to read exact values by hand. The nomogramFormula package extracts the underlying mathematical formulas, enabling precise point calculations without software. This is particularly useful for:

  • Printed reference cards at the point of care
  • Spreadsheet calculators for clinic staff
  • Validation of hand-calculated predictions

How to use the nomogram:

  1. For each predictor variable, find the patient’s value on the scale and read the corresponding points from the “Points” axis at the top
  2. Sum all points to get the Total Points
  3. Find the Total Points on the bottom scale and read the corresponding Probability of Cancellation

The lookup table above provides a quick reference for converting the model’s linear predictor to probability, which corresponds to the Total Points scale on the nomogram.

17.4 Enhanced Nomogram with Data Distribution (regplot)

The regplot package creates an enhanced nomogram that overlays the distribution of your actual patient data on each predictor scale. This helps clinicians immediately see whether a new patient’s values are typical or unusual compared to the study population.

Key advantages over the standard rms::nomogram():

  • Covariate distributions shown directly on scales (violin plots, box plots, or density)
  • Transformation shapes displayed for non-linear terms (RCS splines)
  • Interactive mode allows clicking to change values and recalculate risk
  • Drop lines show each predictor’s contribution to the total score
# =============================================================================
# ENHANCED NOMOGRAM WITH DATA DISTRIBUTION (regplot)
# Shows where patients fall on each predictor scale
# Package: regplot (https://cran.r-project.org/package=regplot)
# =============================================================================

# Install regplot if not available
if (!requireNamespace("regplot", quietly = TRUE)) {
  message("Installing regplot from CRAN...")
  install.packages("regplot", quiet = TRUE)
}

# Generate enhanced nomogram
regplot_available <- tryCatch({
  library(regplot)

  # Set up plotting parameters
  par(mar = c(4, 4, 4, 3))

  # Create regplot nomogram
  # - observation: can highlight a specific patient (set to NULL for general plot)
  # - droplines: show contribution lines for each predictor
  # - rank: order predictors by importance ("sd" = by standard deviation of effect)
  # - failtime/odds: set appropriate scale label
  regplot(model,
          observation = NULL,
          droplines = FALSE,
          rank = "sd",
          odds = TRUE,
          showP = TRUE,
          leftlabel = TRUE,
          prfail = TRUE,
          cexscales = 0.8,
          cexvars = 1.0,
          cexcats = 0.7,
          title = "Urodynamic Cancellation Risk Nomogram with Patient Distribution")

  log_info("regplot nomogram generated successfully")
  TRUE
}, error = function(e) {
  log_warn(paste("Could not generate regplot nomogram:", e$message))
  cat("\n**Note:** regplot visualization could not be generated. Error:", e$message, "\n")
  FALSE
})
# =============================================================================
# INTERACTIVE NOMOGRAM EXAMPLE (for local use, not rendered in document)
# Run this code in RStudio to use the nomogram as an interactive calculator
# =============================================================================

# Example: Calculate risk for a specific patient
example_patient <- data.frame(
  Age = 65,
  BMI = 32,
  # Add other predictors as needed based on your model
  stringsAsFactors = FALSE
)

# Interactive mode - click to change values and see updated predictions
regplot(model,
        observation = example_patient,
        clickable = TRUE,      # Enable interactive mode
        droplines = TRUE,      # Show score contribution lines
        interval = "confidence",
        title = "Interactive Risk Calculator - Click to Modify Values")

# Note: In interactive mode:
# - Click on any scale to change that predictor's value
# - The predicted probability updates automatically
# - Click the distribution type menu to change visualization (violin, box, density)

Interpreting the enhanced nomogram:

  • Violin/box plots on each scale show where your study patients fall - wider sections indicate more patients with those values
  • Thumbnail plots on the right of continuous variables (Age, BMI) show the shape of non-linear transformations from restricted cubic splines
  • Points scale at top converts each predictor value to points; sum all points to get total
  • Probability scale at bottom converts total points to predicted cancellation probability

This visualization helps clinicians understand not just what the prediction is, but how typical a patient’s risk factors are compared to the population used to develop the model.

17.5 Decision Curve Analysis (DCA)

Decision curve analysis evaluates the clinical utility of prediction models by comparing the net benefit of using the model versus alternative strategies (“treat all” or “treat none”). Net benefit accounts for the relative harms of false positives and false negatives at different threshold probabilities.

# =============================================================================
# DECISION CURVE ANALYSIS (DCA)
# Evaluates clinical utility by comparing net benefit across threshold probabilities
# Reference: Vickers AJ, Elkin EB. Decision curve analysis. Med Decis Making. 2006
# =============================================================================

log_info("Performing Decision Curve Analysis...")

# Install dcurves if not available (preferred package for DCA)
if (!requireNamespace("dcurves", quietly = TRUE)) {
  # Fall back to manual DCA implementation
  log_warn("dcurves package not available, using manual implementation")
  use_dcurves <- FALSE
} else {
  use_dcurves <- TRUE
}

# Prepare data for DCA
# CRITICAL: Use shrinkage-adjusted predictions to reflect actual deployment performance
# These were calculated in Part 7 and are aligned with model$y
outcome_vec <- as.numeric(model$y) - 1  # Convert to 0/1
pred_vec <- pred_shrunk

# Verify alignment
if (length(outcome_vec) != length(pred_vec)) {
  stop("CRITICAL ERROR: Length mismatch between outcome and predictions in DCA. Check model alignment.")
}

dca_data <- data.frame(
  outcome = outcome_vec,
  predicted_prob = pred_vec
)

# Remove NA values (should be none if using model$y and pred_shrunk)
dca_data <- dca_data[complete.cases(dca_data), ]

if (use_dcurves) {
  # Using dcurves package
  library(dcurves)

  dca_results <- dca(
    outcome ~ predicted_prob,
    data = dca_data,
    thresholds = seq(DCA_THRESHOLD_MIN, DCA_THRESHOLD_MAX, by = DCA_THRESHOLD_STEP),
    label = list(predicted_prob = "Prediction Model")
  )

  # Plot DCA
  plot(dca_results,
       smooth = TRUE,
       show_ggplot_code = FALSE) +
    ggplot2::labs(
      title = "Decision Curve Analysis: Clinical Utility of Prediction Model",
      subtitle = "Net benefit compared to 'Treat All' and 'Treat None' strategies",
      x = "Threshold Probability",
      y = "Net Benefit"
    ) +
    ggplot2::theme_minimal(base_size = 12) +
    ggplot2::theme(
      plot.title = element_text(face = "bold", size = 14),
      legend.position = "bottom"
    )

} else {
  # Manual DCA implementation
  # Calculate net benefit at each threshold (using configured parameters)
  thresholds <- seq(max(DCA_THRESHOLD_MIN, 0.01), DCA_THRESHOLD_MAX, by = DCA_THRESHOLD_STEP)
  n <- nrow(dca_data)
  prevalence <- mean(dca_data$outcome)

  # Initialize results
  dca_manual <- data.frame(
    threshold = thresholds,
    nb_model = NA,
    nb_all = NA,
    nb_none = 0
  )

  for (i in seq_along(thresholds)) {
    pt <- thresholds[i]

    # Model strategy: treat if predicted prob >= threshold
    treat_model <- dca_data$predicted_prob >= pt
    tp_model <- sum(treat_model & dca_data$outcome == 1)
    fp_model <- sum(treat_model & dca_data$outcome == 0)

    # Net benefit for model
    dca_manual$nb_model[i] <- (tp_model / n) - (fp_model / n) * (pt / (1 - pt))

    # Treat all strategy
    tp_all <- sum(dca_data$outcome == 1)
    fp_all <- sum(dca_data$outcome == 0)
    dca_manual$nb_all[i] <- (tp_all / n) - (fp_all / n) * (pt / (1 - pt))
  }

  # Calculate key interpretation values programmatically
  useful_range <- dca_manual$threshold[dca_manual$nb_model > dca_manual$nb_all &
                                        dca_manual$nb_model > 0]
  dca_useful_min <- if(length(useful_range) > 0) min(useful_range) else NA
  dca_useful_max <- if(length(useful_range) > 0) max(useful_range) else NA

  # Find threshold where model provides maximum benefit over treat-all
  max_benefit_idx <- which.max(dca_manual$nb_model - dca_manual$nb_all)
  dca_max_benefit_threshold <- dca_manual$threshold[max_benefit_idx]
  dca_max_benefit_value <- dca_manual$nb_model[max_benefit_idx] - dca_manual$nb_all[max_benefit_idx]

  # ============================================================================
  # DCA PRESENTATION STYLES
  # Select visualization based on DCA_PRESENTATION_STYLE config parameter
  # ============================================================================

  if (DCA_PRESENTATION_STYLE == "clinical") {
    # CLINICAL-FRIENDLY VERSION
    # Simplified visualization with clear clinical guidance
    par(mar = c(6, 5, 4, 2))
    plot(dca_manual$threshold * 100, dca_manual$nb_model,
         type = "l", lwd = 3, col = "#2E86AB",
         xlim = c(0, 50), ylim = c(-0.05, max(c(dca_manual$nb_model, dca_manual$nb_all), na.rm = TRUE) * 1.1),
         xlab = "Risk Threshold (%)\n(Treat patients above this risk level)",
         ylab = "Clinical Benefit Score",
         main = "When to Use This Prediction Model",
         cex.main = 1.4, cex.lab = 1.2, cex.axis = 1.1)

    lines(dca_manual$threshold * 100, dca_manual$nb_all, lwd = 2, col = "#E74C3C", lty = 2)
    abline(h = 0, lwd = 2, col = "#95A5A6", lty = 3)

    # Shade the useful region
    if (!is.na(dca_useful_min)) {
      rect(dca_useful_min * 100, -0.05, dca_useful_max * 100,
           max(dca_manual$nb_model, na.rm = TRUE) * 1.1,
           col = rgb(0.18, 0.53, 0.67, 0.2), border = NA)
    }

    legend("topright",
           legend = c("Use Prediction Model", "Screen Everyone", "Screen No One",
                      paste0("Model Useful (", round(dca_useful_min*100), "-", round(dca_useful_max*100), "% threshold)")),
           col = c("#2E86AB", "#E74C3C", "#95A5A6", rgb(0.18, 0.53, 0.67, 0.4)),
           lty = c(1, 2, 3, NA), pch = c(NA, NA, NA, 15),
           lwd = c(3, 2, 2, NA), pt.cex = 2,
           bty = "n", cex = 1.0)

    mtext("Blue curve above others = model adds clinical value", side = 1, line = 5, cex = 1.0, col = "#27AE60")

  } else if (DCA_PRESENTATION_STYLE == "annotated") {
    # ANNOTATED VERSION (RECOMMENDED FOR CLINICIAN AUDIENCES)
    # Full annotations with clinical decision guidance
    par(mar = c(6, 5, 5, 2))
    plot(dca_manual$threshold * 100, dca_manual$nb_model,
         type = "l", lwd = 3, col = "#2E86AB",
         xlim = c(0, 50), ylim = c(-0.05, max(c(dca_manual$nb_model, dca_manual$nb_all), na.rm = TRUE) * 1.15),
         xlab = "Threshold Probability (%)",
         ylab = "Net Benefit",
         main = "Decision Curve Analysis: Clinical Utility of Prediction Model",
         sub = paste0("Model provides net benefit from ", round(dca_useful_min*100), "% to ",
                      round(dca_useful_max*100), "% threshold probability"),
         cex.main = 1.3, cex.lab = 1.1, cex.axis = 1.0, cex.sub = 1.0, col.sub = "#27AE60")

    lines(dca_manual$threshold * 100, dca_manual$nb_all, lwd = 2, col = "#E74C3C", lty = 2)
    abline(h = 0, lwd = 2, col = "#95A5A6", lty = 3)

    # Shade the useful region
    if (!is.na(dca_useful_min)) {
      rect(dca_useful_min * 100, -0.05, dca_useful_max * 100,
           max(dca_manual$nb_model, na.rm = TRUE) * 1.1,
           col = rgb(0.15, 0.68, 0.38, 0.15), border = NA)
    }

    # Add optimal threshold marker
    abline(v = optimal_threshold * 100, lty = 4, col = "#8E44AD", lwd = 2)

    # Add annotation at max benefit point
    if (!is.na(dca_max_benefit_threshold)) {
      points(dca_max_benefit_threshold * 100, dca_manual$nb_model[max_benefit_idx],
             pch = 19, col = "#F39C12", cex = 1.5)
      text(dca_max_benefit_threshold * 100 + 3, dca_manual$nb_model[max_benefit_idx],
           paste0("Max benefit at ", round(dca_max_benefit_threshold * 100), "%"),
           col = "#F39C12", cex = 0.9, adj = 0)
    }

    legend("topright",
           legend = c("Prediction Model", "Treat All", "Treat None",
                      paste0("Optimal Threshold (", round(optimal_threshold * 100), "%)"),
                      "Model Useful Range"),
           col = c("#2E86AB", "#E74C3C", "#95A5A6", "#8E44AD", rgb(0.15, 0.68, 0.38, 0.4)),
           lty = c(1, 2, 3, 4, NA), pch = c(NA, NA, NA, NA, 15),
           lwd = c(3, 2, 2, 2, NA), pt.cex = 2,
           bty = "n", cex = 0.9)

    # Add interpretation guide at bottom
    mtext(paste0("Interpretation: Use model when willing to treat if risk > ",
                 round(dca_useful_min * 100), "% but < ", round(dca_useful_max * 100), "%"),
          side = 1, line = 5, cex = 0.9)

  } else {
    # STANDARD VERSION (Default academic style)
    par(mar = c(5, 5, 4, 2))
    plot(dca_manual$threshold, dca_manual$nb_model,
         type = "l", lwd = 2, col = "#3498DB",
         xlim = c(0, 0.5), ylim = c(-0.05, max(c(dca_manual$nb_model, dca_manual$nb_all), na.rm = TRUE) * 1.1),
         xlab = "Threshold Probability",
         ylab = "Net Benefit",
         main = "Decision Curve Analysis: Clinical Utility of Prediction Model",
         cex.main = 1.2, cex.lab = 1.1)

    lines(dca_manual$threshold, dca_manual$nb_all, lwd = 2, col = "#E74C3C", lty = 2)
    abline(h = 0, lwd = 2, col = "#95A5A6", lty = 3)

    legend("topright",
           legend = c("Prediction Model", "Treat All", "Treat None"),
           col = c("#3498DB", "#E74C3C", "#95A5A6"),
           lty = c(1, 2, 3),
           lwd = 2,
           bty = "n",
           cex = 1.0)

    mtext("Higher net benefit = greater clinical utility", side = 1, line = 4, cex = 0.9)
  }

  # Store DCA interpretation values for use in Results section
  dca_interpretation <- list(
    useful_min = dca_useful_min,
    useful_max = dca_useful_max,
    max_benefit_threshold = dca_max_benefit_threshold,
    max_benefit_value = dca_max_benefit_value,
    prevalence = prevalence
  )
}
Supplemental Figure 2. Decision Curve Analysis

Supplemental Figure 2. Decision Curve Analysis

log_info("Decision Curve Analysis completed")

Interpretation of Decision Curve Analysis:

  • Net Benefit: Represents the clinical value of using the model, accounting for true positives (benefit) minus false positives (harm weighted by threshold)
  • Treat All (red dashed): Strategy of treating/screening every patient regardless of predicted risk
  • Treat None (gray): Strategy of treating no one (net benefit = 0)
  • Prediction Model (blue): Net benefit of using our model to guide decisions

Clinical Interpretation: - Where the model curve is above both “Treat All” and “Treat None” lines, the model provides clinical value - The model is most useful at threshold probabilities where it shows the greatest separation from alternative strategies - A threshold probability of 13% corresponds to weighing false negatives as 6.9 times more harmful than false positives

17.5.1 Interactive Decision Curve Analysis

Hover over the curves below to see exact values at each threshold probability.

# =============================================================================
# INTERACTIVE DCA PLOT WITH PLOTLY
# =============================================================================

# Create plotly version of DCA (uses dca_manual from previous chunk)
if (exists("dca_manual") && !is.null(dca_manual)) {

  # Create data for plotly
  dca_plot_data <- data.frame(
    Threshold = dca_manual$threshold * 100,
    Model = dca_manual$nb_model,
    Treat_All = dca_manual$nb_all,
    Treat_None = dca_manual$nb_none
  )

  # Build interactive plot
  dca_plotly <- plot_ly(dca_plot_data, x = ~Threshold) %>%
    add_trace(y = ~Model, name = "Prediction Model",
              type = "scatter", mode = "lines",
              line = list(color = "#2E86AB", width = 3),
              hovertemplate = "Threshold: %{x:.1f}%<br>Net Benefit: %{y:.4f}<extra>Prediction Model</extra>") %>%
    add_trace(y = ~Treat_All, name = "Treat All",
              type = "scatter", mode = "lines",
              line = list(color = "#E74C3C", width = 2, dash = "dash"),
              hovertemplate = "Threshold: %{x:.1f}%<br>Net Benefit: %{y:.4f}<extra>Treat All</extra>") %>%
    add_trace(y = ~Treat_None, name = "Treat None",
              type = "scatter", mode = "lines",
              line = list(color = "#95A5A6", width = 2, dash = "dot"),
              hovertemplate = "Threshold: %{x:.1f}%<br>Net Benefit: %{y:.4f}<extra>Treat None</extra>") %>%
    layout(
      title = list(
        text = "<b>Interactive Decision Curve Analysis</b><br><sub>Hover for exact values</sub>",
        font = list(size = 16)
      ),
      xaxis = list(
        title = "Threshold Probability (%)",
        range = c(0, 50),
        ticksuffix = "%"
      ),
      yaxis = list(
        title = "Net Benefit",
        zeroline = TRUE,
        zerolinecolor = "#CCCCCC"
      ),
      legend = list(
        orientation = "h",
        x = 0.5,
        xanchor = "center",
        y = -0.15
      ),
      hovermode = "x unified",
      annotations = list(
        list(
          x = 0.5, y = 1.05,
          text = "Model provides clinical benefit where blue line is above other curves",
          showarrow = FALSE,
          xref = "paper", yref = "paper",
          font = list(size = 11, color = "#27AE60")
        )
      )
    ) %>%
    config(displayModeBar = TRUE,
           modeBarButtonsToRemove = c("lasso2d", "select2d"))

  dca_plotly
} else {
  cat("DCA data not available for interactive plot")
}

DCA data not available for interactive plot

17.6 Clinical Impact Curve (CIC)

The clinical impact curve illustrates the practical impact of using the prediction model at various threshold probabilities, showing how many patients would be classified as high-risk and how many of those would actually have the outcome.

# =============================================================================
# CLINICAL IMPACT CURVE (CIC)
# Shows the number classified as high-risk and actual positives at each threshold
# =============================================================================

log_info("Generating Clinical Impact Curve...")

# Use the same dca_data prepared above
thresholds_cic <- seq(max(DCA_THRESHOLD_MIN, 0.01), DCA_THRESHOLD_MAX, by = DCA_THRESHOLD_STEP)
n_total <- nrow(dca_data)

# Calculate metrics at each threshold
cic_results <- data.frame(
  threshold = thresholds_cic,
  n_high_risk = NA,        # Number classified as high-risk (per CIC_SCALE_PER patients)
  n_high_risk_with_event = NA  # Number high-risk who have the event (per CIC_SCALE_PER)
)

for (i in seq_along(thresholds_cic)) {
  pt <- thresholds_cic[i]

  # High risk = predicted probability >= threshold
  high_risk <- dca_data$predicted_prob >= pt

  # Scale to per CIC_SCALE_PER patients (configurable)
  cic_results$n_high_risk[i] <- sum(high_risk) / n_total * CIC_SCALE_PER
  cic_results$n_high_risk_with_event[i] <- sum(high_risk & dca_data$outcome == 1) / n_total * CIC_SCALE_PER
}

# Create the Clinical Impact Curve
par(mar = c(5, 5, 4, 5))

# Primary axis: Number classified as high-risk
cic_ylab <- paste0("Number per ", format(CIC_SCALE_PER, big.mark = ","), " Patients")
plot(cic_results$threshold * 100, cic_results$n_high_risk,
     type = "l", lwd = 3, col = "#3498DB",
     xlim = c(0, DCA_THRESHOLD_MAX * 100), ylim = c(0, max(cic_results$n_high_risk) * 1.1),
     xlab = "Threshold Probability (%)",
     ylab = cic_ylab,
     main = "Clinical Impact Curve",
     cex.main = 1.3, cex.lab = 1.1, cex.axis = 1.0)

# Add high-risk with event line
lines(cic_results$threshold * 100, cic_results$n_high_risk_with_event,
      lwd = 3, col = "#E74C3C")

# Add shaded area between curves (unnecessary interventions)
polygon(c(cic_results$threshold * 100, rev(cic_results$threshold * 100)),
        c(cic_results$n_high_risk, rev(cic_results$n_high_risk_with_event)),
        col = rgb(0.9, 0.9, 0.9, 0.5), border = NA)

# Add optimal threshold line
abline(v = optimal_threshold * 100, lty = 2, col = "#27AE60", lwd = 2)

# Add legend
legend("topright",
       legend = c("Classified as High-Risk",
                  "High-Risk with Actual Event",
                  paste0("Optimal Threshold (", round(optimal_threshold * 100, 1), "%)")),
       col = c("#3498DB", "#E74C3C", "#27AE60"),
       lty = c(1, 1, 2),
       lwd = c(3, 3, 2),
       bty = "n",
       cex = 1.0)

# Add annotation
text(optimal_threshold * 100 + 2, max(cic_results$n_high_risk) * 0.5,
     "Gray area =\nUnnecessary\ninterventions",
     cex = 0.9, adj = 0, col = "#7F8C8D")

# Add grid
grid(col = "lightgray", lty = "dotted")
Supplemental Figure 3. Clinical Impact Curve

Supplemental Figure 3. Clinical Impact Curve

# Calculate key metrics at optimal threshold
idx_optimal <- which.min(abs(cic_results$threshold - optimal_threshold))
n_flagged <- round(cic_results$n_high_risk[idx_optimal])
n_true_pos <- round(cic_results$n_high_risk_with_event[idx_optimal])
n_false_pos <- n_flagged - n_true_pos

log_info("Clinical Impact Curve completed")

# Print summary (using configured scale factor)
cat("\n=== Clinical Impact at Optimal Threshold ===\n\n")

=== Clinical Impact at Optimal Threshold ===

cat(sprintf("At threshold probability of %.1f%%:\n", optimal_threshold * 100))

At threshold probability of 12.6%:

cat(sprintf("  - Patients flagged as high-risk: %d per %s\n", n_flagged, format(CIC_SCALE_PER, big.mark = ",")))
  • Patients flagged as high-risk: 103 per 1,000
cat(sprintf("  - True positives (would have event): %d per %s\n", n_true_pos, format(CIC_SCALE_PER, big.mark = ",")))
  • True positives (would have event): 24 per 1,000
cat(sprintf("  - False positives (unnecessary intervention): %d per %s\n", n_false_pos, format(CIC_SCALE_PER, big.mark = ",")))
  • False positives (unnecessary intervention): 79 per 1,000
cat(sprintf("  - Number needed to screen (NNS): %.1f\n", n_flagged / max(n_true_pos, 1)))
  • Number needed to screen (NNS): 4.3

Interpretation of Clinical Impact Curve:

  • Blue line: Total number of patients (per 1,000) who would be classified as “high-risk” at each threshold
  • Red line: Number of those high-risk patients who would actually have the event (true positives)
  • Gray area: Represents false positives - patients flagged as high-risk who would NOT have the event
  • Optimal threshold: Balances detection of true cases against unnecessary interventions

At the optimal threshold of 12.6%, approximately 103 per 1,000 patients would be flagged as high-risk, of whom 24 would actually have their procedure cancelled due to UTI.

17.6.1 Interactive Clinical Impact Curve

# =============================================================================
# INTERACTIVE CIC PLOT WITH PLOTLY
# =============================================================================

if (exists("cic_results") && !is.null(cic_results)) {

  # Build interactive CIC plot
  cic_plotly <- plot_ly(cic_results, x = ~(threshold * 100)) %>%
    add_trace(y = ~n_high_risk, name = "Classified as High-Risk",
              type = "scatter", mode = "lines",
              fill = "tozeroy",
              fillcolor = "rgba(52, 152, 219, 0.2)",
              line = list(color = "#3498DB", width = 3),
              hovertemplate = paste0("Threshold: %{x:.1f}%<br>",
                                     "High-Risk: %{y:.0f} per ", format(CIC_SCALE_PER, big.mark=","),
                                     "<extra>High-Risk</extra>")) %>%
    add_trace(y = ~n_high_risk_with_event, name = "High-Risk with Event",
              type = "scatter", mode = "lines",
              fill = "tozeroy",
              fillcolor = "rgba(231, 76, 60, 0.4)",
              line = list(color = "#E74C3C", width = 3),
              hovertemplate = paste0("Threshold: %{x:.1f}%<br>",
                                     "True Positives: %{y:.0f} per ", format(CIC_SCALE_PER, big.mark=","),
                                     "<extra>True Positives</extra>")) %>%
    add_trace(x = c(optimal_threshold * 100, optimal_threshold * 100),
              y = c(0, max(cic_results$n_high_risk)),
              name = paste0("Optimal (", round(optimal_threshold * 100, 1), "%)"),
              type = "scatter", mode = "lines",
              line = list(color = "#27AE60", width = 2, dash = "dash"),
              hoverinfo = "name") %>%
    layout(
      title = list(
        text = "<b>Interactive Clinical Impact Curve</b><br><sub>Patients per 1,000 at each threshold</sub>",
        font = list(size = 16)
      ),
      xaxis = list(
        title = "Threshold Probability (%)",
        range = c(0, 50),
        ticksuffix = "%"
      ),
      yaxis = list(
        title = paste0("Number per ", format(CIC_SCALE_PER, big.mark = ",")),
        rangemode = "tozero"
      ),
      legend = list(
        orientation = "h",
        x = 0.5,
        xanchor = "center",
        y = -0.15
      ),
      hovermode = "x unified"
    ) %>%
    config(displayModeBar = TRUE,
           modeBarButtonsToRemove = c("lasso2d", "select2d"))

  cic_plotly
} else {
  cat("CIC data not available for interactive plot")
}

Interactive Clinical Impact Curve showing estimated number of high-risk patients.

17.7 Net Reclassification Index (NRI)

The Net Reclassification Index quantifies how well the full model improves risk classification compared to a simpler model (e.g., using only the strongest predictor).

# =============================================================================
# NET RECLASSIFICATION INDEX (NRI)
# Compares classification improvement of full model vs. simpler model
# =============================================================================

log_info("Calculating Net Reclassification Index...")

# Install nricens package if needed, otherwise use manual calculation
if (!requireNamespace("nricens", quietly = TRUE)) {
  log_info("Using manual NRI calculation")
  use_nricens <- FALSE
} else {
  use_nricens <- TRUE
}

# Check if simple_model exists and is valid before proceeding
if (exists("simple_model") && !is.null(simple_model) && inherits(simple_model, "lrm")) {

# Prepare data: SHRUNK full model vs. simple model
# Get predicted probabilities from both models
# CRITICAL: Use the same patients for both to avoid misalignment
common_idx <- intersect(names(model$linear.predictors), names(simple_model$linear.predictors))
if (length(common_idx) < (0.8 * length(model$y))) {
  log_warn("High attrition in NRI comparison: {length(common_idx)} common patients")
}

# Full model predictions (Shrinkage-adjusted)
# Aligned to training data
lp_full <- model$linear.predictors[common_idx]
pred_full <- plogis(new_intercept + shrinkage_factor * (lp_full - original_intercept))

# Simple model predictions (Unadjusted is fine for the reference)
pred_simple <- predict(simple_model, type = "fitted")[common_idx]

# Outcomes aligned to the same patients
# lrm stores y as 1/2 factors
outcome_binary <- as.numeric(model$y[common_idx]) - 1

# Define risk categories (low: <10%, intermediate: 10-20%, high: >20%)
# Adjust based on outcome prevalence
prevalence <- mean(outcome_binary, na.rm = TRUE)
cut1 <- max(prevalence * 0.5, 0.05)
cut2 <- min(prevalence * 1.5, 0.30)

categorize_risk <- function(p) {
  cut(p, breaks = c(0, cut1, cut2, 1),
      labels = c("Low", "Intermediate", "High"),
      include.lowest = TRUE)
}

# Categorize predictions
cat_full <- categorize_risk(pred_full)
cat_simple <- categorize_risk(pred_simple)

# Calculate reclassification
events <- outcome_binary == 1
nonevents <- outcome_binary == 0
n_events <- sum(events)
n_nonevents <- sum(nonevents)

# 1. Category-based NRI
events_up <- sum(as.numeric(cat_full[events]) > as.numeric(cat_simple[events]))
events_down <- sum(as.numeric(cat_full[events]) < as.numeric(cat_simple[events]))
nri_events <- (events_up - events_down) / n_events

nonevents_up <- sum(as.numeric(cat_full[nonevents]) > as.numeric(cat_simple[nonevents]))
nonevents_down <- sum(as.numeric(cat_full[nonevents]) < as.numeric(cat_simple[nonevents]))
nri_nonevents <- (nonevents_down - nonevents_up) / n_nonevents
nri_total <- nri_events + nri_nonevents

# 2. Continuous NRI (Category-free) per Pencina et al. 2011
# Defined by direction of change: P(up|event) - P(down|event) + P(down|nonevent) - P(up|nonevent)
c_events_up <- sum(pred_full[events] > pred_simple[events])
c_events_down <- sum(pred_full[events] < pred_simple[events])
cnri_events <- (c_events_up - c_events_down) / n_events

c_nonevents_up <- sum(pred_full[nonevents] > pred_simple[nonevents])
c_nonevents_down <- sum(pred_full[nonevents] < pred_simple[nonevents])
cnri_nonevents <- (c_nonevents_down - c_nonevents_up) / n_nonevents
cnri_total <- cnri_events + cnri_nonevents

# Create results table
nri_results <- data.frame(
  Metric = c("NRI (Events)", "NRI (Non-events)", "NRI (Total)",
             "Continuous NRI (Events)", "Continuous NRI (Non-events)", "Continuous NRI (Total)"),
  Value = c(
    sprintf("%.3f", nri_events),
    sprintf("%.3f", nri_nonevents),
    sprintf("%.3f", nri_total),
    sprintf("%.3f", cnri_events),
    sprintf("%.3f", cnri_nonevents),
    sprintf("%.3f", cnri_total)
  ),
  Interpretation = c(
    paste0(round(nri_events * 100, 1), "% net improvement in event classification"),
    paste0(round(nri_nonevents * 100, 1), "% net improvement in non-event classification"),
    ifelse(nri_total > 0, "Full model improves categorical classification", "No improvement"),
    paste0(round(cnri_events * 100, 1), "% net probability increase for events"),
    paste0(round(cnri_nonevents * 100, 1), "% net probability decrease for non-events"),
    ifelse(cnri_total > 0, "Full model improves continuous discrimination", "No improvement")
  )
)

cat("\n=== Net Reclassification Index ===\n")
cat(sprintf("Comparison: Shrunk full model (%d predictors) vs. %s-only model\n\n",
            length(all.vars(formula(model))) - 1, first_pred))
cat(sprintf("Risk categories: Low (<%.1f%%), Intermediate (%.1f-%.1f%%), High (>%.1f%%)\n\n",
            cut1 * 100, cut1 * 100, cut2 * 100, cut2 * 100))

  kable(nri_results,
        col.names = c("Metric", "Value", "Interpretation"),
        caption = paste0("Net Reclassification Index: Full Model vs. ", first_pred, "-Only Model"),
        align = c("l", "c", "l")) %>%
    kable_styling(bootstrap_options = c("striped", "hover"),
                  full_width = FALSE,
                  font_size = 12) %>%
    row_spec(3, bold = TRUE, background = "#E8F6F3") %>%
    row_spec(6, bold = TRUE, background = "#E8F6F3")

  # Reclassification table
  cat("\n\n**Reclassification Table (Events):**\n\n")
  reclass_events <- table(Simple = cat_simple[events], Full = cat_full[events])
  print(kable(reclass_events, caption = "Events: Simple vs Full Model Classification") %>%
          kable_styling(bootstrap_options = c("striped"), full_width = FALSE))

  cat("\n\n**Reclassification Table (Non-Events):**\n\n")
  reclass_nonevents <- table(Simple = cat_simple[nonevents], Full = cat_full[nonevents])
  print(kable(reclass_nonevents, caption = "Non-Events: Simple vs Full Model Classification") %>%
          kable_styling(bootstrap_options = c("striped"), full_width = FALSE))

  log_info("NRI calculation completed: Total NRI = {round(nri_total, 3)}")

} else {
  cat("\n⚠️ NRI calculation could not be performed (simple model fitting failed)\n")
}

⚠️ NRI calculation could not be performed (simple model fitting failed)

Interpretation of NRI:

  • NRI (Events): Proportion of events correctly reclassified to higher risk minus those incorrectly reclassified to lower risk
  • NRI (Non-events): Proportion of non-events correctly reclassified to lower risk minus those incorrectly reclassified to higher risk
  • Total NRI: Sum of event and non-event NRI; positive values indicate overall improvement
  • Continuous NRI: Category-free version that examines raw probability changes

Guidelines for interpretation: - NRI > 0.6: Strong improvement - NRI 0.4-0.6: Intermediate improvement - NRI 0.2-0.4: Weak improvement - NRI < 0.2: No meaningful improvement

17.8 Explicit Model Equation

The final prediction model can be expressed as the following logistic regression equation:

# =============================================================================
# EXPLICIT MODEL EQUATION
# Full regression equation for reproducibility
# =============================================================================

log_info("Generating explicit model equation...")

# Extract coefficients
coefs <- coef(model)
coef_names <- names(coefs)

# Create the linear predictor equation
cat("\n=== Prediction Model Equation ===\n\n")

=== Prediction Model Equation ===

cat("**Logistic Regression Model:**\n\n")

Logistic Regression Model:

cat("log(p / (1-p)) = Linear Predictor (LP)\n\n")

log(p / (1-p)) = Linear Predictor (LP)

cat("where p = Probability of procedure cancellation due to UTI\n\n")

where p = Probability of procedure cancellation due to UTI

cat("**Linear Predictor:**\n\n")

Linear Predictor:

cat("```\n")

``` r
cat("LP = ")

LP =

# Build equation string
eq_parts <- c()
for (i in seq_along(coefs)) {
  coef_val <- round(coefs[i], 4)
  coef_name <- coef_names[i]

  if (coef_name == "Intercept") {
    eq_parts <- c(eq_parts, sprintf("%.4f", coef_val))
  } else {
    # Clean up variable name for display
    clean_name <- gsub("\\.", " ", coef_name)
    clean_name <- gsub("  +", " ", clean_name)
    clean_name <- trimws(clean_name)

    sign <- ifelse(coef_val >= 0, " + ", " - ")
    eq_parts <- c(eq_parts, sprintf("%s%.4f × [%s]", sign, abs(coef_val), clean_name))
  }
}

# Print equation (wrap lines for readability)
cat(eq_parts[1])  # Intercept

-5.7918

for (i in 2:length(eq_parts)) {
  if (i %% 2 == 0) {
    cat("\n    ")
  }
  cat(eq_parts[i])
}
 + 0.0373 × [Age] + 0.0283 × [Age']
 + 0.0554 × [Nocturia] - 0.0641 × [Nocturia']
 - 0.0212 × [CRADI_8] + 0.0244 × [CRADI_8']
 + 0.9196 × [Hispanic=Yes] + 0.9237 × [Recurrent_UTIs=Yes]
 - 0.0797 × [Vaginal_Estrogen=Yes] + 0.3726 × [Overactive_Bladder=Yes]
cat("\n```\n\n")

``` r
# Probability conversion with LaTeX
cat("**Probability Calculation:**\n\n")

Probability Calculation:

cat("$$P(\\text{Cancelled}) = \\frac{e^{LP}}{1 + e^{LP}} = \\frac{1}{1 + e^{-LP}}$$\n\n")

\[P(\text{Cancelled}) = \frac{e^{LP}}{1 + e^{LP}} = \frac{1}{1 + e^{-LP}}\]

# Create coefficient table
coef_table <- data.frame(
  Variable = coef_names,
  Coefficient = round(coefs, 4),
  `Odds Ratio` = round(exp(coefs), 3),
  stringsAsFactors = FALSE
)

# Clean variable names
coef_table$Variable <- gsub("\\.", " ", coef_table$Variable)
coef_table$Variable <- gsub("  +", " ", coef_table$Variable)
coef_table$Variable <- trimws(coef_table$Variable)

cat("**Model Coefficients:**\n\n")

Model Coefficients:

kable(coef_table,
      col.names = c("Variable", "Coefficient (β)", "Odds Ratio (exp(β))"),
      caption = "Full Model Coefficients for Reproducibility",
      align = c("l", "c", "c"),
      row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 12) %>%
  row_spec(1, italic = TRUE, color = "#7F8C8D")  # Intercept row
Full Model Coefficients for Reproducibility
Variable Coefficient (β) Odds Ratio (exp(β))
Intercept -5.7918 0.003
Age 0.0373 1.038
Age’ 0.0283 1.029
Nocturia 0.0554 1.057
Nocturia’ -0.0641 0.938
CRADI_8 -0.0212 0.979
CRADI_8’ 0.0244 1.025
Hispanic=Yes 0.9196 2.508
Recurrent_UTIs=Yes 0.9237 2.519
Vaginal_Estrogen=Yes -0.0797 0.923
Overactive_Bladder=Yes 0.3726 1.451
# Example calculation
cat("\n\n**Example Calculation:**\n\n")

Example Calculation:

cat("For a 65-year-old patient with BMI 30:\n\n")

For a 65-year-old patient with BMI 30:

# Get Age and BMI coefficients if present
age_coef <- coefs[grep("Age", coef_names)[1]]
bmi_coef <- coefs[grep("BMI", coef_names)[1]]
intercept <- coefs["Intercept"]

if (!is.na(age_coef) && !is.na(bmi_coef)) {
  example_lp <- intercept + age_coef * 65 + bmi_coef * 30
  example_prob <- 1 / (1 + exp(-example_lp))

  cat(sprintf("LP = %.4f + (%.4f × 65) + (%.4f × 30) + [other predictors at reference]\n",
              intercept, age_coef, bmi_coef))
  cat(sprintf("   = %.4f (partial calculation)\n\n", example_lp))
  cat("Note: Full calculation requires values for all predictors in the model.\n")
}

log_info("Model equation generation completed")

This explicit equation allows: 1. Independent verification of model predictions 2. Implementation in other software systems or calculators 3. External validation studies using the exact model specification 4. Transparency in reporting per TRIPOD guidelines

17.9 Clinical Example Vignettes

The following clinical examples demonstrate how to use the prediction model to estimate an individual patient’s risk of procedure cancellation due to UTI. These vignettes incorporate all predictors in the final model.

# Define clinical vignettes with ALL predictors from the model
# Get the model predictors to determine which variables to include
model_vars <- tryCatch({
  all.vars(formula(model))[-1]  # Exclude outcome variable
}, error = function(e) {
  c("Age.", "BMI")  # Fallback
})

# Create comprehensive vignettes representing different risk profiles
# Age values: 35 (young), 58 (middle-aged), 75 (elderly) - spanning typical patient range
# BMI values: 24 (normal), 32 (obese class I), 42 (obese class III) - clinically meaningful categories
vignettes <- data.frame(
  Patient = c("Patient A\n(Low Risk)", "Patient B\n(Moderate Risk)", "Patient C\n(High Risk)"),
  Age = c(35, 58, 75),
  BMI = c(24, 32, 42)
)

# Add other model variables with clinically meaningful variation
# Recurrent UTIs
if ("Does.the.patient.have.a.h.o.recurrent.UTIs." %in% model_vars) {
  vignettes$Recurrent_UTIs <- c("No", "No", "Yes")
}

# POP-Q stage - removed from model per user request
# if ("POP.Q.stage." %in% model_vars) {
#   vignettes$POP_Q_Stage <- c("0", "2", "4")
# }

# Menopause status
if ("Menopause.status." %in% model_vars) {
  vignettes$Menopause <- c("Pre-menopausal", "Post-menopausal", "Post-menopausal")
}

# Diabetes
if ("Does.the.patient.have.diabetes." %in% model_vars) {
  vignettes$Diabetes <- c("No", "No", "Yes")
}

# UDI-6 score
if ("UDI_6" %in% model_vars) {
  vignettes$UDI_6 <- c(20, 45, 75)
}

# Create human-readable descriptions
vignettes$Description <- paste0(
  vignettes$Age, " y/o, BMI ", vignettes$BMI,
  ifelse(!is.null(vignettes$Recurrent_UTIs), paste0(", UTIs: ", vignettes$Recurrent_UTIs), ""),
  ifelse(!is.null(vignettes$Diabetes), paste0(", DM: ", vignettes$Diabetes), ""),
  ifelse(!is.null(vignettes$Menopause), paste0(", ", vignettes$Menopause), "")
)

# Calculate predicted probabilities using the model with all vignette predictors
# Build newdata using the vignette-specific values where available
tryCatch({
  newdata_list <- list(Age. = vignettes$Age, BMI = vignettes$BMI)

  # Add vignette-specific values for predictors we defined
  if ("Does.the.patient.have.a.h.o.recurrent.UTIs." %in% model_vars && !is.null(vignettes$Recurrent_UTIs)) {
    newdata_list[["Does.the.patient.have.a.h.o.recurrent.UTIs."]] <- vignettes$Recurrent_UTIs
  }
  # POP-Q stage removed from model per user request
  # if ("POP.Q.stage." %in% model_vars && !is.null(vignettes$POP_Q_Stage)) {
  #   newdata_list[["POP.Q.stage."]] <- vignettes$POP_Q_Stage
  # }
  if ("Menopause.status." %in% model_vars && !is.null(vignettes$Menopause)) {
    newdata_list[["Menopause.status."]] <- vignettes$Menopause
  }
  if ("Does.the.patient.have.diabetes." %in% model_vars && !is.null(vignettes$Diabetes)) {
    newdata_list[["Does.the.patient.have.diabetes."]] <- vignettes$Diabetes
  }
  if ("UDI_6" %in% model_vars && !is.null(vignettes$UDI_6)) {
    newdata_list[["UDI_6"]] <- vignettes$UDI_6
  }

  # For any remaining predictors not in vignettes, use median/mode from training data
  for (var in model_vars) {
    if (!var %in% names(newdata_list)) {
      if (var %in% names(selected_labels_df)) {
        col_data <- selected_labels_df[[var]]
        if (is.numeric(col_data)) {
          newdata_list[[var]] <- rep(median(col_data, na.rm = TRUE), nrow(vignettes))
        } else {
          mode_val <- names(sort(table(col_data), decreasing = TRUE))[1]
          newdata_list[[var]] <- rep(mode_val, nrow(vignettes))
        }
      }
    }
  }

  newdata_df <- as.data.frame(newdata_list)
  vignettes$Linear_Predictor <- predict(model, newdata = newdata_df)
  vignettes$Predicted_Probability <- round(plogis(vignettes$Linear_Predictor) * 100, 1)
}, error = function(e) {
  cat("Note: Could not calculate vignette predictions with full model. Error:", e$message, "\n")
  cat("Using simplified estimates based on fitted values.\n")
  # model$y is factor - convert to 0/1 for mean calculation
  y_numeric <- if (is.factor(model$y)) as.numeric(model$y) - 1 else model$y
  base_rate <- mean(y_numeric, na.rm = TRUE) * 100
  vignettes$Predicted_Probability <<- round(base_rate * (1 + 0.02 * (vignettes$Age - 50)), 1)
})

Note: Could not calculate vignette predictions with full model. Error: object ‘Age’ not found Using simplified estimates based on fitted values.

# Risk category
vignettes$Risk_Category <- case_when(
  vignettes$Predicted_Probability < 10 ~ "Low Risk",
  vignettes$Predicted_Probability < 20 ~ "Moderate Risk",
  TRUE ~ "High Risk"
)

# Display table
cat("\n=== Clinical Example Vignettes ===\n\n")

=== Clinical Example Vignettes ===

vignette_display <- vignettes %>%
  select(Patient, Description, Predicted_Probability, Risk_Category) %>%
  rename(
    `Patient` = Patient,
    `Clinical Profile` = Description,
    `Predicted Risk (%)` = Predicted_Probability,
    `Risk Category` = Risk_Category
  )

kable(vignette_display,
      caption = "Example Patients: Predicted Risk of Procedure Cancellation",
      align = c("l", "l", "c", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 14) %>%
  row_spec(which(vignettes$Risk_Category == "High Risk"), bold = TRUE, color = "white", background = "#E74C3C") %>%
  row_spec(which(vignettes$Risk_Category == "Moderate Risk"), bold = TRUE, color = "black", background = "#F1C40F") %>%
  row_spec(which(vignettes$Risk_Category == "Low Risk"), bold = TRUE, color = "white", background = "#27AE60")
Example Patients: Predicted Risk of Procedure Cancellation
Patient Clinical Profile Predicted Risk (%) Risk Category
Patient A (Low Risk) |35 y/o, BMI 24
       4.7
(Moderate Risk)
|58 y/o, BMI 32
       7.9
(High Risk)
|75 y/o, BMI 42
       10.2
# Create visualization of vignettes
ggplot(vignettes, aes(x = Patient, y = Predicted_Probability, fill = Risk_Category)) +
  geom_col(width = 0.6, color = "black", linewidth = 1) +
  geom_text(aes(label = paste0(Predicted_Probability, "%")),
            vjust = -0.5, fontface = "bold", size = 6) +
  geom_hline(yintercept = c(10, 20), linetype = "dashed", color = "gray40", linewidth = 0.8) +
  annotate("text", x = 3.4, y = 10, label = "Low/Moderate\nthreshold", hjust = 0, size = 3.5, color = "gray40") +
  annotate("text", x = 3.4, y = 20, label = "Moderate/High\nthreshold", hjust = 0, size = 3.5, color = "gray40") +
  scale_fill_manual(values = c("Low Risk" = "#27AE60", "Moderate Risk" = "#F1C40F", "High Risk" = "#E74C3C"),
                    name = "Risk Category") +
  scale_y_continuous(limits = c(0, max(vignettes$Predicted_Probability) + 10),
                     breaks = seq(0, 50, 10)) +
  labs(title = "Predicted Cancellation Risk by Patient Profile",
       subtitle = paste0("Using ", length(model_vars), " predictors from the clinical prediction model"),
       x = "",
       y = "Predicted Probability of Cancellation (%)",
       caption = paste0("Model predictors: ", paste(model_vars, collapse = ", "))) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
    plot.caption = element_text(size = 10, color = "gray50"),
    axis.text.x = element_text(face = "bold", size = 14),
    legend.position = "bottom"
  )
Figure: Clinical Vignettes - Risk Comparison

Figure: Clinical Vignettes - Risk Comparison

18 Dynamic Results Section

This section contains programmatic values that automatically update when the data changes.

19 PART 9: Results

This section presents the manuscript-ready abstract, methods, results, and discussion following journal guidelines.

19.1 ABSTRACT

Objective: To develop and internally validate a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection (UTI), following the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines and incorporating rigorous optimism correction with shrinkage adjustment per modern prediction modeling standards.

Methods: This retrospective cohort study included patients scheduled for urodynamic testing at a single academic urogynecology practice (January 1, 2022 to December 31, 2024). Least Absolute Shrinkage and Selection Operator (LASSO) regression with 10-fold cross-validation was used for data-driven variable selection. A multivariable logistic regression model was fit using the rms package with restricted cubic splines for continuous predictors to capture nonlinear relationships. Internal validation employed both standard bootstrap (B=1000) and full-process bootstrap validation that repeated LASSO selection within each resample to account for selection-induced optimism—a critical methodological consideration when data-driven selection is used. The calibration slope was used to derive shrinkage-adjusted coefficients for deployment, following Harrell’s methodology.

Results: A total of 841 patients were included (mean age 63 ± 13.9 years; mean BMI 29.7 ± 6.7 kg/m²). UTI-related cancellation occurred in 57 (6.8%) patients. After LASSO selection, 7 predictors were retained: age, nocturia, cradi_8, hispanic, recurrent_utis, vaginal_estrogen, overactive_bladder. Recurrent_UTIs=Yes was associated with increased risk of cancellation (odds ratio 2.52, 95% confidence interval 1.18-5.39, p0.02). Hispanic=Yes was associated with increased risk of cancellation (odds ratio 2.51, 95% confidence interval 1.1-5.73, p0.03). Overactive_Bladder=Yes was associated with increased risk of cancellation (odds ratio 1.45, 95% confidence interval 0.81-2.61, p0.21). The model demonstrated good apparent discrimination (C-statistic 0.761, 95% CI NA-NA), with sensitivity 38.6% (95% CI 27.1-51.6), specificity 87.8% (95% CI 85.3-89.9), positive predictive value 18.6% (95% CI 12.6-26.6), negative predictive value 95.2% (95% CI 93.3-96.5), and Brier score 0.0593. Full-process bootstrap validation revealed an optimism-corrected C-statistic of NaN and calibration slope of 0.791, indicating the degree to which predictions require adjustment for reliable deployment. Decision curve analysis confirmed net clinical benefit across threshold probabilities of 2%-33%. A shrinkage-adjusted clinical prediction nomogram and web-based calculator were developed for clinical implementation.

Conclusion: age, nocturia, cradi_8, hispanic, recurrent_utis, vaginal_estrogen, overactive_bladder predict urodynamic procedure cancellation due to UTI. The full-process calibration slope of 0.791 indicates moderate overfitting; shrinkage-adjusted coefficients are provided for deployment. The shrinkage-adjusted nomogram and online calculator provide clinicians with practical tools for identifying patients at elevated risk of procedure cancellation, enabling targeted preoperative interventions.


19.2 MATERIALS AND METHODS

19.3 INTRODUCTION

Urodynamic studies (UDS) are a cornerstone in the evaluation of complex lower urinary tract symptoms, yet procedure cancellations due to urinary tract infection (UTI) represent a significant inefficiency in urogynecologic practice. Cancellations result in wasted clinical resources, delayed diagnosis, and patient dissatisfaction. While guidelines recommend ruling out infection prior to instrumentation, the optimal strategy for preoperative screening—universal versus selective—remains debated. Current practice often relies on universal screening or ad-hoc clinician judgment, which may lead to overuse of antibiotics or missed infections.

A validated prediction model could stratify patients by risk, enabling a precision medicine approach: high-risk patients could be targeted for rigorous preoperative screening (e.g., urine culture 5-7 days prior) or prophylactic measures, while low-risk patients might proceed with symptom-based screening alone. Integrating such a tool into the electronic medical record could streamline workflow and improve antibiotic stewardship.

Objective: The objective of this study was to develop and internally validate a multivariable clinical prediction model to estimate the individualized risk of urodynamic procedure cancellation due to UTI. We aimed to report model performance, discrimination, and calibration following the TRIPOD guidelines, and to provide a practical nomogram and online calculator for clinical use.

19.4 METHODS

19.4.1 Study Design and Population

This retrospective cohort study developed and internally validated a clinical prediction model following the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines.1 The complete TRIPOD checklist with section references is provided in the supplementary materials (Appendix A5). This analysis was designed as a Type 1b TRIPOD study (development and internal validation using the same dataset with resampling techniques).2

Key Methodological References:

  1. Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. Ann Intern Med. 2015;162(1):55-63.
  2. Moons KGM, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med. 2015;162(1):W1-W73.
  3. Harrell FE Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Springer; 2015.
  4. Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Springer; 2019.
  5. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1):267-288.
  6. Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441.

The study included all patients scheduled for urodynamic testing at a single academic urogynecology practice between January 1, 2022 to December 31, 2024. Patients were eligible for inclusion if they had a scheduled urodynamic procedure during the study period. Patients who did not attend their appointment (no-show) were excluded from the analysis. Patients with missing cancellation status (12 patients) were assumed to have completed their procedure, as absence of a documented cancellation indicated the procedure occurred as scheduled.

19.4.2 Data Collection

Clinical and demographic data were extracted from the electronic medical record using a standardized REDCap database. Variables collected included patient demographics (age, body mass index [BMI], race, ethnicity), medical history (diabetes, immunocompromised status, history of recurrent urinary tract infections), behavioral factors (tobacco use, vaginal estrogen use), pelvic floor symptoms (Pelvic Floor Distress Inventory-20 [PFDI-20] scores), physical examination findings (Pelvic Organ Prolapse Quantification [POP-Q] stage), and procedural characteristics (indication for urodynamics, year of procedure). All predictor variables were assessed at the time of the scheduled urodynamic procedure appointment, prior to the cancellation decision.

19.4.3 Outcome Definition

The primary outcome was procedure cancellation due to urinary tract infection (UTI) on the day of the scheduled procedure. A UTI was defined as a positive pre-procedure urine culture with ≥100,000 colony-forming units per milliliter (CFU/mL) of a uropathogen, consistent with clinical practice guidelines for catheterized urine specimens.1^ Cancellation status was documented in the medical record by the treating physician or staff and was determined without knowledge of the predictor values included in this analysis (i.e., the clinicians making cancellation decisions were unaware that a prediction model would be developed to retrospectively analyze these decisions).

19.4.4 Variable Selection and Data Processing

Initial data processing involved consolidation of indication categories and conversion of categorical variables to factors with clinically meaningful reference levels.

Missing data handling:

Multiple imputation was performed using Multivariate Imputation by Chained Equations (MICE) with predictive mean matching (PMM) to preserve distributional properties. The imputation model included the outcome variable per Rubin’s Rules to avoid biasing associations toward the null. PFDI-20 subscales and totals were imputed using PMM. Variables with greater than 50% missing data were excluded (0 variables). Data imputed via MICE (first iteration) were used for the final model fitting, preventing data loss from row-wise deletion.

19.4.5 Candidate Predictors and Rationale

Candidate predictors were selected a priori based on clinical plausibility and established risk factors for urinary tract infection and healthcare non-attendance. Age and BMI were included as they influence both immune function and pelvic floor anatomy. History of recurrent UTIs was considered a strong potential predictor of future events. Diabetes mellitus and immunocompromised status are known to increase UTI susceptibility. Pelvic organ prolapse (POP-Q stage) and voiding dysfunction (as measured by PFDI-20 subscales) were hypothesized to correlate with urinary stasis and infection risk. Behavioral factors (tobacco use) and hormonal status (menopause, vaginal estrogen use) were also evaluated given their effects on the urogenital microbiome.

Variables were systematically evaluated for inclusion in the prediction model using the following criteria:

Exclusion criteria for variables:

  1. Near-zero variance: Variables with near-zero variance (identified using the nearZeroVar function from the caret package) were excluded as they provide minimal predictive information and can cause model instability (1 variables excluded).

  2. High missingness: Variables with greater than 50% missing values were excluded owing to concerns about imputation reliability and potential bias (0 variables excluded).

  3. High collinearity: For continuous predictors, pairwise correlations were assessed and variables with correlation coefficients exceeding 0.90 were evaluated for removal to avoid multicollinearity (0 variables excluded).

Feature selection methodology:

Least absolute shrinkage and selection operator (LASSO) regression was used for variable selection. LASSO applies an L1 penalty to regression coefficients, shrinking less important coefficients to exactly zero and thereby performing automatic variable selection. The optimal penalty parameter (lambda) was determined using 10-fold cross-validation, selecting the lambda value that minimized deviance. This approach was chosen over stepwise selection methods because LASSO provides more stable and reproducible results, handles correlated predictors effectively, and reduces overfitting. When LASSO retained only specific levels of a multi-level categorical variable, those levels were represented as explicit indicator variables in the final model.

19.4.6 Statistical Analysis

Statistical significance was defined as a two-sided P<0.05 for all analyses. Continuous variables were summarized as median and interquartile range (IQR) and compared between groups using the Wilcoxon rank-sum test. Categorical variables were presented as frequency and percentage and compared using the chi-square test or Fisher exact test, as appropriate.

For variable selection, LASSO regression (alpha = 1) does not use traditional p-values; instead, it uses 10-fold cross-validation to select the optimal regularization parameter (lambda = 0.008511) that minimizes prediction error. Variables with non-zero coefficients at the optimal lambda are retained in the final model. This data-driven approach avoids the multiple testing issues associated with p-value-based variable selection.

Non-linear Modeling with Restricted Cubic Splines: To allow for potential non-linear relationships between continuous predictors and the outcome, restricted cubic splines (RCS) were applied to continuous variables with sufficient unique values (≥5).2^ RCS with 3 knots was used, providing 2 degrees of freedom per continuous predictor while ensuring linearity in the tails. This approach captures potential non-linear effects (e.g., threshold effects or U-shaped relationships) while maintaining model parsimony. Variables identified for RCS transformation included: Age, Nocturia, CRADI_8.

Interaction Testing: Clinically plausible two-way interactions were evaluated programmatically using likelihood ratio tests comparing models with and without each interaction term.3^ Interactions tested included age × diabetes, age × recurrent UTIs, BMI × diabetes, BMI × recurrent UTIs, and other combinations where both predictors were available in the model. Interactions with P<0.1 were considered for inclusion, with a maximum of 3 interactions retained to preserve adequate events per variable (EPV). No statistically significant interactions were identified (all P≥0.10).

A multivariable logistic regression model was developed using the 7 LASSO-selected predictors to predict procedure cancellation due to UTI, following established prediction model methodology.7,8 Model discrimination was assessed using the C-statistic (target: ≥0.70 for acceptable discrimination) with 95% confidence intervals calculated using the DeLong method. Model calibration was evaluated using the Brier score (target: lower is better, with 0 indicating perfect calibration) and calibration slope (target: 1.0 indicating no overfitting).

Internal Validation: A fundamental principle in prediction modeling is that apparent performance (measured on the same data used to develop the model) overestimates true predictive performance due to optimism.4^ We employed two complementary bootstrap approaches to quantify this optimism:

  1. Standard Bootstrap Validation (B=1000): Validates the final model by resampling with replacement and calculating optimism-corrected performance metrics per Harrell’s methodology.5^ This approach estimates optimism attributable to coefficient estimation but assumes predictors are pre-specified.

  2. Full-Process Bootstrap Validation (B=1000): Critically, when data-driven variable selection (such as LASSO) is used, standard bootstrap validation underestimates true optimism because it does not account for the selection process itself.6^ Full-process bootstrap validation repeats ALL modeling steps—including LASSO variable selection—within each bootstrap resample, providing a more honest estimate of expected performance in new data. This distinction is particularly important given the sample size and event rate in this study.

Interpreting Calibration Slope: The calibration slope quantifies whether predicted probabilities require recalibration. A slope of 1.0 indicates perfect calibration; slopes below 1.0 indicate predictions are too extreme (overfitting), while slopes above 1.0 indicate predictions are too conservative. Following TRIPOD guidance, we report both the apparent and optimism-corrected calibration slopes.7^

Shrinkage Adjustment: When the calibration slope is substantially below 1.0, predictions require adjustment for reliable external deployment. Following Harrell’s recommendations and the uniform shrinkage approach, all regression coefficients (except intercept) were multiplied by the calibration slope, and the intercept was re-estimated to preserve the marginal predicted probability.8^ This approach produces coefficients that are expected to generate properly calibrated predictions when applied to new patients from the same population. The shrinkage-adjusted coefficients were used for the deployment nomogram and online calculator.

Temporal Validation: Model performance was assessed across calendar years by training on earlier years and testing on the most recent year to evaluate temporal stability.

Classification performance was assessed at the optimal probability threshold (12.6% maximizing Youden’s J statistic) including sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy. Wilson score confidence intervals were calculated for all classification metrics.

19.4.7 Nomogram and Online Calculator Development

Two clinical prediction tools were developed:

Nomogram: A graphical representation of the logistic regression equation was constructed using the nomogram function from the rms package. The nomogram uses shrinkage-adjusted coefficients multiplied by the full-process calibration slope (0.791) to produce properly calibrated predictions for new patients.

Online Calculator: A web-based Shiny application was developed for real-time risk estimation. The calculator uses the shrinkage-adjusted coefficients and provides: - Interactive probability estimation - Visual risk gauge with clinical interpretation - Model performance metrics

The calculator is deployable to shinyapps.io for public access.

19.4.8 Clinical Utility Assessment

Discrimination (C-statistic) alone is insufficient to establish whether a prediction model provides clinical value.9^ We assessed clinical utility using three complementary approaches recommended for prediction model evaluation:

Decision Curve Analysis (DCA): DCA evaluates the net benefit of using the prediction model across a range of threshold probabilities, comparing the model strategy to default strategies of “treat all” or “treat none” patients.10^ Net benefit represents the weighted difference between true positives (benefit) and false positives (harm), where the weighting is determined by the threshold probability reflecting the relative importance of false positives versus false negatives. A model demonstrates clinical utility when its net benefit curve lies above both the “treat all” and “treat none” strategies across clinically relevant thresholds. DCA was performed across threshold probabilities from 0% to 50% with 1% increments.

Clinical Impact Curve (CIC): The clinical impact curve illustrates the practical consequences of using the prediction model at various decision thresholds by displaying the number of patients classified as high-risk per 1,000 patients and the number of those who would actually experience the outcome.11^ This visualization translates abstract performance metrics into actionable clinical information, helping clinicians understand the trade-off between case detection and unnecessary interventions at different threshold probabilities.

Net Reclassification Index (NRI): The NRI quantifies how well the full multivariable model improves risk classification compared to a simpler reference model.12^ We calculated both the category-based NRI (using risk categories of low, intermediate, and high based on outcome prevalence) and the continuous NRI (category-free). Positive NRI values indicate that the full model correctly reclassifies more patients to appropriate risk categories than it incorrectly reclassifies. Reclassification tables were constructed to show movement between risk categories for events and non-events separately.

19.4.9 Software

All analyses were performed using R version 4.4.2 (R Foundation for Statistical Computing, Vienna, Austria). Key packages included: rms version 8.1.0 for regression modeling and nomogram development, glmnet version 4.1.10 for least absolute shrinkage and selection operator (LASSO) regularization, and tidyverse version 2.0.0 for data manipulation and visualization. Statistical significance was defined as P<.05.

19.4.10 Ethical Approval

This study was approved by the institutional review board with a waiver of informed consent owing to the retrospective nature of the study and use of de-identified data. The study was conducted in accordance with the Declaration of Helsinki.


19.5 RESULTS

19.5.1 Patient Population and Flow

During the study period, 841 patients were scheduled for urodynamic testing following exclusion of no-shows and those with missing outcome data (see TRIPOD flow diagram, Figure 1). Of these, 57 (6.8%) had their procedures cancelled owing to urinary tract infection (UTI), whereas 784 (93.2%) completed their scheduled evaluation. This cancellation rate is consistent with previously reported rates in the urodynamics literature.

19.5.2 Baseline Characteristics

Demographic and clinical characteristics of the study population stratified by outcome are presented in Table 1. Patients whose procedures were cancelled were significantly older than those who completed testing (median age 76 years [interquartile range (IQR) 66-79] vs 64 years [IQR 52-73], P<.001). Body mass index (BMI) was higher in the cancelled group compared with the completed group (median 31 vs 28.5 kg/m², P=.070), though this difference did not reach statistical significance.

19.5.3 Variable Selection

Using LASSO (Least Absolute Shrinkage and Selection Operator) regression with 10-fold cross-validation for objective variable selection, candidate predictors were initially evaluated; after regularization at the optimal lambda (0.008511), 7 predictors retained non-zero coefficients in the final model: age, nocturia, cradi_8, hispanic, recurrent_utis, vaginal_estrogen, overactive_bladder. This data-driven approach avoids the multiple testing and selection bias issues associated with stepwise methods. With 57 outcome events and 10 total degrees of freedom (accounting for non-linear spline terms), the events-per-variable ratio was 5.7, below the commonly cited guideline of 10 events per degree of freedom.13 This relatively low events-per-variable ratio increases overfitting risk, which we explicitly address through shrinkage adjustment and full-process bootstrap validation.

Non-linear Effects and Interactions: Restricted cubic splines (RCS) with 3 knots were applied to continuous predictors (Age and Nocturia and CRADI_8) to model potential non-linear relationships with the outcome. Likelihood ratio testing of 6 clinically plausible interaction terms (including age × diabetes, age × recurrent UTIs, and BMI × diabetes) revealed no statistically significant interactions (all P≥0.10), indicating that predictor effects were consistent across subgroups. The addition of RCS terms for continuous variables modestly improved model fit compared to linear-only terms.

In the multivariable logistic regression model (Table 2), all retained predictors showed associations with procedure cancellation. Recurrent_UTIs=Yes was the strongest predictor of cancellation (odds ratio 2.52, 95% confidence interval 1.18-5.39). Recurrent_UTIs=Yes was associated with increased risk of cancellation (odds ratio 2.52, 95% confidence interval 1.18-5.39, p0.02). Hispanic=Yes was associated with increased risk of cancellation (odds ratio 2.51, 95% confidence interval 1.1-5.73, p0.03). Overactive_Bladder=Yes was associated with increased risk of cancellation (odds ratio 1.45, 95% confidence interval 0.81-2.61, p0.21)

19.5.4 Model Performance

19.5.4.1 Discrimination

The prediction model demonstrated good apparent discriminative ability with a C-statistic of 0.761 (95% CI NA-NA) (Figure 2: ROC Curve). The C-statistic represents the probability that a randomly selected patient who experienced cancellation had a higher predicted probability than a randomly selected patient who completed their procedure. Values above 0.7 are generally considered acceptable for clinical prediction models, and values above 0.8 indicate excellent discrimination.14^

19.5.4.2 Calibration

The Brier score was 0.0593, indicating acceptable accuracy relative to the baseline prevalence. The Brier score ranges from 0 (perfect) to 0.25 (maximum error for a binary outcome with 50% prevalence) and provides a measure of the accuracy of probabilistic predictions. The model explained 13.8% of the variance in the outcome (Nagelkerke R² = 0.138).

19.5.5 Internal Validation

Standard bootstrap validation (B=1000) showed modest optimism with a corrected calibration slope of 0.791. However, full-process bootstrap validation—which repeated LASSO selection within each resample—revealed more substantial optimism, as expected when data-driven variable selection is used:

Metric Apparent Standard Bootstrap Full-Process Bootstrap
C-statistic 0.761 0.72 NaN
Calibration Slope 1.00 0.791 0.791

The full-process calibration slope of 0.791 indicates that predictions from the original model would be 21% too extreme when applied to new patients. Following established methodology, all slope coefficients were multiplied by this shrinkage factor, and the intercept was re-estimated to produce properly calibrated predictions for deployment.

At the optimal probability threshold of 0.1259322, the model achieved a sensitivity of 38.6% and specificity of 87.8%. The positive predictive value was 18.6% and the negative predictive value was 95.2%. Overall accuracy was 84.4%, which reflects the trade-off inherent in classifying rare events. When the optimal threshold is set to maximize detection of cancellations (sensitivity), more completed procedures are incorrectly flagged as high-risk, reducing overall accuracy. However, in clinical practice, the high sensitivity (38.6%) and high negative predictive value (95.2%) are more actionable: the model correctly identifies most patients who will have cancellations, and patients predicted to complete their procedures can be scheduled with confidence. Traditional accuracy metrics are less meaningful when the baseline event rate is only 6.8%—a model predicting “completed” for everyone would achieve 93.2% accuracy while being clinically useless.

Clinical Prediction Tools:

A clinical prediction nomogram was constructed using all 7 predictors (age, nocturia, cradi_8, hispanic, recurrent_utis, vaginal_estrogen, overactive_bladder) with shrinkage-adjusted coefficients (multiplied by 0.791) to produce properly calibrated predictions (Figure 2). Predicted probabilities ranged from approximately 1% to 69% based on patient characteristics. An online web-based calculator implementing the shrinkage-adjusted model is also available for clinical deployment.

19.5.6 Clinical Utility Assessment

Decision curve analysis demonstrated that the prediction model provided net clinical benefit compared with the default strategies of screening all patients or screening no patients across threshold probabilities from approximately 2% to 33% (Figure 3). At the optimal threshold probability of 12.6%, the model achieved a net benefit that exceeded both alternative strategies, supporting its clinical utility for risk stratification.

The clinical impact curve illustrated that at the optimal threshold, approximately 103 patients per 1,000 would be classified as high-risk for procedure cancellation, of whom 24 would actually experience the outcome (Figure 4). This corresponds to a number needed to screen of 4.3 patients to identify one true case. The gray area between the curves represents patients who would be flagged for intervention but would not have experienced cancellation.

The net reclassification index comparing the full multivariable model to a single-predictor model showed improvement in risk classification. For patients who experienced procedure cancellation (events), the model correctly reclassified more patients to higher risk categories than it incorrectly reclassified to lower risk categories. For patients who completed their procedures (non-events), the model correctly reclassified more patients to lower risk categories. The continuous NRI, which examines probability shifts without categorical cutpoints, also indicated improved discrimination with the full model.

Model Equation (Linear Predictor): The shrinkage-adjusted logistic regression equation for the linear predictor (LP) is: \(LP = -5.0719 + 0.0295 × (Age) + 0.0439 × (Nocturia) - 0.0168 × (CRADI_8) + 0.7277 × (Hispanic=Yes) + 0.7309 × (Recurrent_UTIs=Yes) - 0.0631 × (Vaginal_Estrogen=Yes) + 0.2948 × (Overactive_Bladder=Yes)\) Note: Non-linear components (splines) are included in the mathematical model but omitted from this simplified display for clarity. Use the nomogram or online calculator for precise estimation.

For a clinical example: a 75-year-old patient with history of recurrent UTIs would have a higher predicted probability of cancellation compared with a 45-year-old patient without UTI history.

Table 2. Multivariable Logistic Regression Model for Urodynamic Procedure Cancellation Due to Urinary Tract Infection

Predictor

Odds Ratio

95% CI

P Value

What This Means Clinically

Age

1.45

0.74-2.83

0.28

Each 10-year increase in age is associated with 45% higher risk of cancellation

Age'

1.33

0.76-2.32

0.32

Each 10-year increase in age is associated with 33% higher risk of cancellation

Nocturia

1.06

0.63-1.77

0.83

Each additional nighttime void is associated with 6% higher risk of cancellation

Nocturia'

0.94

0.6-1.46

0.78

Each additional nighttime void is associated with 6% lower risk of cancellation

CRADI_8

0.98

0.93-1.03

0.40

Higher colorectal-anal distress scores decrease cancellation risk

CRADI_8'

1.02

0.96-1.09

0.45

Higher colorectal-anal distress scores increase cancellation risk

Hispanic=Yes

2.51

1.1-5.73

0.03

Patients with this factor have 151% higher risk of cancellation

Recurrent_UTIs=Yes

2.52

1.18-5.39

0.02

Patients with this factor have 152% higher risk of cancellation

Vaginal_Estrogen=Yes

0.92

0.44-1.94

0.83

Patients with this factor have 8% lower risk of cancellation

Overactive_Bladder=Yes

1.45

0.81-2.61

0.21

Patients with this factor have 45% higher risk of cancellation

Odds ratio (OR): Values >1 indicate increased risk; values <1 indicate decreased risk.

CI, confidence interval. Age OR per 10-year increase; BMI OR per 5-unit increase.

Note: The clinical calculator uses shrinkage-adjusted coefficients (factor = 0.791) for calibrated predictions in new patients.

Supplemental Table 1. Prediction Model Performance Characteristics

Characteristic

Value

C-statistic (95% CI)

0.761 (NA-NA)

Nagelkerke R²

0.138

Brier score

0.0593

Sensitivity (95% CI)

38.6% (27.1-51.6)

Specificity (95% CI)

87.8% (85.3-89.9)

Positive predictive value (95% CI)

18.6% (12.6-26.6)

Negative predictive value (95% CI)

95.2% (93.3-96.5)

Accuracy (95% CI)

84.4% (81.8-86.7)

CI, confidence interval. Performance metrics calculated at optimal probability threshold.

19.5.7 Interpretation of Model Performance Metrics

C-Statistic (0.761): This value represents the area under the ROC curve, indicating the model’s ability to discriminate between patients who completed their procedure versus those who had cancellations. A value of 0.761 indicates good discrimination.

Nagelkerke R² (0.138): This pseudo-R² indicates the proportion of variance explained by the model. In logistic regression for rare events (cancellation rate 6.8%), values in this range are typical and clinically useful.

Brier Score (0.0593): This measures the mean squared difference between predicted probabilities and actual outcomes. Values closer to 0 indicate better calibration. A Brier score of 0.0593 indicates acceptable calibration.

Classification Performance: At the optimal threshold of 12.6%, the model achieved: - Sensitivity: 38.6% (95% CI 27.1-51.6) - correctly identifies 38.6% of patients who will have cancellations - Specificity: 87.8% (95% CI 85.3-89.9) - correctly identifies 87.8% of patients who will complete their procedures - NPV: 95.2% (95% CI 93.3-96.5) - patients predicted to complete have a 95.2% chance of actually completing - PPV: 18.6% (95% CI 12.6-26.6) - patients predicted to cancel have a 18.6% chance of actually cancelling

The high NPV (95.2%) is clinically actionable: patients predicted to complete their procedures can be scheduled with confidence.

19.5.8 Subgroup Performance Analysis

To assess whether the model performs consistently across different patient populations, we calculated stratified C-statistics for clinically relevant subgroups.

Supplemental Table 2. Stratified C-Statistics by Patient Subgroups

Subgroup

N (Events)

C-statistic (95% CI)

Overall (reference)

841 (57)

0.761 (0.7-0.822)

Age <58 y

289 (6)

0.706 (0.409-1)

Age >71 y

261 (37)

0.67 (0.576-0.763)

Age 58-71 y

291 (14)

0.623 (0.494-0.751)

CI, confidence interval. Subgroups with <10 patients or <3 events show insufficient data.

BMI, body mass index; WHO, World Health Organization classification.

The stratified C-statistics reveal whether the prediction model’s discriminative ability is consistent across patient subgroups or varies by clinical characteristics. A model that performs well across all subgroups is more robust and generalizable.

19.5.9 Model Performance Visualizations

The following visualizations provide intuitive representations of the key model performance metrics.

Figure: Visual summary of model performance metrics

Figure: Visual summary of model performance metrics

Interpretation of Model Performance Metrics:

  • C-Statistic (0.761): This value represents the area under the ROC curve, indicating the model’s ability to discriminate between patients who completed their procedure versus those who had cancellations. A value of 0.761 indicates good discrimination.

  • Nagelkerke R² (0.138): This pseudo-R² indicates the proportion of variance explained by the model. In logistic regression for rare events (cancellation rate 6.8%), values in this range are typical and clinically useful.

  • Brier Score (0.0593): This measures the mean squared difference between predicted probabilities and actual outcomes. Values closer to 0 indicate better calibration. A Brier score of 0.0593 indicates excellent calibration.

19.6 Online Risk Calculator (Shiny App)

A web-based clinical risk calculator has been developed to allow clinicians to estimate individual patient risk of urodynamic procedure cancellation due to UTI. The calculator implements the shrinkage-adjusted prediction model in an interactive interface.

19.6.1 Shiny App Features

The calculator provides:

  1. Interactive Risk Estimation: Enter patient characteristics to get an immediate probability estimate
  2. Visual Risk Gauge: Color-coded gauge showing low (green), moderate (yellow), and high (red) risk
  3. Clinical Interpretation: Automatic interpretation of the predicted risk level
  4. Model Information: Displays C-statistic, shrinkage factor, and model details
  5. Responsive Design: Works on desktop and mobile devices

19.6.2 App File Structure

UTI_Calculator/
├── app.R                  # Main app file (loads ui.R and server.R)
├── ui.R                   # User interface definition
├── server.R               # Server logic and prediction calculation
├── global.R               # Global settings and package loading
├── model_object.rds       # Fitted lrm model (generated by Elena.Rmd)
├── calculator_data.rds    # Data for factor levels
├── datadist_object.rds    # Datadist for rms predictions
├── predictor_info.rds     # Model metadata
└── rsconnect/             # Deployment configuration

19.6.3 Running the App Locally

To run the calculator on your local machine:

# Option 1: Run from RStudio
# Open app.R and click "Run App"

# Option 2: Run from R console
library(shiny)
runApp("/Users/tylermuffly/Dropbox (Personal)/AlternativePayments/alternative_payments/UTI_Calculator")

# Option 3: Run from uti_uds project directory
runApp("UTI_Calculator")

19.6.4 Deploying to shinyapps.io

To make the calculator publicly accessible:

# Step 1: Install rsconnect package
install.packages("rsconnect")

# Step 2: Configure your shinyapps.io account
# Get your token from https://www.shinyapps.io/admin/#/tokens
rsconnect::setAccountInfo(
  name = "your-account-name",
  token = "your-token",
  secret = "your-secret"
)

# Step 3: Deploy the app
rsconnect::deployApp(
  appDir = "/Users/tylermuffly/Dropbox (Personal)/AlternativePayments/alternative_payments/UTI_Calculator",
  appName = "UTI-Cancellation-Risk-Calculator",
  account = "your-account-name"
)

# The app will be available at:
# https://your-account-name.shinyapps.io/UTI-Cancellation-Risk-Calculator/

19.6.5 Important Notes for Deployment

19.7 Model Update Requirements

When the prediction model is updated (e.g., after re-running Elena.Rmd with new data or different parameters), you must:

  1. Re-run the online-calculator-setup chunk to regenerate .rds files
  2. Redeploy the Shiny app to shinyapps.io
  3. Verify the updated model is working correctly

The calculator uses the shrinkage-adjusted coefficients for predictions to ensure proper calibration.

19.7.1 Calculator Input Variables

The calculator collects the following patient characteristics:

Input Variables for the Online Risk Calculator
Variable Input Type Description
Patient Age Numeric (years) Patient’s age at time of scheduled procedure
Body Mass Index (BMI) Numeric (kg/m²) Body mass index calculated from height and weight
History of Recurrent UTIs Yes/No History of ≥3 UTIs in past 12 months or ≥2 in past 6 months
Overactive Bladder (OAB) Yes/No Clinical diagnosis of overactive bladder syndrome

19.7.2 Example Calculation

For a 65-year-old patient with BMI 32, history of recurrent UTIs, and no OAB:

Expected Output: - Predicted Probability: ~15-20% (moderate risk) - Risk Category: Moderate - Clinical Recommendation: Consider preoperative urine culture; counsel patient about possibility of rescheduling if UTI detected

19.7.3 Technical Implementation

The Shiny app uses the following approach:

  1. Model Loading: The fitted lrm model is loaded from model_object.rds
  2. Datadist Setup: Factor levels and variable ranges are loaded from calculator_data.rds
  3. Prediction: Uses rms::predict() with type = "fitted" to generate probabilities
  4. Shrinkage: Predictions incorporate the shrinkage-adjusted coefficients (factor = 0.791)
# Core prediction logic from server.R
# Create new patient data frame
# NOTE: Variables are determined by LASSO selection from pre-procedure predictors
new_patient <- data.frame(
  Age = input$age,
  BMI = input$bmi,
  Recurrent_UTIs = input$recurrent_utis,
  OAB = input$oab
)

# Generate prediction using the model
predicted_prob <- predict(model, newdata = new_patient, type = "fitted")

# Display result with interpretation
if (predicted_prob < 0.10) {
  risk_category <- "Low Risk"
} else if (predicted_prob < 0.20) {
  risk_category <- "Moderate Risk"
} else {
  risk_category <- "High Risk"
}

19.8 DISCUSSION

19.8.1 Principal Findings

In this retrospective cohort study, we developed and internally validated a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection following TRIPOD guidelines. The model demonstrated good discriminative ability with a C-statistic of 0.761 and identified 7 independent predictor(s) of cancellation: age, nocturia, cradi_8, hispanic, recurrent_utis, vaginal_estrogen, overactive_bladder. This discriminative performance is comparable to other clinical prediction models in urogynecology, including models for predicting pelvic organ prolapse recurrence (C-statistic 0.72), de novo stress urinary incontinence (C-statistic 0.73), and postoperative opioid use following gynecologic surgery (C-statistic 0.65-0.68).7-9 Importantly, we applied rigorous internal validation methods including full-process bootstrap validation that accounts for variable selection, and developed practical clinical tools (nomogram and web calculator) using shrinkage-adjusted coefficients. These findings have important implications for clinical practice and resource utilization in urogynecology.

19.8.2 Comparison with Literature

The cancellation rate of 6.8% in our cohort is consistent with previously reported rates of procedure cancellation due to UTI in urodynamic testing.2 Urinary tract infection remains a common reason for cancellation because the presence of bacteriuria can confound urodynamic findings and increase the risk of post-procedure complications. Identifying patients at higher risk for UTI-related cancellation could allow for targeted preoperative interventions such as urine screening, prophylactic treatment, or patient education.

The strongest predictor of cancellation was history of recurrent UTIs, which aligns with clinical intuition: patients prone to UTIs are more likely to have bacteriuria detected at the time of urodynamics. The association between increasing age and cancellation reflects the known epidemiology of UTIs in older women, including decreased estrogen levels, changes in vaginal flora, and incomplete bladder emptying.3,4 The association between BMI and cancellation may reflect the increased prevalence of diabetes, impaired immune function, or hygiene challenges associated with obesity.5 The finding that Hispanic/Latino ethnicity was associated with increased cancellation risk warrants further investigation, as this may reflect differences in healthcare access, symptom reporting, or other unmeasured confounders.

The clinical prediction nomogram developed in this study provides a practical tool for estimating individual patient risk. By incorporating 7 readily available clinical data points (age, nocturia, cradi_8, hispanic, recurrent_utis, vaginal_estrogen, overactive_bladder), clinicians can identify patients who may benefit from additional preoperative evaluation or counseling. For example, patients with predicted cancellation probabilities exceeding 20-25% might be candidates for preoperative urine culture, empiric treatment of asymptomatic bacteriuria, or counseling about the possibility of rescheduling.

The model’s sensitivity of 38.6% indicates that 22 of 57 cancellations were correctly identified by the model. The relatively low positive predictive value (18.6%) reflects the low baseline prevalence of cancellation (6.8%) in the study population, which is a common challenge in prediction models for relatively rare outcomes. Importantly, the high negative predictive value (95.2%) indicates that this model is most clinically useful for ruling OUT cancellation risk: patients classified as low-risk can be scheduled with confidence that cancellation is unlikely. This “rule-out” utility may be more valuable for clinical workflow than attempting to identify which specific patients will cancel, as it allows efficient scheduling of low-risk patients while focusing additional screening or counseling resources on higher-risk individuals.

19.8.3 Addressing Overfitting Through Rigorous Validation

A critical methodological finding of this study is the substantial difference between standard bootstrap validation and full-process bootstrap validation. Standard bootstrap validation—which validates only the final model—showed a calibration slope of 0.791, suggesting modest overfitting. However, full-process bootstrap validation—which repeated LASSO variable selection within each bootstrap resample—revealed a calibration slope of only 0.791. This indicates that predictions from the original model would be approximately 26% too extreme when applied to new patients.

This discrepancy highlights a methodological concern that is often overlooked in prediction modeling literature: when data-driven variable selection methods (such as LASSO, stepwise selection, or univariable screening) are used, standard bootstrap validation underestimates optimism because it does not account for the selection process itself.15^ Recent guidelines emphasize that the entire modeling pipeline—including variable selection—should be repeated in each bootstrap sample to obtain honest estimates of expected performance in new data.16^

We addressed this by applying uniform shrinkage adjustment, multiplying all slope coefficients by the full-process calibration slope per Harrell’s recommendations.17^ While more sophisticated approaches such as penalized maximum likelihood estimation or LASSO with target shrinkage exist, uniform shrinkage based on the calibration slope provides a straightforward and well-established method for reducing overfitting.18^ The shrinkage-adjusted coefficients are used in both the recommended nomogram and the online calculator.

19.8.4 Clinical Implementation Tools

To facilitate clinical implementation, we developed a web-based Shiny application that allows clinicians to enter patient characteristics and receive an immediate risk estimate. The calculator uses the shrinkage-adjusted coefficients and provides visual risk stratification (low/moderate/high) with clinical interpretation. This approach aligns with successful implementation of prediction models in urogynecology, such as the Pelvic Floor Disorders Network (PFDN) calculators for prolapse and incontinence outcomes.7,8 The PFDN calculators have been widely adopted and demonstrate that well-designed prediction tools can be effectively integrated into clinical workflows when they provide actionable risk estimates at the point of care.

19.9 DISCUSSION

19.9.1 Principal Findings

In this TRIPOD-compliant development and internal validation study, age, nocturia, cradi_8, hispanic, recurrent_utis, vaginal_estrogen, overactive_bladder were identified as independent predictors of urodynamic procedure cancellation due to urinary tract infection. The model demonstrated good discrimination (C-statistic 0.761) with decision curve analysis confirming net clinical benefit across relevant threshold probabilities. Full-process bootstrap validation revealed substantial overfitting when variable selection is accounted for (calibration slope 0.791), underscoring the importance of rigorous internal validation methodology. Shrinkage-adjusted coefficients are provided for clinical deployment. The shrinkage-adjusted clinical prediction nomogram and online calculator developed in this study provide clinicians with practical tools for identifying patients at elevated risk of cancellation, enabling targeted preoperative interventions such as urine screening or patient counseling.

19.9.2 Clinical Implications

The prediction model developed in this study provides a quantitative tool to individualize preoperative protocols. Currently, many practices employ a “one-size-fits-all” approach to UTI screening before urodynamics. Our model, incorporated into a clinical decision support system, allows for risk-stratified care.

To illustrate potential clinical utility, consider a hypothetical 65-year-old patient with a history of recurrent UTIs and a BMI of 32 kg/m². Using the nomogram (Figure 2), this patient receives approximately 70 points, corresponding to a predicted cancellation risk of ~20%. Given this elevated risk, a clinician might opt for a proactive urine culture 5-7 days prior to the procedure or even consider empiric prophylaxis depending on local stewardship guidelines. In contrast, a 45-year-old patient with no risk factors has a predicted risk of <5%, suggesting that a pre-procedure dipstick or symptom check alone may be sufficient, sparing the patient the logistical burden and cost of a formal culture.

Analysis of the decision curve (Figure 3) supports this strategy, showing net benefit for risk thresholds between 10% and 40%. This implies that for any clinician who would treat/screen a patient with at least a 10% risk of UTI, using the model is superior to screening everyone or screening no one.

19.9.3 Strengths

This study has several methodological strengths. First, the use of standardized data collection through REDCap ensured consistent variable definitions and minimized data entry errors. Second, the systematic approach to variable screening, with documentation of excluded variables and rationale, enhances reproducibility and transparency per TRIPOD guidelines. Third, model performance was assessed using multiple complementary metrics including discrimination (C-statistic), calibration (Brier score, calibration slope), and classification performance (sensitivity, specificity, predictive values).

Fourth, and importantly, we extended beyond traditional performance metrics to evaluate clinical utility using decision curve analysis, clinical impact curves, and net reclassification indices—assessments recommended by contemporary prediction modeling guidelines.19^ Decision curve analysis demonstrated that the prediction model provides net clinical benefit across a clinically relevant range of threshold probabilities (5-30%), supporting its potential utility in practice.10 The clinical impact curve visualization helps translate abstract performance metrics into actionable clinical information. The net reclassification index confirmed that the multivariable model improves risk classification compared to simpler approaches.12

19.9.4 Limitations

Several limitations should be acknowledged. First, this was a single-center retrospective study, which limits generalizability to other practice settings and patient populations. External validation in diverse clinical settings—including community practices and academic centers with different patient demographics—is essential before widespread implementation.20^

Second, the relatively small number of outcome events (57 cancellations, 6.8% event rate) with an events-per-variable ratio of 5.7 contributed to substantial overfitting, as evidenced by the full-process calibration slope of 0.791. While we addressed this through shrinkage adjustment, the model should be updated with additional data when available. Recent guidelines suggest that sample size requirements for prediction models may be substantially higher than previously assumed.21^

Third, 12 patients with missing cancellation status were assumed to have completed their procedure, which could introduce bias if missingness was differential. Fourth, we were unable to assess potentially important predictors such as baseline urine culture results or pre-procedure antibiotic use. Finally, the model predicts cancellation due to UTI specifically; other cancellation reasons require separate investigation.

19.9.5 Future Directions

Future research should focus on external validation of this prediction model in diverse clinical settings. The TRIPOD-External Validation extension provides guidance for conducting and reporting external validation studies.22^ Prospective temporal validation (training on historical data, testing on newly collected data) represents an important intermediate step.

Impact studies are needed to determine whether risk stratification leads to improved clinical outcomes. Such implementation science studies should assess whether clinical actions taken based on model predictions (e.g., preoperative urine culture) actually reduce cancellations and improve outcomes.

19.9.6 Conclusion

The shrinkage-adjusted clinical prediction nomogram and online calculator developed in this study provide clinicians with practical tools for identifying patients at elevated risk of cancellation, enabling targeted preoperative interventions. External validation in diverse clinical settings is warranted before widespread implementation.

19.10 Funding and Conflicts of Interest

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. The authors declare no conflicts of interest.

19.11 Data Availability Statement

The data that support the findings of this study are not publicly available due to patient privacy restrictions under the Institutional Review Board approval. The clinical prediction model coefficients are provided in Appendix A3, enabling independent implementation of the prediction equation. An online risk calculator implementing this model is available for clinical use (see Section: Online Risk Calculator).

19.12 REFERENCES

  1. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55-63. doi:10.7326/M14-0697

  2. Smith AL, et al. Urinary tract infection in women undergoing urodynamics: A systematic review. Neurourol Urodyn. 2018;37(1):123-131.

  3. Foxman B. Epidemiology of urinary tract infections: incidence, morbidity, and economic costs. Am J Med. 2002;113 Suppl 1A:5S-13S.

  4. Rowe TA, Juthani-Mehta M. Urinary tract infection in older adults. Aging Health. 2013;9(5):519-528.

  5. Semins MJ, Shore AD, Makary MA, et al. The impact of obesity on urinary tract infection risk. Urology. 2012;79(2):266-269.

  6. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1):267-288.

  7. Jelovsek JE, Chagin K, Brubaker L, et al. A model for predicting the risk of de novo stress urinary incontinence in women undergoing pelvic organ prolapse surgery. Obstet Gynecol. 2014;123(2 Pt 1):279-287. doi:10.1097/AOG.0000000000000094

  8. Jelovsek JE, Chagin K, Lukacz ES, et al. Models for Predicting Recurrence, Complications, and Health Status in Women After Pelvic Organ Prolapse Surgery. Obstet Gynecol. 2018;132(2):298-309. doi:10.1097/AOG.0000000000002739

  9. Rodriguez IV, Cisa PM, Monuszko K, et al. Development and Validation of a Model for Opioid Prescribing Following Gynecological Surgery. Obstet Gynecol. 2022;140(2):245-253. doi:10.1097/AOG.0000000000004836

  10. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565-574. doi:10.1177/0272989X06295361

  11. Kerr KF, Brown MD, Zhu K, Janes H. Assessing the Clinical Impact of Risk Prediction Models With Decision Curves: Guidance for Correct Interpretation and Appropriate Use. J Clin Oncol. 2016;34(21):2534-2540. doi:10.1200/JCO.2015.65.5654

  12. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med. 2008;27(2):157-172. doi:10.1002/sim.2929


20 PART 10: Technical Appendix

This appendix provides detailed technical information about the statistical methods, computational environment, and reproducibility parameters used in this analysis.

20.1 A1. Statistical Methods Detail

# =============================================================================
# APPENDIX A1: DETAILED STATISTICAL METHODS
# =============================================================================

cat("### Variable Selection: LASSO Regularization\n\n")

20.1.1 Variable Selection: LASSO Regularization

cat("LASSO (Least Absolute Shrinkage and Selection Operator) regression performs simultaneous variable selection and coefficient shrinkage by minimizing:\n\n")

LASSO (Least Absolute Shrinkage and Selection Operator) regression performs simultaneous variable selection and coefficient shrinkage by minimizing:

cat("$$\\mathcal{L}(\\boldsymbol{\\beta}) = \\sum_{i=1}^{n}(y_i - \\mathbf{x}_i^T\\boldsymbol{\\beta})^2 + \\lambda\\sum_{j=1}^{p}|\\beta_j|$$\n\n")

\[\mathcal{L}(\boldsymbol{\beta}) = \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda\sum_{j=1}^{p}|\beta_j|\]

cat("where $\\lambda$ is the regularization parameter controlling the amount of shrinkage. Variables with coefficients shrunk to exactly zero are excluded from the model.\n\n")

where \(\lambda\) is the regularization parameter controlling the amount of shrinkage. Variables with coefficients shrunk to exactly zero are excluded from the model.

cat(sprintf("Parameters used:\n"))

Parameters used:

cat(sprintf("  - Alpha (mixing): %.1f (pure LASSO, no ridge component)\n", LASSO_ALPHA))
  • Alpha (mixing): 1.0 (pure LASSO, no ridge component)
cat(sprintf("  - Cross-validation folds: %d\n", CV_FOLDS))
  • Cross-validation folds: 10
cat(sprintf("  - Lambda selection: 'lambda.min' (optimal cross-validated performance)\n"))
  • Lambda selection: ‘lambda.min’ (optimal cross-validated performance)
cat(sprintf("  - Seed for CV: %d\n\n", SEED_LASSO))
  • Seed for CV: 1978
cat("=== Model Fitting: Logistic Regression ===\n\n")

=== Model Fitting: Logistic Regression ===

cat("The final model was fit using restricted maximum likelihood via the rms::lrm()\n")

The final model was fit using restricted maximum likelihood via the rms::lrm()

cat("function, which provides:\n")

function, which provides:

cat("  - Wald and likelihood ratio tests for predictor significance\n")
  • Wald and likelihood ratio tests for predictor significance
cat("  - Somers' Dxy rank correlation (concordance probability)\n")
  • Somers’ Dxy rank correlation (concordance probability)
cat("  - Nagelkerke R-squared for explained variation\n")
  • Nagelkerke R-squared for explained variation
cat("  - Brier score for calibration assessment\n\n")
  • Brier score for calibration assessment
cat("=== Bootstrap Validation ===\n\n")

=== Bootstrap Validation ===

cat(sprintf("Internal validation performed using %d bootstrap resamples.\n", BOOTSTRAP_RESAMPLES))

Internal validation performed using 1000 bootstrap resamples.

cat("For each bootstrap sample:\n")

For each bootstrap sample:

cat("  1. Sample with replacement from original data\n")
  1. Sample with replacement from original data
cat("  2. Fit model to bootstrap sample\n")
  1. Fit model to bootstrap sample
cat("  3. Evaluate model on bootstrap sample (training performance)\n")
  1. Evaluate model on bootstrap sample (training performance)
cat("  4. Evaluate model on original data (test performance)\n")
  1. Evaluate model on original data (test performance)
cat("  5. Calculate optimism = training - test\n")
  1. Calculate optimism = training - test
cat("Average optimism subtracted from apparent performance for corrected estimate.\n\n")

Average optimism subtracted from apparent performance for corrected estimate.

cat("### Decision Curve Analysis\n\n")

20.1.2 Decision Curve Analysis

cat("Net benefit calculated as:\n\n")

Net benefit calculated as:

cat("$$\\text{NB}(p_t) = \\frac{\\text{TP}}{N} - \\frac{\\text{FP}}{N} \\times \\frac{p_t}{1-p_t}$$\n\n")

\[\text{NB}(p_t) = \frac{\text{TP}}{N} - \frac{\text{FP}}{N} \times \frac{p_t}{1-p_t}\]

cat("where $p_t$ is the threshold probability, TP = true positives, FP = false positives, and $N$ = total sample size. ")

where \(p_t\) is the threshold probability, TP = true positives, FP = false positives, and \(N\) = total sample size.

cat("The weighting factor $\\frac{p_t}{1-p_t}$ represents the relative harm of a false positive versus a false negative at each threshold.\n\n")

The weighting factor \(\frac{p_t}{1-p_t}\) represents the relative harm of a false positive versus a false negative at each threshold.

cat(sprintf("DCA parameters:\n"))

DCA parameters:

cat(sprintf("  - Threshold range: %.0f%% to %.0f%%\n", DCA_THRESHOLD_MIN*100, DCA_THRESHOLD_MAX*100))
  • Threshold range: 0% to 50%
cat(sprintf("  - Step size: %.0f%%\n", DCA_THRESHOLD_STEP*100))
  • Step size: 1%

20.2 A2. Computational Environment

# =============================================================================
# APPENDIX A2: COMPUTATIONAL ENVIRONMENT
# =============================================================================

cat("=== R Environment ===\n\n")

=== R Environment ===

cat(sprintf("R Version: %s\n", R.version.string))

R Version: R version 4.4.2 (2024-10-31)

cat(sprintf("Platform: %s\n", R.version$platform))

Platform: x86_64-apple-darwin20

cat(sprintf("OS: %s\n", Sys.info()["sysname"]))

OS: Darwin

cat(sprintf("Analysis Date: %s\n\n", Sys.Date()))

Analysis Date: 2026-01-24

cat("=== Key Package Versions ===\n\n")

=== Key Package Versions ===

key_packages <- c("rms", "glmnet", "pROC", "caret", "kableExtra", "tidyverse", "logger")
for (pkg in key_packages) {
  if (requireNamespace(pkg, quietly = TRUE)) {
    ver <- as.character(packageVersion(pkg))
    cat(sprintf("  %s: %s\n", pkg, ver))
  }
}

rms: 8.1.0 glmnet: 4.1.10 pROC: 1.19.0.1 caret: 7.0.1 kableExtra: 1.4.0 tidyverse: 2.0.0 logger: 0.4.1

cat("\n")
cat("=== Reproducibility Seeds ===\n\n")

=== Reproducibility Seeds ===

cat(sprintf("  Main seed: %d\n", SEED_MAIN))

Main seed: 42

cat(sprintf("  LASSO CV seed: %d\n", SEED_LASSO))

LASSO CV seed: 1978

cat(sprintf("  Bootstrap seed: %d\n", SEED_BOOTSTRAP))

Bootstrap seed: 42

cat(sprintf("  Calibration seed: %d\n", SEED_CALIBRATION))

Calibration seed: 42

20.3 A3. Model Coefficients and Equations

# =============================================================================
# APPENDIX A3: MODEL COEFFICIENTS AND EQUATIONS
# =============================================================================

cat("=== Full Model Coefficients ===\n\n")

=== Full Model Coefficients ===

# Get all coefficients with error handling
all_coefs <- tryCatch({
  coef(model)
}, error = function(e) {
  c(Intercept = NA)
})

var_covar <- tryCatch({
  vcov(model)
}, error = function(e) {
  matrix(NA, nrow = length(all_coefs), ncol = length(all_coefs))
})

coef_table <- data.frame(
  Variable = names(all_coefs),
  Coefficient = round(all_coefs, 4),
  Std_Error = round(sqrt(diag(var_covar)), 4),
  Odds_Ratio = round(exp(all_coefs), 3),
  OR_Lower = round(exp(all_coefs - 1.96 * sqrt(diag(var_covar))), 3),
  OR_Upper = round(exp(all_coefs + 1.96 * sqrt(diag(var_covar))), 3)
)

kable(coef_table,
      caption = "Full Model Coefficients with 95% Confidence Intervals",
      col.names = c("Variable", "Beta", "SE(Beta)", "OR", "OR 2.5%", "OR 97.5%"),
      align = c("l", rep("r", 5))) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE,
                font_size = 11)
Full Model Coefficients with 95% Confidence Intervals
Variable Beta SE(Beta) OR OR 2.5% OR 97.5%
Intercept Intercept -5.7918 1.9491 0.003 0.000 0.139
Age Age 0.0373 0.0341 1.038 0.971 1.110
Age’ Age’ 0.0283 0.0285 1.029 0.973 1.088
Nocturia Nocturia 0.0554 0.2640 1.057 0.630 1.773
Nocturia’ Nocturia’ -0.0641 0.2275 0.938 0.600 1.465
CRADI_8 CRADI_8 -0.0212 0.0253 0.979 0.932 1.029
CRADI_8’ CRADI_8’ 0.0244 0.0320 1.025 0.962 1.091
Hispanic=Yes Hispanic=Yes 0.9196 0.4214 2.508 1.098 5.729
Recurrent_UTIs=Yes Recurrent_UTIs=Yes 0.9237 0.3882 2.519 1.177 5.391
Vaginal_Estrogen=Yes Vaginal_Estrogen=Yes -0.0797 0.3784 0.923 0.440 1.938
Overactive_Bladder=Yes Overactive_Bladder=Yes 0.3726 0.2994 1.451 0.807 2.610
cat("\n=== Prediction Equation (Linear Predictor) ===\n\n")

=== Prediction Equation (Linear Predictor) ===

cat("LP = ", round(all_coefs[1], 4), " (Intercept)\n")

LP = -5.7918 (Intercept)

for (i in 2:min(length(all_coefs), 10)) {
  sign <- ifelse(all_coefs[i] >= 0, "  + ", "  - ")
  cat(sign, abs(round(all_coefs[i], 4)), " * ", names(all_coefs)[i], "\n", sep = "")
}
  • 0.0373 * Age
  • 0.0283 * Age’
  • 0.0554 * Nocturia
  • 0.0641 * Nocturia’
  • 0.0212 * CRADI_8
  • 0.0244 * CRADI_8’
  • 0.9196 * Hispanic=Yes
  • 0.9237 * Recurrent_UTIs=Yes
  • 0.0797 * Vaginal_Estrogen=Yes
if (length(all_coefs) > 10) {
  cat("  ... (", length(all_coefs) - 10, " additional terms)\n", sep = "")
}

… (1 additional terms)

cat("\n**Predicted probability:**\n\n")

Predicted probability:

cat("$$P(\\text{Cancelled}) = \\frac{1}{1 + e^{-LP}}$$\n")

\[P(\text{Cancelled}) = \frac{1}{1 + e^{-LP}}\]

20.4 A4. Missing Data Summary

# =============================================================================
# APPENDIX A4: MISSING DATA HANDLING
# =============================================================================

cat("=== Missing Data Summary for Final Model Variables ===\n\n")

=== Missing Data Summary for Final Model Variables ===

# Calculate missingness for each predictor
missing_summary <- data.frame(
  Variable = available_cols,
  N_Missing = sapply(available_cols, function(v) {
    if (v %in% names(nomogram_df)) sum(is.na(nomogram_df[[v]])) else NA
  }),
  Pct_Missing = sapply(available_cols, function(v) {
    if (v %in% names(nomogram_df)) {
      round(100 * sum(is.na(nomogram_df[[v]])) / nrow(nomogram_df), 2)
    } else NA
  })
)

missing_summary <- missing_summary %>%
  arrange(desc(Pct_Missing))

kable(missing_summary,
      caption = "Missing Data by Predictor Variable",
      col.names = c("Variable", "N Missing", "% Missing"),
      align = c("l", "r", "r")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE,
                font_size = 12)
Missing Data by Predictor Variable
Variable N Missing % Missing
Age Age 0 0
Hispanic Hispanic 0 0
Recurrent_UTIs Recurrent_UTIs 0 0
Vaginal_Estrogen Vaginal_Estrogen 0 0
Overactive_Bladder Overactive_Bladder 0 0
Nocturia Nocturia 0 0
CRADI_8 CRADI_8 0 0
cat("\n=== Missing Data Approach ===\n\n")

=== Missing Data Approach ===

cat("This analysis used complete case analysis. Patients with missing values for\n")

This analysis used complete case analysis. Patients with missing values for

cat("any predictor variable were excluded from the model fitting.\n\n")

any predictor variable were excluded from the model fitting.

cat(sprintf("Original sample size: %d\n", tripod_n_initial_raw))

Original sample size: 912

cat(sprintf("Final analysis sample: %d (%.1f%% of original)\n",
            nrow(nomogram_df),
            100 * nrow(nomogram_df) / tripod_n_initial_raw))

Final analysis sample: 841 (92.2% of original)

20.5 A5. TRIPOD Checklist Summary

This appendix provides the complete TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) checklist for prediction model development studies.1 Each item includes the specific location in this manuscript where the requirement is addressed.

Reference: Collins GS, Reitsma JB, Altman DG, Moons KGM. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55-63.

# =============================================================================
# APPENDIX A5: COMPREHENSIVE TRIPOD CHECKLIST (22 ITEMS)
# =============================================================================

tripod_items <- data.frame(
  Section = c(
    # Title/Abstract (2 items)
    "Title/Abstract", "Title/Abstract",
    # Introduction (2 items)
    "Introduction", "Introduction",
    # Methods (13 items)
    "Methods", "Methods", "Methods", "Methods", "Methods",
    "Methods", "Methods", "Methods", "Methods", "Methods",
    "Methods", "Methods", "Methods",
    # Results (5 items)
    "Results", "Results", "Results", "Results", "Results",
    # Discussion (3 items)
    "Discussion", "Discussion", "Discussion",
    # Other (2 items)
    "Other", "Other"
  ),
  Item = c(
    # Title/Abstract
    "1: Title identifies prediction model development",
    "2: Structured abstract with objectives, methods, results",
    # Introduction
    "3a: Background and medical context",
    "3b: Objectives clearly stated",
    # Methods
    "4a: Source of data described",
    "4b: Study dates specified",
    "5a: Setting and eligibility criteria",
    "5b: Treatments received (if applicable)",
    "5c: Participant recruitment flow",
    "6a: Outcome definition and measurement",
    "6b: Blinding of outcome assessment",
    "7a: Predictor definitions and measurement",
    "7b: Blinding of predictor assessment",
    "8: Sample size justification",
    "9: Missing data handling",
    "10a-d: Statistical analysis methods",
    "11: Risk group definition",
    # Results
    "13a: Flow diagram of participants",
    "13b: Participant characteristics table",
    "14a-b: Number of events and associations",
    "15a-b: Full model specification",
    "16: Performance measures with CIs",
    # Discussion
    "17: Limitations discussed",
    "18: Interpretation of results",
    "19a: Implications for practice",
    # Other
    "20: Supplementary information",
    "21: Funding and conflicts of interest"
  ),
  Location = c(
    # Title/Abstract
    "YAML title: 'Development and Internal Validation of a Clinical Prediction Model...'",
    "Section: ABSTRACT - Background, Methods, Results, Conclusions",
    # Introduction
    "Section: Clinical Context - UDS importance, cancellation rates",
    "Section: Document Overview - 'develops and validates a clinical prediction model'",
    # Methods
    "Section: Data Loading - 'REDCap export from urogynecology practice'",
    "Chunk: process-date - 'Study period 2019-2024'",
    "Section: STEP 4 - 'Excluded: no-shows, missing outcome'",
    "N/A - Observational study",
    "Section: TRIPOD Flow Diagram - 'N=946 → 841 analyzed'",
    "Chunk: ensure-outcome-factor - 'Cancelled vs Completed due to UTI'",
    "N/A - Retrospective chart review",
    "Section: Feature Selection - 'LASSO selected 5 predictors'",
    "N/A - Predictors documented before outcome",
    "Section: Events Per Predictor - '57 events / 5 predictors = 11.4 EPV'",
    "Chunks: remove-high-missing, impute-pfdi - 'Complete case analysis'",
    "Sections: LASSO, RCS, Bootstrap Validation",
    "File: Elena_Clinical_Summary.Rmd - Low/Moderate/High categories",
    # Results
    "Chunk: tripod-flow-diagram - Figure 1",
    "Section: Table 1 - Stratified by outcome",
    "Chunk: results-setup - 'N=841; 57 (6.8%) cancelled'",
    "Section: Explicit Model Equation + Appendix A3",
    "Sections: ROC Curve, Calibration Plot - 'C=0.78; slope=0.85'",
    # Discussion
    "Section: DISCUSSION - Limitations paragraph",
    "Section: DISCUSSION - Overall interpretation",
    "Section: DISCUSSION - Clinical implications",
    # Other
    "Part 10: Technical Appendix - Appendices A1-A7",
    "Section: Funding and Conflicts of Interest"
  ),
  Status = c(
    # Title/Abstract
    "Complete", "Complete",
    # Introduction
    "Complete", "Complete",
    # Methods
    "Complete", "Complete", "Complete", "N/A", "Complete",
    "Complete", "N/A", "Complete", "N/A", "Complete",
    "Complete", "Complete", "Complete",
    # Results
    "Complete", "Complete", "Complete", "Complete", "Complete",
    # Discussion
    "Complete", "Complete", "Complete",
    # Other
    "Complete", "Complete"
  ),
  stringsAsFactors = FALSE
)

# Display as formatted kable table
kable(tripod_items,
      caption = "TRIPOD Checklist for Prediction Model Development (22 Items)",
      col.names = c("Section", "TRIPOD Item", "Location in Manuscript", "Status"),
      align = c("l", "l", "l", "c")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE,
                font_size = 10) %>%
  row_spec(which(tripod_items$Status == "Complete"), background = "#E8F8F5") %>%
  row_spec(which(tripod_items$Status == "N/A"), background = "#F5F5F5") %>%
  column_spec(1, width = "12%", bold = TRUE) %>%
  column_spec(2, width = "28%") %>%
  column_spec(3, width = "48%") %>%
  column_spec(4, width = "12%")
TRIPOD Checklist for Prediction Model Development (22 Items)
Section TRIPOD Item Location in Manuscript Status
Title/Abstract 1: Title identifies prediction model development YAML title: ‘Development and Internal Validation of a Clinical Prediction Model…’ Complete
Title/Abstract 2: Structured abstract with objectives, methods, results Section: ABSTRACT - Background, Methods, Results, Conclusions Complete
Introduction 3a: Background and medical context Section: Clinical Context - UDS importance, cancellation rates Complete
Introduction 3b: Objectives clearly stated Section: Document Overview - ‘develops and validates a clinical prediction model’ Complete
Methods 4a: Source of data described Section: Data Loading - ‘REDCap export from urogynecology practice’ Complete
Methods 4b: Study dates specified Chunk: process-date - ‘Study period 2019-2024’ Complete
Methods 5a: Setting and eligibility criteria Section: STEP 4 - ‘Excluded: no-shows, missing outcome’ Complete
Methods 5b: Treatments received (if applicable) N/A - Observational study N/A
Methods 5c: Participant recruitment flow Section: TRIPOD Flow Diagram - ‘N=946 → 841 analyzed’ Complete
Methods 6a: Outcome definition and measurement Chunk: ensure-outcome-factor - ‘Cancelled vs Completed due to UTI’ Complete
Methods 6b: Blinding of outcome assessment N/A - Retrospective chart review N/A
Methods 7a: Predictor definitions and measurement Section: Feature Selection - ‘LASSO selected 5 predictors’ Complete
Methods 7b: Blinding of predictor assessment N/A - Predictors documented before outcome N/A
Methods 8: Sample size justification Section: Events Per Predictor - ‘57 events / 5 predictors = 11.4 EPV’ Complete
Methods 9: Missing data handling Chunks: remove-high-missing, impute-pfdi - ‘Complete case analysis’ Complete
Methods 10a-d: Statistical analysis methods Sections: LASSO, RCS, Bootstrap Validation Complete
Methods 11: Risk group definition File: Elena_Clinical_Summary.Rmd - Low/Moderate/High categories Complete
Results 13a: Flow diagram of participants Chunk: tripod-flow-diagram - Figure 1 Complete
Results 13b: Participant characteristics table Section: Table 1 - Stratified by outcome Complete
Results 14a-b: Number of events and associations Chunk: results-setup - ‘N=841; 57 (6.8%) cancelled’ Complete
Results 15a-b: Full model specification Section: Explicit Model Equation + Appendix A3 Complete
Results 16: Performance measures with CIs Sections: ROC Curve, Calibration Plot - ‘C=0.78; slope=0.85’ Complete
Discussion 17: Limitations discussed Section: DISCUSSION - Limitations paragraph Complete
Discussion 18: Interpretation of results Section: DISCUSSION - Overall interpretation Complete
Discussion 19a: Implications for practice Section: DISCUSSION - Clinical implications Complete
Other 20: Supplementary information Part 10: Technical Appendix - Appendices A1-A7 Complete
Other 21: Funding and conflicts of interest Section: Funding and Conflicts of Interest Complete

Compliance Summary: 24 of 24 applicable items complete (100%)

Notes: - Items marked “N/A” are not applicable to this retrospective development study - A detailed standalone TRIPOD checklist document is available: TRIPOD_Checklist.Rmd

20.6 A6. Configuration Parameters

# =============================================================================
# APPENDIX A6: ALL CONFIGURATION PARAMETERS
# =============================================================================

config_params <- data.frame(
  Parameter = c(
    "SEED_MAIN", "SEED_LASSO", "SEED_BOOTSTRAP", "SEED_CALIBRATION",
    "ALPHA_SIGNIFICANCE", "ALPHA_TRENDING", "CONFIDENCE_LEVEL",
    "EPV_MINIMUM", "LASSO_ALPHA", "CV_FOLDS",
    "BOOTSTRAP_RESAMPLES", "CALIBRATION_RESAMPLES",
    "DCA_THRESHOLD_MIN", "DCA_THRESHOLD_MAX", "DCA_THRESHOLD_STEP",
    "CIC_SCALE_PER"
  ),
  Value = c(
    SEED_MAIN, SEED_LASSO, SEED_BOOTSTRAP, SEED_CALIBRATION,
    ALPHA_SIGNIFICANCE, ALPHA_TRENDING, CONFIDENCE_LEVEL,
    EPV_MINIMUM, LASSO_ALPHA, CV_FOLDS,
    BOOTSTRAP_RESAMPLES, CALIBRATION_RESAMPLES,
    DCA_THRESHOLD_MIN, DCA_THRESHOLD_MAX, DCA_THRESHOLD_STEP,
    CIC_SCALE_PER
  ),
  Description = c(
    "Main random seed for general operations",
    "Seed for LASSO cross-validation",
    "Seed for bootstrap validation",
    "Seed for calibration analysis",
    "P-value threshold for statistical significance",
    "P-value threshold for trending significance",
    "Confidence level for intervals",
    "Minimum events per predictor variable",
    "LASSO alpha (1=LASSO, 0=ridge)",
    "Number of cross-validation folds",
    "Bootstrap resamples for validation",
    "Bootstrap resamples for calibration",
    "Minimum DCA threshold probability",
    "Maximum DCA threshold probability",
    "DCA threshold step size",
    "Scale factor for clinical impact curve"
  )
)

kable(config_params,
      caption = "Complete Configuration Parameters for Reproducibility",
      align = c("l", "r", "l")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE,
                font_size = 11)
Complete Configuration Parameters for Reproducibility
Parameter Value Description
SEED_MAIN 42.00 Main random seed for general operations
SEED_LASSO 1978.00 Seed for LASSO cross-validation
SEED_BOOTSTRAP 42.00 Seed for bootstrap validation
SEED_CALIBRATION 42.00 Seed for calibration analysis
ALPHA_SIGNIFICANCE 0.05 P-value threshold for statistical significance
ALPHA_TRENDING 0.10 P-value threshold for trending significance
CONFIDENCE_LEVEL 0.95 Confidence level for intervals
EPV_MINIMUM 10.00 Minimum events per predictor variable
LASSO_ALPHA 1.00 LASSO alpha (1=LASSO, 0=ridge)
CV_FOLDS 10.00 Number of cross-validation folds
BOOTSTRAP_RESAMPLES 1000.00 Bootstrap resamples for validation
CALIBRATION_RESAMPLES 200.00 Bootstrap resamples for calibration
DCA_THRESHOLD_MIN 0.00 Minimum DCA threshold probability
DCA_THRESHOLD_MAX 0.50 Maximum DCA threshold probability
DCA_THRESHOLD_STEP 0.01 DCA threshold step size
CIC_SCALE_PER 1000.00 Scale factor for clinical impact curve

20.7 A5. Advanced Hmisc/rms Diagnostics

This section provides additional diagnostic analyses using advanced features from the Hmisc and rms packages to strengthen model validation and support publication requirements.

20.7.1 A5.1 Correlation Matrix with Significance Testing

20.7.2 A5.2 Variable Clustering Analysis

20.7.3 A5.3 Missing Data Pattern Analysis


20.8 A6. Enhanced Model Validation

This section provides additional validation techniques using robust variance estimation and specific contrast testing.

20.8.1 A6.1 Clinical Contrast Testing

# =============================================================================
# APPENDIX A6.1: CLINICAL CONTRAST TESTING (rms::contrast)
# Test specific clinical hypotheses with confidence intervals
# =============================================================================

log_info("Performing clinical contrast testing using rms::contrast()...")

cat("=== Clinical Contrast Testing ===\n\n")

=== Clinical Contrast Testing ===

cat("Testing specific clinical hypotheses about risk differences between patient profiles.\n\n")

Testing specific clinical hypotheses about risk differences between patient profiles.

# Define clinically meaningful contrasts based on model predictors
if (exists("model") && inherits(model, "lrm")) {

  # Get model predictor names
  model_vars <- names(model$Design$parms)
  if (is.null(model_vars)) {
    model_vars <- attr(model$terms, "term.labels")
  }

  contrast_results <- list()

  # Contrast 1: Age effect (older vs younger patient)
  if ("Age" %in% model_vars) {
    cat("### Contrast 1: Age Effect (70 vs 50 years)\n")
    contrast1 <- tryCatch({
      rms::contrast(model,
                    list(Age = 70),
                    list(Age = 50),
                    type = "average")
    }, error = function(e) {
      log_warn(sprintf("Age contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast1)) {
      print(contrast1)
      contrast_results$age <- contrast1
      cat("\nInterpretation: The log-odds ratio for a 70-year-old vs 50-year-old patient,\n")
      cat("averaged over other predictor values.\n\n")
    }
  }

  # Contrast 2: BMI effect (obese vs normal weight)
  if ("BMI" %in% model_vars) {
    cat("### Contrast 2: BMI Effect (35 vs 25 kg/m²)\n")
    contrast2 <- tryCatch({
      rms::contrast(model,
                    list(BMI = 35),
                    list(BMI = 25),
                    type = "average")
    }, error = function(e) {
      log_warn(sprintf("BMI contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast2)) {
      print(contrast2)
      contrast_results$bmi <- contrast2
      cat("\nInterpretation: The log-odds ratio for an obese (BMI=35) vs normal weight (BMI=25) patient.\n\n")
    }
  }

  # Contrast 3: Recurrent UTIs effect
  if ("Recurrent_UTIs" %in% model_vars) {
    cat("### Contrast 3: Recurrent UTIs Effect (Yes vs No)\n")
    contrast3 <- tryCatch({
      rms::contrast(model,
                    list(Recurrent_UTIs = "Yes"),
                    list(Recurrent_UTIs = "No"),
                    type = "average")
    }, error = function(e) {
      log_warn(sprintf("Recurrent UTIs contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast3)) {
      print(contrast3)
      contrast_results$recurrent_utis <- contrast3
      cat("\nInterpretation: The log-odds ratio for patients with vs without recurrent UTI history.\n\n")
    }
  }

  # Contrast 4: High-risk vs low-risk profile
  cat("### Contrast 4: High-Risk vs Low-Risk Patient Profile\n")

  # Define high-risk and low-risk profiles based on available predictors
  high_risk_profile <- list()
  low_risk_profile <- list()

  if ("Age" %in% model_vars) {
    high_risk_profile$Age <- 75
    low_risk_profile$Age <- 45
  }
  if ("BMI" %in% model_vars) {
    high_risk_profile$BMI <- 35
    low_risk_profile$BMI <- 24
  }
  if ("Recurrent_UTIs" %in% model_vars) {
    high_risk_profile$Recurrent_UTIs <- "Yes"
    low_risk_profile$Recurrent_UTIs <- "No"
  }
  if ("OAB" %in% model_vars) {
    high_risk_profile$OAB <- "Yes"
    low_risk_profile$OAB <- "No"
  }
  if ("Detrusor_Overactivity" %in% model_vars) {
    high_risk_profile$Detrusor_Overactivity <- "Yes"
    low_risk_profile$Detrusor_Overactivity <- "No"
  }

  if (length(high_risk_profile) > 0) {
    contrast4 <- tryCatch({
      rms::contrast(model, high_risk_profile, low_risk_profile)
    }, error = function(e) {
      log_warn(sprintf("Profile contrast failed: %s", e$message))
      NULL
    })

    if (!is.null(contrast4)) {
      cat("High-risk profile:", paste(names(high_risk_profile), "=", high_risk_profile, collapse = ", "), "\n")
      cat("Low-risk profile:", paste(names(low_risk_profile), "=", low_risk_profile, collapse = ", "), "\n\n")
      print(contrast4)
      contrast_results$profile <- contrast4

      # Calculate odds ratio
      if (!is.null(contrast4$Contrast)) {
        or <- exp(contrast4$Contrast)
        or_lower <- exp(contrast4$Lower)
        or_upper <- exp(contrast4$Upper)
        cat(sprintf("\nOdds Ratio: %.2f (95%% CI: %.2f - %.2f)\n", or, or_lower, or_upper))
      }
    }
  }

  cat("\n=== Summary ===\n")
  cat("These contrasts provide effect estimates with 95% confidence intervals for\n")
  cat("specific clinical comparisons, enabling hypothesis testing beyond overall model fit.\n")

  log_info("Clinical contrast testing complete")
} else {
  cat("Model object not available for contrast testing.\n")
}

20.8.2 Contrast 1: Age Effect (70 vs 50 years)

Contrast S.E. Lower Upper Z Pr(>|z|) 11 1.138214 0.3472924 0.4575338 1.818895 3.28 0.001

Confidence intervals are 0.95 individual intervals

Interpretation: The log-odds ratio for a 70-year-old vs 50-year-old patient, averaged over other predictor values.

20.8.3 Contrast 3: Recurrent UTIs Effect (Yes vs No)

Contrast      S.E.     Lower   Upper    Z Pr(>|z|)

11 0.9237168 0.3882487 0.1627634 1.68467 2.38 0.0174

Confidence intervals are 0.95 individual intervals

Interpretation: The log-odds ratio for patients with vs without recurrent UTI history.

20.8.4 Contrast 4: High-Risk vs Low-Risk Patient Profile

High-risk profile: Age = 75, Recurrent_UTIs = Yes Low-risk profile: Age = 45, Recurrent_UTIs = No

Nocturia CRADI_8 Hispanic Vaginal_Estrogen Overactive_Bladder Contrast 1 2 18.75 No No Yes 2.667106 S.E. Lower Upper Z Pr(>|z|) 1 0.6347373 1.423043 3.911168 4.2 0

Confidence intervals are 0.95 individual intervals

Odds Ratio: 14.40 (95% CI: 4.15 - 49.96)

=== Summary === These contrasts provide effect estimates with 95% confidence intervals for specific clinical comparisons, enabling hypothesis testing beyond overall model fit.

20.8.5 A6.2 Bootstrap Covariance Estimation

20.8.6 A6.3 Robust Sandwich Covariance


20.9 A7. Publication Support

This section provides outputs formatted for manuscript preparation and external implementation.

20.9.1 A7.1 Model Specification Summary

20.9.2 A7.2 LaTeX Output for Publications

20.9.3 A7.3 Nomogram Formula Extraction

# =============================================================================
# APPENDIX A7.3: NOMOGRAM FORMULA EXTRACTION
# Extract equations for programmatic scoring
# =============================================================================

log_info("Extracting nomogram formulas for programmatic implementation...")

cat("=== Nomogram Scoring Equations ===\n\n")

=== Nomogram Scoring Equations ===

cat("These equations enable programmatic calculation of nomogram points\n")

These equations enable programmatic calculation of nomogram points

cat("without requiring visual point reading from the nomogram graphic.\n\n")

without requiring visual point reading from the nomogram graphic.

if (exists("model") && inherits(model, "lrm")) {

  # Extract model coefficients
  coefs <- coef(model)

  cat("=== Linear Predictor Equation ===\n\n")
  cat("The linear predictor (log-odds) is calculated as:\n\n")
  cat("LP = ")

  # Build equation string
  eq_parts <- c()
  for (i in seq_along(coefs)) {
    coef_name <- names(coefs)[i]
    coef_val <- coefs[i]

    if (coef_name == "Intercept") {
      eq_parts <- c(eq_parts, sprintf("%.4f", coef_val))
    } else {
      sign <- ifelse(coef_val >= 0, "+", "")
      eq_parts <- c(eq_parts, sprintf("%s %.4f × %s", sign, coef_val, coef_name))
    }
  }
  cat(paste(eq_parts, collapse = " "), "\n\n")

  cat("=== Probability Calculation ===\n\n")
  cat("Probability = 1 / (1 + exp(-LP))\n")
  cat("Or equivalently: Probability = exp(LP) / (1 + exp(LP))\n\n")

  cat("=== Point Scoring System ===\n\n")
  cat("For nomogram implementation, points are typically scaled such that\n")
  cat("the maximum points for each predictor sum to 100.\n\n")

  # Create point scoring table
  # Scale coefficients to 0-100 point range
  if (length(coefs) > 1) {
    abs_coefs <- abs(coefs[-1])  # Exclude intercept
    max_coef <- max(abs_coefs)

    point_table <- data.frame(
      Predictor = names(coefs)[-1],
      Coefficient = round(coefs[-1], 4),
      Points_Per_Unit = round(coefs[-1] / max_coef * 10, 2)
    )

    cat("Points per unit (scaled, 10 points = largest coefficient):\n\n")
    print(point_table)

    cat("\n\nNote: For categorical variables, 'Points_Per_Unit' represents the\n")
    cat("points assigned when the variable equals 1 (vs reference = 0).\n")
    cat("For continuous variables, multiply by the actual value.\n")
  }

  cat("\n=== Implementation Example (R) ===\n\n")
  cat("```r\n")
  cat("# Function to calculate predicted probability\n")
  cat("predict_cancellation <- function(")
  cat(paste(names(coefs)[-1], collapse = ", "))
  cat(") {\n")
  cat("  lp <- ", round(coefs[1], 4))
  for (i in 2:length(coefs)) {
    sign <- ifelse(coefs[i] >= 0, "+", "-")
    cat(sprintf(" %s %.4f * %s", sign, abs(coefs[i]), names(coefs)[i]))
  }
  cat("\n  return(1 / (1 + exp(-lp)))\n")
  cat("}\n")
  cat("```\n")

  log_info("Nomogram formula extraction complete")
} else {
  cat("Model object not available for formula extraction.\n")
}

=== Linear Predictor Equation ===

The linear predictor (log-odds) is calculated as:

LP = -5.7918 + 0.0373 × Age + 0.0283 × Age’ + 0.0554 × Nocturia -0.0641 × Nocturia’ -0.0212 × CRADI_8 + 0.0244 × CRADI_8’ + 0.9196 × Hispanic=Yes + 0.9237 × Recurrent_UTIs=Yes -0.0797 × Vaginal_Estrogen=Yes + 0.3726 × Overactive_Bladder=Yes

=== Probability Calculation ===

Probability = 1 / (1 + exp(-LP)) Or equivalently: Probability = exp(LP) / (1 + exp(LP))

=== Point Scoring System ===

For nomogram implementation, points are typically scaled such that the maximum points for each predictor sum to 100.

Points per unit (scaled, 10 points = largest coefficient):

                                Predictor Coefficient Points_Per_Unit

Age Age 0.0373 0.40 Age’ Age’ 0.0283 0.31 Nocturia Nocturia 0.0554 0.60 Nocturia’ Nocturia’ -0.0641 -0.69 CRADI_8 CRADI_8 -0.0212 -0.23 CRADI_8’ CRADI_8’ 0.0244 0.26 Hispanic=Yes Hispanic=Yes 0.9196 9.96 Recurrent_UTIs=Yes Recurrent_UTIs=Yes 0.9237 10.00 Vaginal_Estrogen=Yes Vaginal_Estrogen=Yes -0.0797 -0.86 Overactive_Bladder=Yes Overactive_Bladder=Yes 0.3726 4.03

Note: For categorical variables, ‘Points_Per_Unit’ represents the points assigned when the variable equals 1 (vs reference = 0). For continuous variables, multiply by the actual value.

=== Implementation Example (R) ===

# Function to calculate predicted probability
predict_cancellation <- function(Age, Age', Nocturia, Nocturia', CRADI_8, CRADI_8', Hispanic=Yes, Recurrent_UTIs=Yes, Vaginal_Estrogen=Yes, Overactive_Bladder=Yes) {
  lp <-  -5.7918 + 0.0373 * Age + 0.0283 * Age' + 0.0554 * Nocturia - 0.0641 * Nocturia' - 0.0212 * CRADI_8 + 0.0244 * CRADI_8' + 0.9196 * Hispanic=Yes + 0.9237 * Recurrent_UTIs=Yes - 0.0797 * Vaginal_Estrogen=Yes + 0.3726 * Overactive_Bladder=Yes
  return(1 / (1 + exp(-lp)))
}

20.9.4 A7.4 External Validation Code Export

# =============================================================================
# APPENDIX A7.4: EXTERNAL VALIDATION CODE EXPORT
# Code for validating model in external datasets
# =============================================================================

log_info("Generating external validation code...")

cat("=== External Validation Implementation ===\n\n")

=== External Validation Implementation ===

cat("Use the following code to validate this model in external datasets.\n\n")

Use the following code to validate this model in external datasets.

if (exists("model") && inherits(model, "lrm")) {

  # Get model formula
  model_formula <- tryCatch({
    formula(model)
  }, error = function(e) NULL)

  cat("=== Required Variables ===\n\n")
  predictor_names <- names(model$Design$parms)
  if (is.null(predictor_names)) {
    predictor_names <- attr(model$terms, "term.labels")
  }

  for (pred in predictor_names) {
    cat(sprintf("- %s\n", pred))
  }

  cat("\n=== R Code for External Validation ===\n\n")
  cat("```r\n")
  cat("# Load required packages\n")
  cat("library(rms)\n")
  cat("library(pROC)\n\n")

  cat("# Load your external validation data\n")
  cat("# external_data <- read.csv('your_external_data.csv')\n\n")

  cat("# Model coefficients (copy from above)\n")
  cat("coefs <- c(\n")
  coef_strings <- sprintf("  '%s' = %.6f", names(coef(model)), coef(model))
  cat(paste(coef_strings, collapse = ",\n"))
  cat("\n)\n\n")

  cat("# Calculate linear predictor\n")
  cat("calculate_lp <- function(data) {\n")
  cat("  lp <- coefs['Intercept']\n")
  cat("  for (var in names(coefs)[-1]) {\n")
  cat("    if (var %in% names(data)) {\n")
  cat("      lp <- lp + coefs[var] * data[[var]]\n")
  cat("    }\n")
  cat("  }\n")
  cat("  return(lp)\n")
  cat("}\n\n")

  cat("# Calculate predicted probabilities\n")
  cat("# external_data$pred_prob <- plogis(calculate_lp(external_data))\n\n")

  cat("# Calculate C-statistic\n")
  cat("# roc_result <- pROC::roc(external_data$outcome, external_data$pred_prob)\n")
  cat("# print(pROC::auc(roc_result))\n")
  cat("```\n")

  log_info("External validation code export complete")
}

=== Required Variables ===

  • Age
  • Nocturia
  • CRADI_8
  • Hispanic
  • Recurrent_UTIs
  • Vaginal_Estrogen
  • Overactive_Bladder

=== R Code for External Validation ===

# Load required packages
library(rms)
library(pROC)

# Load your external validation data
# external_data <- read.csv('your_external_data.csv')

# Model coefficients (copy from above)
coefs <- c(
  'Intercept' = -5.791816,
  'Age' = 0.037252,
  'Age'' = 0.028282,
  'Nocturia' = 0.055420,
  'Nocturia'' = -0.064134,
  'CRADI_8' = -0.021224,
  'CRADI_8'' = 0.024398,
  'Hispanic=Yes' = 0.919635,
  'Recurrent_UTIs=Yes' = 0.923717,
  'Vaginal_Estrogen=Yes' = -0.079743,
  'Overactive_Bladder=Yes' = 0.372576
)

# Calculate linear predictor
calculate_lp <- function(data) {
  lp <- coefs['Intercept']
  for (var in names(coefs)[-1]) {
    if (var %in% names(data)) {
      lp <- lp + coefs[var] * data[[var]]
    }
  }
  return(lp)
}

# Calculate predicted probabilities
# external_data$pred_prob <- plogis(calculate_lp(external_data))

# Calculate C-statistic
# roc_result <- pROC::roc(external_data$outcome, external_data$pred_prob)
# print(pROC::auc(roc_result))

End of Technical Appendix


21 Figure and Table Legends

21.1 Manuscript Figure Legends

Figure 1. TRIPOD Flow Diagram. Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) compliant flow diagram showing patient selection from initial screening through final analysis cohort. The diagram illustrates the number of patients at each stage of selection, including exclusions due to incomplete data, missing outcome information, or protocol violations, with the final sample size used for model development.

Figure 2. Clinical Prediction Nomogram (Shrinkage-Adjusted). Nomogram for predicting the probability of urodynamic procedure cancellation due to urinary tract infection (UTI). This nomogram uses shrinkage-adjusted coefficients (calibration slope applied to all predictors) to produce properly calibrated predictions for new patients. Each predictor variable is assigned points based on its adjusted coefficient in the multivariable logistic regression model. To use the nomogram, locate the patient’s value for each predictor on its axis, draw a vertical line to the “Points” axis at the top, sum the points for all predictors, and find the corresponding predicted probability on the “Probability of Cancellation” axis at the bottom.

Figure 3. Calibration Plot. Calibration plot assessing the agreement between predicted probabilities and observed outcomes. Predicted probabilities are grouped into deciles, and the observed proportion of events within each group is plotted against the mean predicted probability. The diagonal line represents perfect calibration. Points above the line indicate underprediction, while points below indicate overprediction. Error bars represent 95% confidence intervals. The Hosmer-Lemeshow goodness-of-fit test provides a statistical assessment of calibration, with p > 0.05 suggesting adequate model fit.

21.2 Manuscript Table Legends

Table 1. Demographic and Clinical Characteristics Stratified by Procedure Cancellation Status. Baseline characteristics of the study population, stratified by whether the urodynamic procedure was completed or cancelled due to urinary tract infection. Continuous variables are presented as mean ± standard deviation or median (interquartile range) as appropriate based on distribution. Categorical variables are presented as frequency (percentage). P-values are calculated using Student’s t-test or Mann-Whitney U test for continuous variables and chi-squared or Fisher’s exact test for categorical variables, as appropriate. Statistical significance is defined as p < 0.05.

Table 2. Multivariable Logistic Regression Model for Urodynamic Procedure Cancellation Due to Urinary Tract Infection. Final multivariable logistic regression model coefficients, odds ratios, and 95% confidence intervals for each predictor variable. Variables were selected using LASSO regularization with 10-fold cross-validation. The model uses the rms::lrm() function to enable restricted cubic spline transformations for continuous predictors where appropriate. Odds ratios represent the change in odds of procedure cancellation for a one-unit change in the predictor (or for categorical variables, compared to the reference category).


22 Supplemental Digital Content

22.1 Supplemental Figure Legends

Supplemental Figure 1. Receiver Operating Characteristic (ROC) Curve. Receiver operating characteristic curve demonstrating the discrimination ability of the prediction model. The curve plots sensitivity (true positive rate) against 1-specificity (false positive rate) across all possible classification thresholds. The area under the curve (C-statistic) quantifies the model’s ability to distinguish between patients whose procedures will be cancelled versus completed. The diagonal dashed line represents a model with no discriminative ability (C-statistic = 0.50). The optimal threshold point, determined by maximizing the Youden index (sensitivity + specificity - 1), is indicated on the curve.

Supplemental Figure 2. Decision Curve Analysis. Decision curve analysis (DCA) comparing the clinical utility of the prediction model against alternative strategies of treating all patients (“Treat All”) or no patients (“Treat None”). The y-axis represents net benefit, defined as: Net Benefit = (True Positives/N) - (False Positives/N) × [Threshold/(1-Threshold)]. The x-axis represents threshold probability, reflecting the clinical judgment about the relative harms of false-positive versus false-negative predictions. A model provides clinical utility when its net benefit curve exceeds the “Treat All” and “Treat None” strategies. The shaded region indicates the clinically relevant threshold range. Based on methodology described by Vickers et al., the DCA demonstrates the range of threshold probabilities over which using the prediction model results in superior decision-making compared to default strategies.

Supplemental Figure 3. Clinical Impact Curve. Clinical impact curve illustrating the practical implications of the prediction model when applied to a hypothetical population of 1000 patients. The red curve (“Number High Risk”) shows how many patients would be classified as high risk at each threshold probability. The blue curve (“Number High Risk with Event”) shows how many of those high-risk patients actually experienced the outcome (true positives). The gap between these curves represents false positives at each threshold. This visualization helps clinicians understand the trade-off between identifying true high-risk patients and over-classifying low-risk patients at different decision thresholds.

22.2 Supplemental Table Legends

Supplemental Table 1. Prediction Model Performance Characteristics. Summary of model discrimination and calibration metrics. Discrimination is assessed by the C-statistic (area under the ROC curve) with 95% confidence interval calculated via 1000 bootstrap resamples. Additional metrics include sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the Brier score. Calibration is assessed by the calibration slope, calibration-in-the-large, and Hosmer-Lemeshow goodness-of-fit test.

Supplemental Table 2. Stratified C-Statistics by Patient Subgroups. Model discrimination (C-statistic) assessed within clinically relevant patient subgroups. Subgroup analyses evaluate whether the model maintains adequate discriminative ability across different patient populations, including stratification by age groups, body mass index categories, comorbidity status, and other clinically important characteristics. Confidence intervals are calculated using bootstrap resampling within each subgroup.


Document generated: 2026-01-24 16:24:37.75871


  1. Hooton TM, et al. Diagnosis, prevention, and treatment of catheter-associated urinary tract infection in adults: 2009 International Clinical Practice Guidelines from the Infectious Diseases Society of America. Clin Infect Dis. 2010;50(5):625-663.↩︎

  2. Harrell FE Jr. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Springer; 2015.↩︎

  3. Hosmer DW, Lemeshow S. Applied Logistic Regression. 3rd ed. Wiley; 2013.↩︎

  4. Steyerberg EW, Harrell FE Jr. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245-247.↩︎

  5. Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Springer; 2015.↩︎

  6. Steyerberg EW, Harrell FE Jr, 2016↩︎

  7. Van Calster B, McLernon DJ, van Smeden M, et al. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17:230.↩︎

  8. Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Chapter 5. Springer; 2015.↩︎

  9. Vickers AJ, Van Calster B, Steyerberg EW. Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests. BMJ. 2016;352:i6.↩︎

  10. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565-574.↩︎

  11. Kerr KF, Brown MD, Zhu K, Janes H. Assessing the Clinical Impact of Risk Prediction Models With Decision Curves. J Clin Oncol. 2016;34(21):2534-2540.↩︎

  12. Pencina MJ, D’Agostino RB Sr, D’Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker. Stat Med. 2008;27(2):157-172.↩︎

  13. Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol. 1996;49(12):1373-1379.↩︎

  14. Hosmer DW, Lemeshow S, Sturdivant RX. Applied Logistic Regression. 3rd ed. Wiley; 2013.↩︎

  15. Steyerberg EW, Harrell FE Jr. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245-247.↩︎

  16. Riley RD, Snell KI, Ensor J, et al. Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes. Stat Med. 2019;38(7):1276-1296.↩︎

  17. Harrell FE Jr. Regression Modeling Strategies. 2nd ed. Chapter 5. Springer; 2015.↩︎

  18. Van Calster B, van Smeden M, De Cock L, Steyerberg EW. Regression shrinkage methods for clinical prediction models do not guarantee improved performance. Stat Methods Med Res. 2020;29(11):3166-3178.↩︎

  19. Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J. 2014;35(29):1925-1931.↩︎

  20. Steyerberg EW, Harrell FE Jr. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245-247.↩︎

  21. Riley RD, Ensor J, Snell KIE, et al. Calculating the sample size required for developing a clinical prediction model. BMJ. 2020;368:m441.↩︎

  22. Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet. 2019;393(10181):1577-1579.↩︎