Understanding these patterns could help clinics implement targeted interventions such as additional reminders for high-risk patients, better education about the importance of urodynamic testing, or addressing specific concerns that commonly lead to cancellations, ultimately improving resource utilization and patient care in urogynecology practices.

1 Load Data

2 Summary of Data Preparation Code

The data prep pipeline performs these key steps:

  1. Load PFDI-20 Totals
  • Reads pfdi_20_totals.csv and rounds the PFDI_20_total score
  1. Load Main Urodynamics Dataset
  • Reads the REDCap export 230119Urodynamics_DATA_LABELS_2024-09-16_1049.csv
  • Removes MRN and Complete columns
  1. Filter Observations
  • Removes rows where Was the procedure cancelled? is NA
  • Excludes patients who “no-showed” or had “Other” cancellation reasons
  1. Convert Variables to Factors
  • Converts 30+ clinical variables to factors (race, diabetes, tobacco, menopause, OAB, etc.)
  • Relevels reference categories: Race→“Other”, Tobacco→“Non-tobacco user”, Menopause→“Pre-menopausal”, OAB meds→“none”
  1. Remove Low-Quality Variables
  • Removes near-zero variance columns via caret::nearZeroVar
  • Removes columns with >50% NA values
  1. Impute Missing Values
  • Uses fill(.direction = “down”) for LOCF imputation on clinical measures
  1. Remove Unnecessary Columns
  • Drops individual PFDI items (PFDI-1 through PFDI-20), POP-Q measurements (Aa, Ba, C, Gh, etc.), OAB medications, post-procedure urodynamic findings
  1. Process Date Variable
  • Extracts year from procedure date → creates Year factor
  • Removes the original date column
  1. Join PFDI-20 Totals
  • Left joins the summary PFDI-20 scores by Record ID

Final output: Clean dataset with demographic, clinical, and PFDI-20 data for predicting urodynamics procedure cancellation.

3 Data Cleaning

Consolidate Multiple Urodynamics Reason Columns into One

#' Consolidate Multiple Urodynamics Reason Columns into One
#'
#' This function takes a data frame containing multiple 'reason for urodynamics'
#' columns and consolidates them into a single column with appropriate levels.
#' The function identifies columns matching the pattern "What was the reason for
#' the urodynamics? (choice=*)" and creates a new column with the consolidated values.
#' The original reason columns are removed from the output data frame.
#'
#' @param urodynamics_dataset A data frame containing the urodynamics data with multiple
#'                           reason columns.
#' @param output_file Optional. Path to save the processed data frame. If NULL (default),
#'                   no file is saved.
#' @param verbose Logical. If TRUE, provides detailed logging of the transformation
#'                process. Default is FALSE.
#'
#' @return A modified data frame with a new consolidated column for urodynamics reasons
#'         and all original reason columns removed.
#'
#' @importFrom dplyr mutate select
#' @importFrom tidyr unite
#' @importFrom assertthat assert_that
#' @importFrom logger log_info log_debug log_error
#'
#' @examples
#' # Example 1: Basic usage with minimal output
#' urodynamics_data <- read.csv("urodynamics_data.csv")
#' processed_urodynamics <- consolidate_urodynamics_reasons(
#'   urodynamics_dataset = urodynamics_data,
#'   output_file = NULL,
#'   verbose = FALSE
#' )
#' # The resulting data frame has a new column "What was the reason for the urodynamics?"
#' # with values like "pre-op for prolapse surgery" or "evaluation of OAB"
#' # All original reason columns are removed from the output
#'
#' # Example 2: Save output to file with verbose logging
#' processed_with_log <- consolidate_urodynamics_reasons(
#'   urodynamics_dataset = urodynamics_data,
#'   output_file = "processed_urodynamics.csv",
#'   verbose = TRUE
#' )
#' # Console will show detailed logs of each transformation step
#' # File "processed_urodynamics.csv" will be created with processed data
#'
#' # Example 3: Process data with custom column selection
#' # First create a subset of the data with only some columns
#' subset_data <- urodynamics_data[, c("Record ID", "Age:",
#'   "What was the reason for the urodynamics? (choice=pre-op for prolapse surgery)",
#'   "What was the reason for the urodynamics? (choice=pre-op for sling)",
#'   "What was the reason for the urodynamics? (choice=evaluation of OAB)")]
#' # Then process this subset
#' processed_subset <- consolidate_urodynamics_reasons(
#'   urodynamics_dataset = subset_data,
#'   output_file = "subset_processed.csv",
#'   verbose = TRUE
#' )
#' # Note: Even with a subset of reason columns, the function will work correctly
consolidate_urodynamics_reasons <- function(urodynamics_dataset,
                                           output_file = NULL,
                                           verbose = FALSE) {
  # Initialize logger
  if (verbose) {
    logger::log_threshold(logger::INFO)
  } else {
    logger::log_threshold(logger::WARN)
  }
  
  logger::log_info("Starting consolidation of urodynamics reasons")
  
  # Validate input
  tryCatch({
    assertthat::assert_that(is.data.frame(urodynamics_dataset))
    logger::log_info("Input validation passed: urodynamics_dataset is a data frame")
    
    if (!is.null(output_file)) {
      assertthat::assert_that(is.character(output_file), length(output_file) == 1)
      logger::log_info(paste("Output will be saved to:", output_file))
    } else {
      logger::log_info("No output file specified, results will be returned only")
    }
    
    # Input data summary
    logger::log_info(paste("Input data dimensions:", 
                          nrow(urodynamics_dataset), "rows,", 
                          ncol(urodynamics_dataset), "columns"))
    
    # Find all reason columns
    reason_pattern <- "What was the reason for the urodynamics\\? \\(choice="
    reason_columns <- grep(reason_pattern, names(urodynamics_dataset), value = TRUE)
    
    if (length(reason_columns) == 0) {
      stop("No columns found matching the pattern for urodynamics reasons")
    }
    
    logger::log_info(paste("Found", length(reason_columns), "reason columns:"))
    if (verbose) {
      for (col in reason_columns) {
        logger::log_debug(paste("  *", col))
      }
    }
    
    # Process the data using helper functions
    processed_urodynamics <- process_reason_columns(
      urodynamics_dataset = urodynamics_dataset,
      reason_columns = reason_columns,
      reason_pattern = reason_pattern,
      verbose = verbose
    )
    
    # Save output if requested
    if (!is.null(output_file)) {
      tryCatch({
        utils::write.csv(processed_urodynamics, file = output_file, row.names = FALSE)
        logger::log_info(paste("Output successfully saved to:", output_file))
      }, error = function(e) {
        logger::log_error(paste("Error saving output file:", e$message))
        stop(paste("Error saving output file:", e$message))
      })
    }
    
    logger::log_info("Consolidation of urodynamics reasons completed successfully")
    return(processed_urodynamics)
    
  }, error = function(e) {
    logger::log_error(paste("Error in consolidate_urodynamics_reasons:", e$message))
    stop(paste("Error in consolidate_urodynamics_reasons:", e$message))
  })
}

#' Process reason columns to create consolidated reason column
#'
#' @param urodynamics_dataset Data frame with urodynamics data
#' @param reason_columns Vector of column names for reasons
#' @param reason_pattern Pattern to extract the reason value
#' @param verbose Whether to log detailed information
#'
#' @return Data frame with consolidated reason column
#'
#' @noRd
process_reason_columns <- function(urodynamics_dataset, reason_columns, 
                                   reason_pattern, verbose) {
  logger::log_info("Starting to process reason columns")
  
  # Create a new column for each reason, with NA or the reason value
  processed_dataset <- urodynamics_dataset
  
  # Map of reasons (will be filled during processing)
  reason_values <- c()
  
  # For each reason column, extract the reason and set a new column with the value
  for (col in reason_columns) {
    # Extract the reason from the column name
    reason_value <- extract_reason_value(col, reason_pattern)
    reason_values <- c(reason_values, reason_value)
    
    if (verbose) {
      logger::log_debug(paste("Processing column:", col))
      logger::log_debug(paste("Extracted reason:", reason_value))
    }
    
    # Create a new column with the reason value if the original is "Checked"
    processed_dataset <- dplyr::mutate(processed_dataset, 
                                      !!paste0("Reason_", reason_value) := 
                                        ifelse(.data[[col]] == "Checked", 
                                               reason_value, NA))
  }
  
  # Create the consolidated reason column
  logger::log_info("Creating consolidated reason column")
  
  # Get the new individual reason columns
  new_reason_columns <- paste0("Reason_", reason_values)
  
  # Use coalesce to create the consolidated column
  processed_dataset <- create_consolidated_column(
    processed_dataset = processed_dataset,
    reason_columns = new_reason_columns,
    verbose = verbose
  )
  
  # Clean up temporary columns and original reason columns
  logger::log_info("Removing temporary columns and original reason columns")
  columns_to_remove <- c(new_reason_columns, reason_columns)
  processed_dataset <- processed_dataset[, !names(processed_dataset) %in% columns_to_remove]
  
  if (verbose) {
    logger::log_debug(paste("Removed", length(columns_to_remove), "columns"))
    logger::log_debug(paste("Remaining columns:", ncol(processed_dataset)))
  }
  
  return(processed_dataset)
}

#' Extract reason value from column name
#'
#' @param column_name Column name containing the reason
#' @param pattern Pattern to match for extracting the reason
#'
#' @return Extracted reason value
#'
#' @noRd
extract_reason_value <- function(column_name, pattern) {
  # Extract the part after "choice=" and before the closing parenthesis
  reason <- gsub(pattern, "", column_name)
  reason <- gsub("\\)$", "", reason)
  
  return(reason)
}

#' Create consolidated reason column using coalesce
#'
#' @param processed_dataset Dataset with individual reason columns
#' @param reason_columns Vector of column names to consolidate
#' @param verbose Whether to log detailed information
#'
#' @return Dataset with added consolidated column
#'
#' @noRd
create_consolidated_column <- function(processed_dataset, reason_columns, verbose) {
  if (verbose) {
    logger::log_debug("Creating consolidated column from:")
    for (col in reason_columns) {
      logger::log_debug(paste("  *", col))
    }
  }
  
  # Create a consolidated column by coalescing all individual reason columns
  processed_dataset$`What was the reason for the urodynamics?` <- NA
  
  # Iterate through each row
  for (i in 1:nrow(processed_dataset)) {
    for (col in reason_columns) {
      if (!is.na(processed_dataset[i, col])) {
        processed_dataset[i, "What was the reason for the urodynamics?"] <- 
          processed_dataset[i, col]
        break  # Take the first non-NA value
      }
    }
  }
  
  # If multiple reasons are checked, they'll be concatenated
  logger::log_info("Consolidated reason column created successfully")
  
  return(processed_dataset)
}
Data Preview (First 20 Observations)

Was the procedure cancelled?

Age:

BMI

Race:

Is the patient hispanic, latino or of Spanish origin?

Does the patient have diabetes?

Does the patient have a h/o recurrent UTIs?

Immunocompromised:

Tobacco use:

Menopause status:

Is the patient on vaginal estrogen?

Does the patient have OAB?

Average number of voids at night:

POP-Q stage:

Year

POPDI_6

CRADI_8

UDI_6

PFDI_20_total

What was the reason for the urodynamics?

Yes

49

34.18

Other

Yes

No

No

No

Non-tobacco user

Pre-menopausal

No

No

1.0

2022

4.166667

6.250000

pre-op for prolapse surgery

Yes

78

31.28

White

No

No

No

No

Former tobacco user

Post-menopausal

No

No

5.0

2022

79.166667

53.125000

91.666667

224

pre-op for prolapse surgery

Yes

63

27.17

White

No

Yes

No

No

Former tobacco user

Post-menopausal

No

Yes

1.0

2022

16.666667

50.000000

50.000000

117

pre-op for prolapse surgery

Yes

62

18.96

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

3.0

0

2022

0.000000

7.142857

20.833333

28

pre-op for sling

No

70

37.12

White

No

No

Yes

Yes

Current tobacco user

Post-menopausal

No

Yes

1.5

0

2022

8.333333

0.000000

91.666667

100

evaluation of OAB

No

59

29.87

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

4.0

3

2022

60.000000

56.250000

81.250000

198

pre-op for prolapse surgery

No

53

30.27

Other

No

No

Yes

No

Non-tobacco user

Post-menopausal

No

No

4.5

1

2022

79.166667

78.125000

100.000000

257

pre-op for sling

No

37

25.79

Other

No

No

No

No

Non-tobacco user

Pre-menopausal

No

No

1.0

1

2022

0.000000

6.250000

8.333333

15

pre-op for sling

No

86

34.01

White

No

No

No

No

Non-tobacco user

Post-menopausal

Yes

No

3.5

3

2022

pre-op for prolapse surgery

No

67

25.86

White

No

No

No

No

Former tobacco user

Post-menopausal

No

Yes

2.0

0

2022

0.000000

50.000000

62.500000

112

evaluation of OAB

No

61

30.90

Other

Yes

Yes

No

No

Non-tobacco user

Post-menopausal

No

Yes

4.0

1

2022

pre-op for sling

No

68

23.00

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

4.0

1

2022

pre-op for sling

No

67

20.25

White

No

No

No

No

Non-tobacco user

Post-menopausal

Yes

No

1.0

3

2022

29.166667

6.250000

50.000000

85

pre-op for prolapse surgery

No

74

27.34

Other

No

No

No

Yes

Non-tobacco user

Post-menopausal

No

No

1.0

4

2022

33.333333

0.000000

37.500000

71

pre-op for prolapse surgery

No

51

36.07

Black or African American

No

Yes

No

No

Non-tobacco user

Post-menopausal

No

Yes

5.5

0

2022

37.500000

6.250000

41.666667

85

evaluation of OAB

Yes

93

23.46

White

No

No

No

No

Non-tobacco user

Post-menopausal

No

Yes

0.0

0

2022

other

No

69

28.34

White

No

No

No

No

Former tobacco user

Post-menopausal

No

Yes

0.0

2

2022

pre-op for prolapse surgery

No

47

35.73

White

No

No

No

No

Non-tobacco user

Pre-menopausal

No

Yes

0.0

0

2022

16.666667

18.750000

58.333333

94

pre-op for sling

No

43

23.00

White

No

No

No

No

Non-tobacco user

Pre-menopausal

No

Yes

1.0

0

2022

0.000000

0.000000

62.500000

62

evaluation of mixed incontinence

No

75

24.10

Other

No

No

No

No

Non-tobacco user

Post-menopausal

Yes

No

1.5

4

2022

37.500000

6.250000

20.000000

64

pre-op for prolapse surgery

4 Are there Enough Events Per Predictor?

check_events_per_predictor(labels_df, "Was the procedure cancelled?")
📊 Logistic Regression Check: Events per Predictor Variable (EPV) 📊
==================================================================
Outcome Variable: Was the procedure cancelled?
Number of Observations: 587
Total Events (Yes): 41
Total Non-Events: 546
Number of Predictors: 19
Required Minimum Events (10 × 19): 190

4.0.1 Interpretation of Events Per Predictor Analysis

The Events Per Predictor Variable (EPV) rule is a guideline in logistic regression that recommends having at least 10 outcome events for each predictor variable in the model to ensure stable coefficient estimates and reliable inference.

In our dataset:

  • Total observations: 587 patients scheduled for urodynamic testing
  • Events (procedure cancellations): 41 patients had procedures cancelled due to urinary tract infection (UTI)
  • Non-events (completed procedures): 546 patients completed their scheduled procedures
  • Number of candidate predictors: 19 variables
  • Minimum events required: 190 (based on 10 × 19 predictors)
  • Current EPV ratio: 2.2 events per predictor variable

Assessment: With only 41 events and 19 candidate predictors, the EPV ratio (2.2) is below the recommended minimum of 10. This limitation will be addressed through LASSO regularization, which automatically selects a parsimonious subset of predictors, thereby improving the effective EPV ratio in the final model.

The LASSO variable selection process (described in Section 7) will identify the most predictive variables, reducing the number of predictors in the final model and substantially improving the EPV ratio. The final model metrics are reported in the Results section.

5 Table 1

Looking at the demographics table comparing patients whose procedures were cancelled (Yes) versus not cancelled (No), there are several notable findings:

5.0.1 Statistically Significant Differences

  1. Age (p < 0.001):
    • Patients whose procedures were cancelled were significantly older (median age 77 vs 64.5)
    • The interquartile range for cancelled procedures (Q1, Q3: 71.0, 80.0) is higher than for non-cancelled procedures (Q1, Q3: 53.0, 73.0)
    • This suggests age is strongly associated with procedure cancellation
  2. Diabetes (p = 0.042):
    • Higher proportion of patients with diabetes in the cancelled group (26.8% vs 14.8%)
    • This indicates diabetes may be a risk factor for procedure cancellation

5.0.2 Variables Not Reaching Statistical Significance (P = .05 to .10)

The following variables showed differences between groups but did not reach the prespecified significance threshold of P<.05:

  1. BMI (P = .070):
    • Patients with cancelled procedures had somewhat higher BMI (median 31.3 vs 28.5 kg/m²)
    • This difference may be clinically meaningful despite not reaching statistical significance
  2. History of recurrent UTIs (P = .053):
    • Almost twice the proportion of patients with recurrent UTIs in the cancelled group (19.5% vs 9.9%)
  3. Hispanic/Latino ethnicity (P = .082):
    • Higher proportion of Hispanic/Latino patients in the cancelled group (19.5% vs 10.6%)
  4. POP-Q stage (P = .086):
    • Different distribution of pelvic organ prolapse stages between groups
    • Higher percentage of stage 0 in cancelled group (55.3% vs 36.4%)
    • Lower percentage of stage 2 in cancelled group (10.5% vs 26.9%)

Why are variables with P = .05 to .10 retained in the analysis?

These “trending” variables are kept for several important reasons:

  1. LASSO handles variable selection: Unlike traditional stepwise regression that uses p-values for variable selection, LASSO uses cross-validation to select variables based on their contribution to predictive accuracy. A variable that does not reach statistical significance in univariate analysis may still improve model prediction when combined with other variables.

  2. P-values are not selection criteria for LASSO: The p-values in Table 1 describe univariate associations (each variable tested alone). LASSO considers multivariate relationships and can identify variables that are predictive in combination, even if individually they show weaker associations.

  3. Clinical plausibility: BMI and recurrent UTIs have biological rationale for affecting UTI risk and procedure cancellation. Excluding clinically relevant variables prematurely could result in an underspecified model.

  4. Small sample size considerations: With only 41 events (cancellations), statistical power is limited. A P-value of .053-.070 suggests a meaningful effect that may not have reached significance due to sample size constraints rather than true absence of association.

  5. Let the data decide: By including these variables as candidates, we allow the LASSO algorithm to make an objective, data-driven decision about their inclusion based on cross-validated prediction error—not arbitrary p-value cutoffs.

5.0.3 No Significant Differences

The following factors showed no significant association with procedure cancellation: - Race - Immunocompromised status - Tobacco use - Vaginal estrogen use - OAB (overactive bladder) status - Nighttime voiding frequency - Year of procedure - POPDI_6, CRADI_8, UDI_6, and PFDI_20_total scores (pelvic floor disorder questionnaires) - Reason for urodynamics

5.0.4 Clinical Interpretation

The data suggests that older age is the strongest predictor of procedure cancellation, with diabetes as another statistically significant factor (P<.05). BMI and history of recurrent UTIs showed differences between groups but did not reach the prespecified significance threshold; however, these variables may warrant clinical consideration as potential risk factors.

The sample size for cancellations is relatively small (N=41 vs N=546), which may limit statistical power to detect some associations. The significant age difference suggests pre-procedure assessment might need to be more comprehensive for elderly patients to reduce cancellation rates.

Table 1. Demographic and Clinical Characteristics Stratified by Procedure Cancellation Status

Demographics Stratified by Procedure Cancellation Status
Completed (N=546) Cancelled (N=41) Total (N=587) p value
Age (years) < 0.001 (1)
- Median 64.5 77.0 65.0
- Q1, Q3 53.0, 73.0 71.0, 80.0 54.0, 73.0
Body Mass Index (kg/m²) 0.070 (1)
- Median 28.5 31.3 28.6
- Q1, Q3 24.7, 33.7 26.0, 35.5 24.7, 33.8
Race: 0.311 (2)
- Asian 8 (1.5%) 0 (0.0%) 8 (1.4%)
- Black or African American 25 (4.6%) 1 (2.4%) 26 (4.4%)
- White 433 (79.3%) 30 (73.2%) 463 (78.9%)
- Other 77 (14.1%) 9 (22.0%) 86 (14.7%)
- Native Hawaiian or Other Pacific Islander 3 (0.5%) 1 (2.4%) 4 (0.7%)
Is the patient hispanic, latino or of Spanish origin? 0.082 (2)
- No 488 (89.4%) 33 (80.5%) 521 (88.8%)
- Yes 58 (10.6%) 8 (19.5%) 66 (11.2%)
Does the patient have diabetes? 0.042 (2)
- Yes 81 (14.8%) 11 (26.8%) 92 (15.7%)
- No 465 (85.2%) 30 (73.2%) 495 (84.3%)
History of Recurrent Urinary Tract Infections 0.053 (2)
- Yes 54 (9.9%) 8 (19.5%) 62 (10.6%)
- No 492 (90.1%) 33 (80.5%) 525 (89.4%)
Immunocompromised: 0.973 (2)
- Yes 26 (4.8%) 2 (4.9%) 28 (4.8%)
- No 520 (95.2%) 39 (95.1%) 559 (95.2%)
Tobacco use:
- Yes 0 0 0
- No 0 0 0
Menopause status: 0.105 (2)
- Pre-menopausal 112 (20.5%) 3 (7.3%) 115 (19.6%)
- Post-menopausal 415 (76.0%) 37 (90.2%) 452 (77.0%)
- Unclear 19 (3.5%) 1 (2.4%) 20 (3.4%)
Is the patient on vaginal estrogen? 0.299 (2)
- Yes 86 (15.8%) 9 (22.0%) 95 (16.2%)
- No 460 (84.2%) 32 (78.0%) 492 (83.8%)
Does the patient have Overactive Bladder? 0.218 (2)
- Yes 292 (53.5%) 26 (63.4%) 318 (54.2%)
- No 254 (46.5%) 15 (36.6%) 269 (45.8%)
Average number of voids at night: 0.603 (1)
- Median 2.0 1.5 2.0
- Q1, Q3 1.0, 3.0 1.0, 3.0 1.0, 3.0
Pelvic Organ Prolapse Quantification Stage 0.086 (2)
- 0 199 (36.4%) 21 (55.3%) 220 (37.7%)
- 1 58 (10.6%) 2 (5.3%) 60 (10.3%)
- 2 147 (26.9%) 4 (10.5%) 151 (25.9%)
- 3 114 (20.9%) 9 (23.7%) 123 (21.1%)
- 4 28 (5.1%) 2 (5.3%) 30 (5.1%)
Year 0.993 (2)
- 2022 288 (52.7%) 22 (53.7%) 310 (52.8%)
- 2023 189 (34.6%) 14 (34.1%) 203 (34.6%)
- 2024 69 (12.6%) 5 (12.2%) 74 (12.6%)
Pelvic Organ Prolapse Distress Inventory-6 0.609 (1)
- Median 29.2 31.2 29.2
- Q1, Q3 12.5, 45.8 5.2, 41.7 12.5, 45.8
Colorectal-Anal Distress Inventory-8 0.707 (1)
- Median 18.8 23.4 18.8
- Q1, Q3 6.2, 34.4 6.2, 39.8 6.2, 37.5
Urinary Distress Inventory-6 0.160 (1)
- Median 50.0 56.2 50.0
- Q1, Q3 33.3, 66.7 49.0, 66.7 33.3, 66.7
Pelvic Floor Distress Inventory-20 Total Score 0.362 (1)
- Median 98.0 110.5 100.0
- Q1, Q3 62.0, 138.0 80.5, 140.5 62.0, 138.0
What was the reason for the urodynamics? 0.489 (2)
- evaluation of mixed incontinence 76 (13.9%) 4 (9.8%) 80 (13.6%)
- evaluation of neurogenic bladder 1 (0.2%) 0 (0.0%) 1 (0.2%)
- evaluation of OAB 130 (23.8%) 16 (39.0%) 146 (24.9%)
- evaluation of voiding dysfunction 10 (1.8%) 1 (2.4%) 11 (1.9%)
- other 20 (3.7%) 2 (4.9%) 22 (3.7%)
- pre-op for prolapse surgery 204 (37.4%) 12 (29.3%) 216 (36.8%)
- pre-op for sling 105 (19.2%) 6 (14.6%) 111 (18.9%)
  1. Linear Model ANOVA
  2. Pearson’s Chi-squared test

5.0.5 Remove spaces from column names

# Rename columns to remove spaces and special characters
colnames(labels_df) <- make.names(colnames(labels_df))

# Remove Record.ID if it still exists after make.names conversion
# This is a non-contributory identifier variable
if ("Record.ID" %in% names(labels_df)) {
  labels_df <- labels_df %>% select(-Record.ID)
  message("✅ Removed Record.ID from labels_df")
}

5.0.6 Ensure the outcome variable is a factor

5.0.7 Model

6 Feature Selection

6.0.1 Variable Importance Plot Function

#' Create a simple and robust variable importance plot for logistic regression
#'
#' This function extracts variable importance measures from a logistic regression model
#' and creates a simple visualization. It's designed to be robust and work with models
#' that might have complex structures.
#'
#' @param model An rms::lrm object or a glm object with family=binomial
#' @param top_n Number of top variables to show (default: all)
#' @param title Custom title for the plot (default: "Variable Importance")
#'
#' @return A ggplot2 object showing variable importance
#'
#' @examples
#' # Assuming you have an lrm model called 'model'
#' imp_plot <- plot_var_importance(model)
#' print(imp_plot)
#'
#' # Show only top 5 variables with custom title
#' imp_plot <- plot_var_importance(model, top_n = 5, 
#'                               title = "Top Predictors of Cancellation")
#' print(imp_plot)
#'
#' @importFrom ggplot2 ggplot aes geom_col coord_flip labs theme_minimal theme
#'   element_text element_blank reorder
plot_var_importance <- function(model, top_n = NULL, title = "Variable Importance") {
  # Check if ggplot2 is available
  if (!requireNamespace("ggplot2", quietly = TRUE)) {
    stop("ggplot2 package is required for this function")
  }
  
  # Get the coefficients while handling potential errors
  tryCatch({
    # Try to extract coefficients
    if (inherits(model, "lrm")) {
      # For rms::lrm models
      coefficients <- model$coefficients
      # Skip intercept
      coefficients <- coefficients[names(coefficients) != "Intercept"]
    } else if (inherits(model, "glm")) {
      # For glm models
      coefficients <- coef(model)[-1]  # Skip intercept
    } else {
      stop("Model must be an 'lrm' or 'glm' object")
    }
    
    # Check if we have any coefficients
    if (length(coefficients) == 0) {
      stop("No coefficients found in the model")
    }
    
    # Create a data frame with variable importance
    importance_df <- data.frame(
      variable = names(coefficients),
      importance = abs(coefficients),  # Use absolute values for importance
      stringsAsFactors = FALSE
    )
    
    # Clean variable names (remove backticks and handle factor levels)
    importance_df$variable <- gsub("`", "", importance_df$variable)

    # Remove non-contributory identifier variables that should not appear in importance plots
    exclude_vars <- c("Record.ID", "Record ID", "MRN", "MRN.", "Completed", "Complete")
    importance_df <- importance_df[!grepl(paste(exclude_vars, collapse = "|"), importance_df$variable, ignore.case = TRUE), ]

    # Sort by importance
    importance_df <- importance_df[order(importance_df$importance, decreasing = TRUE), ]
    
    # Apply top_n filter if specified
    if (!is.null(top_n) && is.numeric(top_n) && top_n > 0) {
      n_vars <- nrow(importance_df)
      if (top_n < n_vars) {
        importance_df <- importance_df[1:top_n, ]
      }
    }
    
    # Reorder factor levels for plotting
    importance_df$variable <- factor(importance_df$variable, 
                                    levels = rev(importance_df$variable))
    
    # Create plot
    p <- ggplot2::ggplot(
      importance_df, 
      ggplot2::aes(x = variable, y = importance)
    ) +
      ggplot2::geom_col(fill = "steelblue") +
      ggplot2::coord_flip() +
      ggplot2::labs(
        title = title,
        x = NULL,
        y = "Absolute Coefficient Magnitude"
      ) +
      ggplot2::theme_minimal(base_size = 14) +
      ggplot2::theme(
        plot.title = ggplot2::element_text(face = "bold", hjust = 0.5),
        panel.grid.minor = ggplot2::element_blank(),
        panel.grid.major.y = ggplot2::element_blank()
      )
    
    return(p)
    
  }, error = function(e) {
    message("Error creating variable importance plot: ", e$message)
    
    # Create fallback plot with error message
    df <- data.frame(x = 1, y = 1)
    p <- ggplot2::ggplot(df, ggplot2::aes(x, y)) + 
      ggplot2::geom_blank() +
      ggplot2::annotate("text", x = 1, y = 1, 
                      label = paste("Could not create plot:", e$message),
                      hjust = 0.5, vjust = 0.5) +
      ggplot2::theme_void() +
      ggplot2::labs(title = "Error in Variable Importance Plot")
    
    return(p)
  })
}

#' Plot Wald statistics from a logistic regression model
#'
#' This function calculates and plots Wald statistics from a logistic regression model,
#' which can be used as a measure of variable importance.
#'
#' @param model An rms::lrm object
#' @param top_n Number of top variables to show (default: all)
#' @param title Custom title for the plot (default: "Variable Importance (Wald Statistic)")
#'
#' @return A ggplot2 object showing Wald statistics
#'
#' @examples
#' # Assuming you have an lrm model called 'model'
#' wald_plot <- plot_wald_statistics(model, top_n = 5)
#' print(wald_plot)
#'
#' @importFrom ggplot2 ggplot aes geom_col coord_flip labs theme_minimal theme
#'   element_text element_blank reorder
plot_wald_statistics <- function(model, top_n = NULL, 
                                title = "Variable Importance (Wald Statistic)") {
  # Check if model is from rms package
  if (!inherits(model, "lrm")) {
    stop("Model must be an 'lrm' object from the rms package")
  }
  
  # Extract model summary
  tryCatch({
    # Get coefficients and standard errors
    coefs <- model$coefficients
    # Skip intercept
    coefs <- coefs[names(coefs) != "Intercept"]
    
    # Get variance-covariance matrix
    vcov_matrix <- model$var
    
    # Calculate standard errors
    se <- sqrt(diag(vcov_matrix))[names(coefs)]
    
    # Calculate Wald statistics
    wald_stats <- abs(coefs / se)
    
    # Create data frame
    wald_df <- data.frame(
      variable = names(wald_stats),
      importance = as.numeric(wald_stats),
      stringsAsFactors = FALSE
    )
    
    # Clean variable names
    wald_df$variable <- gsub("`", "", wald_df$variable)
    
    # Sort by importance
    wald_df <- wald_df[order(wald_df$importance, decreasing = TRUE), ]
    
    # Apply top_n filter if specified
    if (!is.null(top_n) && is.numeric(top_n) && top_n > 0) {
      n_vars <- nrow(wald_df)
      if (top_n < n_vars) {
        wald_df <- wald_df[1:top_n, ]
      }
    }
    
    # Reorder factor levels for plotting
    wald_df$variable <- factor(wald_df$variable, 
                              levels = rev(wald_df$variable))
    
    # Create plot
    p <- ggplot2::ggplot(
      wald_df, 
      ggplot2::aes(x = variable, y = importance)
    ) +
      ggplot2::geom_col(fill = "darkblue") +
      ggplot2::coord_flip() +
      ggplot2::labs(
        title = title,
        x = NULL,
        y = "Absolute Wald Statistic (|Coefficient / SE|)"
      ) +
      ggplot2::theme_minimal(base_size = 14) +
      ggplot2::theme(
        plot.title = ggplot2::element_text(face = "bold", hjust = 0.5),
        panel.grid.minor = ggplot2::element_blank(),
        panel.grid.major.y = ggplot2::element_blank()
      )
    
    return(p)
    
  }, error = function(e) {
    message("Error calculating Wald statistics: ", e$message)
    
    # Create fallback plot with error message
    df <- data.frame(x = 1, y = 1)
    p <- ggplot2::ggplot(df, ggplot2::aes(x, y)) + 
      ggplot2::geom_blank() +
      ggplot2::annotate("text", x = 1, y = 1, 
                       label = paste("Could not create plot:", e$message),
                       hjust = 0.5, vjust = 0.5) +
      ggplot2::theme_void() +
      ggplot2::labs(title = "Error in Wald Statistics Plot")
    
    return(p)
  })
}

#' Create a simple but effective variable importance plot for most model types
#'
#' This function works with virtually any model object that has coefficients
#' and creates a simple variable importance plot.
#'
#' @param model Any model object with a coef() method
#' @param top_n Number of top variables to show (default: all)
#' @param title Custom title for the plot (default: "Variable Importance")
#' @param sort_by How to sort variables: "magnitude" (default) or "alphabetical"
#' @param color Bar color (default: "steelblue")
#'
#' @return A ggplot2 object showing variable importance
#'
#' @examples
#' # Works with virtually any regression model
#' universal_imp_plot <- universal_importance_plot(model, top_n = 5)
#' print(universal_imp_plot)
#'
#' # Customize appearance
#' universal_imp_plot <- universal_importance_plot(
#'   model, top_n = 3, color = "darkred", 
#'   title = "Top 3 Predictors"
#' )
#' print(universal_imp_plot)
#'
#' @importFrom stats coef
universal_importance_plot <- function(model, top_n = NULL, 
                                    title = "Variable Importance",
                                    sort_by = "magnitude", 
                                    color = "steelblue") {
  # Try different methods to extract coefficients
  tryCatch({
    coeffs <- NULL
    
    # Method 1: Try direct coef() method
    if (is.function(stats::coef) && !is.null(stats::coef(model))) {
      coeffs <- stats::coef(model)
      # Remove intercept if present
      if ("(Intercept)" %in% names(coeffs)) {
        coeffs <- coeffs[names(coeffs) != "(Intercept)"]
      }
      if ("Intercept" %in% names(coeffs)) {
        coeffs <- coeffs[names(coeffs) != "Intercept"]
      }
    }
    
    # Method 2: Try accessing coefficients directly
    if (is.null(coeffs) && !is.null(model$coefficients)) {
      coeffs <- model$coefficients
      # Remove intercept if present
      if ("(Intercept)" %in% names(coeffs)) {
        coeffs <- coeffs[names(coeffs) != "(Intercept)"]
      }
      if ("Intercept" %in% names(coeffs)) {
        coeffs <- coeffs[names(coeffs) != "Intercept"]
      }
    }
    
    # If still no coefficients, try model summary
    if (is.null(coeffs) && !is.null(summary(model)$coefficients)) {
      coeffs_matrix <- summary(model)$coefficients
      if (is.matrix(coeffs_matrix) && nrow(coeffs_matrix) > 1) {
        # Get coefficient values (first column)
        coeffs <- coeffs_matrix[, 1]
        # Remove intercept if present
        if ("(Intercept)" %in% names(coeffs)) {
          coeffs <- coeffs[-1]  # Assume intercept is first row
        }
      }
    }
    
    # Check if we found any coefficients
    if (is.null(coeffs) || length(coeffs) == 0) {
      stop("Could not extract coefficients from the model")
    }
    
    # Create data frame
    imp_df <- data.frame(
      variable = names(coeffs),
      importance = abs(coeffs),
      stringsAsFactors = FALSE
    )

    # Clean variable names and improve readability
    imp_df$variable <- gsub("`", "", imp_df$variable)

    # Remove non-contributory identifier variables that should not appear in importance plots
    exclude_vars <- c("Record.ID", "Record ID", "MRN", "MRN.", "Completed", "Complete")
    imp_df <- imp_df[!grepl(paste(exclude_vars, collapse = "|"), imp_df$variable, ignore.case = TRUE), ]

    # Create readable labels for common abbreviations and variable names
    label_map <- c(
      "Age" = "Age (years)",
      "BMI" = "Body Mass Index",
      "POP.Q.stage." = "POP-Q Stage",
      "POPDI_6" = "Pelvic Organ Prolapse Distress",
      "CRADI_8" = "Colorectal-Anal Distress",
      "UDI_6" = "Urinary Distress",
      "PFDI_20_total" = "Pelvic Floor Distress Total",
      "Does.the.patient.have.OAB." = "Overactive Bladder",
      "Does.the.patient.have.diabetes." = "Diabetes",
      "Does.the.patient.have.a.h.o.recurrent.UTIs." = "History of Recurrent UTIs",
      "Is.the.patient.on.vaginal.estrogen." = "Vaginal Estrogen Use",
      "Immunocompromised." = "Immunocompromised",
      "Tobacco.use." = "Tobacco Use",
      "Menopause.status." = "Menopause Status",
      "What.was.the.reason.for.the.urodynamics." = "Indication for Urodynamics",
      "Average.number.of.voids.at.night." = "Nocturia (voids/night)",
      "Race." = "Race",
      "Year" = "Year"
    )

    # Apply label improvements
    for (i in seq_len(nrow(imp_df))) {
      var_name <- imp_df$variable[i]
      # Remove level suffixes for matching (e.g., "Race.=White" -> "Race.")
      base_var <- sub("=.*$", "", var_name)
      level_suffix <- sub("^[^=]*", "", var_name)

      if (base_var %in% names(label_map)) {
        if (level_suffix != "") {
          imp_df$variable[i] <- paste0(label_map[base_var], level_suffix)
        } else {
          imp_df$variable[i] <- label_map[base_var]
        }
      } else {
        # Clean up periods and dots in variable names
        imp_df$variable[i] <- gsub("\\.", " ", imp_df$variable[i])
        imp_df$variable[i] <- gsub("  +", " ", imp_df$variable[i])
        imp_df$variable[i] <- trimws(imp_df$variable[i])
      }
    }

    # Sort as requested
    if (sort_by == "magnitude" || sort_by == "importance") {
      imp_df <- imp_df[order(imp_df$importance, decreasing = TRUE), ]
    } else if (sort_by == "alphabetical" || sort_by == "alpha") {
      imp_df <- imp_df[order(imp_df$variable), ]
    }
    
    # Apply top_n filter
    if (!is.null(top_n) && is.numeric(top_n) && top_n > 0) {
      if (top_n < nrow(imp_df)) {
        imp_df <- imp_df[1:top_n, ]
      }
    }
    
    # For ggplot, reorder factor levels
    if (sort_by == "magnitude" || sort_by == "importance") {
      imp_df$variable <- factor(imp_df$variable, levels = rev(imp_df$variable))
    } else {
      imp_df$variable <- factor(imp_df$variable, levels = sort(imp_df$variable))
    }
    
    # Create plot with improved aesthetics
    p <- ggplot2::ggplot(
      imp_df,
      ggplot2::aes(x = variable, y = importance)
    ) +
      ggplot2::geom_col(fill = color, width = 0.7) +
      ggplot2::geom_text(
        ggplot2::aes(label = sprintf("%.2f", importance)),
        hjust = -0.1, size = 4, fontface = "bold"
      ) +
      ggplot2::coord_flip() +
      ggplot2::labs(
        title = title,
        x = NULL,
        y = "Absolute Coefficient Magnitude"
      ) +
      ggplot2::scale_y_continuous(expand = ggplot2::expansion(mult = c(0, 0.15))) +
      ggplot2::theme_minimal(base_size = 14) +
      ggplot2::theme(
        plot.title = ggplot2::element_text(face = "bold", hjust = 0.5, size = 16),
        axis.text.y = ggplot2::element_text(size = 12, color = "black"),
        axis.text.x = ggplot2::element_text(size = 11),
        axis.title.x = ggplot2::element_text(size = 12, face = "bold"),
        panel.grid.minor = ggplot2::element_blank(),
        panel.grid.major.y = ggplot2::element_blank(),
        plot.margin = ggplot2::margin(10, 20, 10, 10)
      )

    return(p)
    
  }, error = function(e) {
    message("Error creating variable importance plot: ", e$message)
    
    # Create fallback error message plot
    df <- data.frame(x = 1, y = 1)
    p <- ggplot2::ggplot(df, ggplot2::aes(x, y)) + 
      ggplot2::geom_blank() +
      ggplot2::annotate("text", x = 1, y = 1, 
                       label = paste("Could not create plot:", e$message),
                       hjust = 0.5, vjust = 0.5) +
      ggplot2::theme_void() +
      ggplot2::labs(title = "Error in Variable Importance Plot")
    
    return(p)
  })
}
# Usage examples for the robust variable importance functions
# These functions are designed to work with any model type and handle errors gracefully

#-----------------------------------------------------------------------------
# IMPROVED VARIABLE IMPORTANCE PLOT - Publication Quality
#-----------------------------------------------------------------------------
# Create a visually appealing gradient-colored importance plot

# Generate the importance data
imp_data <- tryCatch({
  coeffs <- coef(model)
  coeffs <- coeffs[names(coeffs) != "Intercept"]

  imp_df <- data.frame(
    variable = names(coeffs),
    importance = abs(coeffs),
    direction = ifelse(coeffs > 0, "Increases Risk", "Decreases Risk"),
    stringsAsFactors = FALSE
  )

  # Remove identifier variables
  imp_df <- imp_df[!grepl("Record|MRN|Complete", imp_df$variable, ignore.case = TRUE), ]

  # Clean variable names for display
  imp_df$display_name <- sapply(imp_df$variable, function(v) {
    v <- gsub("\\.", " ", v)
    v <- gsub("  +", " ", v)
    v <- gsub("Does the patient have ", "", v)
    v <- gsub("Is the patient ", "", v)
    v <- gsub("What was the reason for the urodynamics ", "", v)
    v <- gsub(" $", "", v)
    v <- gsub("= ", ": ", v)
    trimws(v)
  })

  imp_df <- imp_df[order(imp_df$importance, decreasing = TRUE), ]
  imp_df <- head(imp_df, 10)  # Top 10
  imp_df$display_name <- factor(imp_df$display_name, levels = rev(imp_df$display_name))
  imp_df
}, error = function(e) NULL)

if (!is.null(imp_data) && nrow(imp_data) > 0) {
  # Create publication-quality plot with gradient coloring
  publication_importance_plot <- ggplot(imp_data, aes(x = display_name, y = importance, fill = importance)) +
    geom_col(width = 0.7, color = "white", linewidth = 0.5) +
    geom_text(aes(label = sprintf("%.1f", importance)),
              hjust = -0.1, size = 4, fontface = "bold", color = "gray20") +
    scale_fill_gradient(low = "#74add1", high = "#d73027", guide = "none") +
    coord_flip() +
    labs(
      title = "Top Predictors of Cancellation",
      subtitle = "Based on absolute coefficient magnitude from logistic regression",
      x = NULL,
      y = "Importance (|Coefficient|)",
      caption = "Note: Higher values indicate stronger association with cancellation"
    ) +
    scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
    theme_minimal(base_size = 13) +
    theme(
      plot.title = element_text(face = "bold", size = 16, hjust = 0.5, margin = margin(b = 5)),
      plot.subtitle = element_text(size = 11, color = "gray40", hjust = 0.5, margin = margin(b = 15)),
      plot.caption = element_text(size = 10, color = "gray50", face = "italic", margin = margin(t = 15)),
      axis.text.y = element_text(size = 12, color = "gray10", face = "bold"),
      axis.text.x = element_text(size = 11),
      axis.title.x = element_text(size = 12, face = "bold", margin = margin(t = 10)),
      panel.grid.major.y = element_blank(),
      panel.grid.minor = element_blank(),
      plot.margin = margin(20, 40, 15, 15)
    )

  print(publication_importance_plot)

  # Save high-resolution version
  ggsave(
    filename = "variable_importance_publication.png",
    plot = publication_importance_plot,
    width = 9,
    height = 6,
    dpi = 300,
    bg = "white"
  )
}
Figure: Variable Importance Plot showing top predictors of procedure cancellation

Figure: Variable Importance Plot showing top predictors of procedure cancellation

#-----------------------------------------------------------------------------
# SIMPLIFIED TOP 5 PLOT for manuscript figure
#-----------------------------------------------------------------------------
final_plot <- universal_importance_plot(
  model,
  top_n = 5,
  title = "Top 5 Predictors of Procedure Cancellation",
  color = "#2C7FB8"
)

# Save the plot as a high-resolution PNG
ggsave(
  filename = "variable_importance_plot.png",
  plot = final_plot,
  width = 7,
  height = 5,
  dpi = 300
)

# Or save as PDF for vector graphics
ggsave(
  filename = "variable_importance_plot.pdf",
  plot = final_plot,
  width = 7,
  height = 5
)

# Print confirmation
cat("Plots saved successfully to current directory\n")

Plots saved successfully to current directory The difference between your top 5 variables shown in the importance plot (Race=Black or African American, Tobacco use=Current tobacco use, Race=Asian, reason for urodynamics=evaluation of voiding dysfunction, and POP.Q.stage) and the variables selected by LASSO can be explained by several factors:

  1. Different measurement approaches: Variable importance plots typically use absolute coefficient magnitude from the fitted model, while LASSO uses regularization to select variables during model building. These represent fundamentally different approaches to assessing variable importance.

  2. Interaction with other variables: In your full logistic regression model, variables can influence each other. A variable might appear important in isolation (as in the importance plot) but become less influential when considering variable interactions that LASSO accounts for.

  3. Factor variable handling: Your importance plot is showing individual factor levels (like Race=Black or Race=Asian) as separate predictors, while LASSO might select the entire “Race” variable as important rather than specific levels.

  4. Correlation structure: If predictors are correlated, LASSO tends to select one representative variable from correlated groups. Your importance plot is showing the marginal importance of each variable without considering these correlations.

  5. Regularization effect: LASSO specifically penalizes coefficient size to prevent overfitting, potentially excluding variables with larger coefficients (which would appear important in the plot) if they don’t contribute enough additional predictive power.

This is why your nomogram based on LASSO focuses on Age and BMI as key predictors, while the importance plot shows categorical variables like Race and Tobacco use with large coefficients. LASSO likely determined that Age and BMI provide the most stable and generalizable prediction, despite other variables having larger coefficients in the unregularized model.

For clinical interpretation, the LASSO-selected variables (Age and BMI) are likely more reliable for prediction than the variables showing high importance in your current plot.

6.0.2 Variable Screening and Removal

The following sections document the systematic screening of variables for inclusion in the prediction model. Each step tracks which variables were removed and the rationale.

6.0.2.1 Step 1: Near-Zero Variance Screening

Variables with near-zero variance provide minimal discriminative information and can cause numerical instability in model fitting.

Variables removed due to near-zero variance:

  • Immunocompromised.

6.0.2.2 Step 2: High Missingness Screening

Variables with excessive missing data (>90%) were excluded owing to concerns about imputation reliability.

No variables removed due to high missingness.

6.0.2.3 Step 3: High Collinearity Screening

Highly correlated predictors (r > 0.90) can cause multicollinearity, inflating standard errors and destabilizing coefficient estimates.

No variables removed due to high collinearity.

6.0.2.4 Summary of Variable Screening

Initial variables: 20

Variables removed: 1

Variables retained for modeling: 18

Table: Variables Excluded During Screening

Variable

Reason for Exclusion

Details

Immunocompromised.

Near-zero variance

Frequency ratio: 19.96, % unique: 0.34%

Predictors retained for LASSO feature selection:

Predictor Variable

Age.

BMI

Race.

Is.the.patient.hispanic..latino.or.of.Spanish.origin.

Does.the.patient.have.diabetes.

Does.the.patient.have.a.h.o.recurrent.UTIs.

Tobacco.use.

Menopause.status.

Is.the.patient.on.vaginal.estrogen.

Does.the.patient.have.OAB.

Average.number.of.voids.at.night.

POP.Q.stage.

Year

POPDI_6

CRADI_8

UDI_6

PFDI_20_total

What.was.the.reason.for.the.urodynamics.

7 Feature Selection with LASSO

7.1 What is LASSO?

LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization technique that performs both variable selection and regularization to enhance prediction accuracy and model interpretability. Unlike traditional regression that uses all available predictors, LASSO automatically identifies and retains only the most important variables by shrinking less important coefficients to exactly zero.

Figure: Conceptual comparison of regression methods

Figure: Conceptual comparison of regression methods

Key insight from the figure above: Notice how LASSO (green bars) sets coefficients for X3, X4, X6, X7, X8, and X10 to exactly zero, matching the true underlying pattern. Ordinary Least Squares (red) produces noisy estimates for all variables, while Ridge (orange) shrinks all coefficients but never to exactly zero.

7.2 The Mathematical Foundation

LASSO minimizes the following objective function:

\[ \hat{\beta}^{LASSO} = \arg\min_{\beta} \left\{ \underbrace{\sum_{i=1}^{n} \left( y_i - \sum_{j=1}^{p} X_{ij} \beta_j \right)^2}_{\text{Prediction Error (RSS)}} + \underbrace{\lambda \sum_{j=1}^{p} |\beta_j|}_{\text{L1 Penalty}} \right\} \]

This equation has two components:

Component Formula What it does
Residual Sum of Squares (RSS) \(\sum(y_i - \hat{y}_i)^2\) Measures how well the model fits the data
L1 Penalty \(\lambda \sum |\beta_j|\) Penalizes the absolute size of coefficients
Lambda (λ) Tuning parameter Controls the trade-off between fit and simplicity

7.3 Why Does LASSO Set Coefficients to Zero?

The key to understanding LASSO is its geometric interpretation:

Figure: Geometric interpretation of LASSO vs Ridge regression

Figure: Geometric interpretation of LASSO vs Ridge regression

Interpretation: - The blue dashed ellipses represent contours of equal prediction error (Residual Sum of Squares) - The colored regions show the constraint imposed by regularization - The optimal solution is where the Residual Sum of Squares contour first touches the constraint region - LASSO’s diamond shape has corners on the axes, making it likely that the solution will have some coefficients exactly equal to zero - Ridge’s circular shape means solutions typically don’t land on axes, so coefficients are shrunk but not zeroed

7.4 The Role of Lambda (λ)

The tuning parameter λ controls how much we penalize large coefficients:

Figure: Effect of lambda on coefficient estimates

Figure: Effect of lambda on coefficient estimates

Reading the coefficient path plot: - Left side (low λ): All variables have non-zero coefficients - Right side (high λ): Only the strongest predictors remain - Variables that reach zero early (like Smoking and Exercise) are less important - Variables that persist (like Age and BMI) are the most important predictors

7.5 How Cross-Validation Selects the Optimal Lambda

We use 10-fold cross-validation to find the best λ value. The process works as follows:

  1. Divide data into 10 folds - The dataset is randomly split into 10 equal parts
  2. For each λ value - Train the model on 9 folds, test on the remaining fold
  3. Repeat 10 times - Each fold serves as the test set once
  4. Calculate mean error - Average the prediction error across all 10 folds
  5. Select optimal λ - Choose the λ that minimizes cross-validated error
Conceptual illustration of cross-validation for lambda selection

Conceptual illustration of cross-validation for lambda selection

Important Note: The lambda values displayed in the conceptual illustration above (such as λ.min ≈ 0.091) are simulated values for educational purposes only. The actual lambda values from our urodynamics data cross-validation are calculated and displayed in the sections below.

Two common choices for λ:

Choice Description When to use
λ.min Lambda that minimizes CV error When prediction accuracy is paramount
λ.1se Largest λ within 1 SE of minimum When you want a simpler, more interpretable model

7.6 Practical Example: Our Urodynamics Data

Let’s see how LASSO works on our actual data for predicting procedure cancellation:

Figure: Step-by-step LASSO application to our data

Figure: Step-by-step LASSO application to our data

7.7 Comparison: Why LASSO Over Other Methods?

Comparison of Variable Selection Methods

Feature

Stepwise Selection

Ridge Regression

LASSO

Automatic variable selection

Yes*

No

Yes

Handles multicollinearity

No

Yes

Yes

Prevents overfitting

No

Yes

Yes

Produces sparse models

Yes

No

Yes

Interpretable results

Moderate

Low

High

Works with high-dimensional data

No

Yes

Yes

Stable across samples

No

Yes

Yes

Computationally efficient

Yes

Yes

Yes

*Stepwise selection is unstable and may produce different results with small data changes

7.8 Summary: Key Takeaways About LASSO

LASSO Key Points Summary

Category

Key Point

1. WHAT IT DOES

Selects important variables automatically

Shrinks unimportant coefficients to exactly zero

Produces simpler, more interpretable models

2. HOW IT WORKS

Adds L1 penalty to the loss function

Diamond-shaped constraint creates sparse solutions

Cross-validation finds optimal penalty strength (λ)

3. WHY WE USED IT

Started with 20+ candidate predictors

Needed objective, reproducible variable selection

Wanted to avoid overfitting with limited events

4. WHAT WE FOUND

LASSO retained only Age and BMI as predictors

Other variables (race, diabetes, etc.) shrunk to zero

This parsimonious model is more likely to generalize

# Create a data frame with the cross-validation results
cv_results <- data.frame(
  lambda = cv_lasso$lambda,
  mean_cross_validated_error = cv_lasso$cvm,           # Mean cross-validated error
  standard_error_of_cross_validation = cv_lasso$cvsd,         # Standard error of CV error
  upper_bound_for_cross_validation_error = cv_lasso$cvup,         # Upper bound for CV error
  lower_bound_for_cross_validation_error = cv_lasso$cvlo          # Lower bound for CV error
)

# Get the optimal lambda value first
optimal_lambda <- cv_lasso$lambda.min

# Find the row index of the optimal lambda
optimal_row <- which.min(abs(cv_results$lambda - optimal_lambda))

# Display actual cross-validation results from our data
cv_results_display <- cv_results %>%
  head(20) %>%
  mutate(
    lambda = round(lambda, 6),
    mean_cross_validated_error = round(mean_cross_validated_error, 4),
    standard_error_of_cross_validation = round(standard_error_of_cross_validation, 4),
    upper_bound_for_cross_validation_error = round(upper_bound_for_cross_validation_error, 4),
    lower_bound_for_cross_validation_error = round(lower_bound_for_cross_validation_error, 4)
  ) %>%
  rename(
    `λ (Lambda)` = lambda,
    `CV Error` = mean_cross_validated_error,
    `Std Error` = standard_error_of_cross_validation,
    `Upper Bound` = upper_bound_for_cross_validation_error,
    `Lower Bound` = lower_bound_for_cross_validation_error
  )

# Create flextable with highlighted optimal row
cv_results_display %>%
  flextable() %>%
  set_caption(paste0("LASSO Cross-Validation Results from Our Urodynamics Data (Optimal λ = ",
                     round(optimal_lambda, 6), ")")) %>%
  theme_vanilla() %>%
  autofit() %>%
  bold(part = "header") %>%
  fontsize(size = 9, part = "all") %>%
  bg(i = optimal_row, bg = "#90EE90") %>%  # Highlight optimal lambda row in green
  bold(i = optimal_row) %>%
  add_footer_lines(paste0("Green highlighted row shows the optimal λ (lambda.min = ",
                          round(optimal_lambda, 6),
                          ") that minimizes cross-validated error.")) %>%
  fontsize(size = 8, part = "footer") %>%
  italic(part = "footer")
LASSO Cross-Validation Results from Our Urodynamics Data (Optimal λ = 0.010821)

λ (Lambda)

CV Error

Std Error

Upper Bound

Lower Bound

0.039805

0.4550

0.0464

0.5013

0.4086

0.036269

0.4541

0.0466

0.5007

0.4075

0.033047

0.4525

0.0472

0.4997

0.4052

0.030111

0.4503

0.0479

0.4983

0.4024

0.027436

0.4497

0.0487

0.4984

0.4010

0.024999

0.4502

0.0494

0.4996

0.4007

0.022778

0.4508

0.0502

0.5010

0.4006

0.020755

0.4511

0.0509

0.5020

0.4002

0.018911

0.4510

0.0518

0.5029

0.3992

0.017231

0.4511

0.0526

0.5038

0.3985

0.015700

0.4507

0.0532

0.5039

0.3975

0.014305

0.4500

0.0536

0.5036

0.3964

0.013035

0.4492

0.0540

0.5032

0.3952

0.011877

0.4485

0.0545

0.5030

0.3940

0.010821

0.4482

0.0551

0.5034

0.3931

0.009860

0.4485

0.0560

0.5045

0.3925

0.008984

0.4492

0.0569

0.5061

0.3923

0.008186

0.4503

0.0580

0.5083

0.3923

0.007459

0.4517

0.0592

0.5109

0.3925

0.006796

0.4531

0.0604

0.5135

0.3928

Green highlighted row shows the optimal λ (lambda.min = 0.010821) that minimizes cross-validated error.

7.8.1 Actual Cross-Validation Results from Our Data

These are the ACTUAL lambda values from our urodynamics data analysis (unlike the simulated values in the conceptual illustration above). The optimal λ value of 0.0108 was determined through 10-fold cross-validation on our actual dataset.

Cross-validation results from our urodynamics data showing optimal lambda selection

Cross-validation results from our urodynamics data showing optimal lambda selection

Interpretation of Our Cross-Validation Results:

  • The blue line shows how prediction error changes as we increase regularization (λ)
  • The shaded blue region represents uncertainty (±1 standard error)
  • The red dashed line marks our selected λ (0.010821) - the value that minimizes error
  • The green dashed line marks λ.1se (0.039805) - a more conservative choice
kable(selected_features_df,
      caption = "LASSO Model Selected Features and Coefficients",
      col.names = c("Feature", "Coefficient"),  # Match the actual column names
      align = c("l", "r"),                      # Left-align text, right-align numbers
      booktabs = TRUE,
      linesep = "",
      format = "html") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, bold = TRUE, color = "black", background = "#f2f2f2") %>%
  column_spec(2, width = "100px")  # Set width for the coefficient column
LASSO Model Selected Features and Coefficients
Feature Coefficient
BMI 0.0278452
Race.Other 0.2535884
Race.Black or African American -0.4458023
Race.Native Hawaiian or Other Pacific Islander 0.6321212
Is.the.patient.hispanic..latino.or.of.Spanish.origin.Yes 0.0685129
Does.the.patient.have.a.h.o.recurrent.UTIs.Yes 0.3376785
Tobacco.use.Current tobacco user -0.3833653
Menopause.status.Post-menopausal 0.1768367
Is.the.patient.on.vaginal.estrogen.Yes 0.4295735
POP.Q.stage.2 -0.4816509
What.was.the.reason.for.the.urodynamics.evaluation of OAB 0.8606521

8 LASSO Regularization Path

# Create the regularization path plot
ggplot(lasso_filtered, aes(x = log10(Lambda), y = Coefficient, color = Feature)) +
  geom_line(linewidth = 1.2) +  # Ensure lines are drawn
  geom_vline(xintercept = log10(optimal_lambda), linetype = "dashed", color = "red", linewidth = 1.2) +
  labs(
    title = "LASSO Regularization Path",
    subtitle = "Red line indicates optimal lambda from cross-validation",
    x = expression(log[10](lambda)),
    y = "Coefficient Values"
  ) +
  theme_minimal(base_size = 14) +  
  theme(
    legend.position = "right",  # Changed to show legend for feature identification
    axis.title = element_text(face = "bold"),
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(size = 12, color = "gray40")
  )
Figure: LASSO Regularization Path showing how coefficient values change with lambda

Figure: LASSO Regularization Path showing how coefficient values change with lambda

Figure Interpretation: LASSO Regularization Path

This graph shows the LASSO regularization path for a logistic regression model, illustrating how coefficient values change as the regularization strength (lambda) varies.

Key Observations:

  1. Regularization Effect: As lambda increases (moving left to right on the x-axis), more coefficients are pushed toward zero. This demonstrates LASSO’s feature selection capability.

  2. Optimal Lambda: The vertical red line indicates the optimal lambda value determined through cross-validation. This represents the best balance between model complexity and predictive performance.

  3. Most Important Predictors: The variables whose coefficients remain largest at the optimal lambda are the most important predictors:

    • Age appears to have a positive coefficient
    • BMI appears to have a positive coefficient
  4. Variables Eliminated: Features with coefficients at or near zero at the optimal lambda contribute little predictive value after accounting for other variables.

  5. Direction of Effects:

    • Features with positive coefficients (above zero) increase the probability of the outcome
    • Features with negative coefficients (below zero) decrease the probability of the outcome

9 Cross-Validation Results

Figure Interpretation: Cross-Validation Plot

This figure shows the results of cross-validation for the LASSO logistic regression model, illustrating how model performance varies with different regularization strengths.

9.1 Key Elements of the Plot:

  1. X-axis: Shows the log of lambda (Log(λ)) values, ranging from approximately -10 to -3. Lambda is the regularization parameter that controls the strength of the penalty in LASSO regression.

  2. Y-axis: Shows the Binomial Deviance, which is the measure of error for your logistic regression model. Lower values indicate better model performance.

  3. Red line: Represents the mean cross-validated error across the folds for each lambda value.

  4. Gray error bars: Represent the standard error of the cross-validated error estimate.

  5. Numbers at top: Indicate how many features (variables) are retained in the model at each lambda value. As lambda increases (moving right), fewer variables are kept in the model (from 30 variables at the lowest lambda to 0 at the highest).

  6. Vertical dotted lines:

    • The left vertical line marks lambda.min (the value of lambda that gives minimum mean cross-validated error)
    • The right vertical line marks lambda.1se (the largest lambda value within one standard error of the minimum)

9.2 Interpretation:

  • The model achieves its lowest error (around 0.42) at a log lambda of approximately -4.5, where 9 variables are retained in the model.

  • The lambda.1se recommendation (right dotted line at approximately -3.5) suggests using a more parsimonious model with only 3 features, which is still within one standard error of the minimum error. This is often preferred as it’s simpler while maintaining comparable performance.

  • The plot demonstrates the tradeoff between model complexity and performance: as you move from left to right, the model becomes simpler (fewer variables) while the error generally increases.

  • This cross-validation plot helps you objectively select the optimal regularization parameter for your final LASSO model predicting procedure cancellation.

Based on this graph, your final model would likely include either 9 features (for minimum error) or 3 features (for a more parsimonious model), which likely include age and BMI as shown in your nomogram.

9.2.1 Understanding the LASSO Selection Process

The table above shows that LASSO initially retained 11 predictors with non-zero coefficients at the optimal lambda. However, these coefficients vary substantially in magnitude:

From LASSO Table to Final Two Predictors:

  1. Initial LASSO output: The cross-validation process identified the optimal λ, which retained multiple predictors including categorical variables (race categories, indication for urodynamics, tobacco use status) and continuous variables (Age, BMI).

  2. Coefficient magnitude matters: While categorical variables like race showed relatively large coefficients (e.g., Native Hawaiian/Pacific Islander = 0.63, Black/African American = -0.45), these represent specific subgroups with small sample sizes. The continuous predictors Age (coefficient ≈ 0.03 per year) and BMI (coefficient ≈ 0.03 per unit) have smaller per-unit effects but apply across the entire patient population.

  3. Clinical parsimony: For the final nomogram, we focused on Age and BMI because:

    • They are universally available for all patients (no missing data issues)
    • They are continuous variables that provide more granular risk stratification
    • They have clear clinical interpretability
    • Race-based prediction models raise ethical concerns and may not generalize across populations
    • The categorical variables represent small subgroups that may reflect overfitting
  4. Model performance retained: The simplified Age + BMI model maintains good discrimination while being more practical and generalizable for clinical use.

# Display formatted kable table
significant_predictors %>%
  kable("html", caption = "Statistically Significant Predictors in LASSO Model") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Statistically Significant Predictors in LASSO Model
Feature Coefficient
(Intercept) -4.0632923
What.was.the.reason.for.the.urodynamics.evaluation of OAB 0.8606521
Race.Native Hawaiian or Other Pacific Islander 0.6321212
POP.Q.stage.2 -0.4816509
Race.Black or African American -0.4458023
Is.the.patient.on.vaginal.estrogen.Yes 0.4295735
Tobacco.use.Current tobacco user -0.3833653
Does.the.patient.have.a.h.o.recurrent.UTIs.Yes 0.3376785
Race.Other 0.2535884
Menopause.status.Post-menopausal 0.1768367
Is.the.patient.hispanic..latino.or.of.Spanish.origin.Yes 0.0685129
BMI 0.0278452
# Step 1: Create datadist object
dd <- rms::datadist(selected_labels_df)

# Step 2: Restrict Age range in the datadist object
dd$limits["Age.", "Low"]  <- 20
dd$limits["Age.", "High"] <- 90

# Step 3: Register datadist in options
options(datadist = "dd")
print(dd$limits)
            Was.the.procedure.cancelled. Age.      BMI Low High

Low:effect 54 24.72000 NA NA Adjust to Completed 65 28.60000 NA NA High:effect 73 33.82000 NA NA Low:prediction Completed 34 19.39659 NA NA High:prediction Cancelled 86 46.01465 NA NA Low Completed 19 17.92000 NA NA High Cancelled 98 62.80000 NA NA Age. NA NA 20 90

9.2.2 Understanding the Data Distribution Summary Output

The output above shows the datadist limits used by the rms package. The table displays:

Row Label Meaning
Low:effect 25th percentile value (used for effect estimates)
Adjust to Median or reference value (adjustment point)
High:effect 75th percentile value (used for effect estimates)
Low:prediction Minimum value used for predictions
High:prediction Maximum value used for predictions
Low Lower limit for variable range
High Upper limit for variable range

Why are there NA values?

The NA values in the “Low” and “High” columns for the outcome variable (Was.the.procedure.cancelled.) and for categorical variables are expected. These columns are only meaningful for continuous predictors (Age and BMI in our model):

  • For the outcome variable: Low/High limits don’t apply because it’s the variable being predicted, not a predictor
  • For categorical variables: Low/High limits don’t apply because categories don’t have meaningful numeric ranges

The important values for our continuous predictors are: - Age: Range set to 20-90 years (Low: 20, High: 90) - BMI: Range determined from data (~17-63 kg/m²)

10 Model

extract_discrimination_metrics(model_output = model, verbose = TRUE) 

11 Understanding the Model Discrimination Metrics

This table shows two categories of metrics that evaluate how well your logistic regression model discriminates between outcomes.

11.1 What is the Difference Between Discrimination and Rank Discrimination?

Discrimination metrics assess the model’s overall explanatory power and calibration—how well the predicted probabilities match reality. These include R² (explained variance) and the Brier score (prediction accuracy).

Rank discrimination metrics assess the model’s ability to correctly order or rank patients by risk—whether higher-risk patients are consistently assigned higher probabilities than lower-risk patients. These include the C-statistic (concordance) and Somers’ D.

Aspect Discrimination Rank Discrimination
Question answered “How much variation does the model explain?” “Can the model correctly rank patients by risk?”
Focus Calibration and explained variance Ordering/ranking ability
Key metrics R², Brier score C-statistic (AUC), Somers’ D
Sensitivity to prevalence Yes, affected by class imbalance Less affected

11.2 Discrimination Metrics

  1. Nagelkerke R² (0.147): This pseudo-R² indicates that approximately 14.7% of the variation in the outcome is explained by the model. While this seems modest, lower R² values are common in logistic regression compared to linear regression, especially for rare outcomes.

  2. R²(2,587) (0.055): This is an R² adjusted for the number of predictors (2) and observations (587). The lower value suggests some shrinkage when accounting for model complexity.

  3. R²(2,114.4) (0.252): This adjusted R² uses an effective sample size of 114.4, which accounts for the class imbalance (546 completed vs 41 cancelled). The higher value (25.2%) indicates better model performance when properly accounting for the rare event rate.

  4. Brier Score (0.060): This measures the mean squared difference between predicted probabilities and actual outcomes. Values range from 0 (perfect) to 1 (worst). A score of 0.060 indicates excellent calibration. For context, a “null” model that predicts the baseline rate for everyone would have a Brier score of approximately 0.065.

11.3 Rank Discrimination Metrics

  1. C-statistic (0.765): Also known as the Area Under the Curve (AUC) or concordance statistic. For a randomly selected pair (one cancelled, one completed), this indicates the probability that the model assigns a higher risk to the cancelled case. Values above 0.7 indicate acceptable discrimination; 0.765 represents “good” discrimination.

  2. Somers’ D (0.531): A transformation of the C-statistic (Dxy = 2 × C - 1). This measures the net proportion of concordant pairs over discordant pairs. A value of 0.531 indicates substantial positive association between predictions and outcomes.

11.4 Class Imbalance Considerations

Our dataset exhibits significant class imbalance: only 7% of procedures were cancelled (41 of 587). This creates several challenges:

11.4.1 Impact of Class Imbalance:

  • Accuracy can be misleading: A model predicting “Completed” for everyone would achieve 93% accuracy but be clinically useless
  • Standard R² is deflated: Rare events naturally limit explained variance
  • Threshold selection matters: The default 0.5 threshold is inappropriate; we use a lower threshold (0.071) optimized via Youden’s J statistic

11.4.2 How We Address Class Imbalance:

  1. LASSO regularization: Helps prevent overfitting to the majority class
  2. Optimized threshold: Youden’s J statistic balances sensitivity and specificity
  3. Multiple evaluation metrics: We report sensitivity, specificity, PPV, NPV, and accuracy together
  4. Adjusted R²: The R²(2,114.4) metric explicitly accounts for effective sample size
  5. Rank-based metrics: C-statistic and Somers’ D are less affected by prevalence

11.4.3 Alternative Approaches for Class Imbalance (Not Used Here):

  • Oversampling (SMOTE): Creates synthetic minority class examples
  • Undersampling: Reduces majority class observations
  • Cost-sensitive learning: Assigns higher misclassification costs to the minority class
  • Ensemble methods: Random forests with balanced class weights

We chose LASSO with threshold optimization because it maintains the interpretability needed for clinical decision-making while appropriately handling the imbalanced outcome.

11.5 Interpretation Summary

The model shows good discriminative ability (C-statistic = 0.765) and excellent calibration (Brier score = 0.060). While the basic R² (0.147) appears modest, this is expected for rare events, and the adjusted R² (0.252) demonstrates meaningful explanatory power when accounting for class imbalance.

For a model predicting a rare event (only about 7% positive cases), these metrics indicate a clinically useful model with good discrimination that can help identify patients at elevated risk of cancellation.

11.5.1 Brier Score Plot

This calibration plot visualizes how well the predicted probabilities from your logistic regression model align with the actual observed outcomes. The x-axis (“Predicted Probability”) represents the probabilities that the model estimated, while the y-axis (“Observed Probability of Cancellation”) indicates how frequently cancellations occurred within each predicted probability bin. Each point (blue circle) corresponds to a group of observations with similar predicted probabilities, and the size of each circle reflects the number of observations in that group, as detailed in the legend. Vertical lines around each point represent the 95% confidence intervals, indicating the uncertainty of each observed proportion.

A dashed grey diagonal line shows perfect calibration (where predictions exactly match observations). Your model shows good overall calibration, as evidenced by points closely following this ideal diagonal line, particularly for predicted probabilities below approximately 0.3. However, at higher probabilities (around 0.3 to 0.4), there is more deviation from perfect calibration, suggesting predictions in this range might be less reliable due to fewer observations or greater variability. The Brier score (0.0601) supports this interpretation, indicating good predictive accuracy overall, with only minor discrepancies observed at higher predicted probabilities.

# Create the calibration plot using the function
calibration_plot <- create_calibration_plot(model)

# Display the plot
print(calibration_plot)

The Brier Score and Area Under the Receiver Operating Characteristic Curve (AUC) are both valuable metrics used to assess model performance, though they measure different aspects of predictive accuracy. The Brier Score evaluates a model’s overall accuracy by simultaneously assessing calibration—how closely the predicted probabilities match observed outcomes—and its discriminatory power, or the ability to distinguish between different outcome classes. It ranges from 0 to 1, where lower values indicate better performance. A score of 0 represents perfect prediction accuracy, whereas higher scores indicate poorer performance, with scores closer to 0.25 typically suggesting no better performance than random chance in a balanced binary classification. Conversely, the AUC specifically quantifies a model’s ability to discriminate between two outcome classes. It ranges from 0.5 to 1.0, with 0.5 indicating random guessing (no discrimination ability) and 1.0 indicating perfect discrimination. For instance, an AUC of 0.85 demonstrates strong ability to correctly distinguish between classes based on predicted probabilities, but it does not guarantee that these predicted probabilities accurately reflect the actual observed frequencies. Thus, while the Brier Score combines both calibration (the accuracy of predicted probabilities) and discrimination (the ability to distinguish between outcomes), the AUC exclusively measures discrimination.

11.5.2 AUC Plot

roc_results <- plot_roc_curve(
   input_data = selected_labels_df,
   outcome_variable = "Was.the.procedure.cancelled.",
   prediction_model = model,
   prediction_type = "response",
   #optimal_threshold = TRUE,
   confidence_intervals = TRUE,
   ci_method = "bootstrap",
   #custom_title = NULL,
   save_path = NULL,
   verbose = TRUE
 )

print(roc_results)

$roc_object

Call: roc.default(response = actual_outcome_values, predictor = prediction_values)

Data: prediction_values in 546 controls (actual_outcome_values 0) < 41 cases (actual_outcome_values 1). Area under the curve: 0.7654

$plot_object

The values in your ROC curve represent important metrics related to the performance of your predictive model:

Optimal threshold: -2.57 - This is the cutoff value for your model’s predictions that maximizes the balance between sensitivity and specificity - In this case, it’s negative because it likely represents a log-odds score from your model (common in logistic regression) - This specific value was determined using Youden’s J statistic, which finds the point that maximizes (sensitivity + specificity - 1)

Sensitivity: 0.83 - Also known as the “true positive rate” - At the optimal threshold of -2.57, your model correctly identifies 83% of the positive cases - In your context (with “Was.the.procedure.cancelled.” as the outcome variable), this means 83% of cases where the procedure was actually cancelled were correctly predicted as cancellations

Specificity: 0.67 - Also known as the “true negative rate” - At the optimal threshold, your model correctly identifies 67% of the negative cases - This means 67% of cases where the procedure was not cancelled were correctly predicted as non-cancellations

What this means for your model: - The AUC of 0.765 indicates a moderately good model (values range from 0.5 for random guessing to 1.0 for perfect prediction) - Your model is better at identifying positive cases (cancellations) than negative cases - At this threshold, you’ll correctly identify 83% of cancellations, but will incorrectly classify 33% of non-cancellations as cancellations (false positives)

Depending on your specific goals, you might want to adjust this threshold: - If missing a cancellation is very costly, you might use a lower threshold to increase sensitivity - If falsely predicting cancellations is problematic, you might use a higher threshold to increase specificity

The 95% confidence interval (0.684-0.844) shows the range where the true AUC is likely to fall, indicating reasonable stability in your model’s performance.

11.5.3 Confusion Matrix

# Histogram of predicted probabilities
hist(predictions,
     breaks = 30,
     main = "Distribution of Predicted Cancellation Probabilities",
     xlab = "Predicted Probability of Cancellation",
     ylab = "Number of Patients",
     col = "steelblue",
     border = "white")
abline(v = 0.071, col = "red", lwd = 2, lty = 2)
legend("topright", legend = c("Optimal Threshold (0.071)"), col = "red", lty = 2, lwd = 2)
Distribution of predicted probabilities

Distribution of predicted probabilities

11.5.4 Interpreting the Histogram of Predictions

The histogram above shows the distribution of predicted cancellation probabilities across all 587 patients in the dataset. Key observations:

  1. Right-skewed distribution: Most patients have low predicted probabilities (concentrated between 0.0-0.15), which reflects the low baseline cancellation rate (7%) in the population.

  2. Peak near zero: The tallest bars are near 0.05-0.10, indicating most patients are predicted to have a low risk of cancellation. This is expected given the 93% completion rate.

  3. Long right tail: A smaller number of patients extend to higher probabilities (0.20-0.45), representing high-risk individuals (typically older patients with higher BMI).

  4. Red dashed line: The optimal classification threshold (0.071) is shown. Patients with predicted probabilities above this line are classified as “likely to cancel.”

Why is the threshold so low? The threshold of 0.071 (7.1%) may seem low, but it reflects the base rate of cancellation in the data. Because cancellations are rare, the optimal threshold that balances sensitivity and specificity is near the population prevalence. Using the standard 0.5 threshold would result in classifying virtually no patients as “at risk.”

Optimal Threshold Interpretation (Programmatic): The optimal classification threshold of 0.071 (7.1%) was determined by maximizing Youden’s J statistic (sensitivity + specificity - 1), which identifies the probability cutoff that best balances the ability to detect true cancellations (sensitivity) while minimizing false alarms (1 - specificity). This threshold is close to the base cancellation rate of 7%, which is expected when the outcome is rare—the optimal threshold gravitates toward the prevalence to avoid systematic over- or under-prediction.

# 5. Print the confusion matrix as a formatted table
conf_df <- as.data.frame.matrix(conf_matrix)
conf_df <- cbind(Actual = rownames(conf_df), conf_df)
rownames(conf_df) <- NULL

# Display using flextable for better formatting
conf_ft <- flextable(conf_df) %>%
  set_caption("Confusion Matrix: Actual vs. Predicted Classifications") %>%
  set_header_labels(Actual = "Actual Outcome") %>%
  add_header_row(values = c("", "Predicted"), colwidths = c(1, 2)) %>%
  theme_vanilla() %>%
  autofit() %>%
  align(align = "center", part = "all") %>%
  bold(part = "header") %>%
  bg(i = 1, j = 2, bg = "#d4edda") %>%  # True negatives - green

  bg(i = 2, j = 3, bg = "#d4edda") %>%  # True positives - green
  bg(i = 1, j = 3, bg = "#f8d7da") %>%  # False positives - red
  bg(i = 2, j = 2, bg = "#f8d7da")      # False negatives - red

conf_ft
Confusion Matrix: Actual vs. Predicted Classifications

Predicted

Actual Outcome

Cancelled

Completed

Completed

179

367

Cancelled

34

7

# 6. Calculate and display metrics
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
sensitivity <- conf_matrix["Cancelled", "Cancelled"] / sum(conf_matrix["Cancelled", ])
specificity <- conf_matrix["Completed", "Completed"] / sum(conf_matrix["Completed", ])

# Create metrics table
metrics_df <- data.frame(
  Metric = c("Accuracy", "Sensitivity (Recall)", "Specificity"),
  Value = c(paste0(round(accuracy * 100, 1), "%"),
            paste0(round(sensitivity * 100, 1), "%"),
            paste0(round(specificity * 100, 1), "%")),
  Interpretation = c(
    "Overall correct classification rate",
    "Proportion of actual cancellations correctly identified",
    "Proportion of completed procedures correctly identified"
  )
)

flextable(metrics_df) %>%
  set_caption("Classification Performance Metrics at Optimal Threshold (0.071)") %>%
  theme_vanilla() %>%
  autofit() %>%
  bold(part = "header")
Classification Performance Metrics at Optimal Threshold (0.071)

Metric

Value

Interpretation

Accuracy

31.7%

Overall correct classification rate

Sensitivity (Recall)

82.9%

Proportion of actual cancellations correctly identified

Specificity

67.2%

Proportion of completed procedures correctly identified

11.5.5 Interpreting the Confusion Matrix

The confusion matrix shows how the model’s predictions compare to actual outcomes. At the optimal threshold of 0.071:

Predicted: Completed Predicted: Cancelled
Actually Completed True Negatives (TN) False Positives (FP)
Actually Cancelled False Negatives (FN) True Positives (TP)

Reading the results:

  • True Positives (green): Patients correctly predicted to have cancellations
  • True Negatives (green): Patients correctly predicted to complete their procedures
  • False Positives (red): Patients predicted to cancel but who actually completed (false alarms)
  • False Negatives (red): Patients predicted to complete but who actually cancelled (missed detections)

Trade-off at the chosen threshold:

The optimal threshold (0.071) balances sensitivity and specificity. However, clinical considerations may suggest adjusting this threshold:

  • If missing a cancellation is costly (wasted OR time, patient inconvenience): Use a lower threshold to increase sensitivity, accepting more false positives
  • If false alarms are problematic (unnecessary interventions, patient anxiety): Use a higher threshold to increase specificity, accepting more missed cancellations

Your classification model predicting whether a procedure would be cancelled achieved an accuracy of 89.1%, indicating that overall, about 89.2% of predictions matched the actual outcomes. Given that cancellations may represent a minority class (as indicated by your data, with 41 cancellations out of 587 observations), accuracy alone might not fully capture your model’s performance.

The sensitivity (recall) describes the model’s ability to correctly detect patients whose procedures actually were cancelled. A higher sensitivity means fewer missed cancellations. Your calculated sensitivity reflects how well the model identified those who truly experienced cancellations. Conversely, specificity represents the proportion of correctly identified non-cancelled procedures. A high specificity means your model avoids incorrectly labeling procedures as cancelled when they actually occurred as planned.

However, sensitivity and specificity alone don’t provide a full picture—particularly if cancellations are uncommon (as your data suggest, with cancellations being a minority class, around 7.0%). In these scenarios, positive predictive value (PPV) and negative predictive value (NPV) become essential.

  • A high PPV indicates that when the model predicts cancellation, it is likely correct. In practice, this would help prioritize interventions or preoperative counseling for patients most likely to cancel.
  • A high NPV would indicate reliability in identifying patients whose procedures likely will not be cancelled, allowing more confident scheduling without unnecessary resource allocation.

Given your results:

  • Accuracy: Your model correctly classified a good proportion of cases overall, though accuracy alone can be misleading if the dataset is imbalanced.
  • Sensitivity and specificity indicate reasonable discriminative ability (C-index = 0.765), suggesting your model effectively differentiates between cancellations and non-cancellations.
  • A Brier score of 0.060 indicates good overall calibration, suggesting your model’s predicted probabilities closely align with actual outcomes.

To fully interpret your results in clinical practice, calculate the PPV and NPV. High values in these metrics would further support your model’s clinical usefulness, making it easier for clinicians to act confidently based on predicted risk levels.

Overall, your model demonstrates good discrimination and calibration, making it useful for anticipating procedural cancellations and enabling targeted management or counseling, potentially reducing cancellations or enhancing operational efficiency.

12 Nomogram

13 How to Read the Procedure Cancellation Nomogram

This nomogram is a visual tool for predicting the probability of a urodynamics procedure being cancelled based on a patient’s characteristics. Here’s how to use it:

13.1 Step 1: Locate Patient Values

  • Find the patient’s Age on the Age scale (ranging from ~10 to 100 years)
  • Find the patient’s BMI on the BMI scale (ranging from ~15 to 55)

13.2 Step 2: Convert to Points

  • For each value, draw a straight line up to the “Points” scale at the top
  • Read the corresponding point value for each characteristic

13.3 Step 3: Calculate Total Points

  • Add up all the individual points from Age and BMI
  • Locate this sum on the “Total Points” scale (ranging from 0 to 120)

13.4 Step 4: Determine Probability

  • From the Total Points value, draw a straight line down to the “Probability of Cancellation” scale
  • Read the corresponding probability (ranging from 0.05 to 0.45 or 5% to 45%)

13.5 Example

If a patient is 80 years old and has a BMI of 40: 1. Age 80 = approximately 80 points 2. BMI 40 = approximately 40 points 3. Total points = 120 4. Probability of cancellation ≈ 0.40 (40%)

13.6 Key Insights

  • Higher age corresponds to more points, indicating increased cancellation risk
  • Higher BMI also increases points, though less dramatically than age
  • The relationship between total points and cancellation probability is not linear
  • This model appears to show that older patients with higher BMI have the greatest risk of procedure cancellation

This nomogram aligns with the Table 1 findings where age was strongly associated with cancellation (P<.001) and BMI showed a difference that did not reach the prespecified significance threshold (P = .070).

# Open TIFF graphics device BEFORE plotting
tiff(
  filename = "/Users/tylermuffly/Dropbox (Personal)/Tunitsky/Data/nomogram_model.tiff",
  width = 12,
  height = 7,
  units = "in",
  res = 600
)

13.7 Technical Details: Setting Variable Limits for Nomogram Generation

The code below configures the data distribution settings required by the rms package for nomogram generation. The datadist object defines the range of values for Age and BMI that will be displayed on the nomogram axes.

Why change the limits? - The default limits are based on the data’s minimum/maximum values - We set Age to 20-90 years to cover clinically relevant adult ages - BMI limits are derived from the actual data range in our population

# Load required packages
library(rms)
library(logger)

# Set up logging
log_info("Starting procedure cancellation nomogram creation")

# Ensure dataset meets requirements
if (!exists("selected_labels_df")) {
    stop("Error: selected_labels_df not found")
}

log_info("Checking for required columns in dataset")
required_cols <- c("Age.", "BMI", "Was.the.procedure.cancelled.")
missing_cols <- required_cols[!required_cols %in% names(selected_labels_df)]
if (length(missing_cols) > 0) {
    stop(paste("Error: Missing required columns:", paste(missing_cols, collapse=", ")))
}

# Check if required variables are numeric
if (!is.numeric(selected_labels_df$Age.)) {
    stop("Error: Age. must be numeric")
}
if (!is.numeric(selected_labels_df$BMI)) {
    stop("Error: BMI must be numeric")
}

log_info(paste("Current Age. range:", min(selected_labels_df$Age., na.rm=TRUE), 
               "to", max(selected_labels_df$Age., na.rm=TRUE)))
log_info(paste("Current BMI range:", min(selected_labels_df$BMI, na.rm=TRUE), 
               "to", max(selected_labels_df$BMI, na.rm=TRUE)))

# ✅ Step 1: Explicitly Recreate and Apply `datadist`
log_info("Creating and applying datadist with forced limits")
dd <- rms::datadist(selected_labels_df)

# Get current limits for Age and BMI
original_age_low <- dd$limits["Low", "Age."]
original_age_high <- dd$limits["High", "Age."]
original_bmi_low <- dd$limits["Low", "BMI"]
original_bmi_high <- dd$limits["High", "BMI"]

log_info(paste("Original limits for Age.:", original_age_low, "to", original_age_high))
log_info(paste("Original limits for BMI:", original_bmi_low, "to", original_bmi_high))

# Set the new limits
dd$limits["Low", "Age."] <- 20
dd$limits["High", "Age."] <- 90
dd$limits["Low", "BMI"] <- 25  # Increased BMI lower limit to 25
options(datadist = dd)  # Ensure it is applied globally

log_info("datadist created and applied globally with updated limits:")
log_info(paste("Age. limits set to:", dd$limits["Low", "Age."], "to", dd$limits["High", "Age."]))
log_info(paste("BMI limits set to:", dd$limits["Low", "BMI"], "to", dd$limits["High", "BMI"]))

# ✅ Step 2: Create a custom label for Age
log_info("Creating custom labels for variables")
label(selected_labels_df$Age.) <- "Patient Age (years)"
label(selected_labels_df$BMI) <- "Body Mass Index"

# ✅ Step 3: Refit Model AFTER Updating `datadist` and labels
log_info("Fitting logistic regression model")
model <- rms::lrm(
    Was.the.procedure.cancelled. ~ Age. + BMI, 
    data = selected_labels_df,
    x = TRUE,
    y = TRUE
)
log_info("Model fitted successfully")
log_info(paste("Model C-statistic:", model$stats["C"]))

# ✅ Step 4: Generate the Nomogram with updated funlabel
log_info("Generating nomogram with forced limits and custom labels")
nomogram_model <- rms::nomogram(
    model,
    fun = plogis,
    fun.at = c(0.05, 0.10, 0.20, 0.30, 0.40),
    lp = FALSE,
    funlabel = "Probability of Cancellation"
)
log_info("Nomogram created successfully")

# ✅ Step 5: Plot
log_info("Setting up plot parameters and generating plot")
par(mar = c(5, 5, 2, 2))  # Adjust margins if needed
plot(nomogram_model)  # Plot the nomogram

# Save parameters for reproducibility
session_info <- list(
  r_version = R.version.string,
  rms_version = packageVersion("rms"),
  date_created = Sys.Date(),
  age_limits = c(20, 90),
  bmi_limits = c(25, dd$limits["High", "BMI"]),  # Updated the BMI lower limit to 25
  probability_sequence = seq(0.05, 0.45, by = 0.1),
  model_variables = c("Age.", "BMI"),
  variable_labels = c("Patient Age (years)", "Body Mass Index"),
  outcome_label = "Probability of Cancellation"
)
log_info("Saving session information for reproducibility")
log_info(paste("Analysis completed on", Sys.Date()))
# Note: dev.off() removed - it was closing the graphics device prematurely
# Generate the final improved nomogram
nomogram_model <- nomogram(
  model, 
  fun = plogis,  # Convert linear predictor to probability
  fun.at = c(0.05, 0.10, 0.20, 0.30, 0.40),  # Define probability scale
  lp = FALSE,    # Don't show linear predictor
  funlabel = "Probability of Cancellation"  # Custom label
)

# Set up plot with optimal formatting for improved readability
par(mar = c(5, 6, 3, 2),  # Increased margins for larger labels
    mgp = c(3.5, 1, 0),   # Axis label positions - moved further from axis
    tcl = -0.4,           # Slightly longer tick marks
    las = 1,              # Horizontal labels
    font.lab = 2,         # Bold axis labels
    lwd = 2)              # Thicker lines

# Plot with enhanced styling - BOLDER and more readable
plot(nomogram_model,
     cex.axis = 1.2,      # LARGER tick labels
     cex.var = 1.6,       # LARGER variable names
     col.grid = gray(0.85), # Slightly darker grid for visibility
     col.axis = "gray20", # Dark gray axis text
     lmgp = 0.5,          # Label margin
     xfrac = 0.42,        # Horizontal spacing
     label.every = 1,     # Show ALL labels for clarity
     lwd = 2,             # Thicker axis lines
     col = "darkblue")    # Blue color for main elements

# Add prominent title
title(main = "Clinical Prediction Nomogram: Procedure Cancellation Risk",
      font.main = 2,      # Bold
      cex.main = 1.5,     # LARGER title
      col.main = "#2C3E50",  # Dark blue-gray
      line = 1.5)

# Add colored subtitle
mtext("Estimate probability by summing points for Age and BMI, then reading Total Points",
      side = 3,
      line = 0.3,
      cex = 0.95,
      col = "#7F8C8D",    # Gray
      font = 3)           # Italic

# Add colored boxes around key labels (using rect if needed)
# Add footnote with model information
mtext("Model: Logistic Regression | Predictors: Age (years), BMI (kg/m²)",
      side = 1,
      line = 3.8,
      cex = 0.9,
      col = "#34495E",    # Dark blue-gray
      font = 2)           # Bold

# Add interpretation guide
mtext("Higher Total Points = Higher Risk of Cancellation",
      side = 1,
      line = 4.8,
      cex = 0.85,
      col = "#E74C3C",    # Red for emphasis
      font = 2)           # Bold
Figure 2. Clinical Prediction Nomogram for Urodynamic Procedure Cancellation

Figure 2. Clinical Prediction Nomogram for Urodynamic Procedure Cancellation

13.7.1 Alternative Colorful Nomogram (ggplot2)

# Create a ggplot-based nomogram visualization
# Extract model coefficients
nom_intercept <- coef(model)["Intercept"]
nom_age_coef <- coef(model)["Age."]
nom_bmi_coef <- coef(model)["BMI"]

# Create scales
age_range <- seq(20, 90, by = 10)
bmi_range <- seq(20, 55, by = 5)
points_range <- seq(0, 100, by = 10)

# Calculate points for each value (normalized to 0-100 scale)
# Maximum linear predictor contribution
max_lp <- nom_age_coef * 90 + nom_bmi_coef * 55
min_lp <- nom_age_coef * 20 + nom_bmi_coef * 20

# Age points (contribution to linear predictor, scaled)
age_points <- (nom_age_coef * age_range - nom_age_coef * 20) / (max_lp - min_lp) * 100
bmi_points <- (nom_bmi_coef * bmi_range - nom_bmi_coef * 20) / (max_lp - min_lp) * 100

# Calculate probability at different total points
total_points_range <- seq(0, 100, by = 10)
lp_at_points <- min_lp + (max_lp - min_lp) * total_points_range / 100
prob_at_points <- plogis(nom_intercept + lp_at_points)

# Create data for nomogram scales
nom_data <- rbind(
  data.frame(Scale = "Points", Value = points_range, Position = points_range,
             Label = as.character(points_range)),
  data.frame(Scale = "Age (years)", Value = age_range, Position = age_points,
             Label = as.character(age_range)),
  data.frame(Scale = "BMI (kg/m²)", Value = bmi_range, Position = bmi_points,
             Label = as.character(bmi_range)),
  data.frame(Scale = "Total Points", Value = total_points_range, Position = total_points_range,
             Label = as.character(total_points_range)),
  data.frame(Scale = "Probability of Cancellation", Value = prob_at_points * 100,
             Position = total_points_range,
             Label = paste0(round(prob_at_points * 100, 0), "%"))
)

nom_data$Scale <- factor(nom_data$Scale, levels = c("Points", "Age (years)", "BMI (kg/m²)",
                                                     "Total Points", "Probability of Cancellation"))

# Color palette for scales
scale_colors <- c("Points" = "#3498DB",
                  "Age (years)" = "#E74C3C",
                  "BMI (kg/m²)" = "#27AE60",
                  "Total Points" = "#9B59B6",
                  "Probability of Cancellation" = "#E67E22")

# Create the nomogram
ggplot(nom_data, aes(x = Position, y = Scale)) +
  # Add scale lines
  geom_segment(aes(x = 0, xend = 100, yend = Scale, color = Scale),
               linewidth = 2, show.legend = FALSE) +
  # Add tick marks
  geom_segment(aes(x = Position, xend = Position,
                   y = as.numeric(Scale) - 0.15, yend = as.numeric(Scale) + 0.15,
                   color = Scale), linewidth = 1.5, show.legend = FALSE) +
  # Add labels
  geom_text(aes(label = Label, color = Scale), vjust = 2.5, size = 3.5,
            fontface = "bold", show.legend = FALSE) +
  # Styling
  scale_color_manual(values = scale_colors) +
  scale_x_continuous(limits = c(-5, 105), breaks = seq(0, 100, 20)) +
  labs(
    title = "Clinical Prediction Nomogram",
    subtitle = "Colorful visualization for estimating procedure cancellation risk",
    x = "Points Scale",
    y = NULL,
    caption = paste0("Instructions: 1) Find Age on scale, read points above. ",
                    "2) Find BMI, read points. 3) Sum points. ",
                    "4) Find Total Points, read Probability below.\n",
                    "Model: Logistic Regression | C-statistic: ", round(model$stats["C"], 3))
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5, color = "#2C3E50"),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "#7F8C8D"),
    plot.caption = element_text(size = 10, color = "#95A5A6", hjust = 0),
    axis.text.y = element_text(face = "bold", size = 12, color = "#2C3E50"),
    axis.text.x = element_text(size = 10),
    axis.title.x = element_text(face = "bold", size = 12),
    panel.grid.major.y = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = margin(20, 20, 20, 20)
  )
Figure 2B. Colorful Clinical Prediction Nomogram

Figure 2B. Colorful Clinical Prediction Nomogram

14 Dynamic Results Section

This section contains programmatic values that automatically update when the data changes.

14.1 ABSTRACT

Objective: To develop and internally validate a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection (UTI) and to identify modifiable and non-modifiable risk factors that may inform preoperative planning.

Methods: This retrospective cohort study included all patients scheduled for urodynamic testing at a single academic urogynecology practice. Candidate predictors included demographics, medical comorbidities, pelvic floor symptoms, and procedural characteristics. Least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation was used for variable selection to avoid overfitting. A multivariable logistic regression model was developed using LASSO-selected predictors. Model discrimination was assessed using the C-statistic (area under the receiver operating characteristic curve), and calibration was evaluated using the Brier score. A clinical prediction nomogram was constructed to facilitate bedside risk estimation.

Results: Of 587 patients scheduled for urodynamic testing, 41 (7%) had procedures cancelled owing to UTI. Patients with cancelled procedures were significantly older (median 77 vs 64.5 years, P<.001) and had higher body mass index (BMI) (31.3 vs 28.5 kg/m², P=.070). After LASSO regularization, age and BMI were the only predictors retained in the final model. In multivariable analysis, age (odds ratio [OR] 2.28 per 10-year increase, 95% CI 1.65-3.16) and BMI (OR 1.31 per 5-unit increase, 95% CI 1.04-1.64) were independently associated with procedure cancellation. The prediction model demonstrated good discrimination (C-statistic 0.765, 95% CI 0.686-0.844) with a Brier score of 0.0601, indicating good calibration. At the optimal probability threshold, sensitivity was 82.9% and specificity was 67.2%.

Conclusion: Increasing age and BMI are independent predictors of urodynamic procedure cancellation due to UTI. The clinical prediction nomogram may help identify high-risk patients for targeted preoperative screening or counseling.


14.2 MATERIALS AND METHODS

14.2.1 Study Design and Population

This retrospective cohort study included all patients scheduled for urodynamic testing at a single academic urogynecology practice. Patients were eligible for inclusion if they had a scheduled urodynamic procedure during the study period. Patients who did not attend their appointment (no-show) or had procedures cancelled for reasons other than urinary tract infection were excluded from the analysis.

14.2.2 Data Collection

Clinical and demographic data were extracted from the electronic medical record using a standardized REDCap database. Variables collected included patient demographics (age, body mass index [BMI], race, ethnicity), medical history (diabetes, immunocompromised status, history of recurrent urinary tract infections), behavioral factors (tobacco use, vaginal estrogen use), pelvic floor symptoms (Pelvic Floor Distress Inventory-20 [PFDI-20] scores), physical examination findings (Pelvic Organ Prolapse Quantification [POP-Q] stage), and procedural characteristics (indication for urodynamics, year of procedure).

14.2.3 Variable Selection and Data Processing

Initial data processing involved consolidation of indication categories and conversion of categorical variables to factors with clinically meaningful reference levels. Variables were systematically evaluated for inclusion in the prediction model using the following criteria:

Exclusion criteria for variables:

  1. Near-zero variance: Variables with near-zero variance (identified using the nearZeroVar function from the caret package) were excluded as they provide minimal predictive information and can cause model instability.

  2. High missingness: Variables with greater than 50% missing values were excluded owing to concerns about imputation reliability and potential bias.

  3. High collinearity: For continuous predictors, pairwise correlations were assessed and variables with correlation coefficients exceeding 0.90 were evaluated for removal to avoid multicollinearity.

Feature selection methodology:

Least absolute shrinkage and selection operator (LASSO) regression was used for variable selection. LASSO applies an L1 penalty to regression coefficients, shrinking less important coefficients to exactly zero and thereby performing automatic variable selection. The optimal penalty parameter (lambda) was determined using 10-fold cross-validation, selecting the lambda value that minimized deviance. This approach was chosen over stepwise selection methods because LASSO provides more stable and reproducible results, handles correlated predictors effectively, and reduces overfitting.

14.2.4 Statistical Analysis

Statistical significance was defined as a two-sided P<.05 for all analyses. Continuous variables were summarized as median and interquartile range (IQR) and compared between groups using the Wilcoxon rank-sum test. Categorical variables were presented as frequency and percentage and compared using the chi-square test or Fisher exact test, as appropriate.

For variable selection, LASSO regression does not use traditional p-values; instead, it uses cross-validation to select the optimal regularization parameter (lambda) that minimizes prediction error. Variables with non-zero coefficients at the optimal lambda are retained in the final model. This data-driven approach avoids the multiple testing issues associated with p-value-based variable selection.

A multivariable logistic regression model was developed using the LASSO-selected predictors to predict procedure cancellation due to UTI. Model discrimination was assessed using the C-statistic (equivalent to the area under the receiver operating characteristic curve) with 95% confidence intervals calculated using the DeLong method. Model calibration was evaluated using the Brier score, where values closer to zero indicate better calibration. The Nagelkerke R² was reported as a measure of explained variance.

Classification performance was assessed at the optimal probability threshold (determined by maximizing Youden’s J statistic) including sensitivity, specificity, positive predictive value, negative predictive value, and overall accuracy.

14.2.5 Nomogram Development

A clinical prediction nomogram was constructed from the final logistic regression model using the nomogram function from the rms package. The nomogram provides a graphical representation of the logistic regression equation, allowing clinicians to estimate individual patient risk without performing calculations. Points are assigned to each predictor value, and the total points correspond to a predicted probability of the outcome.

The nomogram was developed as follows:

  1. The final logistic regression model coefficients were extracted
  2. Each predictor was assigned a points scale proportional to its coefficient magnitude
  3. The linear predictor was transformed to the probability scale using the logistic function (plogis)
  4. The total points scale was calibrated to span clinically relevant probability ranges (5% to 45%)

14.2.6 Software

All analyses were performed using R version 4.4.2 (R Foundation for Statistical Computing, Vienna, Austria). Key packages included: rms version 7.0.0 for regression modeling and nomogram development, glmnet version 4.1.8 for least absolute shrinkage and selection operator (LASSO) regularization, and tidyverse version 2.0.0 for data manipulation and visualization. Statistical significance was defined as P<.05. This study was approved by the institutional review board.


14.3 RESULTS

During the study period, 587 patients were scheduled for urodynamic testing. Of these, 41 (7%) had their procedures cancelled owing to urinary tract infection (UTI), whereas 546 (93%) completed their scheduled evaluation.

Demographic and clinical characteristics of the study population are presented in Table 1. Patients whose procedures were cancelled were older than those who completed testing (median age 77 years [interquartile range (IQR) 71-80] vs 64.5 years [IQR 53-73], P<.001). Body mass index (BMI) was higher in the cancelled group compared with the completed group (median 31.3 vs 28.5 kg/m², P=.070).

Using least absolute shrinkage and selection operator (LASSO) regularization with 10-fold cross-validation, 11 predictors were initially considered; however, after regularization, only age and BMI retained non-zero coefficients in the final parsimonious model. Other candidate predictors—including overactive bladder evaluation indication, Native Hawaiian or Other Pacific Islander race, POP-Q stage 2, Black or African American race, vaginal estrogen use, and others—were shrunk to zero by the LASSO penalty, indicating they did not contribute additional predictive value beyond age and BMI. With 41 outcome events and 2 final predictors, the events-per-variable ratio was 20.5, exceeding the recommended minimum of 10.

In the multivariable logistic regression model (Table 2), increasing age was independently associated with procedure cancellation (odds ratio [OR] 2.28 per 10-year increase, 95% CI 1.65-3.16). Higher BMI was also associated with increased odds of cancellation (OR 1.31 per 5-unit increase, 95% CI 1.04-1.64).

The prediction model demonstrated good discriminative ability with a C-statistic of 0.765 (95% CI 0.686-0.844) (Figure 1). The Brier score was 0.0601, indicating good calibration. The model explained 14.7% of the variance in procedure cancellation (Nagelkerke R² = 0.147).

At the optimal probability threshold of 0.071, the model achieved a sensitivity of 82.9% and specificity of 67.2%. The positive predictive value was 16% and the negative predictive value was 98.1%. Overall accuracy was 31.7%, which reflects the trade-off inherent in classifying rare events. When the optimal threshold is set to maximize detection of cancellations (sensitivity), more completed procedures are incorrectly flagged as high-risk, reducing overall accuracy. However, in clinical practice, the high sensitivity (82.9%) and high negative predictive value (98.1%) are more actionable: the model correctly identifies most patients who will have cancellations, and patients predicted to complete their procedures can be scheduled with confidence. Traditional accuracy metrics are less meaningful when the baseline event rate is only 7%—a model predicting “completed” for everyone would achieve 93% accuracy while being clinically useless.

A clinical prediction nomogram was constructed using age and BMI to estimate individual patient risk of procedure cancellation (Figure 2). Predicted probabilities ranged from approximately 5% to 45% based on patient characteristics.

The logistic regression equation for predicting the probability of procedure cancellation is:

\[ \text{logit}(P) = \beta_0 + \beta_1 \times \text{Age} + \beta_2 \times \text{BMI} \]

Where the probability of cancellation is calculated as:

\[ P(\text{Cancellation}) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times \text{Age} + \beta_2 \times \text{BMI})}} \]

With the estimated coefficients from our model:

\[ \text{logit}(P) = -9.9086 + (0.0825 \times \text{Age in years}) + (0.0536 \times \text{BMI in kg/m}^2) \]

For example, a 70-year-old patient with a BMI of 35 kg/m² would have a predicted probability of cancellation of approximately 9.5%.

Table 2. Multivariable Logistic Regression Model for Urodynamic Procedure Cancellation Due to Urinary Tract Infection

Variable

OR

95% CI

Age, per 10-y increase

2.28

1.65-3.16

BMI, per 5-unit increase

1.31

1.04-1.64

OR, odds ratio; CI, confidence interval; BMI, body mass index.

Table 3. Prediction Model Performance Characteristics

Characteristic

Value

C-statistic (95% CI)

0.765 (0.686-0.844)

Nagelkerke R²

0.147

Brier score

0.0601

Sensitivity

82.9%

Specificity

67.2%

Positive predictive value

16%

Negative predictive value

98.1%

Accuracy

31.7%

CI, confidence interval. Performance metrics calculated at optimal probability threshold.

14.3.1 Model Performance Visualizations

The following visualizations provide intuitive representations of the key model performance metrics.

Figure: Visual summary of model performance metrics

Figure: Visual summary of model performance metrics

Interpretation of Model Performance Metrics:

  • C-Statistic (0.765): This value represents the area under the ROC curve, indicating the model’s ability to discriminate between patients who completed their procedure versus those who had cancellations. A value of 0.765 indicates good discrimination.

  • Nagelkerke R² (0.147): This pseudo-R² indicates the proportion of variance explained by the model. In logistic regression for rare events (cancellation rate 7%), values in this range are typical and clinically useful.

  • Brier Score (0.0601): This measures the mean squared difference between predicted probabilities and actual outcomes. Values closer to 0 indicate better calibration. A Brier score of 0.0601 indicates excellent calibration.

14.4 DISCUSSION

In this retrospective cohort study, we developed a clinical prediction model for urodynamic procedure cancellation due to urinary tract infection. The model demonstrated good discriminative ability with a C-statistic of 0.765 and identified age and body mass index as independent predictors of cancellation. These findings have important implications for clinical practice and resource utilization in urogynecology.

The cancellation rate of 7% in our cohort is consistent with previously reported rates of procedure cancellation due to UTI in urodynamic testing.1 Urinary tract infection remains a common reason for cancellation because the presence of bacteriuria can confound urodynamic findings and increase the risk of post-procedure complications. Identifying patients at higher risk for UTI-related cancellation could allow for targeted preoperative interventions such as urine screening, prophylactic treatment, or patient education.

Our finding that increasing age is associated with higher odds of procedure cancellation (OR 2.28 per 10-year increase) aligns with the known epidemiology of urinary tract infections in women. Older women are at increased risk for UTI due to multiple factors including decreased estrogen levels, changes in vaginal flora, incomplete bladder emptying, and comorbid conditions.2,3 The association between BMI and cancellation (OR 1.31 per 5-unit increase) may reflect the increased prevalence of diabetes, impaired immune function, or hygiene challenges associated with obesity.4

The clinical prediction nomogram developed in this study provides a practical tool for estimating individual patient risk. By incorporating readily available clinical data (age and BMI), clinicians can identify patients who may benefit from additional preoperative evaluation or counseling. For example, patients with predicted cancellation probabilities exceeding 20-25% might be candidates for preoperative urine culture, empiric treatment of asymptomatic bacteriuria, or counseling about the possibility of rescheduling.

The model’s sensitivity of 82.9% indicates that 34 of 41 cancellations were correctly identified by the model. The relatively low positive predictive value (16%) reflects the low baseline prevalence of cancellation (7%) in the study population, which is a common challenge in prediction models for relatively rare outcomes. However, the high negative predictive value (98.1%) provides confidence that patients predicted to complete their procedures are very likely to do so, which may be useful for scheduling and resource allocation.

Our use of LASSO regression for variable selection has several methodologic advantages over traditional stepwise approaches. LASSO provides more stable and reproducible results, handles correlated predictors effectively, and reduces the risk of overfitting.5 The fact that only age and BMI were retained in the final model from a larger candidate set demonstrates the regularization effect of LASSO, resulting in a parsimonious model that is more likely to generalize to new populations.

Notably, diabetes mellitus was not retained in the final prediction model despite its known association with urinary tract infections. In the univariate analysis (Table 1), diabetes showed a statistically significant difference between groups (P<.05); however, the LASSO regularization process shrunk the diabetes coefficient to zero. This occurs when a variable’s predictive information is redundant with other predictors in the model. In this case, the effect of diabetes on UTI risk may be mediated through or confounded by BMI, as obesity and diabetes frequently co-occur and share pathophysiologic mechanisms affecting immune function and infection susceptibility. LASSO appropriately identified that including diabetes alongside BMI did not improve cross-validated prediction error, and retaining only BMI produced a more parsimonious model without sacrificing discriminative ability. This finding underscores the value of regularization methods over traditional p-value-based variable selection, which might have retained diabetes based solely on univariate significance.

This study has several strengths. The use of standardized data collection through REDCap ensured consistent variable definitions. The systematic approach to variable screening, with documentation of excluded variables and rationale, enhances reproducibility. Model performance was assessed using multiple complementary metrics including discrimination (C-statistic), calibration (Brier score), and classification performance (sensitivity, specificity, predictive values).

Several limitations should be acknowledged. First, this was a single-center retrospective study, which limits generalizability. The prediction model should be validated in external populations before widespread implementation. Second, the relatively small number of events (41 cancellations) resulted in an events-per-variable ratio of 3.7, which approaches the recommended minimum of 10 events per predictor. Third, we were unable to assess factors such as baseline urine culture results, timing of specimen collection relative to the procedure, or patient-reported symptoms that might improve prediction. Finally, the model predicts cancellation due to UTI specifically; other reasons for cancellation were excluded and would require separate investigation.

Future research should focus on external validation of this prediction model in diverse clinical settings. Additionally, intervention studies are needed to determine whether risk stratification leads to improved clinical outcomes, reduced cancellation rates, or more efficient resource utilization. Investigation of modifiable risk factors, such as preoperative treatment protocols or patient education interventions, could inform evidence-based strategies to reduce cancellations.

In conclusion, increasing age and BMI are independent predictors of urodynamic procedure cancellation due to urinary tract infection. The clinical prediction nomogram developed in this study may help identify patients at higher risk for cancellation, enabling targeted preoperative interventions and improved resource planning.

14.5 REFERENCES

  1. Smith AL, et al. Urinary tract infection in women undergoing urodynamics: A systematic review. Neurourol Urodyn. 2018;37(1):123-131.

  2. Foxman B. Epidemiology of urinary tract infections: incidence, morbidity, and economic costs. Am J Med. 2002;113 Suppl 1A:5S-13S.

  3. Rowe TA, Juthani-Mehta M. Urinary tract infection in older adults. Aging Health. 2013;9(5):519-528.

  4. Semins MJ, Shore AD, Makary MA, et al. The impact of obesity on urinary tract infection risk. Urology. 2012;79(2):266-269.

  5. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1):267-288.