# Importing and re-naming the data
Data <- read_excel("Data.xlsx")

1 Preface

1.1 Executive Summary

This final project is based on the original published paper by Turan and colleagues: Season and Vitamin D Status do not Affect Probability for Surgical Site Infection after Colorectal Surgery. I aim to replicate the entire study, including the methodology and findings of this paper with the data I have access to. A brief summary of the published paper is described below. I will organize the project by each stage of the PPDAC(D) framework to replicate and answer the main research questions. Note, I was not provided the raw data-set, so I attempted to replicate the study as best as I could with the provided data.

Background: Many patients develop surgical site infections (SSI) after colorectal surgery. This can be a result of multiple factors, however, the patient’s immune system is important for prevention of an SSI. Seasonal weather and Vitamin D levels are known to impact the immune system in the general population, however, research in the surgical population remains scarce, and there are no studies within patients after colorectal surgery. Therefore, this study aims to determine the effects of season on the development of an SSI in patients after colorectal surgery. Secondarily, the study aims to examine the impact of vitamin D levels on SSI developments in this population.

Methodology: The investigators conducted a retrospective cohort study of patients who underwent colorectal surgery at the Cleveland Clinic Main Campus (June 2010 to March 2012), with IRB approval. Perioperative exposures, covariates, and outcomes were obtained from the Cleveland Clinic Perioperative Health Documentation System database, with SSI captured by the colorectal surgery wound infection registry. Patients younger than 18, those with ASA physical status greater than 4, and those with missing covariates were excluded. Season of surgery was defined using four seasonal windows, and SSI (any SSI) was the primary outcome. Vitamin D (25(OH)D) was evaluated in a secondary analysis among patients with a measurement from 3 months before to 1 month after surgery.

Results: In total, 2,919 colorectal surgeries were included in the final analytic cohort. Overall, 241 patients developed an SSI (7.7%). Crude SSI incidence varied across seasons, with the lowest rate in spring (6.8%) and the highest rate in summer (9.9%), while fall (7.3%) and winter (8.2%) were intermediate. However, after multivariable adjustment for patient and operative characteristics, there was no clear evidence of a meaningful seasonal difference in SSI risk. Perioperative vitamin D concentration was not associated with SSI.

Conclusions: In this cohort, SSI risk after colorectal surgery did not differ meaningfully by season after accounting for patient and surgical factors. Perioperative vitamin D was not associated with SSI, but vitamin D data were sparse, limiting inference for that secondary question.

2 The Problem

2.1 Background

Colorectal surgery is associated with surgical site infections (SSI); it occurs in approximately 11% to 28% of all patients. SSI complications are a significant clinical concern as it adversely affects the length of hospital stays, reduces a patient’s quality of life, and increases overall healthcare costs. Furthermore, government and private agencies increasingly use infection rates as a quality of care measure to improve surgical care and reduce infections. Given the clinical relevance of SSI developments in patients after colorectal surgery, it is important to investigate factors associated with their occurrence.

Although many factors play a role in infections, the patient’s immune system is crucial for prevention. Seasonal weather and Vitamin D (25-hydroxyvitamin D) levels are known to affect the immune system in the general population, as infections are generally more common in the winter. This is likely associated with lower concentration of vitamin D in patients during the winter; patients with vitamin D deficiency (<25 nmol/L or <10 ng/mL) have higher percentages of risk for infections. Vitamin D stimulates the expression of potent antimicrobial peptides, which exist in neutrophils, monocytes, natural killer cells, and in epithelial cells, where they play a major role in protecting the body from infections.

Research shows that patients undergoing surgical procedures for peptic ulcer perforation, cardiac surgery, and resections for non-small-cell lung carcinoma, have worse morbidity and mortality outcomes in the winter compared to summer.

While most studies have examined the impact of season on patients in the general population, very few have focused on surgical patients, and no study in the literature investigates the potential association between seasons and wound infections in patients after colorectal surgery.

2.2 Objectives and Hypothesis

Therefore, this study primarily examines whether season changes the risk of getting a surgical site infection (SSI) in patients after colorectal surgery.

The primary objective of this study is to compare the SSI rates between patients who had surgery in the winter and those who had surgery in the summer. The secondary objectives are to compare the SSI rates between patients in the remaining seasons, and to compare patients with high Vitamin D levels and those with low vitamin D levels.

Hypothesis: We hypothesize that patients having colorectal surgery in the winter will have more SSI’s than those having surgery in the summer. Similarly, we hypothesize that patients with low Vitamin D levels will have more SSI’s after colorectal surgery than patients with higher levels.

2.3 Units and Target Population

The units in this study are the individual patients aged 18 years or older who had colorectal surgery at the Cleveland Clinic Main Campus from June 2010 to March 2012. The target population encompasses all adult patients undergoing colorectal surgery.

2.4 Problem Aspect

The study objective has a causative aspect; we are trying to see if a change in one factor (the season, Vitamin D) causes a change in the outcome (SSI development) while adjusting for the other variables.

2.5 Population Attributes

The goal is to determine whether there is a difference, and the degree of difference, in SSI development in the target population based on the season the surgery took place (or Vitamin D levels).

The numerical attributes we will be using include Odds Ratios (OR) to compare the odds of SSI between groups. This refers to the specific odds of infection for the entire population during different seasons (or Vitamin D levels).

The graphical attributes we will be using include a bar graph representing the overall pattern of infection rates across the four seasons of the year.

2.6 Study Variates

The response variate (Y) is whether a patient developed an SSI. The primary explanatory variate (X1) is the season (winter, spring, summer, fall) the surgery took place. The secondary explanatory variable is the patient’s Vitamin D concentration (in ng/mL). Finally, the baseline and surgical covariates are all the other important patient details we must take into consideration (e.g., age, BMI, surgery length). Refer to Table 1 for detailed list of the study variables.

2.7 Table 1: Summary of the Study Variates

# Create a data frame for Table 1
variate_data <- data.frame(
  Variable = c("Surgical site infection (SSI)", "Season", "Vitamin D levels", "Age", "Female", "Race", "BMI", 
               "American Society of Anesthesiologists (ASA) Physical Status", "Diabetes", "Chronic Renal Failure", "Preoperative use of Steroids", 
               "Emergency case", "Duration of Surgery", "RBC transfusion"),
  Role = c("Response (Y)", "Primary Explanatory (X1)", "Secondary Explanatory (X2)", 
           "Baseline Covariate", "Baseline Covariate", "Baseline Covariate", "Baseline Covariate", 
           "Baseline Covariate", "Baseline Covariate", "Baseline Covariate", "Surgical Covariate", 
           "Surgical Covariate", "Surgical Covariate", "Surgical Covariate"),
  Type = c("Binary", "Nominal", "Continuous", "Continuous", "Binary", "Nominal", "Continuous", 
           "Ordinal", "Binary", "Binary", "Binary", 
           "Binary", "Continuous", "Continuous"),
  Description = c("Surgical site infection (Yes, No)", 
                  "Season of surgery (Fall, Winter, Spring, Summer)", 
                  "Serum Vitamin D concentration (ng/mL)", 
                  "Patient's age (years)", 
                  "Patient is female (Yes, No)", 
                  "Patient's race (White, Black, Other)", 
                  "Body Mass Index (kg/m^2)",
                  "Physical status score (1 = normal healthy patient, 2 = mild systemic disease, 3 = severe systemic disease, 4 = severe systemic disease
(constant threat to life)", 
                  "History of diabetes (Yes, No)", 
                  "Chronic renal failure (Yes, No)", 
                  "Preoperative steroid use (Yes, No)",
                  "Emergency surgery case (Yes, No)", 
                  "Length of the surgery (hours)", 
                  "Red blood cell transfusion (mL)")
)

# Generate the table
kbl(variate_data, col.names = c("Variable", "Role", "Type", "Description"),
    caption = "<b> Table 1. </b>: Summary of Study Variates") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered", "condensed"),
                full_width = F) %>%
  column_spec(1:4, border_left = T, border_right = T) %>%
  row_spec(0, bold = T, color = "white", background = "#4E79A7")
Table 1. : Summary of Study Variates
Variable Role Type Description
Surgical site infection (SSI) Response (Y) Binary Surgical site infection (Yes, No)
Season Primary Explanatory (X1) Nominal Season of surgery (Fall, Winter, Spring, Summer)
Vitamin D levels Secondary Explanatory (X2) Continuous Serum Vitamin D concentration (ng/mL)
Age Baseline Covariate Continuous Patient’s age (years)
Female Baseline Covariate Binary Patient is female (Yes, No)
Race Baseline Covariate Nominal Patient’s race (White, Black, Other)
BMI Baseline Covariate Continuous Body Mass Index (kg/m^2)
American Society of Anesthesiologists (ASA) Physical Status Baseline Covariate Ordinal Physical status score (1 = normal healthy patient, 2 = mild systemic disease, 3 = severe systemic disease, 4 = severe systemic disease (constant threat to life)
Diabetes Baseline Covariate Binary History of diabetes (Yes, No)
Chronic Renal Failure Baseline Covariate Binary Chronic renal failure (Yes, No)
Preoperative use of Steroids Surgical Covariate Binary Preoperative steroid use (Yes, No)
Emergency case Surgical Covariate Binary Emergency surgery case (Yes, No)
Duration of Surgery Surgical Covariate Continuous Length of the surgery (hours)
RBC transfusion Surgical Covariate Continuous Red blood cell transfusion (mL)

3 The Plan

This is a retrospective cohort study examining records of patients who underwent colorectal surgery at the Cleveland Clinic Main Campus between June 2010 and March 2012. Therefore, the “planning” involves identifying the correct patient records within the existing hospital database and defining how to categorize the data appropriately to test the key hypotheses. We do not manipulate the subjects but instead we use measured values extracted from existing records.

3.1 Study Population

The study population is the group of patients who had colorectal surgery at the Cleveland Clinic Main Campus between June 2010 and March 2012. We will extract perioperative and outcome records from the Cleveland Clinic Perioperative Health and Documentation Status (date of surgery, vitamin D measurements, baseline and surgical covariates, SSI, SSI type). We will also cross-reference this data with the Colorectal Surgery Wound Infection Registry to confirm outcomes (SSI, SSI type).

Exclusion criteria: Patients under 18 years of age, those with an ASA physical status greater than 4, and those with missing covariate information will be excluded from the study.

3.2 Measuring the Variates

Primary analysis: The primary explanatory variable (season) will be determined by the timing of the surgery date; Spring is defined as March 21–June 20, Summer as June 21–September 22, Fall as September 23–December 20 and Winter as December 21–March 20. Within this, the primary comparison will be to determine the difference (if any) between SSI’s in winter and summer. The remaining seasonal comparisons will be a secondary comparison.

Secondary analysis: The secondary explanatory variable (Vitamin D levels) will be measured in ng/mL. We will include measurements taken between 3 months before and 1 month after the surgery. Vitamin D sufficiency is defined as \(\ge 32~ng/mL\). Inadequacy (\(< 32~ng/mL\)) is subdivided into insufficiency (\(20\) to \(< 32~ng/mL\)) and deficiency (\(< 20~ng/mL\)).

The response variate (SSI) for each analysis will be measured as a binary outcome (Yes/No) and will include organ, deep, and superficial infections in patients after colorectal surgery.

3.3 Dealing with Explanatory Variates

We will divide patients by the season they had colorectal surgery in, and then measure and report the 11 covariates (descriptives). We will perform crude comparisons between the SSI rates across the seasons and vitamin D levels and graph the results. Then, for our primary and secondary analyses we will perform a multivariable logistic regression and adjust for all these covariates to isolate the effects between the explanatory variables and the outcome. The assumptions were assessed for this model in the Appendix II. Any patient with missing information for these 11 variables will be excluded to maintain a high-quality data set.

3.4 Statistical Significance

We will use a significance cutoff (\(\alpha\)) for the primary comparison (winter vs summer) of 0.025, while the other 5 seasonal comparisons (e.g., spring vs summer) will have a cutoff of 0.005. This is because we are testing multiple seasons at once. When running multiple tests at once with a standard significance cutoff of 0.05 for each test, there is an increased risk of Type I Error or a false positive (claiming a difference exists when it actually doesn’t). This is because the risk of getting at least one false positive (5% x 6) adds up. This design ensures that the overall study risk for a false positive stays at 5% (0.025 + (5 x 0.005) = 0.05 or 5%). For the secondary analysis, the original paper does not explicitly state an \(\alpha\) value, therefore, we will use the standard cutoff of 0.05. This secondary analysis only uses one test.

3.5 Sample Size and Power Calculation

We plan to have a total of 1638 patients who underwent colorectal surgery in winter (708) or summer (930), and observed incidence of 10% of patients having surgery in summer experienced infection, which will give us 90% power to detect an odds ratio of 0.54 or less at the 0.05 significance level (\(\alpha\)). This indicates that the study is powered to detect a relatively large seasonal difference in SSI odds (46% less in summer than winter), but smaller differences may go undetected.

3.6 Data Collection and Quality Protocol

We monitor the values to ensure the quality and integrity of our data. We ensure our inclusion and exclusion criteria are applied correctly and our data reflects that. We flag for impossible or implausible values (e.g., negative ages, BMI outside realistic bounds, dates out of order). We ensure consistency by ensuring related variables align (e.g., procedure type matches site). We ensure there are no duplication of data (e.g., patient ID). In this case, we exclude any missing data but it is worth noting and keeping track of these patterns (Appendix I). It is also worth noting any outliers in the data (Appendix II). Then, we describe the data and ensure a large enough sample.

The original study obtained Cleveland Clinic Foundation IRB approval. The study addresses several key STROBE reporting components, but formal STROBE adherence cannot be confirmed because the investigators do not explicitly state this.

Data extraction and file structuring was handled using SAS software version 9.3 for the original study, which is replicated using R version 2025.05.1+513 in this project.

4 The Data

df0 <- Data %>%
  transmute(
    Age             = Age,
    Female          = Female,
    Race            = Race,
    BMI             = BMI,
    ASA             = ASAstatus,
    Diabetes        = Diabetes,
    RenalFailure    = ChronicRenalFailure,
    Steroids        = PreopSteroids,
    Emergency       = Emergency,
    DurationSurgery = DurationSurgery,
    VitaminD        = VitaminD,
    Season_num      = Season,
    RBC             = RBC,
    SSI             = SSI
  )

# Define the variables required for the main season analysis.
# Any missing value in these variables will lead to exclusion from the main analytic cohort.
main_vars <- c(
  "Age", "Female", "Race", "BMI", "ASA", "Diabetes",
  "RenalFailure", "Steroids", "Emergency", "DurationSurgery",
  "Season_num", "RBC", "SSI"
)

# Save the original sample size before exclusions.
n_start <- nrow(df0)

# Exclusion 1: remove patients younger than 18.
df_age <- df0 %>%
  filter(Age >= 18)
n_age_excluded <- n_start - nrow(df_age)

# Exclusion 2: remove patients with ASA > 4.
df_asa <- df_age %>%
  filter(ASA <= 4)
n_asa_excluded <- nrow(df_age) - nrow(df_asa)

# Exclusion 3: remove patients with missing values in variables
# needed for the main season analysis.
n_missing_excluded <- df_asa %>%
  filter(!if_all(all_of(main_vars), ~ !is.na(.))) %>%
  nrow()

# Create the final analytic cohort for the main analysis.
# Re-code variables into factors for tables and regression models.
analytic_df <- df_asa %>%
  filter(if_all(all_of(main_vars), ~ !is.na(.))) %>%
  mutate(
    Season = factor(
      Season_num,
      levels = c(3, 4, 1, 2),
      labels = c("Spring", "Summer", "Fall", "Winter")
    ),
    Race_f = factor(
      Race,
      levels = c(1, 2, 3),
      labels = c("White", "Black", "Others")
    ),
    Female_f = factor(
      Female,
      levels = c(0, 1),
      labels = c("Male", "Female")
    ),
    ASA_f = factor(ASA, levels = c(1, 2, 3, 4)),
    Diabetes_f = factor(Diabetes, levels = c(0, 1), labels = c("No", "Yes")),
    RenalFailure_f = factor(RenalFailure, levels = c(0, 1), labels = c("No", "Yes")),
    Steroids_f = factor(Steroids, levels = c(0, 1), labels = c("No", "Yes")),
    Emergency_f = factor(Emergency, levels = c(0, 1), labels = c("No", "Yes")),
    SSI_f = factor(SSI, levels = c(0, 1), labels = c("No", "Yes"))
  )

# Save the final analytic cohort size.
n_final <- nrow(analytic_df)

# Count how many patients in the final cohort have a vitamin D value available.
n_vitd_available <- analytic_df %>%
  filter(!is.na(VitaminD)) %>%
  nrow()

# Create the vitamin D complete-case subset used for the vitamin D regression model.
# This is stricter than "vitamin D available" because the vitamin D model also needs complete covariate data in the final modeling data-set.
vitd_model_df <- analytic_df %>%
  filter(!is.na(VitaminD)) %>%
  select(
    SSI, VitaminD, Age, Female_f, Race_f, BMI, ASA_f,
    Steroids_f, Diabetes_f, RenalFailure_f, Emergency_f,
    DurationSurgery, RBC
  ) %>%
  drop_na()

n_vitd_complete <- nrow(vitd_model_df)

# Create a simple cohort summary table to report exclusions and counts.
cohort_summary <- tibble(
  Step = c(
    "Starting sample",
    "Excluded: Age < 18",
    "Excluded: ASA > 4",
    "Excluded: missing main-analysis variables",
    "Final analytic cohort",
    "Vitamin D available in final cohort",
    "Vitamin D complete-case subset for vitamin D model"
  ),
  N = c(
    n_start,
    n_age_excluded,
    n_asa_excluded,
    n_missing_excluded,
    n_final,
    n_vitd_available,
    n_vitd_complete
  )
)

# Print the cohort summary. 
cohort_summary
## # A tibble: 7 × 2
##   Step                                                   N
##   <chr>                                              <int>
## 1 Starting sample                                     2919
## 2 Excluded: Age < 18                                     0
## 3 Excluded: ASA > 4                                      0
## 4 Excluded: missing main-analysis variables              0
## 5 Final analytic cohort                               2919
## 6 Vitamin D available in final cohort                   62
## 7 Vitamin D complete-case subset for vitamin D model    62

In this stage, we have the data and are monitoring the data quality, and examining the data before we begin the regression modelling. The following data is for informational purposes only (descriptive).

4.1 Data Monitoring

The original study started with 2,985 patients that had undergone colorectal surgery from June 2010 to March 2012, however, after applying the exclusion criteria, 2,919 patients were chosen for the study ((1) the data set I was provided). When divided by the season the surgeries took place, the samples sizes were: Spring (n=400), Summer (n=930), Fall (n=881), Winter (n=708). Furthermore, only 62 of the 2,919 patients had a vitamin D measurement in the defined perioperative window. The timing of the measurement was typically close to surgery with a median difference of 11 days (IQR - 8 to 50) relative to the procedure date (1). The data was checked and quality was good (note, this was of the data-set I was provided not the raw data-set from the original study, Appendix I).

4.2 Data Examination

Overall, the demographic characteristics of the patients were similar across the seasons (Table 2), however, the number of emergency cases differed slightly. 241 of the 2,919 patients experienced an SSI including 125 organ, 11 deep and 126 superficial infections. The observed (descriptive) SSI incidence differed by seasons: 6.8% (spring), 9.9% (summer), 7.3% (fall), 8.2% (winter). Figure 1 presents this seasonal SSI incidence rate. The median vitamin D level was 24.5 ng/mL (IQR 18 to 34 ng/mL) and was similar across the seasons (Figure 2).

4.3 Table 2: Baseline Characteristics by Season

# Define season order and save counts by season.
seasons_lvls <- c("Spring", "Summer", "Fall", "Winter")
n_season <- table(analytic_df$Season)

# Format continuous variables as mean ± SD.
fmt_mean_sd <- function(x) {
  x <- na.omit(x)
  sprintf("%.0f ± %.0f", mean(x), sd(x))
}

# Format skewed continuous variables as median [Q1, Q3].
fmt_med_iqr <- function(x) {
  x <- na.omit(x)
  q <- quantile(x, c(0.25, 0.5, 0.75))
  sprintf("%.0f [%.0f, %.0f]", q[2], q[1], q[3])
}

# Format categorical variables as n (%).
fmt_n_pct <- function(n, total) {
  sprintf("%d (%.0f%%)", n, 100 * n / total)
}

# Format p-values consistently for tables.
fmt_p <- function(p) {
  ifelse(p < 0.001, "<0.001", sprintf("%.3f", p))
}

# Build one row of Table 2.
build_row <- function(label, vals, pval = "", indent = FALSE) {
  lbl <- if (indent) paste0("\u00a0\u00a0\u00a0\u00a0", label) else label
  data.frame(
    Variable = lbl,
    Spring = vals[1],
    Summer = vals[2],
    Fall = vals[3],
    Winter = vals[4],
    p_Value = pval,
    stringsAsFactors = FALSE,
    check.names = FALSE
  )
}

# Helper for approximately normal continuous variables.
add_anova_row <- function(data, var, label) {
  vals <- sapply(seasons_lvls, function(s) fmt_mean_sd(data[[var]][data$Season == s]))
  p <- summary(aov(reformulate("Season", response = var), data = data))[[1]][["Pr(>F)"]][1]
  build_row(label, vals, fmt_p(p))
}

# Helper for skewed continuous variables.
add_kw_row <- function(data, var, label) {
  vals <- sapply(seasons_lvls, function(s) fmt_med_iqr(data[[var]][data$Season == s]))
  p <- kruskal.test(reformulate("Season", response = var), data = data)$p.value
  build_row(label, vals, fmt_p(p))
}

# Helper for binary categorical variables.
add_binary_row <- function(data, var, label) {
  vals <- sapply(seasons_lvls, function(s) {
    sub <- data[data$Season == s, ]
    fmt_n_pct(sum(sub[[var]] == 1, na.rm = TRUE), nrow(sub))
  })
  p <- chisq.test(table(data$Season, data[[var]]))$p.value
  build_row(label, vals, fmt_p(p))
}
# Create an empty list that will store each row of the baseline table.
rows <- list()

# Add rows for continuous and binary variables.
rows$age <- add_anova_row(analytic_df, "Age", "Age, years")
rows$gender <- add_binary_row(analytic_df, "Female", "Gender (female), n (%)")

# Add Race as a grouped categorical section with header + subrows.
race_p <- chisq.test(table(analytic_df$Season, analytic_df$Race_f))$p.value
rows$race_hdr <- build_row("Race, n (%)", rep("", 4), fmt_p(race_p))
for (lv in levels(analytic_df$Race_f)) {
  rv <- sapply(seasons_lvls, function(s) {
    sub <- analytic_df[analytic_df$Season == s, ]
    fmt_n_pct(sum(sub$Race_f == lv, na.rm = TRUE), nrow(sub))
  })
  rows[[paste0("race_", lv)]] <- build_row(lv, rv, "", indent = TRUE)
}

# Add BMI.
rows$bmi <- add_anova_row(analytic_df, "BMI", "BMI (kg/m²)")

# Add ASA as a grouped categorical section with header + subrows.
asa_p <- chisq.test(table(analytic_df$Season, analytic_df$ASA_f))$p.value
rows$asa_hdr <- build_row("ASA status, n (%)", rep("", 4), fmt_p(asa_p))
for (lv in levels(analytic_df$ASA_f)) {
  av <- sapply(seasons_lvls, function(s) {
    sub <- analytic_df[analytic_df$Season == s, ]
    fmt_n_pct(sum(sub$ASA_f == lv, na.rm = TRUE), nrow(sub))
  })
  rows[[paste0("asa_", lv)]] <- build_row(lv, av, "", indent = TRUE)
}

# Add remaining baseline and surgical variables.
rows$diabetes  <- add_binary_row(analytic_df, "Diabetes", "Diabetes, n (%)")
rows$renal     <- add_binary_row(analytic_df, "RenalFailure", "Chronic renal failure, n (%)")
rows$steroids  <- add_binary_row(analytic_df, "Steroids", "Usage of steroids, n (%)")
rows$emergency <- add_binary_row(analytic_df, "Emergency", "Emergency, n (%)")
rows$duration  <- add_kw_row(analytic_df, "DurationSurgery", "Duration of surgery (hours)")
rows$rbc       <- add_kw_row(analytic_df, "RBC", "RBC transfusion (mL)")
rows$ssi       <- add_binary_row(analytic_df, "SSI", "Surgical site infection (SSI), n (%)")

# Combine all rows into one table.
tbl_baseline <- bind_rows(rows)

# Label the columns with sample size by season.
colnames(tbl_baseline) <- c(
  "Variable",
  sprintf("Spring<br><small>(n=%d)</small>", n_season["Spring"]),
  sprintf("Summer<br><small>(n=%d)</small>", n_season["Summer"]),
  sprintf("Fall<br><small>(n=%d)</small>", n_season["Fall"]),
  sprintf("Winter<br><small>(n=%d)</small>", n_season["Winter"]),
  "<i>p</i>-Value"
)

# Save row indices for styling grouped rows.
indent_idx <- which(startsWith(tbl_baseline$Variable, "\u00a0"))
header_idx <- which(tbl_baseline$Variable %in% c("Race, n (%)", "ASA status, n (%)"))
ssi_idx <- nrow(tbl_baseline)

# Print Table 2.
tbl_baseline %>%
  kbl(
    escape = FALSE,
    row.names = FALSE,
    caption = paste0(
      "<b>Table 2.</b> Demographics, preoperative and intraoperative characteristics by season",
      "<sup>a</sup> (n=", n_final, ")"
    )
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width = TRUE
  ) %>%
  row_spec(0, background = "#2E75B6", color = "white", bold = TRUE,
           extra_css = "text-align: center;") %>%
  column_spec(1, width = "30%") %>%
  column_spec(2:5, width = "14%", extra_css = "text-align: center;") %>%
  column_spec(6, width = "8%", extra_css = "text-align: center;") %>%
  row_spec(indent_idx, italic = TRUE, extra_css = "background-color: #f7f9fd;") %>%
  row_spec(header_idx, bold = TRUE, extra_css = "background-color: #dce6f1;") %>%
  row_spec(ssi_idx, bold = TRUE, extra_css = "border-top: 2px solid #2E75B6;") %>%
  footnote(
    general = paste0(
      "Summary statistics presented as number (%) of patients, mean ± SD, or median [Q1, Q3]. ",
      "ASA = American Society of Anesthesiologists; BMI = body mass index; RBC = red blood cell."
    ),
    alphabet = c(
      "Spring: March 21 to June 20; Summer: June 21 to September 22; Fall: September 23 to December 20; Winter: December 21 to March 20.",
      "Pearson Chi-squared test, unless specified.",
      "Analysis of variance (ANOVA) used for Age and BMI.",
      "Kruskal-Wallis test used for Duration of surgery and RBC transfusion."
    ),
    general_title = "",
    alphabet_title = ""
  )
Table 2. Demographics, preoperative and intraoperative characteristics by seasona (n=2919)
Variable Spring
(n=400)
Summer
(n=930)
Fall
(n=881)
Winter
(n=708)
p-Value
Age, years 53 ± 17 52 ± 17 53 ± 17 52 ± 16 0.549
Gender (female), n (%) 190 (48%) 465 (50%) 446 (51%) 391 (55%) 0.059
Race, n (%) 0.773
    White 357 (89%) 853 (92%) 799 (91%) 638 (90%)
    Black 31 (8%) 53 (6%) 59 (7%) 53 (7%)
    Others 12 (3%) 24 (3%) 23 (3%) 17 (2%)
BMI (kg/m²) 27 ± 6 27 ± 6 27 ± 7 27 ± 6 0.992
ASA status, n (%) 0.465
    1 13 (3%) 28 (3%) 30 (3%) 21 (3%)
    2 172 (43%) 478 (51%) 438 (50%) 348 (49%)
    3 197 (49%) 392 (42%) 381 (43%) 314 (44%)
    4 18 (4%) 32 (3%) 32 (4%) 25 (4%)
Diabetes, n (%) 70 (18%) 180 (19%) 181 (21%) 148 (21%) 0.514
Chronic renal failure, n (%) 26 (6%) 36 (4%) 42 (5%) 30 (4%) 0.196
Usage of steroids, n (%) 31 (8%) 81 (9%) 74 (8%) 70 (10%) 0.619
Emergency, n (%) 29 (7%) 33 (4%) 37 (4%) 30 (4%) 0.024
Duration of surgery (hours) 4 [2, 5] 3 [2, 5] 3 [2, 5] 3 [2, 5] 0.059
RBC transfusion (mL) 0 [0, 0] 0 [0, 0] 0 [0, 0] 0 [0, 0] 0.126
Surgical site infection (SSI), n (%) 27 (7%) 92 (10%) 64 (7%) 58 (8%) 0.131
Summary statistics presented as number (%) of patients, mean ± SD, or median [Q1, Q3]. ASA = American Society of Anesthesiologists; BMI = body mass index; RBC = red blood cell.
a Spring: March 21 to June 20; Summer: June 21 to September 22; Fall: September 23 to December 20; Winter: December 21 to March 20.
b Pearson Chi-squared test, unless specified.
c Analysis of variance (ANOVA) used for Age and BMI.
d Kruskal-Wallis test used for Duration of surgery and RBC transfusion.

4.4 Figure 1: SSI Incidence by Season

# Calculate the crude SSI rate by season.
fig_ssi_season <- analytic_df %>%
  group_by(Season) %>%
  summarise(
    n = n(),
    rate = mean(SSI) * 100,
    .groups = "drop"
  ) %>%
  mutate(Season = factor(Season, levels = seasons_lvls))

# Define colors for the seasons.
season_colors <- c(
  Spring = "#70AD47",
  Summer = "#ED7D31",
  Fall   = "#4472C4",
  Winter = "#A9D18E"
)

# Plot crude SSI incidence by season.
ggplot(fig_ssi_season, aes(x = Season, y = rate, fill = Season)) +
  geom_col(width = 0.6, color = "black", linewidth = 0.4) +
  geom_text(
    aes(label = sprintf("%.1f%%", rate)),
    vjust = -0.6, size = 3.8, fontface = "bold"
  ) +
  scale_fill_manual(values = season_colors) +
  scale_y_continuous(
    limits = c(0, 13),
    breaks = seq(0, 12, 2),
    labels = function(x) paste0(x, "%"),
    expand = expansion(mult = c(0, 0.05))
  ) +
  labs(
    x = "Season",
    y = "Surgical Site Infection Rate (%)"
  ) +
  theme_classic(base_size = 12) +
  theme(
    legend.position = "none",
    axis.title = element_text(face = "bold"),
    panel.grid.major.y = element_line(color = "grey85", linetype = "dashed")
  )

Figure 1. Incidence of surgical site infection (SSI) by season (n = 2919). Spring: March 21 to June 20; Summer: June 21 to September 22; Fall: September 23 to December 20; Winter: December 21 to March 20.

4.5 Figure 2: Vitamin D Concentrations by Season

# Restrict to patients with a vitamin D value available.
vitd_df <- analytic_df %>%
  filter(!is.na(VitaminD)) %>%
  mutate(Season = factor(Season, levels = c("Winter", "Spring", "Summer", "Fall")))

# Calculate overall vitamin D summary statistics for the figure caption.
vitd_q   <- quantile(vitd_df$VitaminD, probs = c(0.25, 0.5, 0.75), na.rm = TRUE)
vitd_med <- round(vitd_q[2], 1)
vitd_q1  <- round(vitd_q[1], 0)
vitd_q3  <- round(vitd_q[3], 0)

# Define colors for the seasons.
season_colors2 <- c(
  Winter = "#4472C4",
  Spring = "#70AD47",
  Summer = "#ED7D31",
  Fall   = "#A9D18E"
)

# Plot vitamin D concentrations by season.
ggplot(vitd_df, aes(x = Season, y = VitaminD, fill = Season)) +
  geom_boxplot(
    width = 0.5,
    outlier.shape = 18,
    outlier.size = 2,
    alpha = 0.85
  ) +
  stat_summary(
    fun = median, geom = "point",
    shape = 18, size = 3, color = "black"
  ) +
  scale_fill_manual(values = season_colors2) +
  coord_cartesian(ylim = c(0, 75)) +
  scale_y_continuous(breaks = seq(0, 60, 20)) +
  labs(
    x = "Season",
    y = "Vitamin D concentration (ng/mL)"
  ) +
  theme_classic(base_size = 12) +
  theme(
    legend.position = "none",
    axis.title = element_text(face = "bold"),
    panel.grid.major.y = element_line(color = "grey85", linetype = "dashed")
  )

Figure 2. Season-dependent vitamin D concentrations. Values were assessed as described in Methods. Box plots show median, interquartile range, and outliers. n = 62 patients with available measurements. Overall median [Q1, Q3]: 24.5 [18, 33] ng/mL.

5 The Analysis

The primary analysis (season and SSI) showed no statistical significance overall (no overall seasonal effect) after adjustment for the covariates (p=0.08).

5.1 Overall Seasonal Effect

# Fit the full multivariable logistic regression model for SSI.
# This model includes Season plus all covariates.
mod_season_full <- glm(
  SSI ~ Season + Age + Female_f + Race_f + BMI + ASA_f +
    Steroids_f + Diabetes_f + RenalFailure_f + Emergency_f +
    DurationSurgery + RBC,
  data = analytic_df,
  family = binomial(link = "logit")
)

# Fit the null model without Season.
mod_season_null <- glm(
  SSI ~ Age + Female_f + Race_f + BMI + ASA_f +
    Steroids_f + Diabetes_f + RenalFailure_f + Emergency_f +
    DurationSurgery + RBC,
  data = analytic_df,
  family = binomial(link = "logit")
)

# Use a likelihood ratio test to assess the overall effect of Season.
lrt_season <- anova(mod_season_null, mod_season_full, test = "LRT")
overall_p <- lrt_season$`Pr(>Chi)`[2]
overall_p
## [1] 0.07966193

There were no statistically significant seasonal pairwise differences either (Table 3). The estimated odds ratio was 1.29 (97.5% confidence interval (CI): 0.86, 1.93) for having surgery in summer as compared with winter (p=0.159). Although insignificant, the biggest difference in infection was between summer (9.9%) and spring (6.8%) (OR (99.5% CI): 0.60 (0.30, 1.13); p=0.03).

5.2 Table 3: Primary Analysis: Seasons and SSI

# Define the six planned pairwise seasonal comparisons.
comparisons <- list(
  list(tgt = "Spring", ref = "Summer", sig = 0.005, primary = FALSE),
  list(tgt = "Spring", ref = "Fall",   sig = 0.005, primary = FALSE),
  list(tgt = "Spring", ref = "Winter", sig = 0.005, primary = FALSE),
  list(tgt = "Summer", ref = "Fall",   sig = 0.005, primary = FALSE),
  list(tgt = "Summer", ref = "Winter", sig = 0.025, primary = TRUE),
  list(tgt = "Fall",   ref = "Winter", sig = 0.005, primary = FALSE)
)

# Run each pairwise comparison by changing the reference season.
pairwise_results <- lapply(comparisons, function(comp) {
  df_tmp <- analytic_df %>%
    mutate(Season_pw = relevel(Season, ref = comp$ref))

  mod_pw <- glm(
    SSI ~ Season_pw + Age + Female_f + Race_f + BMI + ASA_f +
      Steroids_f + Diabetes_f + RenalFailure_f + Emergency_f +
      DurationSurgery + RBC,
    data = df_tmp,
    family = binomial(link = "logit")
  )

  res <- tidy(
    mod_pw,
    exponentiate = TRUE,
    conf.int = TRUE,
    conf.level = 1 - comp$sig
  ) %>%
    filter(term == paste0("Season_pw", comp$tgt))

  tibble(
    Comparison = paste(comp$tgt, "versus", comp$ref),
    OR = round(res$estimate, 2),
    CI_low = round(res$conf.low, 2),
    CI_high = round(res$conf.high, 2),
    p = round(res$p.value, 3),
    primary = comp$primary
  )
})

# Build Table 3.
tbl_pairwise <- bind_rows(pairwise_results) %>%
  mutate(
    `Odds Ratio (CI)` = sprintf("%.2f (%.2f, %.2f)", OR, CI_low, CI_high),
    Comparison = ifelse(primary, paste0(Comparison, " \u2020"), Comparison)
  ) %>%
  select(Comparison, `Odds Ratio (CI)`, p)

primary_row <- which(grepl("\u2020", tbl_pairwise$Comparison))

# Print Table 3.
tbl_pairwise %>%
  kbl(
    col.names = c("Comparison", "Odds Ratio (CI)\u1d43", "<i>p</i>\u1d43"),
    escape = FALSE,
    caption = paste0(
      "<b>Table 3.</b> Association between seasons<sup>a</sup> and surgical site infection ",
      "for colorectal surgical patients (n=", n_final, ")"
    )
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width = FALSE,
    position = "left"
  ) %>%
  row_spec(0, background = "#2E75B6", color = "white", bold = TRUE) %>%
  row_spec(primary_row, bold = TRUE, italic = TRUE, background = "#fff3cd") %>%
  footnote(
    alphabet = c(
      "Spring: March 21 to June 20; Summer: June 21 to September 22; Fall: September 23 to December 20; Winter: December 21 to March 20.",
      "The significance criterion was 0.025 for the primary comparison († Summer versus Winter, highlighted), and 0.005 for the other five comparisons."
    ),
    alphabet_title = ""
  )
Table 3. Association between seasonsa and surgical site infection for colorectal surgical patients (n=2919)
Comparison Odds Ratio (CI)ᵃ p
Spring versus Summer 0.60 (0.30, 1.13) 0.030
Spring versus Fall 0.86 (0.42, 1.66) 0.521
Spring versus Winter 0.78 (0.38, 1.53) 0.308
Summer versus Fall 1.42 (0.87, 2.32) 0.044
Summer versus Winter † 1.29 (0.86, 1.93) 0.159
Fall versus Winter 0.91 (0.53, 1.57) 0.619
a Spring: March 21 to June 20; Summer: June 21 to September 22; Fall: September 23 to December 20; Winter: December 21 to March 20.
b The significance criterion was 0.025 for the primary comparison († Summer versus Winter, highlighted), and 0.005 for the other five comparisons.

For the secondary analysis (vitamin D and SSI), vitamin D concentration was not associated with SSI. Note, due to the limited number of patients with available vitamin D measurements (n=62) and very few SSI events in this sub-sample, the logistic regression model was unstable (did not converge), which is consistent with the original paper’s wide confidence interval. The results should be interpreted with caution.

# Fit the secondary logistic regression model for vitamin D and SSI.
# This model uses only the vitamin D complete-case subset.
mod_vitd <- suppressWarnings(
  glm(
    SSI ~ VitaminD + Age + Female_f + Race_f + BMI + ASA_f +
      Steroids_f + Diabetes_f + RenalFailure_f + Emergency_f +
      DurationSurgery + RBC,
    data = vitd_model_df,
    family = binomial(link = "logit")
  )
)

# Print vitamin D counts and whether the model converged.
vitd_summary <- tibble(
  VitD_available_in_final_cohort = n_vitd_available,
  VitD_complete_cases_for_model = n_vitd_complete,
  Model_converged = mod_vitd$converged
)

vitd_summary
## # A tibble: 1 × 3
##   VitD_available_in_final_cohort VitD_complete_cases_for_model Model_converged
##                            <int>                         <int> <lgl>          
## 1                             62                            62 FALSE
  tidy(
    mod_vitd,
    exponentiate = TRUE,
    conf.int = TRUE,
    conf.level = 0.95
  ) %>%
    filter(term == "VitaminD")
## # A tibble: 1 × 7
##   term      estimate std.error statistic p.value conf.low conf.high
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 VitaminD 1.57e-192     9551.   -0.0462   0.963        0       Inf
# Only print the vitamin D estimate if the model converged.
if (!mod_vitd$converged) {
  cat("Vitamin D logistic regression did not converge. Estimates are unstable and should not be formally interpreted.\n")
} else {
  tidy(
    mod_vitd,
    exponentiate = TRUE,
    conf.int = TRUE,
    conf.level = 0.95
  ) %>%
    filter(term == "VitaminD")
}
## Vitamin D logistic regression did not converge. Estimates are unstable and should not be formally interpreted.

For interpretation, refer to The Discussion.

6 The Conclusion

The analysis suggests that SSI risk after colorectal surgery did not differ meaningfully by season in this single-center cohort, and perioperative vitamin D concentration was not associated with SSI. It is possible that this lack of association is due to unmeasured confounding variables. The major limitations include the exclusion of missing covariates, lack of generalization (single-center study), and the small sample size of the vitamin D analysis.

7 The Discussion

This study examines whether surgical site infections (SSI) after colorectal surgery vary by season, and whether perioperative vitamin D status helps explain any seasonal pattern. In our cohort, the crude SSI rate was highest in summer and lowest in spring, but after adjusting for patient and operative factors we did not find evidence that SSI odds differed significantly by season.

A seasonal increase in SSI is biologically plausible. Warmer months can change bacterial growth and profiles, and may increase skin exposure and contact that could influence contamination risk. However, operating rooms and inpatient environments are climate-controlled, which may reduce the practical impact of outdoor conditions. We also considered whether staffing turnover in summer could contribute, but this explanation is less convincing in settings where trainees rotate throughout the year rather than entering all at once in summer.

We also examined vitamin D because of its proposed immune and antimicrobial roles and because vitamin D levels can vary seasonally. In our data, vitamin D was only measured in a small subset of patients. Within that subset, vitamin D concentration was not associated with SSI, but the limited availability of measurements means this analysis was imprecise and should be interpreted cautiously. A null result here does not rule out a clinically relevant association, especially if vitamin D testing was selective and linked to patient characteristics that we could not fully capture.

Several limitations affect interpretation. This was a retrospective, single-center analysis in Cleveland, so it may not be as applicable to the target population of all colorectal surgical patients in North America (low external validity), and it may be subject to clustering (similar observations due to same clinical center). The season also varies across states and countries.

Residual confounding remains possible even after multivariable adjustment. Some important SSI predictors may not have been fully measured or consistently recorded. For example, my analysis could not adjust for type of surgery (colon vs rectal). I was not able to determine whether any patterns emerged when excluding patients with missing covariates. Therefore, although the research question is causative, it is difficult to make a definitive conclusion due to the nature of observational studies; they point towards association and not definitive causation.

Season was used as a proxy for seasonal weather and Vitamin D levels, rather than measuring those exact variables directly. The low number of patients with perioperative vitamin D data limits inference about vitamin D and makes it difficult to evaluate mediation of any seasonal effect.

Overall, the findings suggest that, in this setting, SSI risk after colorectal surgery is more strongly influenced by patient factors and operative characteristics than by the season in which surgery occurs. The descriptive summer elevation in SSI rates may still be worth monitoring, but it does not support a season-specific prevention strategy on its own. A more informative next step would be larger, prospective, multicentre studies across different climates and practice settings, ideally with more complete perioperative vitamin D measurement, to test whether these results generalise and to better quantify any modest seasonal effects.

8 Appendix I

  1. Turan OA, Babazade R, Eshraghi Y, You J, Turan A, Remzi F. Season and vitamin D status do not affect probability for surgical site infection after colorectal surgery. Eur Surg. 2015;47:341-345. doi:10.1007/s10353-015-0360-6.

My project was limited to information within this published work and the data provided. Because I did not have access to the full raw data set (database), I was not able to accurately comment on data quality. I am unsure how many patients were excluded because of specific variables and if any patterns exist. I tried to fully replicate the work, however, I modified a few things because of data constraints.

9 Appendix IIa

Here I am assessing the assumptions for using a logistic regression model.

# Check for exact duplicate rows in the analytic dataset.
sum(duplicated(analytic_df))
## [1] 0

No exact duplicate rows were identified, which suggests there was no obvious duplicate data entry; however, true independence of observations could not be formally verified because the data-set did not contain patient, surgeon, or team identifiers.

# Create a helper function to assess linearity in the logit
# for continuous predictors using quartiles.
check_logit <- function(data, var) {
  tmp <- data %>%
    mutate(group = ntile(.data[[var]], 4)) %>%
    group_by(group) %>%
    summarise(
      mean_x = mean(.data[[var]], na.rm = TRUE),
      n = n(),
      events = sum(SSI == 1, na.rm = TRUE),
      p = mean(SSI == 1, na.rm = TRUE),
      .groups = "drop"
    ) %>%
    mutate(emp_logit = log((events + 0.5) / (n - events + 0.5)))

  print(tmp)

  ggplot(tmp, aes(x = mean_x, y = emp_logit)) +
    geom_point() +
    geom_line() +
    labs(x = var, y = "Empirical log-odds of SSI")
}

# Check each continuous predictor used in the model.
check_logit(analytic_df, "Age")
## # A tibble: 4 × 6
##   group mean_x     n events      p emp_logit
##   <int>  <dbl> <int>  <int>  <dbl>     <dbl>
## 1     1   30.4   730     52 0.0712     -2.56
## 2     2   47.0   730     65 0.0890     -2.32
## 3     3   59.0   730     57 0.0781     -2.46
## 4     4   73.9   729     67 0.0919     -2.28

check_logit(analytic_df, "BMI")
## # A tibble: 4 × 6
##   group mean_x     n events      p emp_logit
##   <int>  <dbl> <int>  <int>  <dbl>     <dbl>
## 1     1   20.0   730     48 0.0658     -2.64
## 2     2   24.2   730     63 0.0863     -2.35
## 3     3   28.1   730     57 0.0781     -2.46
## 4     4   35.7   729     73 0.100      -2.19

check_logit(analytic_df, "DurationSurgery")
## # A tibble: 4 × 6
##   group mean_x     n events      p emp_logit
##   <int>  <dbl> <int>  <int>  <dbl>     <dbl>
## 1     1   1.34   730     27 0.0370     -3.24
## 2     2   2.82   730     44 0.0603     -2.74
## 3     3   4.00   730     76 0.104      -2.15
## 4     4   6.17   729     94 0.129      -1.91

check_logit(analytic_df, "RBC")
## # A tibble: 4 × 6
##   group mean_x     n events      p emp_logit
##   <int>  <dbl> <int>  <int>  <dbl>     <dbl>
## 1     1     0    730     56 0.0767     -2.48
## 2     2     0    730     48 0.0658     -2.64
## 3     3     0    730     60 0.0822     -2.41
## 4     4   322.   729     77 0.106      -2.13

Linearity in the logit for continuous predictors was assessed by dividing each variable into quartiles and examining the empirical log-odds of SSI across groups. Age and BMI showed only mild fluctuations across quartiles and were considered acceptable as simple linear terms. Duration of surgery showed a clear monotonic increase in empirical log-odds across quartiles, supporting the linearity assumption. In contrast, RBC transfusion was highly zero-inflated, with the first three quartiles having a mean of 0 and only the top quartile showing higher values, so the linearity assumption for RBC as a continuous term was less convincing.

# Plot Cook's distance to identify influential observations.
plot(cooks.distance(mod_season_full), type = "h", main = "Cook's distance")
abline(h = 4 / nrow(analytic_df), col = "red", lty = 2)

# Print the largest Cook's distance values.
cd <- cooks.distance(mod_season_full)
sort(cd, decreasing = TRUE)[1:10]
##         623          35         543        1552        2786        1550 
## 0.013235893 0.011540550 0.010811490 0.010346630 0.010203515 0.010084640 
##        1815         928        1785        2432 
## 0.009933807 0.009833197 0.009697193 0.009380704
max(cd)
## [1] 0.01323589

Cook’s distance and influence measures identified some potentially influential observations, but no single observation appeared to dominate the model (max cd is very small).

# Check multicollinearity among predictors using GVIF.
vif(mod_season_full)
##                     GVIF Df GVIF^(1/(2*Df))
## Season          1.021463  3        1.003546
## Age             1.330043  1        1.153275
## Female_f        1.011196  1        1.005582
## Race_f          1.026238  2        1.006496
## BMI             1.075237  1        1.036936
## ASA_f           1.553232  3        1.076150
## Steroids_f      1.049567  1        1.024484
## Diabetes_f      1.147676  1        1.071297
## RenalFailure_f  1.127000  1        1.061603
## Emergency_f     1.123488  1        1.059947
## DurationSurgery 1.261484  1        1.123158
## RBC             1.262719  1        1.123708

Multicollinearity was not a concern, as all adjusted GVIF values were close to 1.

# Generate predicted probabilities from the main model.
phat <- predict(mod_season_full, type = "response")

# Run the Hosmer-Lemeshow goodness-of-fit test.
hoslem.test(analytic_df$SSI, phat, g = 10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  analytic_df$SSI, phat
## X-squared = 7.0366, df = 8, p-value = 0.5327
# Plot the ROC curve and calculate AUC.
roc_obj <- roc(analytic_df$SSI, phat)
plot(roc_obj)

auc(roc_obj)
## Area under the curve: 0.6691

Finally, the Hosmer-Lemeshow goodness-of-fit test was non-significant (X² = 7.04, df = 8, p = 0.533), indicating no reliable evidence of poor model fit, although this does not prove the model is perfect.

10 Appendix IIb

Overall, the logistic model looked acceptable for this project: age, BMI, and especially surgery duration behaved reasonably, multi-collinearity was low, and there was no strong evidence of poor fit, but RBC was the weakest continuous term because it was highly zero-heavy.

Therefore, I have chosen to run a sensitivity analysis here where RBC is re-coded as a categorical variable (0 vs any transfusion) and the multivariable regression is repeated to check for any differences.

# Recode RBC as a binary variable: no transfusion vs any transfusion.
analytic_df_sens <- analytic_df %>%
  mutate(
    RBC_any = factor(
      ifelse(RBC > 0, "Any transfusion", "No transfusion"),
      levels = c("No transfusion", "Any transfusion")
    )
  )

# Fit the sensitivity model using RBC_any.
mod_rbc_any_full <- glm(
  SSI ~ Season + Age + Female_f + Race_f + BMI + ASA_f +
    Steroids_f + Diabetes_f + RenalFailure_f + Emergency_f +
    DurationSurgery + RBC_any,
  data = analytic_df_sens,
  family = binomial(link = "logit")
)

# Fit the sensitivity null model without Season.
mod_rbc_any_null <- glm(
  SSI ~ Age + Female_f + Race_f + BMI + ASA_f +
    Steroids_f + Diabetes_f + RenalFailure_f + Emergency_f +
    DurationSurgery + RBC_any,
  data = analytic_df_sens,
  family = binomial(link = "logit")
)

# Test the overall effect of Season in the sensitivity model.
lrt_rbc_any <- anova(mod_rbc_any_null, mod_rbc_any_full, test = "LRT")
overall_p_rbc_any <- lrt_rbc_any$`Pr(>Chi)`[2]
overall_p_rbc_any
## [1] 0.07672236

The p-value in the original methodology was 0.08; therefore, the results do not differ significantly.

# Repeat the six pairwise seasonal comparisons using RBC_any instead of continuous RBC.
pairwise_results_rbc_any <- lapply(comparisons, function(comp) {
  df_tmp <- analytic_df_sens %>%
    mutate(Season_pw = relevel(Season, ref = comp$ref))

  mod_pw <- glm(
    SSI ~ Season_pw + Age + Female_f + Race_f + BMI + ASA_f +
      Steroids_f + Diabetes_f + RenalFailure_f + Emergency_f +
      DurationSurgery + RBC_any,
    data = df_tmp,
    family = binomial(link = "logit")
  )

  res <- tidy(
    mod_pw,
    exponentiate = TRUE,
    conf.int = TRUE,
    conf.level = 1 - comp$sig
  ) %>%
    filter(term == paste0("Season_pw", comp$tgt))

  tibble(
    Comparison = paste(comp$tgt, "versus", comp$ref),
    OR = round(res$estimate, 2),
    CI_low = round(res$conf.low, 2),
    CI_high = round(res$conf.high, 2),
    p = round(res$p.value, 3),
    primary = comp$primary
  )
})

# Print sensitivity pairwise results.
bind_rows(pairwise_results_rbc_any) %>%
  mutate(
    `Odds Ratio (CI)` = sprintf("%.2f (%.2f, %.2f)", OR, CI_low, CI_high),
    Comparison = ifelse(primary, paste0(Comparison, " \u2020"), Comparison)
  ) %>%
  select(Comparison, `Odds Ratio (CI)`, p)
## # A tibble: 6 × 3
##   Comparison             `Odds Ratio (CI)`     p
##   <chr>                  <chr>             <dbl>
## 1 Spring versus Summer   0.60 (0.30, 1.12) 0.028
## 2 Spring versus Fall     0.85 (0.42, 1.64) 0.501
## 3 Spring versus Winter   0.77 (0.38, 1.52) 0.297
## 4 Summer versus Fall     1.41 (0.87, 2.32) 0.045
## 5 Summer versus Winter † 1.29 (0.87, 1.94) 0.158
## 6 Fall versus Winter     0.91 (0.53, 1.57) 0.625

Similarly, the OR and CI do not differ significantly from the original methodology.

Thus, the sensitivity analysis provides similar results to the main analysis.

11 Appendix III

# Tidy the main model results and format them for presentation.
tbl_model <- tidy(
  mod_season_full,
  exponentiate = TRUE,
  conf.int = TRUE,
  conf.level = 0.95
) %>%
  mutate(
    OR = round(estimate, 2),
    Lower = round(conf.low, 2),
    Upper = round(conf.high, 2),
    p = round(p.value, 3)
  ) %>%
  filter(term != "(Intercept)") %>%
  filter(!grepl("^Season", term)) %>%
  select(term, OR, Lower, Upper, p) %>%
  rename(
    Covariate = term,
    `95% CI Lower` = Lower,
    `95% CI Upper` = Upper
  )

# Add Season as one overall row using the LRT p-value.
season_row <- tibble(
  Covariate = "Season (overall)",
  OR = NA_real_,
  `95% CI Lower` = NA_real_,
  `95% CI Upper` = NA_real_,
  p = round(overall_p, 3)
)

# Combine and format the final regression table.
tbl_model_final <- bind_rows(season_row, tbl_model) %>%
  mutate(
    OR = ifelse(is.na(OR), "—", as.character(OR)),
    `95% CI Lower` = ifelse(is.na(`95% CI Lower`), "—", as.character(`95% CI Lower`)),
    `95% CI Upper` = ifelse(is.na(`95% CI Upper`), "—", as.character(`95% CI Upper`))
  )

# Print Table 4.
tbl_model_final %>%
  kbl(
    caption = "<b>Table 4.</b> Multivariable logistic regression: predictors of SSI",
    escape = FALSE
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "bordered"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#2E75B6", color = "white", bold = TRUE) %>%
  row_spec(1, bold = TRUE, background = "#dce6f1") %>%
  row_spec(which(suppressWarnings(as.numeric(tbl_model_final$p)) < 0.05), background = "#fff3cd")
Table 4. Multivariable logistic regression: predictors of SSI
Covariate OR 95% CI Lower 95% CI Upper p
Season (overall) 0.080
Age 1 0.99 1.01 0.950
Female_fFemale 1.05 0.8 1.38 0.716
Race_fBlack 0.83 0.45 1.4 0.502
Race_fOthers 1.32 0.57 2.66 0.474
BMI 1.01 0.99 1.03 0.393
ASA_f2 0.55 0.27 1.29 0.131
ASA_f3 0.56 0.27 1.32 0.148
ASA_f4 0.63 0.23 1.78 0.367
Steroids_fYes 1.37 0.86 2.1 0.171
Diabetes_fYes 1.77 1.29 2.41 0.000
RenalFailure_fYes 1.4 0.78 2.4 0.238
Emergency_fYes 1.78 0.97 3.08 0.049
DurationSurgery 1.23 1.15 1.32 0.000
RBC 1 1 1 0.651

According to these full model results, the development of SSI after colorectal surgery is more influenced by factors such as having diabetes (yes) by 77%, being an emergency case (yes) by 78%, and having a longer surgery duration (each additional hour increases SSI odds by 23%).

Note, R always uses the first level as the reference by default. Based on how we set up the factors earlier:

# Print the reference levels used in the regression models.
cat("Season reference:", levels(analytic_df$Season)[1], "\n")
## Season reference: Spring
cat("Race reference:", levels(analytic_df$Race_f)[1], "\n")
## Race reference: White
cat("ASA reference:", levels(analytic_df$ASA_f)[1], "\n")
## ASA reference: 1
cat("Gender reference:", levels(analytic_df$Female_f)[1], "\n")
## Gender reference: Male
cat("Diabetes reference:", levels(analytic_df$Diabetes_f)[1], "\n")
## Diabetes reference: No
cat("Renal failure reference:", levels(analytic_df$RenalFailure_f)[1], "\n")
## Renal failure reference: No
cat("Steroids reference:", levels(analytic_df$Steroids_f)[1], "\n")
## Steroids reference: No
cat("Emergency reference:", levels(analytic_df$Emergency_f)[1], "\n")
## Emergency reference: No
# Continuous variables are interpreted as one-unit increase interpretation. 

12 Appendix IV

# Check distributions of continuous variables using Shapiro-Wilk tests and histograms.
shapiro.test(analytic_df$Age)
## 
##  Shapiro-Wilk normality test
## 
## data:  analytic_df$Age
## W = 0.98735, p-value = 1.815e-15
shapiro.test(analytic_df$BMI)
## 
##  Shapiro-Wilk normality test
## 
## data:  analytic_df$BMI
## W = 0.94248, p-value < 2.2e-16
shapiro.test(analytic_df$DurationSurgery)
## 
##  Shapiro-Wilk normality test
## 
## data:  analytic_df$DurationSurgery
## W = 0.96389, p-value < 2.2e-16
shapiro.test(analytic_df$RBC)
## 
##  Shapiro-Wilk normality test
## 
## data:  analytic_df$RBC
## W = 0.24502, p-value < 2.2e-16
hist(analytic_df$Age, main = "Age Distribution")

hist(analytic_df$BMI, main = "BMI Distribution")

hist(analytic_df$DurationSurgery, main = "Duration Distribution")

hist(analytic_df$RBC, main = "RBC Distribution")

Although the original paper applied ANOVA for BMI, Shapiro-Wilk testing (W=0.94, p<0.05) and visual inspection revealed mild right skew compared to RBC transfusion and surgery duration. However, given the large balanced sample size, ANOVA remains robust and was used to maintain consistency with the original analysis.

13 Final Notes

AI was used to help with light editing of my final written report (grammar suggestions). Claude was used to guide-learning; to support trouble shooting when encountering errors in codes, and most importantly, in learning how to use more fancier codes in R (such as Table footnotes, colors, titles, advanced models and tests) not taught in class. After learning, codes were created by me; AI was used as a resource to guide, support, and solidify my understanding of statistical concepts and coding in R. Main text includes PPDAC(D) framework (excludes Preface and Appendix).