Title: “Surgical Site Infections in Pediatric Patients, 2024” Author: “Angel Alexandria Porter” Date: 5/13/2026 —

Data source: Center for Health Care Quality/Healthcare-Associated Infections Program/Epidemiology Unit

####Research Questions

cat("<b>Do hospitals classified as Worse have higher infection rates (SIR) than those classified as Non Worse (same or better)?

Do the infection rates (SIR) differ across operative procedure types?

Is there a relationship between procedure volume (Procedure_Count) and infection rates (SIR)?</b>")

Do hospitals classified as Worse have higher infection rates (SIR) than those classified as Non Worse (same or better)?

Do the infection rates (SIR) differ across operative procedure types?

Is there a relationship between procedure volume (Procedure_Count) and infection rates (SIR)?

####Overview

The table presents surgical site infection (SSI) data for 28 operative procedures performed on pediatric patients between January 1 and December 31, 2024, as reported by California hospitals to the Centers for Disease Control and Prevention National Healthcare Safety Network (NHSN). Data were obtained from NHSN on April 16, 2025. There is no ReadMe (or something similar) file with that information.

Key Variables 1. Standardized Infection Rate (SIR) SIR = 1 is as expected SIR > 1 is worse (more infections) SIR < 1 is better 2. Procedure Count Number of surgeries performed 3. Infections Reported Number of observed infections 4. Comparison (Perfomance Category) Hospital performance related to national benchmark Better/Same/Worse 5. Operative Procedure Types of surgery performed

I chose this specific topic because I aspire to become a pediatrician one day. This dataset provides valuable information to the public about the performance of surgical teams treating children in California. It serves as a useful tool for monitoring outcomes and identifying areas where the surgical care for pediatric patients can be improved.

####Import the dataset

First, load the dataset from the Surgical Site Infections (SSIs) for Operative Procedures in California Hospitals repository URL.

# Load necessary package(s).
library(readr)#install.packages("readr")
# Load the dataset using read_csv
fully_needed <- read_csv("ca_ssi_peds_2024.csv", show_col_types = FALSE)
head(fully_needed)
# A tibble: 6 × 17
   Year State      County HAI      Operative_Procedure Facility_ID Facility_Name
  <dbl> <chr>      <chr>  <chr>    <chr>                     <dbl> <chr>        
1  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
2  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
3  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
4  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
5  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
6  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
# ℹ 10 more variables: Hospital_Category_RiskAdjustment <chr>,
#   Facility_Type <chr>, Procedure_Count <dbl>, Infections_Reported <dbl>,
#   Infections_Predicted <dbl>, SIR <dbl>, SIR_CI_95_Lower_Limit <dbl>,
#   SIR_CI_95_Upper_Limit <dbl>, Comparison <chr>, Notes <chr>

Handle missing values

# How many NA(s)?
colSums(is.na(fully_needed))
                            Year                            State 
                               0                                0 
                          County                              HAI 
                              19                                0 
             Operative_Procedure                      Facility_ID 
                               0                               19 
                   Facility_Name Hospital_Category_RiskAdjustment 
                              19                               19 
                   Facility_Type                  Procedure_Count 
                              19                              971 
             Infections_Reported             Infections_Predicted 
                             971                              971 
                             SIR            SIR_CI_95_Lower_Limit 
                            1240                             1264 
           SIR_CI_95_Upper_Limit                       Comparison 
                            1240                             1240 
                           Notes 
                             338 
# Load necessary package(s)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
clean_fully_needed <- fully_needed %>%
 
  # Select only needed columns
  select(SIR,
         `Procedure_Count`,
         `Infections_Reported`,
         Comparison,
         `Operative_Procedure`) %>%
  
  # Remove rows with missing values
  filter(!is.na(SIR),
         SIR >= 0,
         !is.na(`Procedure_Count`),
         !is.na(`Infections_Reported`)) %>%
  
# Group data 
  group_by(Comparison) %>%
# Arrange highest SIR first 
  arrange(desc(SIR))

print(clean_fully_needed)
# A tibble: 196 × 5
# Groups:   Comparison [3]
     SIR Procedure_Count Infections_Reported Comparison Operative_Procedure     
   <dbl>           <dbl>               <dbl> <chr>      <chr>                   
 1  9.66              60                   2 Worse      Open reduction of fract…
 2  8.70             126                   2 Same       All procedures          
 3  8.65              33                   3 Worse      Small bowel surgery     
 4  4.84             341                   4 Worse      STATE OF CALIFORNIA POO…
 5  4.44              13                   1 Same       Small bowel surgery     
 6  4.08              28                   1 Same       Spinal fusion           
 7  3.81              82                   2 Same       Laminectomy             
 8  3.54              45                   3 Same       Colon surgery           
 9  3.43             119                   2 Same       Open reduction of fract…
10  3.23              68                   2 Same       Small bowel surgery     
# ℹ 186 more rows
View(clean_fully_needed)

Visualization 1 (Tableau)

https://public.tableau.com/views/StandardInfectionRatioversusHospitalPerformanceGroups/Dashboard1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link

The boxplot presents the distribution of Standardized Infection Ratios (SIR) among hospital performance groups classified as Better, Same, and Worse. Hospitals in the Worse category exhibit substantially higher SIR values and greater variability than the other groups, reflecting elevated infection rates. In contrast, hospitals in the Better category display consistently low SIR values with minimal variation. The Same category occupies an intermediate position, characterized by a moderate spread and several high-value outliers. Collectively, these findings indicate a strong association between hospital performance classification and infection rates.

Visualization 2

library(ggplot2)
library(dplyr)

library(ggplot2)
library(dplyr)
library(stringr)

# Remove pooled/reference categories
filtered_silly <- clean_fully_needed %>%
  filter(!grepl("STATE OF CALIFORNIA", Operative_Procedure))%>%
filter(Operative_Procedure != "All procedures")
# Optional: clean long procedure names
filtered_statue <- filtered_silly %>%
  mutate(
    Operative_Procedure = str_replace_all(Operative_Procedure, "_", " "),
    Operative_Procedure = str_trim(Operative_Procedure)
  )
filtered_statue <- clean_fully_needed %>%
  filter(
    !grepl("STATE OF CALIFORNIA", Operative_Procedure),
    Operative_Procedure != "All procedures"
  ) %>%
  group_by(Operative_Procedure) %>%
  mutate(
    n = n(),
    Procedure_Label = paste0(
      Operative_Procedure,
      " (n=", n, ")"
    )
  )

# Create horizontal boxplot
ggplot(filtered_statue,
       aes(x = reorder(Procedure_Label, SIR, median),
           y = SIR,
           fill = Operative_Procedure)) +

  geom_boxplot(alpha = 0.85, outlier.alpha = 0.6) +
  geom_jitter(
  height = 0.15,
  alpha = 0.4,
  size = 1
) +

  coord_flip() +

  labs(
    title = "Distribution of SIR Across Operative Procedure Types",
    subtitle = "Variation in infection ratios differs across procedure categories",
    x = "Operative Procedure Type",
    y = "Standardized Infection Ratio (SIR)"
  ) +

  scale_fill_viridis_d() +

  theme_minimal() +

  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 11),
    axis.text = element_text(size = 9),
    legend.position = "none",
    panel.grid.minor = element_blank()
  )

The visualization confirms that Standardized Infection Ratios (SIR) vary by procedure. Open reduction of fracture, laparotomy, and spinal fusion exhibit higher median SIRs and greater dispersion, while cardiac, thoracic, and kidney transplant surgeries remain concentrated near zero. Outliers across several categories further highlight significant hospital-level variance within the same procedures. Ultimately, the boxplot distributions provide clear visual evidence that operative type is a key factor in infection rate variability. Open reduction of fracture and exploratory abdominal surgery exhibit higher median SIR values, indicating an elevated infection risk. Conversely, cardiac and thoracic surgeries show distributions clustered near zero, reflecting lower infection rates overall.

Visualization 3

library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
p <- ggplot(clean_fully_needed, aes(x = Procedure_Count, y = SIR)) +
  geom_point(alpha = 0.4, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  scale_x_log10()+
  geom_jitter(alpha = 0.3, height = 0.05)+
  labs(
    title = "Relationship Between Procedure Volume and Infection Rates",
    x = "Procedure Count",
    y = "Standardized Infection Ratio (SIR)"
  ) +
  theme_minimal()
ggplotly(p)
`geom_smooth()` using formula = 'y ~ x'

The scatterplot indicates a weak positive association between procedure volume and infection rates. Most hospitals demonstrate low standardized infection ratio (SIR) values regardless of procedure count; however, several hospitals with moderate to high procedure volumes display elevated infection ratios.

Multiple linear regression

model <- lm(
  SIR ~ log10(Procedure_Count) + Infections_Reported + SIR_CI_95_Lower_Limit + SIR_CI_95_Upper_Limit,
  data = fully_needed
)
summary(model)

Call:
lm(formula = SIR ~ log10(Procedure_Count) + Infections_Reported + 
    SIR_CI_95_Lower_Limit + SIR_CI_95_Upper_Limit, data = fully_needed)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.84088 -0.60196  0.08868  0.72827  1.53382 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -0.87716    0.41448  -2.116   0.0358 *  
log10(Procedure_Count)  0.19215    0.17247   1.114   0.2668    
Infections_Reported    -0.01889    0.00780  -2.422   0.0165 *  
SIR_CI_95_Lower_Limit   3.61317    0.27283  13.243  < 2e-16 ***
SIR_CI_95_Upper_Limit   0.12887    0.01417   9.095 2.73e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7971 on 167 degrees of freedom
  (1276 observations deleted due to missingness)
Multiple R-squared:  0.7333,    Adjusted R-squared:  0.7269 
F-statistic: 114.8 on 4 and 167 DF,  p-value: < 2.2e-16
par(
  mfrow = c(2,2),
  mar = c(4,4,2,1)
)

plot(model)

The multiple linear regression model was highly significant (p < 0.001), explaining 72.7% of the variance in SIR values. While the confidence interval limits and reported infections were significant predictors, procedure volume (log10) did not show a statistically significant impact. Diagnostic plots revealed potential heteroscedasticity—common in skewed healthcare data—suggesting the model’s strong explanatory power should be interpreted cautiously, particularly given the close relationship between certain predictors and the response variable.

A. The table presents surgical site infection (SSI) data for 28 operative procedures performed on pediatric patients between January 1 and December 31, 2024, as reported by California hospitals to the Centers for Disease Control and Prevention National Healthcare Safety Network (NHSN). Data were obtained from NHSN on April 16, 2025. There is no ReadMe (or something similar) file with that information.

Key Variables 1. Standardized Infection Rate (SIR) SIR = 1 is as expected SIR > 1 is worse (more infections) SIR < 1 is better 2. Procedure Count Number of surgeries performed 3. Infections Reported Number of observed infections 4. Comparison (Perfomance Category) Hospital performance related to national benchmark Better/Same/Worse 5. Operative Procedure Types of surgery performed

I chose this specific topic because I aspire to become a pediatrician one day. This dataset provides valuable information to the public about the performance of surgical teams treating children in California. It serves as a useful tool for monitoring outcomes and identifying areas where the surgical care for pediatric patients can be improved.

B.These are my citations: Microsoft. (n.d.). Bing. https://www.bing.com/ Prabhakaran, S. (2026, May 11). Ggplot2 geom_jitter() in R: Scatter with random jitter. r. https://r-statistics.co/ggplot2-geom_jitter-in-R.html

C. Boxplots showed that ‘Worse’-rated hospitals have higher median SIRs and more variability. A linear scale was chosen over a log scale to better represent zero-value observations.

A log-transformed scatterplot revealed only a weak correlation between procedure volume and SIR. This suggests that factors like patient complexity or hospital practices are more influential than sheer volume—a conclusion supported by your regression analysis.

These points clearly illustrate that high volume doesn’t automatically mean high risk.

Horizontal boxplots with jittered points revealed that infection risk is highly dependent on procedure type. Open Reduction of Fracture and Laparotomy showed higher median SIRs and wider variability, while Cardiac and Thoracic surgeries clustered near zero. The jittered data highlighted extreme outliers, suggesting that even within similar procedures, hospital performance varies significantly. Readability was optimized by using a horizontal layout and refined labeling.

Collectively, the visualizations show that infection rates are driven more by procedure type and hospital performance classification than by procedure volume. Although the high volume of zero SIR values created scaling challenges, the final presentation effectively highlights the primary drivers of healthcare-associated infections.