DATA 110 - Project 1 - Hospital Acquired Surgical Site Infections in California

Author

Catherine Z. Matenje

Introduction

For project 1, I selected a data set obtained from the California Department of Public Health (CDPH) containing hospital-reported data on surgical site infections (SSIs). Hospitals in California are required to report Healthcare-Associated Infections (HAI) to the CDPH as part of a larger reporting system under the the Centers for Disease Control and Prevention National Healthcare Safety Network (NHSN) which aims to track, prevent, and eliminate infections patients acquire while receiving medical care. California hospitals are requires to track and report “deep incision and organ/space SSIs for adult and pediatric patients for 28 types of operative procedures.

This particular data set reports incidence of SSIs for adults in 2024 by operative procedure types and includes procedure counts, the number of infections observed (reported) and predicted, the standardized infection ratio (SIR) and it’s associated 95% confidence intervals, and lastly statistical interpretation to show whether SSI incidence was the same (no different), better (lower), or worse (higher) than the national baseline.

My particular interest for this project was to explore whether the number of procedures performed at hospitals is related to the standardized infection rates, and whether this relationship changes by procedure type or county. Basically, I am interested in understanding whether hospitals performing a high number of procedures generally experienced high infection rates, and whether this differs by procedure type and the county. Perhaps, counties with higher number of procedures performed might experience higher infection rates. This information could be useful in understanding patterns of disease, and other disparities that might be at play.

Please note that each row in this data set represents a hospital and a corresponding operative procedure. There can be multiple rows for one hospital since the data set reports 28 procedures per hospital.

Below are my research questions and the visualizations I hope to create for this project:

Research Question:

How does surgical procedure volume relate to infection rates, and how does this relationship vary across procedure types and counties in California?

Loading Packages

library(tidyverse)
library(ggplot2)
library(ggfortify)
library(ggalluvial)
library(RColorBrewer) 

Loading Data Set

ca_ssi_data <- readr::read_csv("ca_ssi_adult_odp_2024_data set.csv")
Rows: 6293 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): State, County, HAI, Operative_Procedure, Facility_Name, Hospital_C...
dbl  (8): Year, Facility_ID, Procedure_Count, Infections_Reported, Infection...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exploring Data Set

head(ca_ssi_data)
# A tibble: 6 × 18
   Year State      County HAI      Operative_Procedure Facility_ID Facility_Name
  <dbl> <chr>      <chr>  <chr>    <chr>                     <dbl> <chr>        
1  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
2  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
3  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
4  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
5  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
6  2024 California <NA>   Surgica… STATE OF CALIFORNI…          NA <NA>         
# ℹ 11 more variables: Hospital_Category_RiskAdjustment <chr>,
#   Facility_Type <chr>, Procedure_Count <dbl>, Infections_Reported <dbl>,
#   Infections_Predicted <dbl>, SIR <dbl>, SIR_CI_95_Lower_Limit <dbl>,
#   SIR_CI_95_Upper_Limit <dbl>, Comparison <chr>, Met_2020_Goal <chr>,
#   Notes <chr>
names(ca_ssi_data)
 [1] "Year"                             "State"                           
 [3] "County"                           "HAI"                             
 [5] "Operative_Procedure"              "Facility_ID"                     
 [7] "Facility_Name"                    "Hospital_Category_RiskAdjustment"
 [9] "Facility_Type"                    "Procedure_Count"                 
[11] "Infections_Reported"              "Infections_Predicted"            
[13] "SIR"                              "SIR_CI_95_Lower_Limit"           
[15] "SIR_CI_95_Upper_Limit"            "Comparison"                      
[17] "Met_2020_Goal"                    "Notes"                           
# [1] "Year"                             "State"                            "County"                          
# [4] "HAI"                              "Operative_Procedure"              "Facility_ID"                     
# [7] "Facility_Name"                    "Hospital_Category_RiskAdjustment" "Facility_Type"                   
# [10] "Procedure_Count"                  "Infections_Reported"              "Infections_Predicted"            
# [13] "SIR"                              "SIR_CI_95_Lower_Limit"            "SIR_CI_95_Upper_Limit"           
# [16] "Comparison"                       "Met_2020_Goal"                    "Notes"

# Key Variables of Interest

# County, SIR, Procedure_Count, Comparison, Operative_Procedure

glimpse(ca_ssi_data)
Rows: 6,293
Columns: 18
$ Year                             <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2…
$ State                            <chr> "California", "California", "Californ…
$ County                           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ HAI                              <chr> "Surgical Site Infections (SSI)", "Su…
$ Operative_Procedure              <chr> "STATE OF CALIFORNIA POOLED DATA", "S…
$ Facility_ID                      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Facility_Name                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Hospital_Category_RiskAdjustment <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Facility_Type                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Procedure_Count                  <dbl> 655036, 263, 30137, 6739, 15683, 1143…
$ Infections_Reported              <dbl> 3914, 2, 78, 250, 42, 242, 634, 98, 8…
$ Infections_Predicted             <dbl> 4514.54, 1.79, 124.47, 228.93, 65.16,…
$ SIR                              <dbl> 0.87, 1.12, 0.63, 1.09, 0.65, 1.07, 0…
$ SIR_CI_95_Lower_Limit            <dbl> 0.84, 0.19, 0.50, 0.96, 0.47, 0.94, 0…
$ SIR_CI_95_Upper_Limit            <dbl> 0.89, 3.70, 0.78, 1.23, 0.86, 1.21, 0…
$ Comparison                       <chr> "Better", "Same", "Better", "Same", "…
$ Met_2020_Goal                    <chr> "No", "No", "Yes", "No", "Yes", "No",…
$ Notes                            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# County,Operative Procedure and Comparison are classed as characters. I will change these to factors in the following code chunk

Cleaning the Data Set

# Deleting rows with statewide pooled data (rows 1-29)
# Deleting rows with "All procedures" (aggregated hospital-level rows)
# Remove missing values ONLY for key variables

ca_ssi_clean <- ca_ssi_data |>
   slice(-(1:29)) |> 
  filter(!Operative_Procedure=="All procedures") |>
  filter(!is.na(County)) |>
  drop_na(SIR, Procedure_Count, Comparison, Operative_Procedure)
  
# Changing characters to factors for categorical variables

ca_ssi_clean$Comparison <- as.factor(ca_ssi_clean$Comparison)
ca_ssi_clean$Operative_Procedure <- as.factor(ca_ssi_clean$Operative_Procedure)
ca_ssi_clean$County <- as.factor(ca_ssi_clean$County)

glimpse(ca_ssi_clean)
Rows: 3,119
Columns: 18
$ Year                             <dbl> 2024, 2024, 2024, 2024, 2024, 2024, 2…
$ State                            <chr> "California", "California", "Californ…
$ County                           <fct> Sacramento, Sacramento, Sacramento, S…
$ HAI                              <chr> "Surgical Site Infections (SSI)", "Su…
$ Operative_Procedure              <fct> "Appendix surgery", "Cesarean section…
$ Facility_ID                      <dbl> 30000037, 30000037, 30000037, 3000003…
$ Facility_Name                    <chr> "Methodist Hospital of Sacramento", "…
$ Hospital_Category_RiskAdjustment <chr> "Acute Care Hospital", "Acute Care Ho…
$ Facility_Type                    <chr> "Community, 125-250 Beds", "Community…
$ Procedure_Count                  <dbl> 109, 500, 19, 65, 139, 248, 127, 30, …
$ Infections_Reported              <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 3, 1, 1, 0, 0…
$ Infections_Predicted             <dbl> 0.25, 0.52, 0.40, 0.22, 0.27, 0.85, 0…
$ SIR                              <dbl> 0.00, 0.00, 0.00, 0.00, 3.77, 0.00, 0…
$ SIR_CI_95_Lower_Limit            <dbl> 0.00, 0.00, 0.00, 0.00, 0.10, 0.00, 0…
$ SIR_CI_95_Upper_Limit            <dbl> 14.58, 7.07, 9.18, 16.77, 21.03, 4.35…
$ Comparison                       <fct> Same, Same, Same, Same, Same, Same, S…
$ Met_2020_Goal                    <chr> "Yes", "Yes", "Yes", "Yes", "No", "Ye…
$ Notes                            <chr> "\xa5 See Data Dictionary", "\xa5 See…

Exploratory Data Analysis

Counting the Number of Hospitals/Facilities in the Data Set and the Number of Operative Procedures reported by each

First, I want to count how many unique facilities are represented in the data set and how many operative procedures each facility reported. I need to identify the unique number of facilities and then count the number of unique procedures reported by each facility

# Count number of facilities by ID and Name
 n_distinct(ca_ssi_clean$Facility_ID) 
[1] 299
  # 299 

 n_distinct(ca_ssi_clean$Facility_Name)
[1] 298
  # 298
 
 #The most useful value to use is the facility ID since it is a unique number assigned to each hospital. 
 
 # Count Number of Operative Procedures by Facility
 
 hospital_procedure_volume <- ca_ssi_clean |>
   group_by(Facility_ID) |>
   summarize(total_procedures = sum(Procedure_Count, na.rm = TRUE)) |>
   arrange(desc(total_procedures))
 
 hospital_procedure_volume
# A tibble: 299 × 2
   Facility_ID total_procedures
         <dbl>            <dbl>
 1   930000004            13996
 2    70001357            12134
 3    80000149             8610
 4    90001116             8388
 5    30000113             7933
 6   930000912             7872
 7    30000151             7348
 8    60000014             7191
 9   930000127             7098
10    60000071             6777
# ℹ 289 more rows

Plot 1: Number of Operative Procedure Types reported by Hospital

I want to understand how procedures are distributed among the hospitals. Typically, smaller hospitals, such as rural or community hospitals, might perform fewer surgeries compares to larger hospitals, such as university or teaching hospitals. I decided to plot the number of operative procedures per facility (total procedure count aka procedure volume) by the number of hospitals.

ggplot (hospital_procedure_volume, aes (x=total_procedures)) +
  geom_histogram(fill = "lightblue", color = "black", bins =30) +
  labs (
    title = "Distribution of Total Procedure Volume per Hospital",
    x = "Total Procedure Count",
    y = "Number of Hospitals"
  )+
  theme_minimal()

# This plot is not meaningful because it is highly right skewed and some of the values are squished making it look like most hospitals reported 0 procedures which is incorrect


ggplot (hospital_procedure_volume, aes (x=total_procedures)) +
  geom_histogram(fill = "lightblue", color = "black", bins =30) +
  scale_x_log10() +
  labs (
    title = "Corrected Distribution of Total Procedure Volume per Hospital",
    x = "Total Procedure Count",
    y = "Number of Hospitals", 
    caption = "Each bar represents the number of hospitals whose total procedure volume falls within the bin (range on the x-axis)"
  )+
  theme_minimal()

Please note that Each bar in the histogram represents the number of hospitals whose procedure count falls within a certain range. A majority of hospitals fall between 500 to 3000 procedures, with a few between 10-100 procedures, and lastly even fewer hospitals have more than 5000 procedures. Even though this plot doesn’t show it well, the distribution is still right-skewed because a majority of hospitals have a medium procedure volume and very few have extremely high procedure volume.

Distribution of SIR

Next, I wish to view the distribution of the standardized infection rate (SIR) in the data set. The SIR is calculated by dividing the number of observed infections by the number of predicted infections. Generally, an SIR of less that 1 (<1) is considered better than expected, an SIR equal to 1 is as expected and SIR greater than 1 (>1) is worse than expected. I expect that a majority of values are better than expected with a few outliers that are worse than expected

ggplot (ca_ssi_clean, aes(x = SIR)) + 
  geom_histogram(fill = "steelblue", color = "white", bins = 30) +
  labs( 
    title = "Distribution of Standardized Infection Ratio (SIR)", 
    x = "SIR", 
    y = "Frequency",
    caption = "Each observation represents a hospital and corresponding operative procedure. Each hospital reports 28 operative procedures"
    ) + theme_minimal()

The distribution of the SIR is heavily right-skewed. Most values are between 0 and 1, which means more hospitals are performing “better than expected”. However, there are also a significant number of extreme outliers, going up to 15! This suggests that some hospitals or operative procedures have high infection rates

Procedure Volume by Procedure Type

Next, I want to look at the number of procedures by operative procedure type to better understand which procedures are most common in the data set

ggplot(ca_ssi_clean,
       aes(x = reorder(Operative_Procedure, -Procedure_Count),
           y = Procedure_Count)) +
  stat_summary(fun = sum, geom = "bar", fill = "purple") +
  labs(
    title = "Total Procedure Volume by Procedure Type",
    x = "Procedure Type",
    y = "Total Procedure Count"
  ) +
  theme_minimal() 

# The x-axis labels for the procedures are overlapping and hard to read. I need to put them at an angle so they are more visible

ggplot(ca_ssi_clean, 
       aes(x = reorder(Operative_Procedure, -Procedure_Count), 
           y = Procedure_Count)) + 
  stat_summary(fun = sum, geom = "bar", fill = "purple") + 
  labs(
    title = "Total Procedure Volume by Procedure Type",
    x = "Procedure Type",
    y = "Total Procedure Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The top three common operative procedures appear to be cesarean section, gallbladder surgery, knee prosthesis and exploratory abdominal surgery (protoplasm). Most of the procedures are not very common. It is possible that the C-sections and other high volume procedures are driving most of the data

Procedure Volume by County

Next, I am interested in observing whether the number of procedures varies by County. Perhaps, certain counties will have higher procedure counts than others.

ggplot(ca_ssi_clean,
       aes(x = reorder(County, -Procedure_Count),
           y = Procedure_Count)) +
  stat_summary(fun = sum, geom = "bar", fill = "steelblue") +
  labs(
    title = "Total Procedure Volume by County",
    x = "County",
    y = "Total Procedure Count"
  ) +
  theme_minimal()

# Same as above, the x-axis labels for thecounties are overlapping and hard to read. I need to put them at an angle so they are more visible

ggplot(ca_ssi_clean,
       aes(x = reorder(County, -Procedure_Count),
           y = Procedure_Count)) +
  stat_summary(fun = sum, geom = "bar", fill = "steelblue") +
  labs(
    title = "Total Procedure Volume by County",
    x = "County",
    y = "Total Procedure Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Los Angeles, San Diego and Orange counties have the highest procedure counts. Most counties have very low procedure counts. This couuld indicate poor reporting from hospitals or could perhaps indicate the high-procedure counties might have higher healthcare activity. Los Angeles for example is quite populated and likely had high healthcare activity compared to other counties.

Boxplot of SIR by Procedure Type

Next, I was interested to see how the standardized infection rates varies by procedure types. Perhaps, certain procedures have a lower or higher SIR in this data set. I’m not sure how useful or meaningful this will be as there will be 27 or 28 boxplots displayed, however, perhaps, I will be able to get an idea of which procedures have lower/higher SIRs

ggplot(ca_ssi_clean, aes(x = Operative_Procedure, y = SIR)) +
  geom_boxplot(fill = "orange") +
  labs(
    title = "SIR Distribution by Procedure Type",
    x = "Procedure Type",
    y = "SIR"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The procedures with the lowest counts (less shaded orange boxes) tend to have extreme/high outliers. Some procedures such as kidney transplant and liver transplant seem to have higher mediam SIR which indicates worse than expected performance. There are also many outliers across all procedure tyes which indicates variability in the data set.

Scatterplot of Relationship bteween Procedure Count and SIR

Lastly, I am primarily interested in observing whether a relationship exists between Procedure count and SIR.

ggplot(ca_ssi_clean, aes(x = Procedure_Count, y = SIR)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm") +
  labs(
    title = "Relationship Between Procedure Volume and Infection Rate",
    x = "Procedure Count",
    y = "SIR"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

The scatterplot confirms that the data is heavily right skewed in terms of procedure count, most hospitals performed less than 10,000 procedures in 2024. The scatterplot also does not show a strong relationship between procedure count and SIR. There is a slight upward trend indicated by the line, but it is not enough to indicate a positive relationship. Linear regression with other variables will help to determine whether other factors increase the effect.

Summary Statistics

OVERALL SUMMARY STATS

summary(ca_ssi_clean$SIR)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.9177  1.3800 14.7100 
summary(ca_ssi_clean$Procedure_Count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0    63.0   122.0   191.4   235.0  2383.0 

SUMMARY STATS BY PROCEDURE TYPE

ca_ssi_clean |>
  group_by(Operative_Procedure) |>
  summarise(
    avg_SIR = mean(SIR, na.rm = TRUE),
    total_procedures = sum(Procedure_Count, na.rm = TRUE)
  ) |>
  arrange(desc(total_procedures))
# A tibble: 27 × 3
   Operative_Procedure                        avg_SIR total_procedures
   <fct>                                        <dbl>            <dbl>
 1 Cesarean section                             1.26            110870
 2 Gallbladder surgery                          0.827            50512
 3 Exploratory abdominal surgery (laparotomy)   1.13             42634
 4 Knee prosthesis                              0.865            42359
 5 Spinal fusion                                1.07             40526
 6 Open reduction of fracture                   0.974            39306
 7 Laminectomy                                  0.748            38344
 8 Hip prosthesis                               0.965            38205
 9 Colon surgery                                0.750            29808
10 Appendix surgery                             0.792            25014
# ℹ 17 more rows

As noted in my plots above, C-section has the highest volume of procedures, however, these summary stats show they also have a slightly higher SIR (>1). Some procedures such as kidney transplant,Bile duct, liver or pancreatic surgery, ovarian surgey, and spleen surgery have low volume but high SIRs.

Summary of Exploratory Data Analysis Findings:

Overall, my EDA revealed that most hospitals in the data set perform at or near the expected levels with SIRs between 0 and 1. There are a few extreme SIR values/outliers in the data set which appear to be related to low volume or less common procedures. Based on my scatterplot, it does not appear that procedure count/volume explains infection rates on it’s own. I will need to perform multiple linear regression to determine which other variables such as procedure type, comparison (performance) or county, are better predictors of infection rates.

Multiple Linear Regression with Backward Elimination

*The purpose of this section is to identify which variables in my data set explain or predict SIR (Infection Rates)?

Full Model with all variables

fit1 <- lm(SIR ~ Procedure_Count +
                 Infections_Reported +
                 Comparison +
                 Operative_Procedure +
                 County,
           data = ca_ssi_clean)

summary(fit1)

Call:
lm(formula = SIR ~ Procedure_Count + Infections_Reported + Comparison + 
    Operative_Procedure + County, data = ca_ssi_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.3721 -0.6610 -0.2722  0.3471  9.2182 

Coefficients:
                                                                Estimate
(Intercept)                                                    1.1579731
Procedure_Count                                               -0.0018060
Infections_Reported                                            0.2630817
ComparisonSame                                                -0.0120296
ComparisonWorse                                                2.9975527
Operative_ProcedureBile duct, liver or pancreatic surgery     -0.3593180
Operative_ProcedureCardiac surgery                            -0.1409771
Operative_ProcedureCesarean section                            0.9274948
Operative_ProcedureColon surgery                              -0.6996882
Operative_ProcedureCoronary bypass,chest and donor incisions  -0.0526856
Operative_ProcedureCoronary bypass,chest incision only        -0.4545957
Operative_ProcedureExploratory abdominal surgery (laparotomy)  0.0313966
Operative_ProcedureGallbladder surgery                         0.0762173
Operative_ProcedureGastric surgery                            -0.1690881
Operative_ProcedureHeart transplant                           -0.5543579
Operative_ProcedureHip prosthesis                             -0.0238770
Operative_ProcedureHysterectomy, abdominal                     0.1739939
Operative_ProcedureHysterectomy, vaginal                       0.0436220
Operative_ProcedureKidney surgery                             -0.1023795
Operative_ProcedureKidney transplant                           0.3179542
Operative_ProcedureKnee prosthesis                             0.0562572
Operative_ProcedureLaminectomy                                 0.0381708
Operative_ProcedureLiver transplant                           -0.9486058
Operative_ProcedureOpen reduction of fracture                 -0.0728935
Operative_ProcedureOvarian surgery                             0.8140021
Operative_ProcedurePacemaker surgery                           0.0551921
Operative_ProcedureRectal surgery                             -0.5874954
Operative_ProcedureSmall bowel surgery                        -0.3709971
Operative_ProcedureSpinal fusion                              -0.1667891
Operative_ProcedureSpleen surgery                              0.1893339
Operative_ProcedureThoracic surgery                            0.0538888
CountyAmador                                                  -0.5789959
CountyButte                                                    0.0220401
CountyCalaveras                                                2.1495590
CountyContra Costa                                            -0.2433027
CountyDel Norte                                               -0.7135997
CountyEl Dorado                                               -0.2799025
CountyFresno                                                  -0.2381498
CountyHumboldt                                                -0.5349796
CountyImperial                                                 0.8186518
CountyInyo                                                    -0.4155534
CountyKern                                                    -0.4131513
CountyKings                                                   -0.0323239
CountyLake                                                    -0.4019167
CountyLos Angeles                                             -0.3418009
CountyMarin                                                    0.9731203
CountyMendocino                                               -0.5661300
CountyMerced                                                  -0.0009802
CountyMono                                                    -0.9584614
CountyMonterey                                                -0.4287955
CountyNapa                                                     0.2878743
CountyNevada                                                  -0.1779024
CountyOrange                                                  -0.3578967
CountyPlacer                                                  -0.2568145
CountyRiverside                                               -0.3036910
CountySacramento                                              -0.3535378
CountySan Benito                                               4.6653843
CountySan Bernardino                                          -0.1500652
CountySan Diego                                               -0.2056425
CountySan Francisco                                           -0.2629855
CountySan Joaquin                                             -0.2850759
CountySan Luis Obispo                                         -0.4677782
CountySan Mateo                                               -0.0795682
CountySanta Barbara                                           -0.0861912
CountySanta Clara                                             -0.0822171
CountySanta Cruz                                               0.0191257
CountyShasta                                                  -0.0017947
CountySiskiyou                                                -0.3445788
CountySolano                                                   0.0899507
CountySonoma                                                   0.5380372
CountyStanislaus                                              -0.2635970
CountySutter                                                  -0.8669147
CountyTehama                                                  -0.3265798
CountyTulare                                                  -0.3917117
CountyTuolumne                                                -0.2380431
CountyVentura                                                 -0.1405249
CountyYolo                                                     0.3713698
CountyYuba                                                    -0.1461363
                                                              Std. Error
(Intercept)                                                    0.2337916
Procedure_Count                                                0.0001501
Infections_Reported                                            0.0137566
ComparisonSame                                                 0.1831556
ComparisonWorse                                                0.2300055
Operative_ProcedureBile duct, liver or pancreatic surgery      0.1579132
Operative_ProcedureCardiac surgery                             0.1680803
Operative_ProcedureCesarean section                            0.1506513
Operative_ProcedureColon surgery                               0.1236642
Operative_ProcedureCoronary bypass,chest and donor incisions   0.1499359
Operative_ProcedureCoronary bypass,chest incision only         0.3329924
Operative_ProcedureExploratory abdominal surgery (laparotomy)  0.1232586
Operative_ProcedureGallbladder surgery                         0.1248425
Operative_ProcedureGastric surgery                             0.1333740
Operative_ProcedureHeart transplant                            0.4617477
Operative_ProcedureHip prosthesis                              0.1229133
Operative_ProcedureHysterectomy, abdominal                     0.1340253
Operative_ProcedureHysterectomy, vaginal                       0.4111545
Operative_ProcedureKidney surgery                              0.2342649
Operative_ProcedureKidney transplant                           0.3332720
Operative_ProcedureKnee prosthesis                             0.1294376
Operative_ProcedureLaminectomy                                 0.1398527
Operative_ProcedureLiver transplant                            0.3756376
Operative_ProcedureOpen reduction of fracture                  0.1243344
Operative_ProcedureOvarian surgery                             0.4622731
Operative_ProcedurePacemaker surgery                           0.2133072
Operative_ProcedureRectal surgery                              0.1412819
Operative_ProcedureSmall bowel surgery                         0.1240770
Operative_ProcedureSpinal fusion                               0.1359084
Operative_ProcedureSpleen surgery                              0.3330789
Operative_ProcedureThoracic surgery                            0.1582918
CountyAmador                                                   0.8527703
CountyButte                                                    0.2823218
CountyCalaveras                                                1.2008357
CountyContra Costa                                             0.1701321
CountyDel Norte                                                0.8527664
CountyEl Dorado                                                0.3375866
CountyFresno                                                   0.1699035
CountyHumboldt                                                 0.3093238
CountyImperial                                                 0.4649970
CountyInyo                                                     1.2010764
CountyKern                                                     0.1758055
CountyKings                                                    0.3613494
CountyLake                                                     0.5465455
CountyLos Angeles                                              0.1166048
CountyMarin                                                    0.2668208
CountyMendocino                                                0.3765024
CountyMerced                                                   0.3267814
CountyMono                                                     0.6989960
CountyMonterey                                                 0.2219857
CountyNapa                                                     0.2947821
CountyNevada                                                   0.4131568
CountyOrange                                                   0.1325640
CountyPlacer                                                   0.2298322
CountyRiverside                                                0.1404413
CountySacramento                                               0.1511410
CountySan Benito                                               1.2072208
CountySan Bernardino                                           0.1415909
CountySan Diego                                                0.1313093
CountySan Francisco                                            0.1584625
CountySan Joaquin                                              0.1852753
CountySan Luis Obispo                                          0.2830004
CountySan Mateo                                                0.1965618
CountySanta Barbara                                            0.2035945
CountySanta Clara                                              0.1477972
CountySanta Cruz                                               0.2947240
CountyShasta                                                   0.2580489
CountySiskiyou                                                 0.5002848
CountySolano                                                   0.2155035
CountySonoma                                                   0.2155149
CountyStanislaus                                               0.1864633
CountySutter                                                   0.5458124
CountyTehama                                                   0.6990050
CountyTulare                                                   0.2884882
CountyTuolumne                                                 0.3931131
CountyVentura                                                  0.1680724
CountyYolo                                                     0.3618585
CountyYuba                                                     0.3760945
                                                              t value Pr(>|t|)
(Intercept)                                                     4.953 7.71e-07
Procedure_Count                                               -12.033  < 2e-16
Infections_Reported                                            19.124  < 2e-16
ComparisonSame                                                 -0.066 0.947637
ComparisonWorse                                                13.033  < 2e-16
Operative_ProcedureBile duct, liver or pancreatic surgery      -2.275 0.022950
Operative_ProcedureCardiac surgery                             -0.839 0.401676
Operative_ProcedureCesarean section                             6.157 8.41e-10
Operative_ProcedureColon surgery                               -5.658 1.67e-08
Operative_ProcedureCoronary bypass,chest and donor incisions   -0.351 0.725322
Operative_ProcedureCoronary bypass,chest incision only         -1.365 0.172296
Operative_ProcedureExploratory abdominal surgery (laparotomy)   0.255 0.798955
Operative_ProcedureGallbladder surgery                          0.611 0.541571
Operative_ProcedureGastric surgery                             -1.268 0.204976
Operative_ProcedureHeart transplant                            -1.201 0.230014
Operative_ProcedureHip prosthesis                              -0.194 0.845986
Operative_ProcedureHysterectomy, abdominal                      1.298 0.194311
Operative_ProcedureHysterectomy, vaginal                        0.106 0.915513
Operative_ProcedureKidney surgery                              -0.437 0.662125
Operative_ProcedureKidney transplant                            0.954 0.340140
Operative_ProcedureKnee prosthesis                              0.435 0.663864
Operative_ProcedureLaminectomy                                  0.273 0.784921
Operative_ProcedureLiver transplant                            -2.525 0.011610
Operative_ProcedureOpen reduction of fracture                  -0.586 0.557738
Operative_ProcedureOvarian surgery                              1.761 0.078361
Operative_ProcedurePacemaker surgery                            0.259 0.795850
Operative_ProcedureRectal surgery                              -4.158 3.29e-05
Operative_ProcedureSmall bowel surgery                         -2.990 0.002812
Operative_ProcedureSpinal fusion                               -1.227 0.219836
Operative_ProcedureSpleen surgery                               0.568 0.569781
Operative_ProcedureThoracic surgery                             0.340 0.733549
CountyAmador                                                   -0.679 0.497216
CountyButte                                                     0.078 0.937780
CountyCalaveras                                                 1.790 0.073545
CountyContra Costa                                             -1.430 0.152796
CountyDel Norte                                                -0.837 0.402768
CountyEl Dorado                                                -0.829 0.407097
CountyFresno                                                   -1.402 0.161114
CountyHumboldt                                                 -1.730 0.083819
CountyImperial                                                  1.761 0.078415
CountyInyo                                                     -0.346 0.729379
CountyKern                                                     -2.350 0.018835
CountyKings                                                    -0.089 0.928728
CountyLake                                                     -0.735 0.462167
CountyLos Angeles                                              -2.931 0.003401
CountyMarin                                                     3.647 0.000270
CountyMendocino                                                -1.504 0.132774
CountyMerced                                                   -0.003 0.997607
CountyMono                                                     -1.371 0.170415
CountyMonterey                                                 -1.932 0.053497
CountyNapa                                                      0.977 0.328862
CountyNevada                                                   -0.431 0.666795
CountyOrange                                                   -2.700 0.006976
CountyPlacer                                                   -1.117 0.263912
CountyRiverside                                                -2.162 0.030665
CountySacramento                                               -2.339 0.019393
CountySan Benito                                                3.865 0.000114
CountySan Bernardino                                           -1.060 0.289297
CountySan Diego                                                -1.566 0.117431
CountySan Francisco                                            -1.660 0.097097
CountySan Joaquin                                              -1.539 0.123991
CountySan Luis Obispo                                          -1.653 0.098449
CountySan Mateo                                                -0.405 0.685653
CountySanta Barbara                                            -0.423 0.672072
CountySanta Clara                                              -0.556 0.578058
CountySanta Cruz                                                0.065 0.948263
CountyShasta                                                   -0.007 0.994451
CountySiskiyou                                                 -0.689 0.491023
CountySolano                                                    0.417 0.676417
CountySonoma                                                    2.497 0.012594
CountyStanislaus                                               -1.414 0.157562
CountySutter                                                   -1.588 0.112322
CountyTehama                                                   -0.467 0.640386
CountyTulare                                                   -1.358 0.174625
CountyTuolumne                                                 -0.606 0.544870
CountyVentura                                                  -0.836 0.403166
CountyYolo                                                      1.026 0.304839
CountyYuba                                                     -0.389 0.697627
                                                                 
(Intercept)                                                   ***
Procedure_Count                                               ***
Infections_Reported                                           ***
ComparisonSame                                                   
ComparisonWorse                                               ***
Operative_ProcedureBile duct, liver or pancreatic surgery     *  
Operative_ProcedureCardiac surgery                               
Operative_ProcedureCesarean section                           ***
Operative_ProcedureColon surgery                              ***
Operative_ProcedureCoronary bypass,chest and donor incisions     
Operative_ProcedureCoronary bypass,chest incision only           
Operative_ProcedureExploratory abdominal surgery (laparotomy)    
Operative_ProcedureGallbladder surgery                           
Operative_ProcedureGastric surgery                               
Operative_ProcedureHeart transplant                              
Operative_ProcedureHip prosthesis                                
Operative_ProcedureHysterectomy, abdominal                       
Operative_ProcedureHysterectomy, vaginal                         
Operative_ProcedureKidney surgery                                
Operative_ProcedureKidney transplant                             
Operative_ProcedureKnee prosthesis                               
Operative_ProcedureLaminectomy                                   
Operative_ProcedureLiver transplant                           *  
Operative_ProcedureOpen reduction of fracture                    
Operative_ProcedureOvarian surgery                            .  
Operative_ProcedurePacemaker surgery                             
Operative_ProcedureRectal surgery                             ***
Operative_ProcedureSmall bowel surgery                        ** 
Operative_ProcedureSpinal fusion                                 
Operative_ProcedureSpleen surgery                                
Operative_ProcedureThoracic surgery                              
CountyAmador                                                     
CountyButte                                                      
CountyCalaveras                                               .  
CountyContra Costa                                               
CountyDel Norte                                                  
CountyEl Dorado                                                  
CountyFresno                                                     
CountyHumboldt                                                .  
CountyImperial                                                .  
CountyInyo                                                       
CountyKern                                                    *  
CountyKings                                                      
CountyLake                                                       
CountyLos Angeles                                             ** 
CountyMarin                                                   ***
CountyMendocino                                                  
CountyMerced                                                     
CountyMono                                                       
CountyMonterey                                                .  
CountyNapa                                                       
CountyNevada                                                     
CountyOrange                                                  ** 
CountyPlacer                                                     
CountyRiverside                                               *  
CountySacramento                                              *  
CountySan Benito                                              ***
CountySan Bernardino                                             
CountySan Diego                                                  
CountySan Francisco                                           .  
CountySan Joaquin                                                
CountySan Luis Obispo                                         .  
CountySan Mateo                                                  
CountySanta Barbara                                              
CountySanta Clara                                                
CountySanta Cruz                                                 
CountyShasta                                                     
CountySiskiyou                                                   
CountySolano                                                     
CountySonoma                                                  *  
CountyStanislaus                                                 
CountySutter                                                     
CountyTehama                                                     
CountyTulare                                                     
CountyTuolumne                                                   
CountyVentura                                                    
CountyYolo                                                       
CountyYuba                                                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.194 on 3041 degrees of freedom
Multiple R-squared:  0.3982,    Adjusted R-squared:  0.383 
F-statistic: 26.13 on 77 and 3041 DF,  p-value: < 2.2e-16
# Since I am also including categorical variables, like comparison, operative procedure and county, R will create dummy variables.

Interpretation of Model 1:

Adjusted R^2: 0.383. This tells me that this model explains about 38% of the variation in SIR.

*Procedure_Count Estimate is -0.001806 and p value = < 0.0000000000000002 This tells me that as procedure count increases, SIR decreases (negative direction of estimate). The P value is very significant so I will keep this in my model

  • Infections_Reported Estimate is +0.263 and the p value = < 0.0000000000000002 This tells me that as the number of infections reported increases, SIR also increases. The p value is also very significant so I will keep this in my model

  • ComparisonWorse Estimate is +2.99 and p value = < 0.0000000000000002 This tells me that Hospitals that have been labeled as performing “worse” than expected generally have a higher SIR which is expected. The p value is very significant so I will keep this in my model

  • Procedure Type: certain procedure types have significant p values (less than 0.05) including c-section, colon surgery and rectal surgery. This tells me that procedure count does matter so I will keep it

  • County: While some counties such as Los Angeles appear significant, it’s hard to tell if there is a true pattern. I will eliminate County in my next model.

Second Model with County eliminated

fit2 <- lm(SIR ~ Procedure_Count +
                 Infections_Reported +
                 Comparison +
                 Operative_Procedure,
           data = ca_ssi_clean)

summary(fit2)

Call:
lm(formula = SIR ~ Procedure_Count + Infections_Reported + Comparison + 
    Operative_Procedure, data = ca_ssi_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6271 -0.7058 -0.2874  0.3494  9.9088 

Coefficients:
                                                                Estimate
(Intercept)                                                    0.9081411
Procedure_Count                                               -0.0019110
Infections_Reported                                            0.2641723
ComparisonSame                                                 0.0292952
ComparisonWorse                                                3.1551068
Operative_ProcedureBile duct, liver or pancreatic surgery     -0.4180766
Operative_ProcedureCardiac surgery                            -0.1442883
Operative_ProcedureCesarean section                            0.9796658
Operative_ProcedureColon surgery                              -0.6921486
Operative_ProcedureCoronary bypass,chest and donor incisions  -0.0639782
Operative_ProcedureCoronary bypass,chest incision only        -0.5144165
Operative_ProcedureExploratory abdominal surgery (laparotomy)  0.0311379
Operative_ProcedureGallbladder surgery                         0.0904401
Operative_ProcedureGastric surgery                            -0.1897118
Operative_ProcedureHeart transplant                           -0.5927838
Operative_ProcedureHip prosthesis                             -0.0179482
Operative_ProcedureHysterectomy, abdominal                     0.1381472
Operative_ProcedureHysterectomy, vaginal                      -0.0056873
Operative_ProcedureKidney surgery                             -0.1429997
Operative_ProcedureKidney transplant                           0.2650027
Operative_ProcedureKnee prosthesis                             0.0715693
Operative_ProcedureLaminectomy                                 0.0481859
Operative_ProcedureLiver transplant                           -0.9629479
Operative_ProcedureOpen reduction of fracture                 -0.0596615
Operative_ProcedureOvarian surgery                             0.8116833
Operative_ProcedurePacemaker surgery                           0.0244745
Operative_ProcedureRectal surgery                             -0.6106984
Operative_ProcedureSmall bowel surgery                        -0.3674168
Operative_ProcedureSpinal fusion                              -0.1631111
Operative_ProcedureSpleen surgery                              0.1206924
Operative_ProcedureThoracic surgery                            0.0257079
                                                              Std. Error
(Intercept)                                                    0.2113993
Procedure_Count                                                0.0001489
Infections_Reported                                            0.0137448
ComparisonSame                                                 0.1842195
ComparisonWorse                                                0.2306090
Operative_ProcedureBile duct, liver or pancreatic surgery      0.1590230
Operative_ProcedureCardiac surgery                             0.1694328
Operative_ProcedureCesarean section                            0.1510767
Operative_ProcedureColon surgery                               0.1238136
Operative_ProcedureCoronary bypass,chest and donor incisions   0.1510592
Operative_ProcedureCoronary bypass,chest incision only         0.3358123
Operative_ProcedureExploratory abdominal surgery (laparotomy)  0.1241477
Operative_ProcedureGallbladder surgery                         0.1258646
Operative_ProcedureGastric surgery                             0.1345672
Operative_ProcedureHeart transplant                            0.4653280
Operative_ProcedureHip prosthesis                              0.1234652
Operative_ProcedureHysterectomy, abdominal                     0.1350294
Operative_ProcedureHysterectomy, vaginal                       0.4129260
Operative_ProcedureKidney surgery                              0.2359003
Operative_ProcedureKidney transplant                           0.3357949
Operative_ProcedureKnee prosthesis                             0.1302248
Operative_ProcedureLaminectomy                                 0.1409040
Operative_ProcedureLiver transplant                            0.3783442
Operative_ProcedureOpen reduction of fracture                  0.1248980
Operative_ProcedureOvarian surgery                             0.4657699
Operative_ProcedurePacemaker surgery                           0.2147959
Operative_ProcedureRectal surgery                              0.1424195
Operative_ProcedureSmall bowel surgery                         0.1247380
Operative_ProcedureSpinal fusion                               0.1367209
Operative_ProcedureSpleen surgery                              0.3358581
Operative_ProcedureThoracic surgery                            0.1596810
                                                              t value Pr(>|t|)
(Intercept)                                                     4.296 1.79e-05
Procedure_Count                                               -12.830  < 2e-16
Infections_Reported                                            19.220  < 2e-16
ComparisonSame                                                  0.159  0.87366
ComparisonWorse                                                13.682  < 2e-16
Operative_ProcedureBile duct, liver or pancreatic surgery      -2.629  0.00861
Operative_ProcedureCardiac surgery                             -0.852  0.39450
Operative_ProcedureCesarean section                             6.485 1.03e-10
Operative_ProcedureColon surgery                               -5.590 2.46e-08
Operative_ProcedureCoronary bypass,chest and donor incisions   -0.424  0.67194
Operative_ProcedureCoronary bypass,chest incision only         -1.532  0.12566
Operative_ProcedureExploratory abdominal surgery (laparotomy)   0.251  0.80197
Operative_ProcedureGallbladder surgery                          0.719  0.47247
Operative_ProcedureGastric surgery                             -1.410  0.15870
Operative_ProcedureHeart transplant                            -1.274  0.20279
Operative_ProcedureHip prosthesis                              -0.145  0.88443
Operative_ProcedureHysterectomy, abdominal                      1.023  0.30635
Operative_ProcedureHysterectomy, vaginal                       -0.014  0.98901
Operative_ProcedureKidney surgery                              -0.606  0.54444
Operative_ProcedureKidney transplant                            0.789  0.43007
Operative_ProcedureKnee prosthesis                              0.550  0.58265
Operative_ProcedureLaminectomy                                  0.342  0.73239
Operative_ProcedureLiver transplant                            -2.545  0.01097
Operative_ProcedureOpen reduction of fracture                  -0.478  0.63291
Operative_ProcedureOvarian surgery                              1.743  0.08149
Operative_ProcedurePacemaker surgery                            0.114  0.90929
Operative_ProcedureRectal surgery                              -4.288 1.86e-05
Operative_ProcedureSmall bowel surgery                         -2.946  0.00325
Operative_ProcedureSpinal fusion                               -1.193  0.23295
Operative_ProcedureSpleen surgery                               0.359  0.71935
Operative_ProcedureThoracic surgery                             0.161  0.87211
                                                                 
(Intercept)                                                   ***
Procedure_Count                                               ***
Infections_Reported                                           ***
ComparisonSame                                                   
ComparisonWorse                                               ***
Operative_ProcedureBile duct, liver or pancreatic surgery     ** 
Operative_ProcedureCardiac surgery                               
Operative_ProcedureCesarean section                           ***
Operative_ProcedureColon surgery                              ***
Operative_ProcedureCoronary bypass,chest and donor incisions     
Operative_ProcedureCoronary bypass,chest incision only           
Operative_ProcedureExploratory abdominal surgery (laparotomy)    
Operative_ProcedureGallbladder surgery                           
Operative_ProcedureGastric surgery                               
Operative_ProcedureHeart transplant                              
Operative_ProcedureHip prosthesis                                
Operative_ProcedureHysterectomy, abdominal                       
Operative_ProcedureHysterectomy, vaginal                         
Operative_ProcedureKidney surgery                                
Operative_ProcedureKidney transplant                             
Operative_ProcedureKnee prosthesis                               
Operative_ProcedureLaminectomy                                   
Operative_ProcedureLiver transplant                           *  
Operative_ProcedureOpen reduction of fracture                    
Operative_ProcedureOvarian surgery                            .  
Operative_ProcedurePacemaker surgery                             
Operative_ProcedureRectal surgery                             ***
Operative_ProcedureSmall bowel surgery                        ** 
Operative_ProcedureSpinal fusion                                 
Operative_ProcedureSpleen surgery                                
Operative_ProcedureThoracic surgery                              
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.206 on 3088 degrees of freedom
Multiple R-squared:  0.3766,    Adjusted R-squared:  0.3706 
F-statistic: 62.19 on 30 and 3088 DF,  p-value: < 2.2e-16

Interpretation of Model 2:

Adjusted R^2: 0.3706. This tells me that this model explains about 37% of the variation in SIR. This is slightly lower than the first model, however, this is likely due to the elimination of county which had a large number of insignificant levels. The key variables from model 1, procedure count, infections reported, and hospital comparison remain highly significant with p values less than 0.05

ComparisonSame: This appears to have a very large p-value (0.87) and is not significant in my model so I will collapse the Variable Comparison (better, worse, same) in the next model to have only two levels “worse” and “not worse”.

Third Model with ComparisonSame modified

ca_ssi_clean <- ca_ssi_clean |>
  mutate(Comparison_collapsed = ifelse(Comparison == "Worse", "Worse", "Not Worse"))

ca_ssi_clean$Comparison_collapsed <- as.factor(ca_ssi_clean$Comparison_collapsed)

fit3 <- lm(SIR ~ Procedure_Count +
                 Infections_Reported +
                 Comparison_collapsed +
                 Operative_Procedure,
           data = ca_ssi_clean)

summary(fit3)

Call:
lm(formula = SIR ~ Procedure_Count + Infections_Reported + Comparison_collapsed + 
    Operative_Procedure, data = ca_ssi_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.6300 -0.7061 -0.2879  0.3501  9.9079 

Coefficients:
                                                                Estimate
(Intercept)                                                    0.9381202
Procedure_Count                                               -0.0019163
Infections_Reported                                            0.2644800
Comparison_collapsedWorse                                      3.1250789
Operative_ProcedureBile duct, liver or pancreatic surgery     -0.4197276
Operative_ProcedureCardiac surgery                            -0.1440867
Operative_ProcedureCesarean section                            0.9819738
Operative_ProcedureColon surgery                              -0.6946037
Operative_ProcedureCoronary bypass,chest and donor incisions  -0.0645236
Operative_ProcedureCoronary bypass,chest incision only        -0.5148751
Operative_ProcedureExploratory abdominal surgery (laparotomy)  0.0308246
Operative_ProcedureGallbladder surgery                         0.0906537
Operative_ProcedureGastric surgery                            -0.1899138
Operative_ProcedureHeart transplant                           -0.5931557
Operative_ProcedureHip prosthesis                             -0.0184210
Operative_ProcedureHysterectomy, abdominal                     0.1376405
Operative_ProcedureHysterectomy, vaginal                      -0.0062333
Operative_ProcedureKidney surgery                             -0.1430067
Operative_ProcedureKidney transplant                           0.2649190
Operative_ProcedureKnee prosthesis                             0.0719209
Operative_ProcedureLaminectomy                                 0.0481600
Operative_ProcedureLiver transplant                           -0.9690436
Operative_ProcedureOpen reduction of fracture                 -0.0599137
Operative_ProcedureOvarian surgery                             0.8125066
Operative_ProcedurePacemaker surgery                           0.0248340
Operative_ProcedureRectal surgery                             -0.6118852
Operative_ProcedureSmall bowel surgery                        -0.3691040
Operative_ProcedureSpinal fusion                              -0.1638113
Operative_ProcedureSpleen surgery                              0.1201819
Operative_ProcedureThoracic surgery                            0.0258317
                                                              Std. Error
(Intercept)                                                    0.0956406
Procedure_Count                                                0.0001452
Infections_Reported                                            0.0136057
Comparison_collapsedWorse                                      0.1323605
Operative_ProcedureBile duct, liver or pancreatic surgery      0.1586587
Operative_ProcedureCardiac surgery                             0.1694013
Operative_ProcedureCesarean section                            0.1503542
Operative_ProcedureColon surgery                               0.1228279
Operative_ProcedureCoronary bypass,chest and donor incisions   0.1509964
Operative_ProcedureCoronary bypass,chest incision only         0.3357469
Operative_ProcedureExploratory abdominal surgery (laparotomy)  0.1241125
Operative_ProcedureGallbladder surgery                         0.1258375
Operative_ProcedureGastric surgery                             0.1345400
Operative_ProcedureHeart transplant                            0.4652487
Operative_ProcedureHip prosthesis                              0.1234099
Operative_ProcedureHysterectomy, abdominal                     0.1349705
Operative_ProcedureHysterectomy, vaginal                       0.4128465
Operative_ProcedureKidney surgery                              0.2358631
Operative_ProcedureKidney transplant                           0.3357415
Operative_ProcedureKnee prosthesis                             0.1301855
Operative_ProcedureLaminectomy                                 0.1408817
Operative_ProcedureLiver transplant                            0.3763380
Operative_ProcedureOpen reduction of fracture                  0.1248682
Operative_ProcedureOvarian surgery                             0.4656677
Operative_ProcedurePacemaker surgery                           0.2147502
Operative_ProcedureRectal surgery                              0.1422013
Operative_ProcedureSmall bowel surgery                         0.1242663
Operative_ProcedureSpinal fusion                               0.1366284
Operative_ProcedureSpleen surgery                              0.3357898
Operative_ProcedureThoracic surgery                            0.1596539
                                                              t value Pr(>|t|)
(Intercept)                                                     9.809  < 2e-16
Procedure_Count                                               -13.201  < 2e-16
Infections_Reported                                            19.439  < 2e-16
Comparison_collapsedWorse                                      23.610  < 2e-16
Operative_ProcedureBile duct, liver or pancreatic surgery      -2.645   0.0082
Operative_ProcedureCardiac surgery                             -0.851   0.3951
Operative_ProcedureCesarean section                             6.531 7.61e-11
Operative_ProcedureColon surgery                               -5.655 1.70e-08
Operative_ProcedureCoronary bypass,chest and donor incisions   -0.427   0.6692
Operative_ProcedureCoronary bypass,chest incision only         -1.534   0.1252
Operative_ProcedureExploratory abdominal surgery (laparotomy)   0.248   0.8039
Operative_ProcedureGallbladder surgery                          0.720   0.4713
Operative_ProcedureGastric surgery                             -1.412   0.1582
Operative_ProcedureHeart transplant                            -1.275   0.2024
Operative_ProcedureHip prosthesis                              -0.149   0.8814
Operative_ProcedureHysterectomy, abdominal                      1.020   0.3079
Operative_ProcedureHysterectomy, vaginal                       -0.015   0.9880
Operative_ProcedureKidney surgery                              -0.606   0.5444
Operative_ProcedureKidney transplant                            0.789   0.4301
Operative_ProcedureKnee prosthesis                              0.552   0.5807
Operative_ProcedureLaminectomy                                  0.342   0.7325
Operative_ProcedureLiver transplant                            -2.575   0.0101
Operative_ProcedureOpen reduction of fracture                  -0.480   0.6314
Operative_ProcedureOvarian surgery                              1.745   0.0811
Operative_ProcedurePacemaker surgery                            0.116   0.9079
Operative_ProcedureRectal surgery                              -4.303 1.74e-05
Operative_ProcedureSmall bowel surgery                         -2.970   0.0030
Operative_ProcedureSpinal fusion                               -1.199   0.2306
Operative_ProcedureSpleen surgery                               0.358   0.7204
Operative_ProcedureThoracic surgery                             0.162   0.8715
                                                                 
(Intercept)                                                   ***
Procedure_Count                                               ***
Infections_Reported                                           ***
Comparison_collapsedWorse                                     ***
Operative_ProcedureBile duct, liver or pancreatic surgery     ** 
Operative_ProcedureCardiac surgery                               
Operative_ProcedureCesarean section                           ***
Operative_ProcedureColon surgery                              ***
Operative_ProcedureCoronary bypass,chest and donor incisions     
Operative_ProcedureCoronary bypass,chest incision only           
Operative_ProcedureExploratory abdominal surgery (laparotomy)    
Operative_ProcedureGallbladder surgery                           
Operative_ProcedureGastric surgery                               
Operative_ProcedureHeart transplant                              
Operative_ProcedureHip prosthesis                                
Operative_ProcedureHysterectomy, abdominal                       
Operative_ProcedureHysterectomy, vaginal                         
Operative_ProcedureKidney surgery                                
Operative_ProcedureKidney transplant                             
Operative_ProcedureKnee prosthesis                               
Operative_ProcedureLaminectomy                                   
Operative_ProcedureLiver transplant                           *  
Operative_ProcedureOpen reduction of fracture                    
Operative_ProcedureOvarian surgery                            .  
Operative_ProcedurePacemaker surgery                             
Operative_ProcedureRectal surgery                             ***
Operative_ProcedureSmall bowel surgery                        ** 
Operative_ProcedureSpinal fusion                                 
Operative_ProcedureSpleen surgery                                
Operative_ProcedureThoracic surgery                              
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.205 on 3089 degrees of freedom
Multiple R-squared:  0.3766,    Adjusted R-squared:  0.3708 
F-statistic: 64.35 on 29 and 3089 DF,  p-value: < 2.2e-16

Interpretation of Model 3:

*Adjusted R^2: 0.3708. This tells me that this model explains about 37% of the variation in SIR. This is about the same as the previous model.

  • Comparison_collapsedWorse Estimate is +3.125 and p value = < 0.0000000000000002 This tells me that Hospitals that have been labeled as performing “Worse” than expected generally have a SIR that is 3.1 units higher than those classified as “Not Worse” which includes the original variable categories of Better and Same. The p value is very significant at 0.05 level.

*Procedure_Count Estimate is -0.0019163 and p value = < 0.0000000000000002 This tells me that as procedure count increases, SIR decreases (negative direction of estimate). Higher procedure counts are associated with slightly lower infection rates. The P value is very significant at the 0.05 model.

  • Infections_Reported Estimate is +0.264 and the p value = < 0.0000000000000002 This tells me that as the number of infections reported increases, SIR also increases. Higher SIRs are associated with higher reported infections. The p value is also very significant at 0.05 level.

*Operative Procedure: There are several notable procedures that have significant p-values such as c-section, colon surgery, rectal surgery, small bowel surgery, and liver transplant. However, a majority of the procedures remain not significant. I will try eliminating this in the next model

Fourth Model with Operative Procedure eliminated

fit4 <- lm(SIR ~ Procedure_Count +
                 Infections_Reported +
                 Comparison_collapsed,
           data = ca_ssi_clean)

summary(fit4)

Call:
lm(formula = SIR ~ Procedure_Count + Infections_Reported + Comparison_collapsed, 
    data = ca_ssi_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.1891 -0.6603 -0.5215  0.3433 10.0012 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                0.7220146  0.0299527  24.105  < 2e-16 ***
Procedure_Count           -0.0008569  0.0001105  -7.753 1.21e-14 ***
Infections_Reported        0.1956387  0.0122381  15.986  < 2e-16 ***
Comparison_collapsedWorse  3.4564286  0.1323665  26.113  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.234 on 3115 degrees of freedom
Multiple R-squared:  0.341, Adjusted R-squared:  0.3403 
F-statistic: 537.2 on 3 and 3115 DF,  p-value: < 2.2e-16

Interpretation of Model 4:

This is my final model!

Adjusted R^2 is 0.3403 which means this model explains 34% of variation in SIR. While is it slightly lower than the other observed values in previous models, and also does not explain all variation, the model also includes a few significant predictors of SIR as follows:

Procedure Count (Estimate is -0.0008569): As procedure count increases, SIR decreases which suggests that hospitals with higher procedure counts or volume may have better infection rates

Infections Reported (Estimate is 0.1956387): Higher reported infections are associated with an increase in SIR which makes sense as SIR compares observed infections to predicted infections

Comparison_collapsedWorse (Estimate 3.4564286): Hospitals that are categorized as “worse” have SIR values that are generally about 3.46 units higher than those classified as “Not Worse” (Better or Same)

Diagnostic Plot:

I will run diagnostic plot to assess regression assumptions based on Professor Saidi’s in class coding notes

autoplot(fit4)
Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggfortify package.
  Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.

Final Visualization:

Based on my multiple linear regression results, I will create a scatter plot where showing procedure counts (x axis) by SIR (y axis). I will use color reflect the Comparison variable (Worse vs. Not Worse) and size will reflect the number of reported infections.

options(scipen = 999)

ggplot(ca_ssi_clean,
       aes(x = Procedure_Count,
           y = SIR,
           size = Infections_Reported,
           color = Comparison_collapsed)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
  scale_color_brewer(palette = "Set2") +

  labs(
    title = "Relationship Between Procedure Volume and Infection Rate",
    subtitle = "Hospitals performing worse than expected tend to have higher SIR values",
    x = "Procedure Count",
    y = "Standardized Infection Ratio (SIR)",
    size = "Infections Reported",
    color = "Performance",
    caption = "Source: California Department of Public Health, Division of Healthcare Quality Promotion"
  ) +

  theme_minimal(base_size = 12)
`geom_smooth()` using formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation: size.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Essay

Description of Scatterplot:

Overall, the scatterplot shows that there is no strong relationship between procedure volume and infection rates (SIR), as indicated by the flat trend among hospitals performing “Not Worse” (green) meaning better as expected. However, hospitals categorized as “Worse” do experience higher SIR values especially at lower procedure counts. Further, larger points which represent higher numbers of infections are more concentrated among worse-performing hospitals. This suggests that the burden of infection is closely associated with performance category (Not Worse vs Worse) than with procedure volume alone.

Data - Cleaning

First, I removed the first 29 rows in the data set, which contained statewide pooled data rather than hospital-level observations. Since my analysis focuses on individual hospitals, these aggregated rows were not necessary. Next, I removed rows where the operative procedure was labeled as “All procedures,” as these represent aggregated summaries for each hospital rather than specific procedure types. I also filtered out observations with missing values in key variables of interest, including County, SIR, Procedure_Count, Comparison, and Operative_Procedure, using drop_na(). This ensured that all analyses were based on complete cases. Finally, I converted categorical variables (County, Operative_Procedure, and Comparison) from character type to factors. This step is important for regression analysis where categorical variables are converted into dummy variables.

Challenges/Limitations

One visualization I initially planned to include was an alluvial plot to show how performance categories (Better, Same, Worse) varied across procedure types. However, I chose not to include this in the final analysis because of time and complexity of the plot. Also, some of my exploratory plots, such as the histogram of procedure volume, were difficult to interpret due to heavy right-skewness.