Final Project Quarto

Author

Telesphore E. L. Kabore

** PART I. INTRODUCTION**

In the history of healthcare quality, nosocomial infections remain a significant concern in hospitals and tertiary care institutions. A nosocomial infection, also known as a healthcare-associated infection (HAI), is an infection that patients acquire while receiving treatment for other medical conditions. These infections occur for various reasons and can arise at different stages of the care process. A well-known example is surgical site infection (SSI), in which a patient undergoing surgery may develop an infection before, during, or after the procedure. Given the potential harm to patients, it is critical for healthcare workers and systems to prevent such infections, ensuring that patients do not leave with additional illnesses unrelated to their original condition. Due to their impact on patient safety, nosocomial infections are a key focus of quality improvement initiatives. Understanding their causes and predictors is essential for developing effective prevention strategies.

Historical Context and Background

Research on surgical site infections (SSIs) has evolved significantly since its early documentation. As noted by Sergio et al. (2016), the first reliable surgical mortality statistics were published in 1841 by the French surgeon Malgaigne (1806–1863), who reported a 60% mortality rate for amputations, primarily attributed to hospital-acquired infections. In modern surgical practice, SSIs remain a critical concern, particularly in colorectal surgery. Turan et al. (2015) observed that SSIs occur in 11–28% of colorectal surgery patients, imposing substantial clinical and economic burdens. However, despite anecdotal suggestions that seasonal variations (e.g., winter infections linked to lower vitamin D levels) may influence SSI rates, robust evidence supporting this association remains lacking. Turan et al. (2015) explicitly concluded that “season and vitamin D status do not affect the probability of SSI after colorectal surgery.”

Study Design and Data Source

This project analyzes data from Turan et al. (2015), a retrospective observational cohort study of 2,919 adults undergoing colorectal surgery. Key features of the dataset include: Seasonal stratification: Surgeries were categorized by date (spring: March 21–June 20; summer: June 21–September 22; fall: September 23–December 20; winter: December 21–March 20). Vitamin D measurement: Secondary analysis included patients with documented serum 25(OH)D levels (available for ~5% of the cohort). Variables: Age, gender, race, BMI, surgical risk factors, procedural indices, season (key predictor), and SSI occurrence (binary outcome).

While this design enables hypothesis generation, it is susceptible to biases (e.g., selection bias due to exclusion of pediatric patients, information bias from retrospective data collection).

Research Question and Analytical Approach

The primary objective of this project or overarching question to answer is: What is the model or suit of variables that best explain SSI occurrence? Given the binary outcome (SSI: Yes/No), a three-step logistic regression analysis will be conducted: 1. Descriptive statistics: Summarize cohort characteristics (e.g., frequencies, mean ± SD/median [IQR]). 2. Bivariate analysis: Screen individual variables for SSI associations (chi-square/t-tests; unadjusted ORs). 3. Multivariable analysis: Adjust for confounders to isolate independent predictors (adjusted ORs, 95% CIs).

** PART II. DATA ANALYSIS**

Loading Libraries

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
✔ broom        1.0.8     ✔ rsample      1.3.0
✔ dials        1.4.0     ✔ tune         1.3.0
✔ infer        1.0.8     ✔ workflows    1.2.0
✔ modeldata    1.4.0     ✔ workflowsets 1.1.1
✔ parsnip      1.3.2     ✔ yardstick    1.3.2
✔ recipes      1.3.1     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
library(infer)
library(skimr)

Load data set

setwd("~/Telesphore/Personnel/Etudes/Montgomery_College/Data_Sciences_Certificate_program/Math_217/Final_Project")
SeasonalEffect <- read_csv("SeasonalEffect.csv")
Rows: 2919 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (14): Age, Female, Race, BMI, ASAstatus, Diabetes, ChronicRenalFailure, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data cleaning

# quick look at the data
glimpse(SeasonalEffect)
Rows: 2,919
Columns: 14
$ Age                 <dbl> 44.0, 28.1, 39.7, 26.6, 69.0, 40.2, 69.0, 27.8, 57…
$ Female              <dbl> 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
$ Race                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ BMI                 <dbl> 24.3, 20.3, 21.6, 22.3, 16.0, 23.3, 19.9, 24.9, 30…
$ ASAstatus           <dbl> 2, 2, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 3, 2, 2, 3,…
$ Diabetes            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ ChronicRenalFailure <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ PreopSteroids       <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ Emergency           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
$ DurationSurgery     <dbl> 5.38, 2.78, 1.92, 3.45, 6.27, 4.33, 3.68, 3.52, 7.…
$ VitaminD            <dbl> NA, NA, NA, 37, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Season              <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ RBC                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 963, 0, 0, 0, 550, 0…
$ SSI                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
# Looking at understanding variables definitions
names(SeasonalEffect)
 [1] "Age"                 "Female"              "Race"               
 [4] "BMI"                 "ASAstatus"           "Diabetes"           
 [7] "ChronicRenalFailure" "PreopSteroids"       "Emergency"          
[10] "DurationSurgery"     "VitaminD"            "Season"             
[13] "RBC"                 "SSI"                
# Make all headers lowercase and remove spaces
names(SeasonalEffect) <- tolower(names(SeasonalEffect))
names(SeasonalEffect) <- gsub(" ","",names(SeasonalEffect))
head(SeasonalEffect)
# A tibble: 6 × 14
    age female  race   bmi asastatus diabetes chronicrenalfailure preopsteroids
  <dbl>  <dbl> <dbl> <dbl>     <dbl>    <dbl>               <dbl>         <dbl>
1  44        0     1  24.3         2        0                   0             0
2  28.1      0     1  20.3         2        0                   0             0
3  39.7      1     1  21.6         2        0                   0             1
4  26.6      1     1  22.3         2        0                   0             0
5  69        1     1  16           3        0                   0             0
6  40.2      1     1  23.3         2        0                   0             0
# ℹ 6 more variables: emergency <dbl>, durationsurgery <dbl>, vitamind <dbl>,
#   season <dbl>, rbc <dbl>, ssi <dbl>
SeasonalEffect1 <- SeasonalEffect %>% 
filter(!is.na("VitaminD"))
head(SeasonalEffect1)
# A tibble: 6 × 14
    age female  race   bmi asastatus diabetes chronicrenalfailure preopsteroids
  <dbl>  <dbl> <dbl> <dbl>     <dbl>    <dbl>               <dbl>         <dbl>
1  44        0     1  24.3         2        0                   0             0
2  28.1      0     1  20.3         2        0                   0             0
3  39.7      1     1  21.6         2        0                   0             1
4  26.6      1     1  22.3         2        0                   0             0
5  69        1     1  16           3        0                   0             0
6  40.2      1     1  23.3         2        0                   0             0
# ℹ 6 more variables: emergency <dbl>, durationsurgery <dbl>, vitamind <dbl>,
#   season <dbl>, rbc <dbl>, ssi <dbl>

** Descriptive statistics**

skim(SeasonalEffect1)
Data summary
Name SeasonalEffect1
Number of rows 2919
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1.00 52.57 16.81 18.10 40.20 53.2 64.55 100.0 ▅▆▇▅▁
female 0 1.00 0.51 0.50 0.00 0.00 1.0 1.00 1.0 ▇▁▁▁▇
race 0 1.00 1.12 0.40 1.00 1.00 1.0 1.00 3.0 ▇▁▁▁▁
bmi 0 1.00 27.00 6.47 12.90 22.40 26.0 30.40 76.7 ▇▇▁▁▁
asastatus 0 1.00 2.48 0.62 1.00 2.00 2.0 3.00 4.0 ▁▇▁▇▁
diabetes 0 1.00 0.20 0.40 0.00 0.00 0.0 0.00 1.0 ▇▁▁▁▂
chronicrenalfailure 0 1.00 0.05 0.21 0.00 0.00 0.0 0.00 1.0 ▇▁▁▁▁
preopsteroids 0 1.00 0.09 0.28 0.00 0.00 0.0 0.00 1.0 ▇▁▁▁▁
emergency 0 1.00 0.04 0.21 0.00 0.00 0.0 0.00 1.0 ▇▁▁▁▁
durationsurgery 0 1.00 3.58 1.95 0.02 2.22 3.4 4.70 14.8 ▇▇▂▁▁
vitamind 2857 0.02 26.52 15.00 1.12 18.00 24.5 33.25 88.0 ▅▇▃▁▁
season 0 1.00 2.47 1.22 1.00 1.00 2.0 4.00 4.0 ▇▆▁▃▇
rbc 0 1.00 80.43 348.32 0.00 0.00 0.0 0.00 6740.0 ▇▁▁▁▁
ssi 0 1.00 0.08 0.28 0.00 0.00 0.0 0.00 1.0 ▇▁▁▁▁

Visualizing data

# Visualizing SSI distribution by seasons
ggplot(SeasonalEffect1, aes(x = season, y = ssi)) + 
  geom_jitter(width = .09, height = 0.09, alpha = 0.2)

** Bivariate Analysis**

# List candidate predictors
predictors <- c("age", "female", "race", "bmi", "asastatus","diabetes", "chronicrenalfailure", "preopsteroid", "emergency", "durationsurgery", "vitamind", "season", "rbc")
list(predictors)
[[1]]
 [1] "age"                 "female"              "race"               
 [4] "bmi"                 "asastatus"           "diabetes"           
 [7] "chronicrenalfailure" "preopsteroid"        "emergency"          
[10] "durationsurgery"     "vitamind"            "season"             
[13] "rbc"                
fit1 <- glm(ssi ~ age, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ age, family = binomial(), data = SeasonalEffect1)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.666420   0.225519 -11.823   <2e-16 ***
age          0.004862   0.004007   1.213    0.225    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1662.3  on 2917  degrees of freedom
AIC: 1666.3

Number of Fisher Scoring iterations: 5
# Based on this output, age does not seem to be a meaningful predictor of SSI (p=0.225)
fit1 <- glm(ssi ~ female, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ female, family = binomial(), data = SeasonalEffect1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.39713    0.09575 -25.036   <2e-16 ***
female      -0.02141    0.13452  -0.159    0.874    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1663.7  on 2917  degrees of freedom
AIC: 1667.7

Number of Fisher Scoring iterations: 5
# Based on this output, gender does not seem to be a meaningful predictor of SSI (p=0.874)
fit1 <- glm(ssi ~ race, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ race, family = binomial(), data = SeasonalEffect1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.47878    0.19646 -12.617   <2e-16 ***
race         0.06298    0.16368   0.385      0.7    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1663.6  on 2917  degrees of freedom
AIC: 1667.6

Number of Fisher Scoring iterations: 5
# Based on this output, race does not seem to be a meaningful predictor of SSI (p= 0.7)
fit1 <- glm(ssi ~ bmi, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ bmi, family = binomial(), data = SeasonalEffect1)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.007797   0.278937 -10.783   <2e-16 ***
bmi          0.021893   0.009737   2.249   0.0245 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1658.9  on 2917  degrees of freedom
AIC: 1662.9

Number of Fisher Scoring iterations: 5
# Based on this output, BMI seems to be a meaningful predictor of SSI (p= 0.0245)
fit1 <- glm(ssi ~ asastatus, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ asastatus, family = binomial(), data = SeasonalEffect1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.9332     0.2828 -10.372   <2e-16 ***
asastatus     0.2088     0.1078   1.938   0.0527 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1660.0  on 2917  degrees of freedom
AIC: 1664

Number of Fisher Scoring iterations: 5
# Based on this output, asastatus seems to be a very moderate predictor of SSI (= 0.0527)
fit1 <- glm(ssi ~ diabetes, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ diabetes, family = binomial(), data = SeasonalEffect1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.57884    0.08075 -31.937  < 2e-16 ***
diabetes     0.68898    0.14719   4.681 2.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1643.4  on 2917  degrees of freedom
AIC: 1647.4

Number of Fisher Scoring iterations: 5
# Based on this output, people with diabetes have almost 2 times higher odds of developing an SSI compared to those without diabetes (p= 0.001)
fit1 <- glm(ssi ~ chronicrenalfailure, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ chronicrenalfailure, family = binomial(), 
    data = SeasonalEffect1)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -2.44137    0.06982  -34.97   <2e-16 ***
chronicrenalfailure  0.57815    0.26278    2.20   0.0278 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1659.5  on 2917  degrees of freedom
AIC: 1663.5

Number of Fisher Scoring iterations: 5
# Based on this output, patients with chronic renal failure have 78% higher odds of developing an SSI compared to those without it (p= 0.0278)
fit1<- glm(ssi ~ emergency, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ emergency, family = binomial(), data = SeasonalEffect1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.43846    0.06967 -35.000   <2e-16 ***
emergency    0.55317    0.26946   2.053   0.0401 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1660.0  on 2917  degrees of freedom
AIC: 1664

Number of Fisher Scoring iterations: 5
fit1 <- glm(ssi ~ vitamind, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ vitamind, family = binomial(), data = SeasonalEffect1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.65340    1.06132  -3.442 0.000577 ***
vitamind     0.03253    0.02720   1.196 0.231642    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 29.663  on 61  degrees of freedom
Residual deviance: 28.398  on 60  degrees of freedom
  (2857 observations deleted due to missingness)
AIC: 32.398

Number of Fisher Scoring iterations: 5
# Based on this output, vitamin D is not a predictor of SSI (p= 0.231)
fit1 <- glm(ssi ~ durationsurgery, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ durationsurgery, family = binomial(), data = SeasonalEffect1)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -3.23965    0.14893 -21.753  < 2e-16 ***
durationsurgery  0.21148    0.03091   6.842 7.78e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1619.4  on 2917  degrees of freedom
AIC: 1623.4

Number of Fisher Scoring iterations: 5

Based on this output, longer surgical procedures is significantly associated with higher odds of developing SSI (p< 0.001)

fit1 <- glm(ssi ~ season, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ season, family = binomial(), data = SeasonalEffect1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.66271    0.15817 -16.834   <2e-16 ***
season       0.10047    0.05511   1.823   0.0683 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1660.4  on 2917  degrees of freedom
AIC: 1664.4

Number of Fisher Scoring iterations: 5

Based on this output, season is not a good predictor of SSI (p= 0.0683)

fit1 <- glm(ssi ~ rbc, family = binomial(), data = SeasonalEffect1)
summary(fit1)

Call:
glm(formula = ssi ~ rbc, family = binomial(), data = SeasonalEffect1)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.4642910  0.0699475 -35.231  < 2e-16 ***
rbc          0.0005087  0.0001320   3.854 0.000116 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1663.7  on 2918  degrees of freedom
Residual deviance: 1650.7  on 2917  degrees of freedom
AIC: 1654.7

Number of Fisher Scoring iterations: 5

Based on this output, rbc explains a small but statistically meaningful prediction of SSI. A one unit increase in rbc is associated with 0.05% increase in the odds of SSI (p= 0.000116)

** Multivariate Analysis: Adjusted Logistic Regression**

** PART III. CONCLUSION**