My data 110 Project 1 data set is referred to as David Ngendahimana, “Season Effect Dataset”, TSHS Resources Portal (2016). Available at https://www.causeweb.org/tshs/season-effect/. My data set contains 2,919 individual records (observations) of adults undergoing colorectal surgery and 14 variables including. age, gender, race, BMI, several risk factors, several surgical indices, vitamin D levels (on 5% of the patients, approx.), the key predictor (season) and the outcome (infection or not) are provided. In my project, I am planning to compare the rates of surgical wound infection among patients having colorectal surgery in the winter to those among patients having surgery in the summer.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("~/Telesphore/Personnel/Etudes/Montgomery_College/Data_Sciences_Certificate_program/Data_110/Week3")
SeasonalEffect <- read_csv("SeasonalEffect.csv")
## Rows: 2919 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): Age, Female, Race, BMI, ASAstatus, Diabetes, ChronicRenalFailure, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Make all headers lowercase and remove spaces
names(SeasonalEffect) <- tolower(names(SeasonalEffect))
names(SeasonalEffect) <- gsub(" ","",names(SeasonalEffect))
head(SeasonalEffect)
## # A tibble: 6 × 14
## age female race bmi asastatus diabetes chronicrenalfailure preopsteroids
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 44 0 1 24.3 2 0 0 0
## 2 28.1 0 1 20.3 2 0 0 0
## 3 39.7 1 1 21.6 2 0 0 1
## 4 26.6 1 1 22.3 2 0 0 0
## 5 69 1 1 16 3 0 0 0
## 6 40.2 1 1 23.3 2 0 0 0
## # ℹ 6 more variables: emergency <dbl>, durationsurgery <dbl>, vitamind <dbl>,
## # season <dbl>, rbc <dbl>, ssi <dbl>
summary(SeasonalEffect)
## age female race bmi
## Min. : 18.10 Min. :0.0000 Min. :1.000 Min. :12.9
## 1st Qu.: 40.20 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:22.4
## Median : 53.20 Median :1.0000 Median :1.000 Median :26.0
## Mean : 52.57 Mean :0.5111 Mean :1.119 Mean :27.0
## 3rd Qu.: 64.55 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:30.4
## Max. :100.00 Max. :1.0000 Max. :3.000 Max. :76.7
##
## asastatus diabetes chronicrenalfailure preopsteroids
## Min. :1.000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :2.000 Median :0.0000 Median :0.00000 Median :0.0000
## Mean :2.482 Mean :0.1984 Mean :0.04591 Mean :0.0877
## 3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :4.000 Max. :1.0000 Max. :1.00000 Max. :1.0000
##
## emergency durationsurgery vitamind season
## Min. :0.00000 Min. : 0.020 Min. : 1.12 Min. :1.000
## 1st Qu.:0.00000 1st Qu.: 2.220 1st Qu.:18.00 1st Qu.:1.000
## Median :0.00000 Median : 3.400 Median :24.50 Median :2.000
## Mean :0.04419 Mean : 3.581 Mean :26.52 Mean :2.472
## 3rd Qu.:0.00000 3rd Qu.: 4.700 3rd Qu.:33.25 3rd Qu.:4.000
## Max. :1.00000 Max. :14.800 Max. :88.00 Max. :4.000
## NA's :2857
## rbc ssi
## Min. : 0.00 Min. :0.00000
## 1st Qu.: 0.00 1st Qu.:0.00000
## Median : 0.00 Median :0.00000
## Mean : 80.43 Mean :0.08256
## 3rd Qu.: 0.00 3rd Qu.:0.00000
## Max. :6740.00 Max. :1.00000
##
I decided to look at surgical site infections by seasons in combination with the patient health status. That way I can decide to further explore later on if there is any difference in surgical site infections between winter and summer.
SeasonalEffect2 <- SeasonalEffect |>
select(age, female, 'ssi','season','asastatus') |>
group_by(age, female)
head(SeasonalEffect2)
## # A tibble: 6 × 5
## # Groups: age, female [6]
## age female ssi season asastatus
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 44 0 0 3 2
## 2 28.1 0 0 3 2
## 3 39.7 1 0 3 2
## 4 26.6 1 0 3 2
## 5 69 1 0 3 3
## 6 40.2 1 0 3 2
dim(SeasonalEffect2)
## [1] 2919 5
summary(SeasonalEffect2)
## age female ssi season
## Min. : 18.10 Min. :0.0000 Min. :0.00000 Min. :1.000
## 1st Qu.: 40.20 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median : 53.20 Median :1.0000 Median :0.00000 Median :2.000
## Mean : 52.57 Mean :0.5111 Mean :0.08256 Mean :2.472
## 3rd Qu.: 64.55 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:4.000
## Max. :100.00 Max. :1.0000 Max. :1.00000 Max. :4.000
## asastatus
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :2.482
## 3rd Qu.:3.000
## Max. :4.000
Loading ggplot2
library(ggplot2)
SeasonalEffect2$asastatus <- as.factor(SeasonalEffect2$asastatus)
SeasonalEffect2$season <- as.factor(SeasonalEffect2$season)
ggplot(SeasonalEffect2, aes(x = (season), fill = (asastatus))) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "SSI Distribution by Season and ASA Status",
x = "season",
y = "Count",
fill = "ASA Status",
caption = "Source: David Ngendahimana, “Season Effect Dataset”, TSHS Resources Portal (2016)") +
scale_fill_discrete(name = "ASA Status") +
theme_minimal()
# Ensuring asastatus and season are treated as factors
SeasonalEffect2$asastatus <- as.factor(SeasonalEffect2$asastatus)
SeasonalEffect2$season <- as.factor(SeasonalEffect2$season)
# Customizing colors
my_colors <- c("1" = "#1f77b4", # Blue for ASA 1
"2" = "#ff7f0e", # Orange for ASA 2
"3" = "#2ca02c", # Green for ASA 3
"4" = "#d62729") # Red for ASA 4
# Creating the new Bar plot with own colors
ggplot(SeasonalEffect2, aes(x = (season), fill = (asastatus))) +
geom_bar(position = "dodge", stat = "count") +
labs(title = "SSI Distribution by Season and ASA Status",
x = "season",
y = "Count",
fill = "ASA Status",
caption = "Source: David Ngendahimana, “Season Effect Dataset”, TSHS Resources Portal (2016)") +
scale_fill_manual(name = "ASA Status",
values = my_colors,
drop = FALSE) +
theme_minimal()
## I used DeepSeek to help find out the code for the visualization above.
After loading tidyverse, and my data set, i run the code below to clean the data. 1. names(SeasonalEffect) <- tolower(names(SeasonalEffect)) 2. names(SeasonalEffect) <- gsub(” “,”“,names(SeasonalEffect)) head(SeasonalEffect) The code allowed me to ensure that all variables are written in lower case and that there was no space separating a given variable.
I chose to visualize the distribution of surgical site infection (SSI) by season (Fall, Winter,Spring and Summer). Some interesting observations can be made from the visualization. essentially two: The number of SSI varies per season with summer having the highest number, followed by fall.The SSI numbers are at the lowest in Spring. Similarly the number of SSI is higher depending on whether the patient has mild systemic disease or severe systemic disease (highest).The number of SSI were comparatively lower for normal patients and the ones with severe systemic disease with constant threat to life. Based on the visualization, it might have been interesting to look find out if there is any relationship between SSI and season and also clarify the relationship between SSI the patients health status based on the standards of the American Society of Anesthesiologist physical status.