FIRST BRIEF ESSAY

My data 110 Project 1 data set is referred to as David Ngendahimana, “Season Effect Dataset”, TSHS Resources Portal (2016). Available at https://www.causeweb.org/tshs/season-effect/. My data set contains 2,919 individual records (observations) of adults undergoing colorectal surgery and 14 variables including. age, gender, race, BMI, several risk factors, several surgical indices, vitamin D levels (on 5% of the patients, approx.), the key predictor (season) and the outcome (infection or not) are provided. In my project, I am planning to compare the rates of surgical wound infection among patients having colorectal surgery in the winter to those among patients having surgery in the summer.

Loading tidyverse

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data set

setwd("~/Telesphore/Personnel/Etudes/Montgomery_College/Data_Sciences_Certificate_program/Data_110/Week3")
SeasonalEffect <- read_csv("SeasonalEffect.csv")
## Rows: 2919 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (14): Age, Female, Race, BMI, ASAstatus, Diabetes, ChronicRenalFailure, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean up the data:

Make all headers lowercase and remove spaces

names(SeasonalEffect) <- tolower(names(SeasonalEffect))
names(SeasonalEffect) <- gsub(" ","",names(SeasonalEffect))
head(SeasonalEffect)
## # A tibble: 6 × 14
##     age female  race   bmi asastatus diabetes chronicrenalfailure preopsteroids
##   <dbl>  <dbl> <dbl> <dbl>     <dbl>    <dbl>               <dbl>         <dbl>
## 1  44        0     1  24.3         2        0                   0             0
## 2  28.1      0     1  20.3         2        0                   0             0
## 3  39.7      1     1  21.6         2        0                   0             1
## 4  26.6      1     1  22.3         2        0                   0             0
## 5  69        1     1  16           3        0                   0             0
## 6  40.2      1     1  23.3         2        0                   0             0
## # ℹ 6 more variables: emergency <dbl>, durationsurgery <dbl>, vitamind <dbl>,
## #   season <dbl>, rbc <dbl>, ssi <dbl>
summary(SeasonalEffect)
##       age             female            race            bmi      
##  Min.   : 18.10   Min.   :0.0000   Min.   :1.000   Min.   :12.9  
##  1st Qu.: 40.20   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:22.4  
##  Median : 53.20   Median :1.0000   Median :1.000   Median :26.0  
##  Mean   : 52.57   Mean   :0.5111   Mean   :1.119   Mean   :27.0  
##  3rd Qu.: 64.55   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:30.4  
##  Max.   :100.00   Max.   :1.0000   Max.   :3.000   Max.   :76.7  
##                                                                  
##    asastatus        diabetes      chronicrenalfailure preopsteroids   
##  Min.   :1.000   Min.   :0.0000   Min.   :0.00000     Min.   :0.0000  
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.00000     1st Qu.:0.0000  
##  Median :2.000   Median :0.0000   Median :0.00000     Median :0.0000  
##  Mean   :2.482   Mean   :0.1984   Mean   :0.04591     Mean   :0.0877  
##  3rd Qu.:3.000   3rd Qu.:0.0000   3rd Qu.:0.00000     3rd Qu.:0.0000  
##  Max.   :4.000   Max.   :1.0000   Max.   :1.00000     Max.   :1.0000  
##                                                                       
##    emergency       durationsurgery     vitamind         season     
##  Min.   :0.00000   Min.   : 0.020   Min.   : 1.12   Min.   :1.000  
##  1st Qu.:0.00000   1st Qu.: 2.220   1st Qu.:18.00   1st Qu.:1.000  
##  Median :0.00000   Median : 3.400   Median :24.50   Median :2.000  
##  Mean   :0.04419   Mean   : 3.581   Mean   :26.52   Mean   :2.472  
##  3rd Qu.:0.00000   3rd Qu.: 4.700   3rd Qu.:33.25   3rd Qu.:4.000  
##  Max.   :1.00000   Max.   :14.800   Max.   :88.00   Max.   :4.000  
##                                     NA's   :2857                   
##       rbc               ssi         
##  Min.   :   0.00   Min.   :0.00000  
##  1st Qu.:   0.00   1st Qu.:0.00000  
##  Median :   0.00   Median :0.00000  
##  Mean   :  80.43   Mean   :0.08256  
##  3rd Qu.:   0.00   3rd Qu.:0.00000  
##  Max.   :6740.00   Max.   :1.00000  
## 

I decided to look at surgical site infections by seasons in combination with the patient health status. That way I can decide to further explore later on if there is any difference in surgical site infections between winter and summer.

SeasonalEffect2 <- SeasonalEffect |>
  select(age, female, 'ssi','season','asastatus') |>
  group_by(age, female)
head(SeasonalEffect2)
## # A tibble: 6 × 5
## # Groups:   age, female [6]
##     age female   ssi season asastatus
##   <dbl>  <dbl> <dbl>  <dbl>     <dbl>
## 1  44        0     0      3         2
## 2  28.1      0     0      3         2
## 3  39.7      1     0      3         2
## 4  26.6      1     0      3         2
## 5  69        1     0      3         3
## 6  40.2      1     0      3         2

Check the dimensions and the summary to make sure no missing values

dim(SeasonalEffect2)
## [1] 2919    5

Check for missing data

summary(SeasonalEffect2)
##       age             female            ssi              season     
##  Min.   : 18.10   Min.   :0.0000   Min.   :0.00000   Min.   :1.000  
##  1st Qu.: 40.20   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median : 53.20   Median :1.0000   Median :0.00000   Median :2.000  
##  Mean   : 52.57   Mean   :0.5111   Mean   :0.08256   Mean   :2.472  
##  3rd Qu.: 64.55   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:4.000  
##  Max.   :100.00   Max.   :1.0000   Max.   :1.00000   Max.   :4.000  
##    asastatus    
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :2.000  
##  Mean   :2.482  
##  3rd Qu.:3.000  
##  Max.   :4.000

Create Barplot

Loading ggplot2

library(ggplot2)
SeasonalEffect2$asastatus <- as.factor(SeasonalEffect2$asastatus)
SeasonalEffect2$season <- as.factor(SeasonalEffect2$season)
ggplot(SeasonalEffect2, aes(x = (season), fill = (asastatus))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "SSI Distribution by Season and ASA Status",
       x = "season",
       y = "Count",
       fill = "ASA Status",
       caption = "Source: David Ngendahimana, “Season Effect Dataset”, TSHS Resources Portal (2016)") +
  scale_fill_discrete(name = "ASA Status") +
  theme_minimal()

Creating Barplot with own generated colors

# Ensuring asastatus and season are treated as factors

SeasonalEffect2$asastatus <- as.factor(SeasonalEffect2$asastatus)
SeasonalEffect2$season <- as.factor(SeasonalEffect2$season)
# Customizing colors
my_colors <- c("1" = "#1f77b4", # Blue for ASA 1
               "2" = "#ff7f0e", # Orange for ASA 2
               "3" = "#2ca02c", # Green for ASA 3
               "4" = "#d62729") # Red for ASA 4


# Creating the new Bar plot with own colors
ggplot(SeasonalEffect2, aes(x = (season), fill = (asastatus))) +
  geom_bar(position = "dodge", stat = "count") +
  labs(title = "SSI Distribution by Season and ASA Status",
       x = "season",
       y = "Count",
       fill = "ASA Status",
       caption = "Source: David Ngendahimana, “Season Effect Dataset”, TSHS Resources Portal (2016)") +
  scale_fill_manual(name = "ASA Status",
                    values = my_colors,
                    drop = FALSE) +
  
  theme_minimal()

## I used DeepSeek to help find out the code for the visualization above.

SECOND BRIEF ESSAY

After loading tidyverse, and my data set, i run the code below to clean the data. 1. names(SeasonalEffect) <- tolower(names(SeasonalEffect)) 2. names(SeasonalEffect) <- gsub(” “,”“,names(SeasonalEffect)) head(SeasonalEffect) The code allowed me to ensure that all variables are written in lower case and that there was no space separating a given variable.

I chose to visualize the distribution of surgical site infection (SSI) by season (Fall, Winter,Spring and Summer). Some interesting observations can be made from the visualization. essentially two: The number of SSI varies per season with summer having the highest number, followed by fall.The SSI numbers are at the lowest in Spring. Similarly the number of SSI is higher depending on whether the patient has mild systemic disease or severe systemic disease (highest).The number of SSI were comparatively lower for normal patients and the ones with severe systemic disease with constant threat to life. Based on the visualization, it might have been interesting to look find out if there is any relationship between SSI and season and also clarify the relationship between SSI the patients health status based on the standards of the American Society of Anesthesiologist physical status.