Surgery Timings THS Dataset Analysis

By Zaid Hageman

Introduction

The Surgery Timings THS (Teaching Statistics in the Health Sciences) data set focuses on general surgery patients, containing information on 32,001 people who underwent surgery. This data was collected as part of a study by Sessier et al., titled “Operation Timing and 30-Day Mortality After Elective General Surgery,” published in Anesthesia & Analgesia in 2011.

The data set includes a variety of demographic and health variables. Key demographic variables include age, gender, and race. Their health status is measured through BMI (the Body Mass Index), and many baseline health conditions such as baseline cancer, cardiovascular disease, dementia, diabetes, digestive disease, osteoarthritis, psychiatric disorders, and pulmonary disease. There are a couple other variables in the data set however my analysis will focus on BMI, Age, and Race.

This analysis will explore the relationships between surgical timing factors and patient outcomes, particularly investigating whether the time of day, day of the week, or month of the year influences 30-day mortality and in-hospital complications. By examining these factors, the study aims to contribute valuable insights into optimizing surgical scheduling for improved patient safety and outcomes.

The dataset was contributed by Dr. Amy Nowacki of the Cleveland Clinic and is available through the TSHS Resources Portal.

Creating the Analysis

Importing The Necessary Tools

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)

surgery_data <- read_csv("surgery_timing_THS.csv")

Rows: 32001 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): ahrq_ccs
dbl (24): age, gender, race, asa_status, bmi, baseline_cancer, baseline_cvd,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

surgery_data

# A tibble: 32,001 × 25
   ahrq_ccs   age gender  race asa_status   bmi baseline_cancer baseline_cvd
   <chr>    <dbl>  <dbl> <dbl>      <dbl> <dbl>           <dbl>        <dbl>
 1 <Other>   67.8      1     1          1  28.0               0            1
 2 <Other>   39.5      0     1          1  37.8               0            1
 3 <Other>   56.5      0     1          1  19.6               0            0
 4 <Other>   71        1     1          2  32.2               0            1
 5 <Other>   56.3      1     2          1  24.3               1            0
 6 <Other>   57.7      0     1          1  40.3               0            1
 7 <Other>   56.6      1     3          3  64.6               0            1
 8 <Other>   64.2      0     1          2  43.2               0            1
 9 <Other>   66.2      1     1          2  28.0               0            1
10 <Other>   20.1      0     1          1  27.4               1            0
# ℹ 31,991 more rows
# ℹ 17 more variables: baseline_dementia <dbl>, baseline_diabetes <dbl>,
#   baseline_digestive <dbl>, baseline_osteoart <dbl>, baseline_psych <dbl>,
#   baseline_pulmonary <dbl>, baseline_charlson <dbl>, mortality_rsi <dbl>,
#   complication_rsi <dbl>, ccsMort30Rate <dbl>, ccsComplicationRate <dbl>,
#   hour <dbl>, dow <dbl>, month <dbl>, moonphase <dbl>, mort30 <dbl>,
#   complication <dbl>

Cleaning up the Data

Since the Race Data is saved as numbers we need to change it so that it is clear what each category is.

Then we need to remove any data that has NAs to not form a 4th unnecessary graph.

After than we can take our random sample to create a simple random sample in order not to flood our graph with too many observations.

surgery_data$race <- as.factor(surgery_data$race)

surgery_data <- surgery_data %>%
  mutate(race = recode(race,
                       `1` = "Caucasian",
                       `2` = "African American",
                       `3` = "Other"))
# Everything above is used in order to take the numerical values of race and turn them into categories

surgery_data <- surgery_data %>%
  filter(!is.na(race), !is.na(bmi), !is.na(age)) # Getting rid of any rows lacking data

sampled_data <- surgery_data %>%
    group_by(race) %>%
    slice_sample(n = 100) # Slice sample of 100 people in order not to convolute the graph (Simple Random Sample)

Looking at the correlation between Age, Race, and BMI

custom_colors <- c("Caucasian" = "#00ff05",   # Darkest green
                   "African American" = "#01fec7",  # Medium green
                   "Other" = "#65909a")   # Lightest green

plot1 <- ggplot(sampled_data, aes(x = age, y = bmi, color = race)) +
  geom_point(alpha = 1) + 
  geom_smooth(method = 'lm', formula = y ~ x, color = "orange", se = FALSE) + # Linear Regression Line
  geom_smooth(color = "yellow") + # Line for estimation (not linear regression)
  labs(title = "Scatter Plot of Age vs BMI",
       x = "Age",
       y = "BMI",
       caption = "Source: by Dr. Amy Nowacki of the Cleveland Clinic available through the TSHS Resources Portal") +
  scale_color_manual(values = custom_colors) +
  theme_minimal() + 
  theme(
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 13, face = "bold"),
    axis.text = element_text(size = 9),
    plot.caption = element_text(size = 9),
    legend.title = element_text(face = "bold")
  )
plot1

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#This is the graph to be taken into consideration for grading

plot2 <- ggplot(sampled_data, aes(x = age, y = bmi, color = bmi)) +
 geom_point(alpha = 0.32) + # Allows for some transparency with the dots
  geom_smooth(color = "yellow") +
  labs(title = "Scatter Plot of Age vs BMI by Race",
       x = "Age",
       y = "BMI",
       caption = "Source: by Dr. Amy Nowacki of the Cleveland Clinic available through the TSHS Resources Portal") +
  facet_wrap(~ race) +  # Allows for all races to be seen seperately
  theme_minimal() +  # Use a minimal theme
  theme(
    plot.title = element_text(size = 15, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 13, face = "bold"),
    axis.text = element_text(size = 9),
    plot.caption = element_text(size = 9) 
  )
plot2

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#This is not the graph to be taken into consideration for grading
# Same as the last graph with some different elements

In order to do the linear regression analysis easier i want to make separate groups for each race to see the difference, and the reason i want to do this is i believe i will be running into some issues because these scatter plots are parabolic and therefor the linear regression will be misleading. With this in mind i want to see how they will differ from one another. Assuming the linear regression was not misleading, the graphs in theory should have very similar linear regressions IF and only if the race has no factor on BMI. All of this information is important to keep in mind when looking at the linear regression analysis.

Linear Regression Analysis

caucasian_data <- sampled_data %>%
  filter(race == "Caucasian", !is.na(age), !is.na(bmi))

african_american_data <- sampled_data %>%
  filter(race == "African American", !is.na(age), !is.na(bmi))

other_data <- sampled_data %>%
  filter(race == "Other", !is.na(age), !is.na(bmi))
# Here i created data sets for each race and removed their NAs in order to see their specific linear regressions to see if there is any relation

fit_all <- lm(bmi ~ age, data = sampled_data)
fit_cauc <- lm(bmi ~ age, data = caucasian_data)
fit_afr <- lm(bmi ~ age, data = african_american_data)
fit_oth <- lm(bmi ~ age, data = other_data)

cor(sampled_data$age, sampled_data$bmi)

[1] 0.0195067

summary(fit_all)


Call:
lm(formula = bmi ~ age, data = sampled_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.600  -4.917  -1.312   3.364  34.940 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.793894   1.699802  16.940   <2e-16 ***
age          0.009792   0.029073   0.337    0.737    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.015 on 298 degrees of freedom
Multiple R-squared:  0.0003805, Adjusted R-squared:  -0.002974 
F-statistic: 0.1134 on 1 and 298 DF,  p-value: 0.7365

With all the information we have on this data set we can see that there is little to no statistical significance about this data in terms of this linear regression. The p-value of 0.3204 suggests that there is no meaningful relationship and the correlation of 0.05756215 is incredibly weak. On top of this the adjusted R^2 is incredibly weak with a value of -0.00003118 because it is negative. However even knowing there is no significance we can still look at the point slope form for the sake of looking.

Point slope form is y=mx+b

y is dependent variable (BMI)

m is the slope

x is the independent variable (age)

b is the y intercept

BMI = 0.02798 × Age + 28.11873

Correlation of 0.05756215

P value of 0.3204

Adjusted R^2 Value of -0.00003118

What this means

These values show absolutely no relation to one another. The graph looks appearably parabolic which means that a linear model is a horrible method of analyzing this graph. However what this is trying to show is that for each additional year of age, BMI increases by 0.02798 units. This in context makes no sense because we can see in regular day to day life that this is not the case. It also shows us that when our age is 0 the BMI should be 28.11873 which we also know is not the case as according to the World Health Organization (WHO) the average BMI for a baby is 14.0.

Even through all of this it would be interesting to see what the other linear regressions would look like for each race individually.

Caucasian Analysis

cor(caucasian_data$age, caucasian_data$bmi)

[1] 0.04992161

summary(fit_cauc)


Call:
lm(formula = bmi ~ age, data = caucasian_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-12.359  -4.631  -1.969   2.475  34.729 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 27.85970    3.47815   8.010 2.42e-12 ***
age          0.02859    0.05778   0.495    0.622    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.712 on 98 degrees of freedom
Multiple R-squared:  0.002492,  Adjusted R-squared:  -0.007686 
F-statistic: 0.2448 on 1 and 98 DF,  p-value: 0.6218

BMI = −0.04913 × Age + 32.94

African American Analysis

cor(african_american_data$age, african_american_data$bmi)

[1] -0.1550859

summary(fit_afr)


Call:
lm(formula = bmi ~ age, data = african_american_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.630  -4.400  -1.286   3.888  18.999 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 35.06890    2.98457  11.750   <2e-16 ***
age         -0.07938    0.05108  -1.554    0.123    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.935 on 98 degrees of freedom
Multiple R-squared:  0.02405,   Adjusted R-squared:  0.01409 
F-statistic: 2.415 on 1 and 98 DF,  p-value: 0.1234

BMI = −0.05129 × Age + 34.82

Other Race Analysis

cor(other_data$age, other_data$bmi)

[1] 0.1370799

summary(fit_oth)


Call:
lm(formula = bmi ~ age, data = other_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6280 -4.2180 -0.9116  2.7243 18.5623 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 24.83577    2.35531   10.54   <2e-16 ***
age          0.05689    0.04153    1.37    0.174    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.049 on 98 degrees of freedom
Multiple R-squared:  0.01879,   Adjusted R-squared:  0.008779 
F-statistic: 1.877 on 1 and 98 DF,  p-value: 0.1738

BMI = 0.01009 × Age + 27.56

After looking at all of these seperately we can see that across the board there is no correlation between a persons age and their BMI but rather other factors unforseen to this analysis that take that into account.

Essay

In this data set analysis, i cleaned the data set by converting numerical factors into categorical factors using the as.factor function. I also used mutate and recode functions so that i could manipulate the data set in a manner that allows me to take the numerical values in the data set and make them mean something. For example 0 1 2 were numbers relating to races and using mutate i was able to transform it to be able to be represented as such. I also used filters to ensure quality data and remove any rows that were lacking data.

The primary visualization was a scatter plot that shows the relationship between age and BMI, color coded by race. The plot shows an extremely weak correlation between age and BMI with a correlation of 0.05756215 and a p-value of 0.3204 and on top of this an adjusted R^2 value of -0.00003118, which shows us there is little to no relationship between the two. It was also clear after just looking at the graph that it seemed slightly parabolic rather than linear, proving the linear regression model would show little help in our analysis. While it showed many helpful insights, i wish i looked at some other variables like surgical timings/health conditions because i think they would have had more correlations and I would have found some more impactful trends.

Middle Age Group Analysis (Not Finished but For Future Testing/Practice)

After learning that the middle age group has the most diversity in BMI, im going to mutate the data set to make a middle age group and see what diseases they are most prone to.

surgery_data$age <- as.vector(surgery_data$age)

surgery_data <- surgery_data %>%
  mutate(age_group = case_when(
    age >= 30 & age < 60 ~ "Middle-aged"))

middle_aged_data <- surgery_data %>%
  filter(age_group == "Middle-aged", !is.na(age)) %>%
  arrange(desc(age))

middle_aged_data <- middle_aged_data %>%
  rename(
    cancer = baseline_cancer, cvd = baseline_cvd, dementia = baseline_dementia,
    diabetes = baseline_diabetes, digestive = baseline_digestive, osteoart = baseline_osteoart,
    psych = baseline_psych, pulmonary = baseline_pulmonary)

middle_aged_data

# A tibble: 13,982 × 26
   ahrq_ccs     age gender race  asa_status   bmi cancer   cvd dementia diabetes
   <chr>      <dbl>  <dbl> <fct>      <dbl> <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 <Other>     59.9      1 Cauc…          1  28.2      0     0        0        0
 2 <Other>     59.9      0 Cauc…          1  24.8      0     1        0        0
 3 <Other>     59.9      0 Cauc…          1  27.3      0     1        0        0
 4 <Other>     59.9      1 Cauc…         NA  30.6      0     0        0        0
 5 Arthropla…  59.9      1 Cauc…          2  37.9      0     1        0        0
 6 Arthropla…  59.9      0 Afri…          2  41.6      0     1        0        1
 7 Arthropla…  59.9      0 Cauc…          1  27.5      0     0        0        0
 8 Arthropla…  59.9      1 Other          2  35.3      0     1        0        0
 9 Colorecta…  59.9      0 Cauc…          2  30.6      1     1        0        0
10 Colorecta…  59.9      1 Cauc…          2  22.9      1     1        0        0
# ℹ 13,972 more rows
# ℹ 16 more variables: digestive <dbl>, osteoart <dbl>, psych <dbl>,
#   pulmonary <dbl>, baseline_charlson <dbl>, mortality_rsi <dbl>,
#   complication_rsi <dbl>, ccsMort30Rate <dbl>, ccsComplicationRate <dbl>,
#   hour <dbl>, dow <dbl>, month <dbl>, moonphase <dbl>, mort30 <dbl>,
#   complication <dbl>, age_group <chr>