The Surgery Timings THS (Teaching Statistics in the Health Sciences) data set focuses on general surgery patients, containing information on 32,001 people who underwent surgery. This data was collected as part of a study by Sessier et al., titled “Operation Timing and 30-Day Mortality After Elective General Surgery,” published in Anesthesia & Analgesia in 2011.
The data set includes a variety of demographic and health variables. Key demographic variables include age, gender, and race. Their health status is measured through BMI (the Body Mass Index), and many baseline health conditions such as baseline cancer, cardiovascular disease, dementia, diabetes, digestive disease, osteoarthritis, psychiatric disorders, and pulmonary disease. There are a couple other variables in the data set however my analysis will focus on BMI, Age, and Race.
This analysis will explore the relationships between surgical timing factors and patient outcomes, particularly investigating whether the time of day, day of the week, or month of the year influences 30-day mortality and in-hospital complications. By examining these factors, the study aims to contribute valuable insights into optimizing surgical scheduling for improved patient safety and outcomes.
The dataset was contributed by Dr. Amy Nowacki of the Cleveland Clinic and is available through the TSHS Resources Portal.
Creating the Analysis
Importing The Necessary Tools
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(ggplot2)
surgery_data <-read_csv("surgery_timing_THS.csv")
Rows: 32001 Columns: 25
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ahrq_ccs
dbl (24): age, gender, race, asa_status, bmi, baseline_cancer, baseline_cvd,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Since the Race Data is saved as numbers we need to change it so that it is clear what each category is.
Then we need to remove any data that has NAs to not form a 4th unnecessary graph.
After than we can take our random sample to create a simple random sample in order not to flood our graph with too many observations.
surgery_data$race <-as.factor(surgery_data$race)surgery_data <- surgery_data %>%mutate(race =recode(race,`1`="Caucasian",`2`="African American",`3`="Other"))# Everything above is used in order to take the numerical values of race and turn them into categoriessurgery_data <- surgery_data %>%filter(!is.na(race), !is.na(bmi), !is.na(age)) # Getting rid of any rows lacking datasampled_data <- surgery_data %>%group_by(race) %>%slice_sample(n =100) # Slice sample of 100 people in order not to convolute the graph (Simple Random Sample)
Looking at the correlation between Age, Race, and BMI
custom_colors <-c("Caucasian"="#00ff05", # Darkest green"African American"="#01fec7", # Medium green"Other"="#65909a") # Lightest greenplot1 <-ggplot(sampled_data, aes(x = age, y = bmi, color = race)) +geom_point(alpha =1) +geom_smooth(method ='lm', formula = y ~ x, color ="orange", se =FALSE) +# Linear Regression Linegeom_smooth(color ="yellow") +# Line for estimation (not linear regression)labs(title ="Scatter Plot of Age vs BMI",x ="Age",y ="BMI",caption ="Source: by Dr. Amy Nowacki of the Cleveland Clinic available through the TSHS Resources Portal") +scale_color_manual(values = custom_colors) +theme_minimal() +theme(plot.title =element_text(size =15, face ="bold", hjust =0.5),axis.title =element_text(size =13, face ="bold"),axis.text =element_text(size =9),plot.caption =element_text(size =9),legend.title =element_text(face ="bold") )plot1
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#This is the graph to be taken into consideration for grading
plot2 <-ggplot(sampled_data, aes(x = age, y = bmi, color = bmi)) +geom_point(alpha =0.32) +# Allows for some transparency with the dotsgeom_smooth(color ="yellow") +labs(title ="Scatter Plot of Age vs BMI by Race",x ="Age",y ="BMI",caption ="Source: by Dr. Amy Nowacki of the Cleveland Clinic available through the TSHS Resources Portal") +facet_wrap(~ race) +# Allows for all races to be seen seperatelytheme_minimal() +# Use a minimal themetheme(plot.title =element_text(size =15, face ="bold", hjust =0.5),axis.title =element_text(size =13, face ="bold"),axis.text =element_text(size =9),plot.caption =element_text(size =9) )plot2
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
#This is not the graph to be taken into consideration for grading# Same as the last graph with some different elements
In order to do the linear regression analysis easier i want to make separate groups for each race to see the difference, and the reason i want to do this is i believe i will be running into some issues because these scatter plots are parabolic and therefor the linear regression will be misleading. With this in mind i want to see how they will differ from one another. Assuming the linear regression was not misleading, the graphs in theory should have very similar linear regressions IF and only if the race has no factor on BMI. All of this information is important to keep in mind when looking at the linear regression analysis.
Linear Regression Analysis
caucasian_data <- sampled_data %>%filter(race =="Caucasian", !is.na(age), !is.na(bmi))african_american_data <- sampled_data %>%filter(race =="African American", !is.na(age), !is.na(bmi))other_data <- sampled_data %>%filter(race =="Other", !is.na(age), !is.na(bmi))# Here i created data sets for each race and removed their NAs in order to see their specific linear regressions to see if there is any relation
fit_all <-lm(bmi ~ age, data = sampled_data)fit_cauc <-lm(bmi ~ age, data = caucasian_data)fit_afr <-lm(bmi ~ age, data = african_american_data)fit_oth <-lm(bmi ~ age, data = other_data)
cor(sampled_data$age, sampled_data$bmi)
[1] 0.0195067
summary(fit_all)
Call:
lm(formula = bmi ~ age, data = sampled_data)
Residuals:
Min 1Q Median 3Q Max
-13.600 -4.917 -1.312 3.364 34.940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.793894 1.699802 16.940 <2e-16 ***
age 0.009792 0.029073 0.337 0.737
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.015 on 298 degrees of freedom
Multiple R-squared: 0.0003805, Adjusted R-squared: -0.002974
F-statistic: 0.1134 on 1 and 298 DF, p-value: 0.7365
With all the information we have on this data set we can see that there is little to no statistical significance about this data in terms of this linear regression. The p-value of 0.3204 suggests that there is no meaningful relationship and the correlation of 0.05756215 is incredibly weak. On top of this the adjusted R^2 is incredibly weak with a value of -0.00003118 because it is negative. However even knowing there is no significance we can still look at the point slope form for the sake of looking.
Point slope form is y=mx+b
y is dependent variable (BMI)
m is the slope
x is the independent variable (age)
b is the y intercept
BMI = 0.02798 × Age + 28.11873
Correlation of 0.05756215
P value of 0.3204
Adjusted R^2 Value of -0.00003118
What this means
These values show absolutely no relation to one another. The graph looks appearably parabolic which means that a linear model is a horrible method of analyzing this graph. However what this is trying to show is that for each additional year of age, BMI increases by 0.02798 units. This in context makes no sense because we can see in regular day to day life that this is not the case. It also shows us that when our age is 0 the BMI should be 28.11873 which we also know is not the case as according to the World Health Organization (WHO) the average BMI for a baby is 14.0.
Even through all of this it would be interesting to see what the other linear regressions would look like for each race individually.
Caucasian Analysis
cor(caucasian_data$age, caucasian_data$bmi)
[1] 0.04992161
summary(fit_cauc)
Call:
lm(formula = bmi ~ age, data = caucasian_data)
Residuals:
Min 1Q Median 3Q Max
-12.359 -4.631 -1.969 2.475 34.729
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.85970 3.47815 8.010 2.42e-12 ***
age 0.02859 0.05778 0.495 0.622
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.712 on 98 degrees of freedom
Multiple R-squared: 0.002492, Adjusted R-squared: -0.007686
F-statistic: 0.2448 on 1 and 98 DF, p-value: 0.6218
Call:
lm(formula = bmi ~ age, data = african_american_data)
Residuals:
Min 1Q Median 3Q Max
-15.630 -4.400 -1.286 3.888 18.999
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.06890 2.98457 11.750 <2e-16 ***
age -0.07938 0.05108 -1.554 0.123
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.935 on 98 degrees of freedom
Multiple R-squared: 0.02405, Adjusted R-squared: 0.01409
F-statistic: 2.415 on 1 and 98 DF, p-value: 0.1234
BMI = −0.05129 × Age + 34.82
Other Race Analysis
cor(other_data$age, other_data$bmi)
[1] 0.1370799
summary(fit_oth)
Call:
lm(formula = bmi ~ age, data = other_data)
Residuals:
Min 1Q Median 3Q Max
-9.6280 -4.2180 -0.9116 2.7243 18.5623
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.83577 2.35531 10.54 <2e-16 ***
age 0.05689 0.04153 1.37 0.174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.049 on 98 degrees of freedom
Multiple R-squared: 0.01879, Adjusted R-squared: 0.008779
F-statistic: 1.877 on 1 and 98 DF, p-value: 0.1738
BMI = 0.01009 × Age + 27.56
After looking at all of these seperately we can see that across the board there is no correlation between a persons age and their BMI but rather other factors unforseen to this analysis that take that into account.
Essay
In this data set analysis, i cleaned the data set by converting numerical factors into categorical factors using the as.factor function. I also used mutate and recode functions so that i could manipulate the data set in a manner that allows me to take the numerical values in the data set and make them mean something. For example 0 1 2 were numbers relating to races and using mutate i was able to transform it to be able to be represented as such. I also used filters to ensure quality data and remove any rows that were lacking data.
The primary visualization was a scatter plot that shows the relationship between age and BMI, color coded by race. The plot shows an extremely weak correlation between age and BMI with a correlation of 0.05756215 and a p-value of 0.3204 and on top of this an adjusted R^2 value of -0.00003118, which shows us there is little to no relationship between the two. It was also clear after just looking at the graph that it seemed slightly parabolic rather than linear, proving the linear regression model would show little help in our analysis. While it showed many helpful insights, i wish i looked at some other variables like surgical timings/health conditions because i think they would have had more correlations and I would have found some more impactful trends.
Middle Age Group Analysis (Not Finished but For Future Testing/Practice)
After learning that the middle age group has the most diversity in BMI, im going to mutate the data set to make a middle age group and see what diseases they are most prone to.