Colon Surgery data set

This ’urgent colon surgery” data is a vast dataset that shows preoperative risk factors, intraoperative variables, and 30-day postoperative mortality and morbidity outcomes for patients undergoing major surgical procedures in both the inpatient and outpatient setting. There are many variables one can consider, which encourages numerous ways to find out a specific study concept from part of this data. That is one of the things that I really like about this dataset, that you can explore through so many factors and how detailed it is. For example, the age of the patient, time of procedure, type of race, type of physician, weight height and so much more. I plan to explore the bmi of a certain age group according to the race of the patients and through that be able to see how colon surgery has affected their health by an alluvial. I specifically chose these data variables because it is said that there are chances that colon surgery is risked by abnormal bmi’s. I wanted to see this through a distribution and conclude how much a bmi is affected. Alluvial’s are widely used for showing certain data over a slope of quantity containing conditional/ unconditional distributions. In my case, over the increasing range of age groups (quantitative variable), I am conditioning the graph to show the bmi (quantitative variable) distributions by the race (categorical variable) of each patient. This type of visualization is a great way to be able to see an entire category and compare them to each other. I started by importing the data into my r code and using the head function to see the first few rows of the data. Then, after I decided the variables I need, I realized the age range is very high which could make my data looked clustered together. Therefore, I used the mutate function to divide the age groups in increments of 10, starting from age 10, and also changed its class from character to integer. For the data with the BMI of each patient, I removed any NA’s the data holds so it doesn’t cause the graph to be skewed in any direction. Finally, I used group_by and summarize to create a summary table for the data I need, and with that started to form my alluvial. Using ggplot and scalefill color palette 3, I created a beautiful alluvial with ages as the x axis, bmi for the y axis, and race as the color fill, to visualize the correspondence to age with bmi, along with the race of a patient, and if that even affects the data at all.

What the visualization showed, was astonishing. I decided to go with the age range starting from 30, because that is about the time patients start getting sick. A human body starts getting weak and that has many effects on them. The first thing that really caught my eye through this visualization is how I saw green abruptly start about a third to the graph, and most of the distribution was all the way on top. It turns out there is no record of American Indian or Alaska Native bmi’s until the age of 36, and after that it goes from a bmi of 115 to above 200, more than any other racial group. Similarly, another group that has an interesting distribution is the Asian population. While in most cases, you can notice that the bmi rises as the age increases, in the Asian distribution it is suddenly high at the age of 33, and then drops by the age of 35. I wonder what could have caused that. I would say that the white population had somewhat of a normal distribution compared to the other racial groups ranging from a bmi of 40 – 140. There is a sudden drop at the age of 37, but it is quickly recovered within a year.

I had a difficult time figuring out how I can divide the ages from the scale into groups of 10. I assumed the group_by function would be able to do this, however it did not give me the accurate results I needed. Later, after consulting with Professor Saidi, I was able to figure out how to correctly incorporate the mutate function and get the desired results. As alluvial’s need to have a quantitative variable for the x – axis, I could not use characters as I had divided the age groups. I wish I could figure out a correct way to display that information through a different visualization as well. For now, I went with a range of from 30 – 50 as my x – axis.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(RColorBrewer)

Import the dataset

surgery <- read_csv("urgent_colon surgery.csv")
## New names:
## • `otheryes` -> `otheryes...8`
## • `Homeyes` -> `Homeyes...15`
## • `facilityyes` -> `facilityyes...16`
## • `otheryes` -> `otheryes...20`
## • `Homeyes` -> `Homeyes...35`
## • `facilityyes` -> `facilityyes...37`
## • `otheryes` -> `otheryes...38`
## • `` -> `...252`
## Warning: One or more parsing issues, see `problems()` for details
## Rows: 43607 Columns: 314
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (69): SEX, RACE_NEW, ETHNICITY_HISPANIC, PRNCPTX, INOUT, TRANST, AGE, D...
## dbl (213): PUFYEAR, CASEID, femaleyes, whiteyes, blackyes, otheryes...8, CPT...
## lgl  (32): OTHERPROC10, OTHERCPT10, OTHERWRVU10, CONCURR7, CONCPT7, CONWRVU7...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(surgery)
## # A tibble: 6 × 314
##   PUFYEAR  CASEID SEX    femal…¹ RACE_…² white…³ black…⁴ other…⁵ ETHNI…⁶ PRNCPTX
##     <dbl>   <dbl> <chr>    <dbl> <chr>     <dbl>   <dbl>   <dbl> <chr>   <chr>  
## 1    2018 8634492 male         0 White         1       0       0 N       COLECT…
## 2    2017 6834792 female       1 White         1       0       0 Y       COLECT…
## 3    2018 8328996 female       1 Black …       0       1       0 N       COLECT…
## 4    2016 5055443 female       1 White         1       0       0 N       COLECT…
## 5    2018 8694016 male         0 White         1       0       0 N       COLECT…
## 6    2018 9030828 female       1 White         1       0       0 N       COLECT…
## # … with 304 more variables: CPT <dbl>, WORKRVU <dbl>, INOUT <chr>,
## #   TRANST <chr>, Homeyes...15 <dbl>, facilityyes...16 <dbl>,
## #   transferyes <dbl>, transferacuteyes <dbl>, transferEDyes <dbl>,
## #   otheryes...20 <dbl>, AGE <chr>, `65+` <dbl>, `18-29` <dbl>, `30-39` <dbl>,
## #   `40-49` <dbl>, `50-59` <dbl>, `60-69` <dbl>, `70-79` <dbl>, `80-89` <dbl>,
## #   `90+` <dbl>, Decade <dbl>, ADMYR <dbl>, OPERYR <dbl>, DISCHDEST <chr>,
## #   Homeyes...35 <dbl>, Expiredyes <dbl>, facilityyes...37 <dbl>, …

Change the class of age and remove any NAs

class(surgery$AGE)
## [1] "character"
surgery$AGE <- as.integer(surgery$AGE) # change the class of age from character to numeric data.
## Warning: NAs introduced by coercion
summary(surgery$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.00   54.00   65.00   63.22   75.00   90.00    1333
removena_surgery <- surgery %>% # remove any Nas from the data
  filter(!is.na(BMI) & !is.na(AGE))
head(removena_surgery) # view the updated data
## # A tibble: 6 × 314
##   PUFYEAR  CASEID SEX    femal…¹ RACE_…² white…³ black…⁴ other…⁵ ETHNI…⁶ PRNCPTX
##     <dbl>   <dbl> <chr>    <dbl> <chr>     <dbl>   <dbl>   <dbl> <chr>   <chr>  
## 1    2018 8634492 male         0 White         1       0       0 N       COLECT…
## 2    2017 6834792 female       1 White         1       0       0 Y       COLECT…
## 3    2018 8328996 female       1 Black …       0       1       0 N       COLECT…
## 4    2016 5055443 female       1 White         1       0       0 N       COLECT…
## 5    2018 8694016 male         0 White         1       0       0 N       COLECT…
## 6    2018 9030828 female       1 White         1       0       0 N       COLECT…
## # … with 304 more variables: CPT <dbl>, WORKRVU <dbl>, INOUT <chr>,
## #   TRANST <chr>, Homeyes...15 <dbl>, facilityyes...16 <dbl>,
## #   transferyes <dbl>, transferacuteyes <dbl>, transferEDyes <dbl>,
## #   otheryes...20 <dbl>, AGE <int>, `65+` <dbl>, `18-29` <dbl>, `30-39` <dbl>,
## #   `40-49` <dbl>, `50-59` <dbl>, `60-69` <dbl>, `70-79` <dbl>, `80-89` <dbl>,
## #   `90+` <dbl>, Decade <dbl>, ADMYR <dbl>, OPERYR <dbl>, DISCHDEST <chr>,
## #   Homeyes...35 <dbl>, Expiredyes <dbl>, facilityyes...37 <dbl>, …

Use group_by and summarise to create a summary table

by_race <- removena_surgery %>%
  group_by(RACE_NEW, AGE) %>% # group by race and age 
  summarize(count = n(),
            bmi = mean(BMI)) # we need to use the mean bmi
## `summarise()` has grouped output by 'RACE_NEW'. You can override using the
## `.groups` argument.
unique(by_race$RACE_NEW)
## [1] "American Indian or Alaska Native"    "Asian"                              
## [3] "Black or African American"           "Native Hawaiian or Pacific Islander"
## [5] "Unknown/Not Reported"                "White"

Use the mutate command to make age groups to view the data easier

surgery_Age <- by_race %>%
  mutate(AGEGroup = ifelse(AGE %in% 0:9.99999,"00-10 Years", # age groups 0 - 10 years
                ifelse(AGE %in% 10:19.99999, "10-20 Years", # age groups 10 - 20 years
                ifelse(AGE %in% 20:29.99999, "20-30 Years", # age groups 30 - 40 years
                ifelse(AGE %in% 30:39.99999, "30-40 Years", # age groups 40 - 50 years
                ifelse(AGE %in% 40:49.99999, "40-50 Years", # age groups 50 - 60 years
                ifelse(AGE %in% 50-59.99999, "50-60 Years", # age groups 60 - 70 years
                ifelse(AGE %in% 60-69.99999, "60-70 Years", "70+ Years"))))))))
unique(surgery_Age$AGEGroup)
## [1] "20-30 Years" "30-40 Years" "40-50 Years" "50-60 Years" "10-20 Years"

Load the Alluvial library

library(alluvial)
library(ggalluvial)

Make an alluvial

 surgery_alluv <- surgery_Age %>%
  filter(AGE > 30 & AGE < 50) %>% # filter by age for the x axis 
  ggplot(aes(x=AGE, y = bmi, alluvium = RACE_NEW)) +
  theme_bw() +
  geom_alluvium(aes(fill = RACE_NEW), # fill by race - catagoical variable
                color = "white",
                width = .1, 
                alpha = .8,
                decreasing = FALSE) +
  scale_fill_brewer(palette = "Set3") +
  ggtitle("BMI of Patients Age 30 - 50 According to Race") +
  ylab("BMI") + 
  xlab("Age ranging from 30 - 50")
surgery_alluv