Heart Transplant Survival Analysis

Author

SMukrabine

##Introduction ##Research Question:

Which factors are most associated with patient survival after being accepted for a heart transplant?

My goal of this project is to study how different factors affect how long patients live after they are accepted for a heart transplant. I want to look at age, prior surgery, and transplant status to see how they relate to survival time. This data comes from http://www.stat.ucla.edu/~jsanchez/data/stanford.txt. The dataset has the following columns:

id – patient number, acceptyear – year patient was accepted, age – patient’s age, survived – if the patient survived or not, survtime – survival time in days, prior – if the patient had surgery before, transplant – if the patient got a transplant, wait – how long the patient waited. In this dataset has 103 rows and 8 variables.

I cleaned the data and used R code to find the average age and the longest survival time. I also grouped the data by transplant status to compare the results. This will help me understand which factors may affect survival.

Data Analysis

For this project, I used R to clean and analyze the data. First, I checked the structure of the dataset using str() and looked at the first few rows with head(). Then, I selected only the columns I needed. I calculated summary statistics using filter(), mean(), and max(). I also created a new column called survival_status using mutate(). Finally, I made a visualizations, such as histogram to explore patterns in survival time and patient characteristics.

##Load Necessary Libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)

##Load the Dataset

setwd("C:/Users/sajut/OneDrive/Desktop/DATA_101")

heart_transplant_data <- read_csv("heart_transplant_csv.csv")

Rows: 103 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): survived, prior, transplant
dbl (5): id, acceptyear, age, survtime, wait

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

##Data Cleaning and Exploration

clean_data <- is.na(heart_transplant_data)
head(heart_transplant_data)

# A tibble: 6 × 8
     id acceptyear   age survived survtime prior transplant  wait
  <dbl>      <dbl> <dbl> <chr>       <dbl> <chr> <chr>      <dbl>
1    15         68    53 dead            1 no    control       NA
2    43         70    43 dead            2 no    control       NA
3    61         71    52 dead            2 no    control       NA
4    75         72    52 dead            2 no    control       NA
5     6         68    54 dead            3 no    control       NA
6    42         70    36 dead            3 no    control       NA

str(heart_transplant_data)

spc_tbl_ [103 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id        : num [1:103] 15 43 61 75 6 42 54 38 85 2 ...
 $ acceptyear: num [1:103] 68 70 71 72 68 70 71 70 73 68 ...
 $ age       : num [1:103] 53 43 52 52 54 36 47 41 47 51 ...
 $ survived  : chr [1:103] "dead" "dead" "dead" "dead" ...
 $ survtime  : num [1:103] 1 2 2 2 3 3 3 5 5 6 ...
 $ prior     : chr [1:103] "no" "no" "no" "no" ...
 $ transplant: chr [1:103] "control" "control" "control" "control" ...
 $ wait      : num [1:103] NA NA NA NA NA NA NA 5 NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_double(),
  ..   acceptyear = col_double(),
  ..   age = col_double(),
  ..   survived = col_character(),
  ..   survtime = col_double(),
  ..   prior = col_character(),
  ..   transplant = col_character(),
  ..   wait = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

clean_heart_transplant_data <- heart_transplant_data |>
  select(id, acceptyear, age, survived, survtime, prior, transplant, wait)

##Compare Survival Time by Transplant Status

clean_heart_transplant_data <- heart_transplant_data |>
  mutate(survival_status = ifelse(survived == "alive", 1, 0)) #create survival status
mean_age <- mean(clean_heart_transplant_data$age, na.rm = TRUE)
max_survival <- max(clean_heart_transplant_data$survtime, na.rm = TRUE)
mean_age

[1] 44.64078

max_survival

[1] 1799

head(clean_heart_transplant_data)

# A tibble: 6 × 9
     id acceptyear   age survived survtime prior transplant  wait
  <dbl>      <dbl> <dbl> <chr>       <dbl> <chr> <chr>      <dbl>
1    15         68    53 dead            1 no    control       NA
2    43         70    43 dead            2 no    control       NA
3    61         71    52 dead            2 no    control       NA
4    75         72    52 dead            2 no    control       NA
5     6         68    54 dead            3 no    control       NA
6    42         70    36 dead            3 no    control       NA
# ℹ 1 more variable: survival_status <dbl>

table(clean_heart_transplant_data$heart_transplant_data)

Warning: Unknown or uninitialised column: `heart_transplant_data`.

< table of extent 0 >

##Filter only dead patients

dead_patients <- clean_heart_transplant_data |>
  filter(survived == "dead")

dead_patients$transplant_factor <- factor(     #create a factor for transplant status
  dead_patients$transplant,
)

table(dead_patients$transplant_factor)


  control treatment 
       30        45

##Visualization: Histogram of Age vs Survival Time

ggplot(clean_heart_transplant_data, aes(x = age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(
    title = "Distribution of Patient Age",
    x = "Age",
    y = "Number of Patients"
  ) +
  theme_minimal()

Future Directions

The present investigation indicates that factors associated with patient age and impact survival times. With the data cleaned, I saw the first 6 rows (head()), indicating the patient’s age, his survival time, and whether he was treated with a transplant. I have also created a new variable, survival_status, and converted it into numeric. The data shows overall that despite the finding of a biological inability to transplant patients long-term survivors, non-transplanted patients do, in fact, have poor long-term survival and confirm transplantation as being significant for survival. In this project, I filtered the dataset to include only patients who have died and created a factor variable for transplant status to facilitate easier interpretation of the results. By counting the number of deaths in each group, we see that 30 deaths occurred in the control group (patients who did not receive a transplant), while 45 deaths occurred in the treatment group (patients who received a transplant). This shows that transplant status is strongly associated with survival. The higher death count in the treatment group may reflect that these patients were more critically ill, but receiving a transplant likely improved their overall survival compared to the control group. I also created a histogram to visualize the distribution of patient ages. It shows that patients aged 40 to 60 are more likely to need a heart transplant compared to younger patients. This helps us understand whether age plays a role in survival outcomes after being accepted for a heart transplant. For future research, advanced statistical methods such as regression could be used to examine the influence of multiple factors on survival. Including additional variables, such as waiting time and history of prior surgery, could further improve the analysis.