ERGOT Data Exam

# Getting set up
library(tidyverse)
library(lubridate)
library(janitor)
library(forcats)
library(gt)
library(patchwork)

# Read data
patients <- read.csv("ergot_data_analyst_test_2020/patients.csv")
meld_scores <- read.csv("ergot_data_analyst_test_2020/meld_scores.csv")

# date columns need formatting
patients <- patients %>% 
  mutate(listing_date = lubridate::dmy(listing_date),
         transplant_date = lubridate::dmy(transplant_date),
         death_date = lubridate::dmy(death_date))
    
meld_scores <- meld_scores %>% 
  mutate(start_date = lubridate::dmy(start_date),
         end_date = lubridate::dmy(end_date))

# Identifying the listing_meld score and transplant_meld score as given in the documentation
patients_scores <- patients %>% 
  inner_join(meld_scores, by = "id") %>% 
  filter(!is.na(meld)) %>% 
  mutate(listing_meld_score = if_else(listing_date == start_date, meld, NA_integer_),
         transplant_meld_score = if_else(transplant_date >= start_date & 
                                           transplant_date <= end_date, meld, NA_integer_)) 

patients_max_scores <- meld_scores %>% 
  filter(!is.na(meld)) %>% 
  group_by(id) %>% 
  summarise(max_meld_score = max(meld))

# manually testing use cases
# meld_scores %>% filter(id == 89)
# patients %>% filter(id == 24)      # 10,16, 89
# patients_scores %>% filter(id == 89)


# about 500 or so patients dont have meld scores
patients_all_scores <- patients_scores %>% 
  inner_join(patients_max_scores, by = "id") %>%  
  group_by(id) %>% 
  summarise(listing_meld_score = max(listing_meld_score, na.rm = TRUE),
            transplant_meld_score = max(transplant_meld_score, na.rm = TRUE),
            age = max(as.numeric(age)),
            max_meld_score = max(max_meld_score),
            died = if_else(is.na(max(death_date)), "Alive", "Died"),
            death_date_before = if_else(max(death_date) < max(listing_date), "data issue","no data issue")) %>%
  mutate(score_group = cut(transplant_meld_score, breaks =  c(0,13,26,39,41), right = FALSE),
          age_group = cut(transplant_meld_score, breaks =  seq(1, 90, by = 10), right = FALSE))

Analysis Ask

There is controversy over whether transplant MELD is associated with mortality risk after receiving a liver transplant. Use the data to address this question. Write an abstract summarizing your findings, as if submitting to an academic conference. You may include one table and/or one figure, if you like. Not counting tables and figures, your abstract should be maximum 300 words. You do not need to cite any sources, or do any external reading.

Initial Exploration

# patients_all_scores %>% # count(age)
#   tabyl(score_group, age_group) %>% 
#   adorn_title()

# Playing around with some dataviz tools for preliminary visuals
# library(rpivotTable)
# rpivotTable(patients_all_scores)
# 
# library(esquisse)
# esquisse::esquisser()

# 10% of patients have high meld score of 40.
# patients_all_scores %>% 
#   tabyl(transplant_meld_score) %>% 
#   adorn_pct_formatting()

patients_all_scores %>% 
  ggplot(aes(x = age_group, y=transplant_meld_score, fill = factor(died))) + 
  geom_boxplot() + 
  labs(x = "Age Group ", title = "Distribution of Transplant Meld Scores by Age") +
  theme_light()  + 
 theme(legend.title = element_blank(), legend.position = c(0.8,0.2), 
       legend.background = element_rect(linetype="solid", colour ="grey"))

patients_all_scores <- patients_all_scores %>% 
 filter(is.na(death_date_before) | death_date_before == "no data issue") %>%
 filter(!is.na(transplant_meld_score)) %>%
 mutate(transplant_meld_score_group = if_else(transplant_meld_score < 13, "Low",
                                              if_else(transplant_meld_score < 26, "Medium",
                                                      if_else(transplant_meld_score < 39, "High","Highest(score of 40 only)")))) %>% 
  mutate(transplant_meld_score_group = factor(transplant_meld_score_group, 
                                              levels = c("Low","Medium","High","Highest(score of 40 only)")))

# 10% of patients have high meld score of 40
patients_all_scores %>% 
  tabyl(transplant_meld_score_group ) %>% 
  adorn_pct_formatting() %>% 
  gt() %>% 
  tab_header(title = "Percentage of patients by MELD score groups", subtitle = NULL)

Percentage of patients by MELD score groups
transplant_meld_score_group	n	percent
Low	8579	18.8%
Medium	18618	40.7%
High	13098	28.7%
Highest(score of 40 only)	5410	11.8%

patients_all_scores %>% 
  tabyl(transplant_meld_score_group ,  died) %>% 
  adorn_percentages("all") %>% 
  adorn_pct_formatting() %>% 
  adorn_ns("front") %>% 
  gt()

transplant_meld_score_group	Alive	Died
Low	8250 (18.1%)	329 (0.7%)
Medium	18028 (39.4%)	590 (1.3%)
High	12604 (27.6%)	494 (1.1%)
Highest(score of 40 only)	5161 (11.3%)	249 (0.5%)

patients_all_scores %>%
 ggplot() +
 aes(x = transplant_meld_score, y = ..count../sum(count), fill = as.factor(died) ) +
 geom_histogram(bins = 30) +
 labs(x = " ", title = "Distribution of Transplant Meld Scores", y = "% of patients") +
 theme_light()  +
 scale_y_continuous(labels = scales::percent) + 
 theme(legend.title = element_blank(), legend.position = c(0.8,0.8), 
       legend.background = element_rect(linetype="solid", colour ="grey"))

patients_all_scores %>% 
 ggplot(aes(x = transplant_meld_score_group , y = ..count../sum(count),  fill = as.factor(died) )) +
 geom_bar() + 
 theme_light()  +
 labs(x = "Transplant Meld Score Group ", title = "Distribution of Transplant Meld Scores", 
      y = "% of patients") +
 scale_y_continuous(labels = scales::percent) + 
 theme(legend.title = element_blank(), legend.position = c(0.8,0.8), 
       legend.background = element_rect(linetype="solid", colour ="grey"))

# patients_scores %>% tabyl(age, transplant_meld_score ,  died)

Data Challenges

Nearly half of those(1385) that show as died(3047) have listing_date after their death_date and are hence excluded from the subsequent analysis.
about 500 or so patients dont have meld scores and have thus been excluded.
age column has both actual age and age groups, which may bring some challenges in data quality when slicing by age.

Quick Summary

93% patients have had a successful liver transplants. Top one third of MELD scores(>=26) are exhibited by 40% of this patient population. In the given sample of patients the mortality is very low after transplant and is spread across age buckets. Also, the MELD scores increase as the age increases (boxplot above).

patients_all_scores %>% 
   tabyl(score_group, age_group ) %>% 
  gt()

Statistical Test Analysis Next Steps

Analytically speaking, I would approach this as a hypothesis testing problem and use a control(another sample of patients who did not receive liver transplant) & treatment(sample of patients who received liver transplant, i.e. this data) group to test association, such that:

Null Hypothesis: H0 - the assumption of no association between transplant and mortality.
By contrast, HA assumes that there is some effect of transplant on mortality of the receiving patients.

Statistical test for categorical outcome (Mortality due to transplant or not) that we would use is Pearson’s chi-square test, regardless of the number of categories of the outcome or the exposure variables.