PART 1
PART 2
PART 3
Origin of the Topic
While searching for my biostats capstone dataset, I stumbled upon an organ transplant database online. It caught my attention because I am registered as an organ donor but know little about the success factors for organ donation. Curious about the statistics, I took a peek at the data and was amazed by the depth and breadth of the information. It looked like a rich source of data for learning how to apply statistics, and that is where my project began.
Factors that Affect Transplant Success
Organ transplants are remarkable life-saving procedures that continue to make progress on multiple fronts, thanks to advances in medicine and technology – improved surgical procedures, increased donor organ supply, and more efficient organ procurement and matching, to name a few.
Organ transplant success is measured in two ways: graft survival and patient survival. Graft survival is defined as the time from transplant to graft failure without the patient’s death[^1]. Graft is the medical term for the organ that is transplanted from one body (the donor) to another body (the recipient). Patient survival is the time from transplant to the patient death[^1].
There are many factors that affect graft and patient survival[^2,^3]. These factors appear to be organ-specific, based on my non-systematic persusal of online research articles. For example, the set of factors that affect kidney transplant outcomes are different from the set of factors that affect heart transplants. This makes sense to me since there are biological factors specific to an organ;s functioning.
Most scientific research articles are focused on the study of one organ. So it is difficult to find scholarship on all the factors that affect organ transplant success. But as an example, kidneys, the organ category with the highest number of transplants, have 3 major risk factors: donor age, the length of time between the time the organ is procured and when it is transplanted, and how urgently the recipient needs the organ[^4].
The Role of Ethnicity in Survival
This project examines the survival rates for different ethnicities and the relationship between ethnicity and survival rate. Among the many variables in the dataset, I chose to focus on ethnicity because a quick, initial analysis of the data revealed some differences in survival rates across ethnicities. I never thought that there could be differences in survival based on ethnicity. So this for me was an interesting question to study.
My online search on the relationship between ethnicity and survival yielded little information. Perhaps this information is only available in medical journals. It is a highly technical topic. But I did find a few articles that referenced ethnic disparities in graft survival and patient survival. The reasons are complex and include both immunological and non-immunological factors[^3].
Researchers at Brigham and Women’s Hospital in Harvard Medical School’s Division of Transplant Surgery studied racial and ethnic disparities in kidney transplantation. In their 2011 published study[^5], they described these factors as:
a combination of behavioral, social, environmental, and occupational factors, as well as potential intended or unintended discrimination within the healthcare system. Immunologic factors such as human leukocyte antigen matching, composition of the organ donor pool, and patient immune response, all of which affect post-transplantation graft rejection rates and patient survival, also contribute to health disparities between ethnic groups."
McCaughan and Patterson of Belfast City Hospital and Queen’s University, Belfast discuss the lower graft survival statistics of African Americans in their 2014 published study on kidney transplants[^3]:
In the most recent data from the OPTN/SRTR database [29], graft outcomes were shown to vary by racial/ ethnic group, irrespective of donor type, and the differences tended to increase with time post-transplantation. An analysis of graft survival at various time points (3 months, 1, 5, and 10 years) showed that African Americans presented the lowest graft survival at each time interval. This finding was explained by a higher incidence of delayed graft function and acute rejection because of incompletely understood mechanisms, including higher levels of costimulatory molecules and expression of Duffy antigens on erythrocytes.
I observed similar disparities in my data analysis, although mine was focused on patient survival as opposed to graft survival. From hereon, when you see “Survival Rate,” it refers to patient survival rate.
About OPTN
OPTN (Organ Procurement and Transplantation Network) is a public-partnership that links all professionals involved in the U.S. organ donation and transplantation system. It operates under a regulatory framework established by the U.S. Department of Health and Human Services in 2000. The goal of OPTN is the increase the number of transplants, access to transplants, and survival rates[^6].
The OPTN maintains a public online database that contains data on every organ donation and transplant event occurring in the U.S. since 1987. This database allows institutions to register patients, match organs to patients, and manage time-sensitive patient data before and after transplants. It is used by all organ transplant programs, procurement organizations, and tissue labs[^7].
How OPTN Collects and Shares Data
All of OPTN’s data is collected online via web-based forms from hospitals, labs, and organ procurement organizations. The database is used by transplant professionals, public health researchers, scientists, patients, federal/state/local governments, and the general public[^7].
The OPTN database holds an enormous amount of data. You can look up the number of organ transplants, donors, and recipients spliced up many different ways. By type of organ, age group, ethnicity, blood type, gender, and more. By whether the donor organ came from a deceased or living person. You can even look at the data at a state or hospital level.
The OPTN website is set up like a search tool. You first select what you want to see (e.g. transplants, donors, waiting lists, survival rates), and then add criteria that filter the data further. The data reports are displayed on standard templates, but they are also customizable.
Link to OPTN’s database: https://optn.transplant.hrsa.gov/data/view-data-reports/national-data/
The most current survival data on the website contains 8 years of patient data for the time period 2008-2015.
The survival data can be categorized by organ, ethnicity, age group, and number of years post transplant. Survival data is presented as a group count or the mean of a certain group. Data for individual observations (one patient) are not provided. Therefore, all the statistical analyses that follow are based on means of means.
A mean in the data is the mean for the time period 2008-2015.
Unfortunately, the OPTN website does not present the survival data with both ethnicity and age group together. You can only get data for one or the other variable in one report, but not both. That’s too bad, as recipient age is a known factor for survival. It would have been interesting to see which factor, age or ethnicity, is more associated with survival.
There are 5 variables in my dataset.
Two are numerical variables:
Three are categorical variables:
There are 2 csv files. They essentially contain the same data but formatted in different ways – the first is comprised of rates; the second of counts.
The export function on the OPTN website was not working (for many weeks). Therefore, I gave up trying to export and copied the data from my browser and pasted it into Excel. Then I manually deleted extraneous text (headings, footnotes).
When I read this file into R as a csv, the SurvivalRate column was represented as string, not numerical, data. This is because there were asterisks (*) in the column. The asterisks denote data that was not available for that particular row.
This string data had some quirk. I was not able to replace the asterisks in R using the gsub command, nor could I replace them in Excel via the standard Find & Replace function. Strangely, Excel would replace all the values in the column, not just the cells with the asterisk.
So as not to waste time on data cleaning and leave more time for analysis, I manually deleted the asterisks from the Survival Rate column in Excel. There weren’t that many (11 asterisks). Those cells were left blank.
Those blank cells became NA’s when the csv file was read into R. NA’s create errors when numerical calculations are done on that column, so I removed the rows containing NAs from the data frame.
There were 96 rows before removal. After removal, 85 rows remained. This is still a decent number of data points, certainly enough to allow this analysis to still be useful for learning.
The categorical varialbles – Organ, YPT (Years Post Transplant), and Ethnicity – needed to be re-coded as factors for the multiple regression analysis.
Here is the coding scheme I used:
ORGAN (Organ_code)
ETHNICITY (Eth_code)
YPT (YPT_code)
The original data file has a column for Number Alive (a count) and a column for Survival Rate (a percentage). Total number of organ recipients is not given. I needed the total to calculate proportions for the Chi-square analysis, e.g. the proportion of Alive to Total Number of Organ Recipients.
I adopted the “Don method” to calculate this total because it was faster than doing it in R. I divided Number Alive by Survival Rate for each row in the Excel file and put this total in a new column called Total Recipients.
# Read the 1st csv file into R
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
surv_eth <- read_csv("Survival_Rates.csv")
## Parsed with column specification:
## cols(
## Organ = col_character(),
## Organ_code = col_double(),
## Ethnicity = col_character(),
## Eth_code = col_double(),
## YPT = col_character(),
## YPT_code = col_double(),
## NumberAlive = col_double(),
## SurvivalRate = col_double()
## )
surv_eth
## # A tibble: 96 x 8
## Organ Organ_code Ethnicity Eth_code YPT YPT_code NumberAlive
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Heart 0 White 0 1 Ye… 0 5949
## 2 Heart 0 Black 1 1 Ye… 0 1973
## 3 Heart 0 Hispanic 2 1 Ye… 0 892
## 4 Heart 0 Asian 3 1 Ye… 0 333
## 5 Heart 0 White 0 3 Ye… 1 5286
## 6 Heart 0 Black 1 3 Ye… 1 1591
## 7 Heart 0 Hispanic 2 3 Ye… 1 713
## 8 Heart 0 Asian 3 3 Ye… 1 256
## 9 Heart 0 White 0 5 Ye… 2 4564
## 10 Heart 0 Black 1 5 Ye… 2 1283
## # … with 86 more rows, and 1 more variable: SurvivalRate <dbl>
# See how many rows there are in the data frame
NROW(surv_eth)
## [1] 96
# Remove the rows with NA in the SurvivalRate column
surv_eth <- surv_eth[!is.na(surv_eth$SurvivalRate),]
# Now see how many rows there are now
NROW(surv_eth)
## [1] 85
# NOTE: This code comes from: https://rpubs.com/nhanwei/73936?source=post_page-----4311c6c386b6----------------------
The parameter to be studied is the mean survival rate of 4 ethnicity groups (White, Black, Hispanic, Asian).
For each ethnicity, I lumped all the types of organ transplants together (there are 8 organ types).
The mean for a particular organ is further broken out into 3 categories of YPT (years post transplant). Those 3 categories are: 1 year, 3 year, 5 year.
Therefore, each ethnicity has a total of 24 means: 8 organs mulitplied by 3 YPT’s. This is equivalent to 24 rows of data for each ethnicity.
The mean survival rate for a particular ethnicity is the “grand mean” of these 24 means.
*EXAMPLE: The mean survival rate for WHITE is the grand mean of:
… for a total of 24 rows for WHITE.
Due to missing values in the OPTN database, the Asian, Black, and Hispanic data have less than 24 rows.
SUMMARY OF AVAILABLE DATA:
There 85 total rows in this dataset.
This section calculates the overall mean survival rate of all the ethnicities combined and the confidence interval.
NOTE: This mean encompasses means for ALL ORGANS at ALL YPT’s (1 year, 3 year, and 5 year). Therefore, the overall mean is for 85 rows of data (see Parameter section for explanation).
The mean survival rate for all ethnicities is 78.1% with a standard deviation of 13.1%.
# Calculate the OVERALL mean survival rate and SD (for all organs and ethnicities, at all YPTs)
mean(surv_eth$SurvivalRate)
## [1] 78.05294
sd(surv_eth$SurvivalRate)
## [1] 13.05047
Formula: y +/- t * se
y = mean
t = two-tailed 5% critical value
se = s / square root of n
s = 13.0505
n = 85
se = 13.0505 / square root of 85 = 1.4648
df = n - 1 = 84
t = 1.9886 obtained from code below
# Calculate t, the two-tailed 5% critical value
qt(0.975,df = 84)
## [1] 1.98861
95% Confidence Interval:
y +/- t * se
78.0529 +/- 1.9886 * 1.4648
95% CI: (75.1399, 80.9659)
INTERPRETATION: We are 95% confident that the true population mean survival rate for U.S. organ transplant recipients of all ethnicities is between 75.14 years and 80.97 years.
This section calculates the mean survival rate for each ethnicity.
Again, these means encompass ALL ORGANS and ALL YPT (1 year, 3 year, and 5 year).
# Calculate the mean survival rate & SD for WHITE
white <- surv_eth %>%
filter(Ethnicity == "White")
mean(white$SurvivalRate)
## [1] 74.84583
sd(white$SurvivalRate)
## [1] 14.99307
# Calculate the mean survival rate & SD for BLACK
black <- surv_eth %>%
filter(Ethnicity == "Black")
mean(black$SurvivalRate)
## [1] 76.96087
sd(black$SurvivalRate)
## [1] 12.19675
# Calculate the mean survival rate & SD for HISPANIC
hispanic <- surv_eth %>%
filter(Ethnicity == "Hispanic")
mean(hispanic$SurvivalRate)
## [1] 78.59565
sd(hispanic$SurvivalRate)
## [1] 13.04591
# Calculate the mean survival rate & SD for ASIAN
asian <- surv_eth %>%
filter(Ethnicity == "Asian")
mean(asian$SurvivalRate)
## [1] 84.02667
sd(asian$SurvivalRate)
## [1] 9.646722
# Calculate the mean survival rate for each ethnicity
# Arrange the rows (ethnicities) from highest to lowest survival rate
meanSR_by_eth <- surv_eth %>%
group_by(Ethnicity) %>%
summarize(MeanSR = mean(SurvivalRate)) %>%
arrange(desc(MeanSR))
meanSR_by_eth
## # A tibble: 4 x 2
## Ethnicity MeanSR
## <chr> <dbl>
## 1 Asian 84.0
## 2 Hispanic 78.6
## 3 Black 77.0
## 4 White 74.8
# Create side-by-side boxplots of the mean survival rate for the 4 ethnicities
boxplot(SurvivalRate ~ Ethnicity, data = surv_eth, main = "Mean Survival Rate by Ethnicity", ylab = "Patient Survival Rate (% Alive)")
It appears as if there could be some variation between the means and maybe within each ethnicity.
But is this difference statistically significant? The ANOVA test in the next section will reveal.
Note: The outlier for Asian is due to Asian having a smaller number of data points than the other ethicities. There were 3 NA’s for Asian. So, Asian had 3 fewer rows out of 8 total possible rows. This is a large proportion that is missing, and it possibly distorts the findings.
Although this is not the central topic of the project, I’d like to show the differences in mean survival rate for various organs because it is informative.
Compare the mean survival rates for each type of organ.
# Calculate the mean survival rate for each organ type
# Arrange the rows (organ types) from highest to lowest survival rate
meanSR_by_organ <- surv_eth %>%
group_by(Organ) %>%
summarize(MeanSR = mean(SurvivalRate)) %>%
arrange(desc(MeanSR))
meanSR_by_organ
## # A tibble: 8 x 2
## Organ MeanSR
## <chr> <dbl>
## 1 Kidney / Pancreas 88.5
## 2 Kidney 87.6
## 3 Heart 84.0
## 4 Liver 80.3
## 5 Lung 70.9
## 6 Heart / Lung 69.7
## 7 Pancreas 69.6
## 8 Intestine 64.9
# Create side-by-side boxplots of the mean survival rate for the 8 types of organ transplants
library(ggplot2)
p <- ggplot(surv_eth, aes(x = Organ, y = SurvivalRate)) +
ggtitle("Mean Survival Rate for Different Organs")+
ylab("Patient Survival Rate (% Alive")+
geom_boxplot(aes(fill=Organ))
p + theme_minimal()
The side-by-side boxplots are impactful. These clearly show mean survival rate varying by type of organ.
Kidney and kidney/pancreas together have the highest survival rates (87-88%), while intestine, pancreas alone, lung, and heart/lung together have the lowest (65-70%).
It is interesting that pancreas alone is 69.6%, but pancreas with kidney is 88.5%. Similarly, heart alone is 84%, but heart with lung is only 69.7%.
Let’s look at the differences in mean survival rate for various years post transplant, because like organs, the number of years post transplant makes a difference in survival.
Compare the mean survival rates for 1 year vs 3 years vs 5 years post transplant.
# Calculate the mean survival rate for the 3 YPT intervals
# Arrange the rows (organ types) from highest to lowest survival rate
meanSR_by_YPT <- surv_eth %>%
group_by(YPT) %>%
summarize(MeanSR = mean(SurvivalRate)) %>%
arrange(desc(MeanSR))
meanSR_by_organ
## # A tibble: 8 x 2
## Organ MeanSR
## <chr> <dbl>
## 1 Kidney / Pancreas 88.5
## 2 Kidney 87.6
## 3 Heart 84.0
## 4 Liver 80.3
## 5 Lung 70.9
## 6 Heart / Lung 69.7
## 7 Pancreas 69.6
## 8 Intestine 64.9
# Create side-by-side boxplots of the mean survival rate for the 8 types of organ transplants
boxplot(SurvivalRate ~ YPT, data = surv_eth, main="Mean Survival Rate by Years Post Transplant", ylab="Patient Survival Rate (% Alive)")
The side-by-side boxplots also show a noticeable difference in mean survival rates for 1 year vs 3 year vs 5 years post transplant.
I performed an ANOVA test to evaluate the difference in the mean survival rates between the four ethnicities.
The mean survival rate for a particular ethnicity includes ALL ORGANS and ALL YPT (1 year, 3 year, and 5 year data).
NULL HYPOTHESIS:
There is no difference in mean survival rate among the ethnicities after a transplant.
Mu (white) = Mu (black) = Mu (hispanic) = Mu (asian)
ALTERNATIVE HYPOTHESIS:
There is at least one ethnicity whose mean survival rate after a transplant is different from the other ethnicities.
# Perfom ANOVA on mean survival rate for all 4 ethnicities. All YPT's included.
a1 <- aov(SurvivalRate ~ Ethnicity, data = surv_eth)
summary(a1)
## Df Sum Sq Mean Sq F value Pr(>F)
## Ethnicity 3 816 272.1 1.634 0.188
## Residuals 81 13490 166.5
a1
## Call:
## aov(formula = SurvivalRate ~ Ethnicity, data = surv_eth)
##
## Terms:
## Ethnicity Residuals
## Sum of Squares 816.339 13490.093
## Deg. of Freedom 3 81
##
## Residual standard error: 12.90521
## Estimated effects may be unbalanced
Now check the residuals to fitted plot and the qq plot.
# Plot the fitted values of the AOV to the residuals
plot(a1$fitted.values, residuals(a1))
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
# Plot the normal qq plot
qqnorm(residuals(a1))
qqline(residuals(a1))
The residuals to fitted plot looks fine. The qq plot is not very normal. But that is likely due to the “outlier” in the Asian dataset.
I am accepting the qq plot because I think the Asian outlier is an artificial outlier. The lowest survival rate for Asian was 57%. But this percentage is in the same range as the lowest rate for the other ethnicities. In fact, the lowest rate for White was even lower (48%).
CONCLUSION:
The p-value is 0.506. Fail to reject the null.
There is no compelling evidence that there is a difference in mean survival rate after an organ transplant among the four ethnicities.
This result is surprising to me. The means look different to me in the side-by-side boxplots. Perhaps this is due to incomplete data for Asian and the lowest survival rate being treated as an outlier.
I performed a Chi-square test to see if there is a relationship between ethnicity and survival rate.
The test compares the proportions of organ recipients who are still alive after a transplant between the 4 ethnicities.
Like for ANOVA, the data for this test includes rows for ALL ORGANS and ALL YPT’s.
NULL HYPOTHESIS:
There is no relationship between ethnicity and survival rate after an organ transplant.
ALTERNATIVE HYPOTHESIS:
There is some association between ethnicity and survival rate after an organ transplant.
# Read a 3rd csv file
# This file shows counts (number alive) to allow proportions to be computed for the Chi-square test
prop_eth <- read_csv("Alive_Proportions.csv")
## Parsed with column specification:
## cols(
## Organ = col_character(),
## Organ_code = col_double(),
## Ethnicity = col_character(),
## Eth_code = col_double(),
## YPT = col_character(),
## YPT_code = col_double(),
## NumberAlive = col_double(),
## SurvivalRate = col_double(),
## TotalRecipients = col_double()
## )
prop_eth
## # A tibble: 96 x 9
## Organ Organ_code Ethnicity Eth_code YPT YPT_code NumberAlive
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 Heart 0 White 0 1 Ye… 0 5949
## 2 Heart 0 Black 1 1 Ye… 0 1973
## 3 Heart 0 Hispanic 2 1 Ye… 0 892
## 4 Heart 0 Asian 3 1 Ye… 0 333
## 5 Heart 0 White 0 3 Ye… 1 5286
## 6 Heart 0 Black 1 3 Ye… 1 1591
## 7 Heart 0 Hispanic 2 3 Ye… 1 713
## 8 Heart 0 Asian 3 3 Ye… 1 256
## 9 Heart 0 White 0 5 Ye… 2 4564
## 10 Heart 0 Black 1 5 Ye… 2 1283
## # … with 86 more rows, and 2 more variables: SurvivalRate <dbl>,
## # TotalRecipients <dbl>
NROW(prop_eth)
## [1] 96
# More data cleaning
# Remove the rows with NA in the TotalRecipients column
prop_eth <- prop_eth[!is.na(prop_eth$TotalRecipients),]
# Now see how many rows there are now
NROW(prop_eth)
## [1] 85
# Calculate the number of people still alive after a transplant for each ethnicity
# This number is the total alive (includes alive after 1/3/5 years)
alive_by_eth <- prop_eth %>%
group_by(Ethnicity) %>%
summarize(Alive = sum(NumberAlive)) %>%
arrange(Ethnicity)
alive_by_eth
## # A tibble: 4 x 2
## Ethnicity Alive
## <chr> <dbl>
## 1 Asian 13933
## 2 Black 55520
## 3 Hispanic 39376
## 4 White 164594
# Calculate the total number of organ recipients for each ethnicity
total_rec <- prop_eth %>%
group_by(Ethnicity) %>%
summarize(TotalRecipients = sum(TotalRecipients)) %>%
arrange(Ethnicity)
total_rec
## # A tibble: 4 x 2
## Ethnicity TotalRecipients
## <chr> <dbl>
## 1 Asian 15711
## 2 Black 67443
## 3 Hispanic 45400
## 4 White 196713
# Calculate the percentage still alive for each ethnicity
alive_percent <- prop_eth %>%
group_by(Ethnicity) %>%
summarize(PercentAlive = 100* sum(NumberAlive) / sum(TotalRecipients)) %>%
arrange(Ethnicity)
alive_percent
## # A tibble: 4 x 2
## Ethnicity PercentAlive
## <chr> <dbl>
## 1 Asian 88.7
## 2 Black 82.3
## 3 Hispanic 86.7
## 4 White 83.7
# Create vectors for the summary table
ethnicity <- c("Asian", "Black", "Hispanic", "White")
alive <- c(13933, 55520, 39376, 164594)
total <- c(15711, 67443, 45400, 196713)
percent_alive <- c(88.7, 82.3, 86.7, 83.7)
# Create a table showing the percent alive for each ethnicity
alive_summary <- data.frame(ethnicity,alive, total, percent_alive)
alive_summary
## ethnicity alive total percent_alive
## 1 Asian 13933 15711 88.7
## 2 Black 55520 67443 82.3
## 3 Hispanic 39376 45400 86.7
## 4 White 164594 196713 83.7
str(alive_summary)
## 'data.frame': 4 obs. of 4 variables:
## $ ethnicity : Factor w/ 4 levels "Asian","Black",..: 1 2 3 4
## $ alive : num 13933 55520 39376 164594
## $ total : num 15711 67443 45400 196713
## $ percent_alive: num 88.7 82.3 86.7 83.7
# Create a new data frame for the Chi-sq test - it holds only the number alive vs total recipients for each ethnicity
alive_df <- data.frame(ethnicity,alive, total)
alive_df
## ethnicity alive total
## 1 Asian 13933 15711
## 2 Black 55520 67443
## 3 Hispanic 39376 45400
## 4 White 164594 196713
# Run a Chi-square test on the alive_df data frame
chisq.test(alive_df[,-1]) # [,-1] removes the 1st column (Ethnicity) from consideration
##
## Pearson's Chi-squared test
##
## data: alive_df[, -1]
## X-squared = 57.029, df = 3, p-value = 2.534e-12
CONCLUSION:
The p-value (2.534e-12) is very low. Reject the null.
There is very strong evidence that there is a relationship between ethnicity and survival rate after an organ transplant.
This finding is also surprising to me. I did not expect to see such resounding evidence for a relationship.
Since there is only one continuous variable in my dataset (Survival Rate), it is not possible to do a linear regression. But it is possible to do a multiple regression using this continuous variable and several categorical variables.
I ran a model to see if type of organ, ethnicity, and number of years post transplant are associated with survival rate.
The response variable is Survival Rate.
The predictor variables are: Organ, YPT (Years Post Transplant), and Ethnicity.
Note that the predictor variables are coded as factors (e.g. 0, 1, 2, etc.). Refer to Parameter section in Part 2 for the coding key.
NULL HYPOTHESIS:
There is no relationship between organ, YPT, or ethnicity and survival rate.
ALTERNATIVE HYPOTHESIS:
There is a relationship between one more of these categories (organ, YPT, or ethnicity) and survival rate.
# Create a multiple regression model with all 3 predictors
# Note that the predictor variables have been converted into factors.
fit1 <- lm(SurvivalRate ~ Organ_code + Eth_code + YPT_code, data = surv_eth)
plot(fit1)
# Plot histogram of residuals to see if there are any outliers
hist(fit1$residuals)
1). Linearity - N/A (since this predictor variables are categorical)
2). Normality of Residuals - normal
In the Residuals vs Fitted plot above, the scattering of the data points look fairly random.
The plot identifies 3 outliers, which I left in the model:
The residuals histogram shows a little bit of left skew, but I think this could be considered close enough to normal.
3). Constant Variability - N/A (since there is no linear model)
# Look at the statistics
summary(fit1)
##
## Call:
## lm(formula = SurvivalRate ~ Organ_code + Eth_code + YPT_code,
## data = surv_eth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.651 -6.020 1.295 7.344 17.836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.7372 2.6761 32.412 < 2e-16 ***
## Organ_code -0.7502 0.4718 -1.590 0.11574
## Eth_code 2.7596 0.9909 2.785 0.00666 **
## YPT_code -9.9361 1.3060 -7.608 4.42e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.763 on 81 degrees of freedom
## Multiple R-squared: 0.4603, Adjusted R-squared: 0.4403
## F-statistic: 23.03 on 3 and 81 DF, p-value: 7.072e-11
Assume alpha level = 0.10.
The p-values for Ethnicity (0.00666) and Years Post Transplant (4.42e-11) are very low and below alpha. These variables are significant.
The p-value for Organ (0.11574) is greater than alpha but pretty close to it. So Organ could have some small effect on survival rate.
# Remove ORGAN from model
fit2 <- lm(SurvivalRate ~ Eth_code + YPT_code, data = surv_eth)
summary(fit2)
##
## Call:
## lm(formula = SurvivalRate ~ Eth_code + YPT_code, data = surv_eth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.751 -8.090 0.610 7.253 17.555
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.156 2.147 39.193 < 2e-16 ***
## Eth_code 2.734 1.000 2.734 0.00766 **
## YPT_code -10.005 1.317 -7.595 4.4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.854 on 82 degrees of freedom
## Multiple R-squared: 0.4435, Adjusted R-squared: 0.4299
## F-statistic: 32.67 on 2 and 82 DF, p-value: 3.674e-11
The p-value for Ethnicity (0.00766) increases slightly.
The p-value for YPT (4.4e-11) is almost unchanged.
Organ makes almost no difference on the model. Therefore, remove organ from the model.
(with only 2 predictor variables)
SurvivalRate = 84.156 + 2.734 * Ethnicity - 10.005 * YPT
To interpret the slopes, recall these factor codes defined in Part 2:
ETHNICITY
YPT
CONCLUSION:
There is some relationship between ethnicity and survival rate, and there is some relationship between YPT and survival rate.
For each change in ethnicity, the survival rate increases as follows:
- Blacks have a 2.73% higher survival rate compared to Whites, all other factors held constant.
- Hispanics have a 5.47% higher survival rate compared to Whites, all other factors held constant.
- Asians have a 8.20% higher survival rate compared to Whites, all other factors held constant.
For every additional 2 years after a transplant, the survival rate decreasess by 10.01%, all other factors held constant.
MEAN SURVIVAL RATE
The mean survival rate for all ethnicities is 78.1% with a standard deviation of 13.1%.
95% CI: (75.1399, 80.9659)
Interpretation: We are 95% confident that the true population mean survival rate for U.S. organ transplant recipients of all ethnicities is between 75.1% and 81.0%.
ANOVA
The p-value is 0.506. Fail to reject the null.
Conclusion: There is no compelling evidence that there is a difference in mean survival rate after an organ transplant among the four ethnicities.
CHI-SQUARE
The p-value (2.534e-12) is very low. Reject the null.
Conclusion: There is very strong evidence that there is a relationship between ethnicity and survival rate after an organ transplant.
MULTIPLE REGRESSION
Survival Rate = 84.156 + 2.734 * Ethnicity - 10.005 * YPT
Conclusion:
There is some relationship between ethnicity and survival rate, and there is some relationship between YPT and survival rate.
For every change in ethnicity, the survival rate increases as follows:
For every additional 2 years after a transplant, the survival rate decreases by 10.01%, all other factors held constant.
There is no difference in mean survival rate among the 4 ethnicities, based on the ANOVA test.
But there is some relationship between ethnicity and survival rate, demonstrated by the Chi-square test and Multiple Regression model. Additionally, there is an association between the number of years post transplant and survival rate.
It is interesting that there is statistically no difference in mean survival rate across ethnicities, but there is an association between ethnicity and surivival. I suppose the latter can be true even if the differences between ethnicities are not large enough to be statistically significant.
The association between YPT and survival rate seems logical me. The longer that time passes, the greater the likelihood of dying, whether due to complications from the transplant, other disease, natural aging, or other factors.
I personally find these results interesting. It never occured to me that survival rates might differ based on race/ethnicity. I just assumed that the more obvious biological factors, such as the recipient’s underlying health, age, and the quality of the donor organ/transplant, would largely determine survival post transplant. That ethnicity might play a role was a surprise. I’m guessing it might be a surprise for many other people too, since background information on ethnicity as a factor was not easy to find online.
I’m not sure how useful this finding is in terms of being actionable. If ethnicity plays a biological role in survival, it is obviously impossible to alter it for a patient. But what can be useful is studying the etiology or mechanisms that lead one ethnicity, such as Asian, to fare better after a transplant. By understanding the etiology of survival, we might find other ways to improve patient outcomes.
The statistical analysis was not easy to set up. The hardest part of this project was figuring out what part of the data to use. Because there is so much of it, it is easy to get lost in it. The OPTN database is set up in such a way that the burden is on the user to know what he/she wants to get out of it.
Additionally, much of the data is categorical. All the numbers are aggregates. There are no independent observations. So I found it challenging picking the variables to use, and structuring the dataset in a way that would allow statistical analysis to be performed on something meaningful. While there are plenty of variables, only a few variables were meaningful to me (survival, waiting time).
The analysis was diverse in terms of examining the parameter with different statistical tests. But not so sure about accurate. I did not have a strong grasp of organ transplant terminology or methodology for reporting patient outcomes, since I am not a medical professional. So throughout the entire analysis, I kept asking myself if the way I was manipulating the data was correct. I could not help thinking that my groupings might be overly simplistic and there could be errors in the results. For example, by grouping rows together that should not be grouped due to redundancy.
I would have liked to analyze waiting time (for an organ), because it would be very interesting to compare that with survival rates. But the waiting time data in the database was too skimpy to use. One deficiency in the OPTN survival data report was not having ethnicity and age information together, as mentioned earlier. That would have made the multiple regression analysis much more interesting.
But all in all, this was a great exercise to apply everything I learned in class to a topic that is meaningful to me. It also solidified my understanding of these core statistical tests, and more importantly, will help me retain it.
[^1] Survival Rates Following Transplant. (n.d.). Retrieved from: https://nhsbtdbe.blob.core.windows.net/umbraco-assets-corp/4659/section_11_survival_rates.pdf This appears to be a chapter from a medical textbook.
[^2] Legendre, C., Canaudi, G., Martinez, F. (Septemeber 2013). Factors influencing long-term outcome after kidney transplantation. Retrieved from: https://onlinelibrary.wiley.com/doi/pdf/10.1111/tri.12217.
[^3] McCaughan, J., Patterson, C., Maxwell, A., Courtney, A. (2014, September 24). Factors influencing survival after kidney transplant failure. Retrieved from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4178314/
[^4] Journal Of The American Medical Association (2005, March 31). Success Of Liver Transplantation May Be Most Influenced By Three Risk Factors. Retrieved from: https://www.sciencedaily.com/releases/2005/03/050329143526.htm.
[^5] Malek SK, Keys BJ, Kumar S, Milford E, Tullius SG. (May 2011). Racial and ethnic disparities in kidney transplantation. Retrieved from: https://www.ncbi.nlm.nih.gov/pubmed/21166727.
[^6] OPTN - About the OPTN. (n.d.). Retrieved from https://optn.transplant.hrsa.gov/governance/about-the-optn/
[^7] OPTN - About Data. (n.d.). Retrieved from https://optn.transplant.hrsa.gov/data/about-data/