Introduction
Organ transplants are a crucial process which saves many lives each year. One of the major challenges with organ transplants is the lack of available organs and the consequent difficulty associated with selecting recipients to receive the organ. Many factors are considered including age, chance of survival, medical history, time waiting for a transplant and so on. The Australia and New Zealand Dialysis and Transplant Registry contains data regarding patient histories and outcomes of kidney transplants for renal failure. This data will be used to determine clustering patterns of certain variables that may impact the success of a transplant, analyse the correlation between a patients time spent on dialysis to the success of a transplant and the amount of previous grafts they have received. Furthermore, the data will be used to assess whether a linear model can account for patients time spent on dialysis by using their age at transplant as a predictor.
Data Manipulation
Data download
setwd("/Users/jaydentreisman/Downloads/42264_EP")
patientData <- read.csv('42264_AnzdataPatients.csv')
transplantData <- read.csv('42264_AnzdataTransplants.csv')
creatineData <- read.csv('42264_AnzdataTransplantSerumCreatinine.csv')
transplantData2 <- transplantData[!duplicated(transplantData$id),]
Data <- full_join(patientData,transplantData2,creatineData, by = "id")
str(Data)
## 'data.frame': 999 obs. of 60 variables:
## $ id : chr "P10002268" "P10002998" "P10003314" "P10013539" ...
## $ gendercode : chr "M" "M" "M" "M" ...
## $ latereferralcode : chr "N" "N" "N" "N" ...
## $ racialorigincode : chr "Caucasoid" "South-East Asian - Vietnamese" "Caucasoid" "Caucasoid" ...
## $ racialoriginother : chr "" "" "" "" ...
## $ primaryrenaldiseasecode : chr "Cystinosis" "Mesangial Proliferative (Iga+)" "Presumed GN (No Biopsy)" "Spina Bifida or Myelomeningocoele" ...
## $ primaryrenaldiseaseother : chr "" "" "" "PREFLUX NEPHROPATHY" ...
## $ biopsycode : chr "" "Y" "N" "N" ...
## $ creatinineatentry : int 600 1234 1025 646 NA 449 NA 503 940 542 ...
## $ height : num 113 152 162 175 154 ...
## $ weight : num 21.8 33.6 53 85 51.5 NA 22.2 NA 51 NA ...
## $ smokingcode : chr "Never" "Never" "Never" "Never" ...
## $ rrtstartdate : chr "01may1990" "30aug1990" "07apr1991" "08may1990" ...
## $ rrtstartcode : chr "APD/IPD Hospital" "Hospital HD" "Hospital HD" "Hospital HD" ...
## $ deathdate : chr "" "" "20feb2016" "" ...
## $ graftsustaininglifecode : chr "" "" "Y" "" ...
## $ cancerever : chr "No" "Yes" "No" "No" ...
## $ chroniclungcode : chr "N" "N" "N" "N" ...
## $ coronaryarterycode : chr "N" "N" "N" "N" ...
## $ peripheralvascularcode : chr "Y" "N" "Y" "Y" ...
## $ cerebrovasularcode : chr "Y" "N" "N" "N" ...
## $ diabetescode : chr "N" "N" "N" "N" ...
## $ diabetescode_text : chr "No" "No" "No" "No" ...
## $ graftno : int 2 2 3 2 1 1 2 2 2 2 ...
## $ transplantdate : chr "06may2008" "27jan2015" "02may2009" "12aug2014" ...
## $ transplantcentrestate : chr "SA" "NSW" "VIC" "NZ" ...
## $ recipientantibodycmvcode : chr "Positive" "Positive" "Positive" "Positive" ...
## $ recipientantibodyebvcode : chr "Positive" "Positive" "Positive" "Positive" ...
## $ donorsourcecode : chr "Friend" "Deceased" "Deceased" "Sister" ...
## $ donorsourceother : chr "" "" "" "" ...
## $ donorage : int 46 59 45 58 55 22 21 47 51 62 ...
## $ donorgendercode : chr "F" "F" "F" "F" ...
## $ ischaemia : int 20 20 7 16 5 11 16 15 15 3 ...
## $ imediatefunctioncode : chr "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" ...
## $ firstprovendate : chr "" "" "" "" ...
## $ diseaseingraftcode : chr "" "" "" "" ...
## $ graftfailurecausecode : chr "" "" "" "" ...
## $ graftfailurecauseother : chr "" "" "" "" ...
## $ graftfailuredate : chr "" "" "" "" ...
## $ countrycode : chr "" "" "" "" ...
## $ endtransplantcode : chr "S" "S" "Z" "S" ...
## $ endtransplantdate : chr "31dec2016" "31dec2016" "20feb2016" "31dec2016" ...
## $ lastknownstatus : chr "S" "S" "Z" "S" ...
## $ lastfollowupdate : chr "31dec2016" "31dec2016" "20feb2016" "31dec2016" ...
## $ ageattransplant : int 27 60 34 54 48 59 24 52 48 54 ...
## $ transplantstatus : int 0 0 1 0 1 1 1 0 1 1 ...
## $ transplantperiod : int 3161 704 2485 872 31 1575 2742 6023 1 3143 ...
## $ alivestatus : int 0 0 1 0 1 1 0 0 1 1 ...
## $ aliveperiod : int 3161 704 2485 872 31 1575 3418 6023 1787 3143 ...
## $ hlamismatchesa : int 2 0 2 0 2 2 1 1 0 2 ...
## $ hlamismatchesb : int 2 2 0 1 2 2 1 1 1 1 ...
## $ hlamismatchesdr : int 0 0 0 0 2 1 1 0 0 2 ...
## $ hlamismatchesdq : int NA 0 NA 0 NA 1 NA 0 0 NA ...
## $ maxcytotoxicantibodies : int 0 0 64 0 99 0 3 3 89 46 ...
## $ currentcytotoxicantibodies : int 0 0 56 0 3 0 0 0 55 0 ...
## $ timeondialysis : num 18 24.4 18.1 24.3 13.9 ...
## $ lasttreatmentcodepretransplant: chr "M" "DC" "D" "DC" ...
## $ lasttreatmentdatepretransplant: chr "13dec2006" "01jan2011" "14jul2008" "01jan2011" ...
## $ lasttreatmentpretransplant : chr "CAPD" "HD Satellite - Conventional" "Satellite HD" "HD Satellite - Conventional" ...
## $ gfailcat : chr "" "" "" "" ...
Variable selection
use <- c('creatinineatentry','height','weight','graftno','donorage','ischaemia','ageattransplant','hlamismatchesa','hlamismatchesb','hlamismatchesdr','maxcytotoxicantibodies','currentcytotoxicantibodies','timeondialysis', 'transplantstatus')
Data <- Data[,use]
Missingness
Data[Data==""] = NA
Data[Data=="-"] = NA
Data <- na.omit(Data)
dim(Data)
## [1] 750 14
Data <- droplevels(Data)
str(Data)
## 'data.frame': 750 obs. of 14 variables:
## $ creatinineatentry : int 600 1234 1025 646 940 1294 874 320 508 1600 ...
## $ height : num 113 152 162 175 162 ...
## $ weight : num 21.8 33.6 53 85 51 43.2 55.6 55 78 82 ...
## $ graftno : int 2 2 3 2 2 2 2 2 1 1 ...
## $ donorage : int 46 59 45 58 51 42 66 62 49 67 ...
## $ ischaemia : int 20 20 7 16 15 4 4 5 4 7 ...
## $ ageattransplant : int 27 60 34 54 48 29 51 38 52 52 ...
## $ hlamismatchesa : int 2 0 2 0 0 2 1 1 0 2 ...
## $ hlamismatchesb : int 2 2 0 1 1 0 2 1 1 2 ...
## $ hlamismatchesdr : int 0 0 0 0 0 0 1 0 1 1 ...
## $ maxcytotoxicantibodies : int 0 0 64 0 89 0 0 74 5 22 ...
## $ currentcytotoxicantibodies: int 0 0 56 0 55 0 0 30 0 0 ...
## $ timeondialysis : num 18 24.4 18.1 24.3 12 ...
## $ transplantstatus : int 0 0 1 0 1 0 1 0 0 0 ...
## - attr(*, "na.action")= 'omit' Named int [1:249] 5 6 7 8 10 13 15 16 17 18 ...
## ..- attr(*, "names")= chr [1:249] "5" "6" "7" "8" ...
Clustering and Chi-square test
Scaling Data
DataScaled <- Data %>%
scale()
K-Means Clustering
set.seed(100)
kM <- kmeans(DataScaled,2)
Data$transplantstatus <- factor(Data$transplantstatus, levels = c(0, 1), labels = c("Graft success", "Graft lost"))
pca <- prcomp(DataScaled)
df <- data.frame(pca$x, cluster = paste("cluster", kM$cluster, sep = "_"), Data)
ggplot(df, aes(x = PC1, y = PC2, shape = transplantstatus, colour = cluster)) + geom_point()
Hierarchical clustering
hC <- hclust(dist(DataScaled), method = "complete")
hCC <- cutree(hC, k = 3)
hC %>%
as.dendrogram() %>%
set("branches_k_color", k = 3) %>%
set("labels","") %>%
plot()
## Warning in `labels<-.dendrogram`(dend, value = value, ...): The lengths of the
## new labels is shorter than the number of leaves in the dendrogram - labels are
## recycled.
Chi-Squared testing
DATA <- data.frame(Data, kmeans = kM$cluster, hclust = hCC)
(tabk <- table(DATA$kmeans, DATA$transplantstatus))
##
## Graft success Graft lost
## 1 77 44
## 2 477 152
chisq.test(tabk)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tabk
## X-squared = 7.2031, df = 1, p-value = 0.007278
chisq.test(tabk)$expected
##
## Graft success Graft lost
## 1 89.37867 31.62133
## 2 464.62133 164.37867
(tabh <- table(DATA$hclust, DATA$transplantstatus))
##
## Graft success Graft lost
## 1 26 7
## 2 527 189
## 3 1 0
chisq.test(tabh)$expected
## Warning in chisq.test(tabh): Chi-squared approximation may be incorrect
##
## Graft success Graft lost
## 1 24.3760000 8.6240000
## 2 528.8853333 187.1146667
## 3 0.7386667 0.2613333
(tabhk <- table(DATA$hclust, DATA$kmeans))
##
## 1 2
## 1 32 1
## 2 89 627
## 3 0 1
The k-means clustering produced one densely packed cluster 2 (blue) and a more scattered cluster 1 (red) where there is larger spaces between points, suggesting this cluster is distinct from 2. The hierarchical clustering produced an unusual trend, with one smaller cluster (red) containing only a singular point, which was a successful graft, hence the decision to choose k = 3, so that more points are separated into distinct clusters. The second cluster (green) also contained very few points compared to k-means clustering. These clusters suggest that k-means clustering is better suited to estimating the effects of the other variables on transplant success.
The Chi-square testing supports this, demonstrating the unbalance of hierarchical clustering when compared to k-means (seen in tabhk). The hierarchical clustering places patients into clusters that are too small (cluster 1 = 33, cluster 3 = 1) to provide insight into the transplant data. This is further supported from the expected chi-square values for tabh, which has several cells that are too small. From the chi-squared testing for independence in the k-means clustering (tabk), the results (p < 0.05) demonstrate that the success of a transplant is dependent on the other variables selected. The odds ratio (77*152)/(44*477) = 0.556 (3dp) suggests that the odds of a graft being successful is lower if the patient falls into cluster 1.
Anova Testing
ggplot(Data, aes(x=factor(graftno), y = timeondialysis, colour=transplantstatus)) + geom_boxplot() +
stat_summary(fun.y = mean, geom = "line", aes(group = transplantstatus)) + theme_classic() + xlab('graft number') + ylab("time on dialysis")+ labs(colour = "Transplant success")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Data$transplantstatus <- as.numeric(Data$transplantstatus)
Data.aov = aov(timeondialysis ~ graftno * transplantstatus, Data)
summary(Data.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## graftno 1 7877 7877 946.130 < 2e-16 ***
## transplantstatus 1 35 35 4.226 0.040155 *
## graftno:transplantstatus 1 96 96 11.528 0.000722 ***
## Residuals 746 6211 8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
qqnorm(residuals(Data.aov))
The ANOVA was performed to test whether there is a correlation between the number of grafts a patient has undergone and the success of the transplant. An interaction model rather than additive was used under the assumption that the two variables (transplant status and graft number) may have linked effects on transplant success. The Q-Q plot demonstrates that there is a normal distribution of the residuals which fulfills this assumption, whereas the interaction plot did show some outliers but these were not numerous enough for concern.The p values were significant for each test suggesting that interactive and individual effects of number of previous grafts and the success of the transplant significantly impacted the time spent on dialysis.
Linear regression
fit = lm(timeondialysis~ageattransplant, Data)
plot(fit)
summary(fit)
##
## Call:
## lm(formula = timeondialysis ~ ageattransplant, data = Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.896 -2.655 -1.369 0.936 26.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.04409 0.49799 6.113 1.58e-09 ***
## ageattransplant 0.01217 0.01037 1.174 0.241
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.356 on 748 degrees of freedom
## Multiple R-squared: 0.001839, Adjusted R-squared: 0.0005043
## F-statistic: 1.378 on 1 and 748 DF, p-value: 0.2408
The linear regression model, intended to assess whether the age of patients receiving a transplant was correlated to the time they spent on dialysis. The residuals vs fitted plot demonstrated a relatively homoscedastic distribution of residuals, fulfilling that assumption. Regarding Q-Q normal distribution, the plot demonstrates a normal trend until theoretical quantile 1, where the results skew away from the normal. While this is not ideal based on the assumptions, the majority of points fall into the normal distribution. The scale-location plot follows a relatively horizontal line for most points with some falling outside of this trend. This supports the residuals vs fitted plot in confirming homoscedastic residuals. The residuals vs leverage plot is very densely populated for most data points but has several outliers that could be impacting the linear regression. From the summary analysis, there is no significant correlation between age at the transplant and dialysis time (p>0.05) that can be explained by a linear model.