Lab report 2

Introduction

Organ transplants are a crucial process which saves many lives each year. One of the major challenges with organ transplants is the lack of available organs and the consequent difficulty associated with selecting recipients to receive the organ. Many factors are considered including age, chance of survival, medical history, time waiting for a transplant and so on. The Australia and New Zealand Dialysis and Transplant Registry contains data regarding patient histories and outcomes of kidney transplants for renal failure. This data will be used to determine clustering patterns of certain variables that may impact the success of a transplant, analyse the correlation between a patients time spent on dialysis to the success of a transplant and the amount of previous grafts they have received. Furthermore, the data will be used to assess whether a linear model can account for patients time spent on dialysis by using their age at transplant as a predictor.

Data Manipulation

Data download

setwd("/Users/jaydentreisman/Downloads/42264_EP")
patientData <- read.csv('42264_AnzdataPatients.csv')
transplantData <- read.csv('42264_AnzdataTransplants.csv')
creatineData <- read.csv('42264_AnzdataTransplantSerumCreatinine.csv')
transplantData2 <- transplantData[!duplicated(transplantData$id),]
Data <- full_join(patientData,transplantData2,creatineData, by = "id")
str(Data)

## 'data.frame':    999 obs. of  60 variables:
##  $ id                            : chr  "P10002268" "P10002998" "P10003314" "P10013539" ...
##  $ gendercode                    : chr  "M" "M" "M" "M" ...
##  $ latereferralcode              : chr  "N" "N" "N" "N" ...
##  $ racialorigincode              : chr  "Caucasoid" "South-East Asian - Vietnamese" "Caucasoid" "Caucasoid" ...
##  $ racialoriginother             : chr  "" "" "" "" ...
##  $ primaryrenaldiseasecode       : chr  "Cystinosis" "Mesangial Proliferative (Iga+)" "Presumed GN (No Biopsy)" "Spina Bifida or Myelomeningocoele" ...
##  $ primaryrenaldiseaseother      : chr  "" "" "" "PREFLUX NEPHROPATHY" ...
##  $ biopsycode                    : chr  "" "Y" "N" "N" ...
##  $ creatinineatentry             : int  600 1234 1025 646 NA 449 NA 503 940 542 ...
##  $ height                        : num  113 152 162 175 154 ...
##  $ weight                        : num  21.8 33.6 53 85 51.5 NA 22.2 NA 51 NA ...
##  $ smokingcode                   : chr  "Never" "Never" "Never" "Never" ...
##  $ rrtstartdate                  : chr  "01may1990" "30aug1990" "07apr1991" "08may1990" ...
##  $ rrtstartcode                  : chr  "APD/IPD Hospital" "Hospital HD" "Hospital HD" "Hospital HD" ...
##  $ deathdate                     : chr  "" "" "20feb2016" "" ...
##  $ graftsustaininglifecode       : chr  "" "" "Y" "" ...
##  $ cancerever                    : chr  "No" "Yes" "No" "No" ...
##  $ chroniclungcode               : chr  "N" "N" "N" "N" ...
##  $ coronaryarterycode            : chr  "N" "N" "N" "N" ...
##  $ peripheralvascularcode        : chr  "Y" "N" "Y" "Y" ...
##  $ cerebrovasularcode            : chr  "Y" "N" "N" "N" ...
##  $ diabetescode                  : chr  "N" "N" "N" "N" ...
##  $ diabetescode_text             : chr  "No" "No" "No" "No" ...
##  $ graftno                       : int  2 2 3 2 1 1 2 2 2 2 ...
##  $ transplantdate                : chr  "06may2008" "27jan2015" "02may2009" "12aug2014" ...
##  $ transplantcentrestate         : chr  "SA" "NSW" "VIC" "NZ" ...
##  $ recipientantibodycmvcode      : chr  "Positive" "Positive" "Positive" "Positive" ...
##  $ recipientantibodyebvcode      : chr  "Positive" "Positive" "Positive" "Positive" ...
##  $ donorsourcecode               : chr  "Friend" "Deceased" "Deceased" "Sister" ...
##  $ donorsourceother              : chr  "" "" "" "" ...
##  $ donorage                      : int  46 59 45 58 55 22 21 47 51 62 ...
##  $ donorgendercode               : chr  "F" "F" "F" "F" ...
##  $ ischaemia                     : int  20 20 7 16 5 11 16 15 15 3 ...
##  $ imediatefunctioncode          : chr  "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" "Spontaneous fall in Se.Creatinine by 10% within 24 Hours" ...
##  $ firstprovendate               : chr  "" "" "" "" ...
##  $ diseaseingraftcode            : chr  "" "" "" "" ...
##  $ graftfailurecausecode         : chr  "" "" "" "" ...
##  $ graftfailurecauseother        : chr  "" "" "" "" ...
##  $ graftfailuredate              : chr  "" "" "" "" ...
##  $ countrycode                   : chr  "" "" "" "" ...
##  $ endtransplantcode             : chr  "S" "S" "Z" "S" ...
##  $ endtransplantdate             : chr  "31dec2016" "31dec2016" "20feb2016" "31dec2016" ...
##  $ lastknownstatus               : chr  "S" "S" "Z" "S" ...
##  $ lastfollowupdate              : chr  "31dec2016" "31dec2016" "20feb2016" "31dec2016" ...
##  $ ageattransplant               : int  27 60 34 54 48 59 24 52 48 54 ...
##  $ transplantstatus              : int  0 0 1 0 1 1 1 0 1 1 ...
##  $ transplantperiod              : int  3161 704 2485 872 31 1575 2742 6023 1 3143 ...
##  $ alivestatus                   : int  0 0 1 0 1 1 0 0 1 1 ...
##  $ aliveperiod                   : int  3161 704 2485 872 31 1575 3418 6023 1787 3143 ...
##  $ hlamismatchesa                : int  2 0 2 0 2 2 1 1 0 2 ...
##  $ hlamismatchesb                : int  2 2 0 1 2 2 1 1 1 1 ...
##  $ hlamismatchesdr               : int  0 0 0 0 2 1 1 0 0 2 ...
##  $ hlamismatchesdq               : int  NA 0 NA 0 NA 1 NA 0 0 NA ...
##  $ maxcytotoxicantibodies        : int  0 0 64 0 99 0 3 3 89 46 ...
##  $ currentcytotoxicantibodies    : int  0 0 56 0 3 0 0 0 55 0 ...
##  $ timeondialysis                : num  18 24.4 18.1 24.3 13.9 ...
##  $ lasttreatmentcodepretransplant: chr  "M" "DC" "D" "DC" ...
##  $ lasttreatmentdatepretransplant: chr  "13dec2006" "01jan2011" "14jul2008" "01jan2011" ...
##  $ lasttreatmentpretransplant    : chr  "CAPD" "HD Satellite - Conventional" "Satellite HD" "HD Satellite - Conventional" ...
##  $ gfailcat                      : chr  "" "" "" "" ...

Variable selection

use <- c('creatinineatentry','height','weight','graftno','donorage','ischaemia','ageattransplant','hlamismatchesa','hlamismatchesb','hlamismatchesdr','maxcytotoxicantibodies','currentcytotoxicantibodies','timeondialysis', 'transplantstatus')
Data <- Data[,use]

Missingness

Data[Data==""]  = NA
Data[Data=="-"]  = NA
Data <- na.omit(Data)
dim(Data)

## [1] 750  14

Data <- droplevels(Data)
str(Data)

## 'data.frame':    750 obs. of  14 variables:
##  $ creatinineatentry         : int  600 1234 1025 646 940 1294 874 320 508 1600 ...
##  $ height                    : num  113 152 162 175 162 ...
##  $ weight                    : num  21.8 33.6 53 85 51 43.2 55.6 55 78 82 ...
##  $ graftno                   : int  2 2 3 2 2 2 2 2 1 1 ...
##  $ donorage                  : int  46 59 45 58 51 42 66 62 49 67 ...
##  $ ischaemia                 : int  20 20 7 16 15 4 4 5 4 7 ...
##  $ ageattransplant           : int  27 60 34 54 48 29 51 38 52 52 ...
##  $ hlamismatchesa            : int  2 0 2 0 0 2 1 1 0 2 ...
##  $ hlamismatchesb            : int  2 2 0 1 1 0 2 1 1 2 ...
##  $ hlamismatchesdr           : int  0 0 0 0 0 0 1 0 1 1 ...
##  $ maxcytotoxicantibodies    : int  0 0 64 0 89 0 0 74 5 22 ...
##  $ currentcytotoxicantibodies: int  0 0 56 0 55 0 0 30 0 0 ...
##  $ timeondialysis            : num  18 24.4 18.1 24.3 12 ...
##  $ transplantstatus          : int  0 0 1 0 1 0 1 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:249] 5 6 7 8 10 13 15 16 17 18 ...
##   ..- attr(*, "names")= chr [1:249] "5" "6" "7" "8" ...

Clustering and Chi-square test

Scaling Data

DataScaled <- Data %>%
   scale()

K-Means Clustering

set.seed(100)
kM <- kmeans(DataScaled,2)
Data$transplantstatus <- factor(Data$transplantstatus, levels = c(0, 1), labels = c("Graft success", "Graft lost"))

pca <- prcomp(DataScaled)
df <- data.frame(pca$x, cluster = paste("cluster", kM$cluster, sep = "_"), Data)
ggplot(df, aes(x = PC1, y = PC2, shape = transplantstatus, colour = cluster)) + geom_point()

Hierarchical clustering

hC <- hclust(dist(DataScaled), method = "complete")
hCC <- cutree(hC, k = 3)
hC %>% 
  as.dendrogram() %>%
  set("branches_k_color", k = 3) %>% 
  set("labels","") %>%
  plot()

## Warning in `labels<-.dendrogram`(dend, value = value, ...): The lengths of the
## new labels is shorter than the number of leaves in the dendrogram - labels are
## recycled.

Chi-Squared testing

DATA <- data.frame(Data, kmeans = kM$cluster, hclust = hCC)

(tabk <- table(DATA$kmeans, DATA$transplantstatus))

##    
##     Graft success Graft lost
##   1            77         44
##   2           477        152

chisq.test(tabk)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tabk
## X-squared = 7.2031, df = 1, p-value = 0.007278

chisq.test(tabk)$expected

##    
##     Graft success Graft lost
##   1      89.37867   31.62133
##   2     464.62133  164.37867

(tabh <- table(DATA$hclust, DATA$transplantstatus))

##    
##     Graft success Graft lost
##   1            26          7
##   2           527        189
##   3             1          0

chisq.test(tabh)$expected

## Warning in chisq.test(tabh): Chi-squared approximation may be incorrect

##    
##     Graft success  Graft lost
##   1    24.3760000   8.6240000
##   2   528.8853333 187.1146667
##   3     0.7386667   0.2613333

(tabhk <- table(DATA$hclust, DATA$kmeans))

##    
##       1   2
##   1  32   1
##   2  89 627
##   3   0   1

The k-means clustering produced one densely packed cluster 2 (blue) and a more scattered cluster 1 (red) where there is larger spaces between points, suggesting this cluster is distinct from 2. The hierarchical clustering produced an unusual trend, with one smaller cluster (red) containing only a singular point, which was a successful graft, hence the decision to choose k = 3, so that more points are separated into distinct clusters. The second cluster (green) also contained very few points compared to k-means clustering. These clusters suggest that k-means clustering is better suited to estimating the effects of the other variables on transplant success.

The Chi-square testing supports this, demonstrating the unbalance of hierarchical clustering when compared to k-means (seen in tabhk). The hierarchical clustering places patients into clusters that are too small (cluster 1 = 33, cluster 3 = 1) to provide insight into the transplant data. This is further supported from the expected chi-square values for tabh, which has several cells that are too small. From the chi-squared testing for independence in the k-means clustering (tabk), the results (p < 0.05) demonstrate that the success of a transplant is dependent on the other variables selected. The odds ratio (77*152)/(44*477) = 0.556 (3dp) suggests that the odds of a graft being successful is lower if the patient falls into cluster 1.

Anova Testing

ggplot(Data, aes(x=factor(graftno), y = timeondialysis, colour=transplantstatus)) + geom_boxplot() + 
  stat_summary(fun.y = mean, geom = "line", aes(group = transplantstatus))  + theme_classic() + xlab('graft number') + ylab("time on dialysis")+ labs(colour = "Transplant success")

## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Data$transplantstatus <- as.numeric(Data$transplantstatus)

Data.aov = aov(timeondialysis ~ graftno * transplantstatus, Data)
summary(Data.aov)

##                           Df Sum Sq Mean Sq F value   Pr(>F)    
## graftno                    1   7877    7877 946.130  < 2e-16 ***
## transplantstatus           1     35      35   4.226 0.040155 *  
## graftno:transplantstatus   1     96      96  11.528 0.000722 ***
## Residuals                746   6211       8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

qqnorm(residuals(Data.aov))

The ANOVA was performed to test whether there is a correlation between the number of grafts a patient has undergone and the success of the transplant. An interaction model rather than additive was used under the assumption that the two variables (transplant status and graft number) may have linked effects on transplant success. The Q-Q plot demonstrates that there is a normal distribution of the residuals which fulfills this assumption, whereas the interaction plot did show some outliers but these were not numerous enough for concern.The p values were significant for each test suggesting that interactive and individual effects of number of previous grafts and the success of the transplant significantly impacted the time spent on dialysis.

Linear regression

fit = lm(timeondialysis~ageattransplant, Data)
plot(fit)

summary(fit)

## 
## Call:
## lm(formula = timeondialysis ~ ageattransplant, data = Data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.896 -2.655 -1.369  0.936 26.152 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.04409    0.49799   6.113 1.58e-09 ***
## ageattransplant  0.01217    0.01037   1.174    0.241    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.356 on 748 degrees of freedom
## Multiple R-squared:  0.001839,   Adjusted R-squared:  0.0005043 
## F-statistic: 1.378 on 1 and 748 DF,  p-value: 0.2408

The linear regression model, intended to assess whether the age of patients receiving a transplant was correlated to the time they spent on dialysis. The residuals vs fitted plot demonstrated a relatively homoscedastic distribution of residuals, fulfilling that assumption. Regarding Q-Q normal distribution, the plot demonstrates a normal trend until theoretical quantile 1, where the results skew away from the normal. While this is not ideal based on the assumptions, the majority of points fall into the normal distribution. The scale-location plot follows a relatively horizontal line for most points with some falling outside of this trend. This supports the residuals vs fitted plot in confirming homoscedastic residuals. The residuals vs leverage plot is very densely populated for most data points but has several outliers that could be impacting the linear regression. From the summary analysis, there is no significant correlation between age at the transplant and dialysis time (p>0.05) that can be explained by a linear model.

Lab report 2

2025-03-26