#ANZ Transplant Data Analysis

##Introduction End stage renal disease(ESRD) is the total and permanent failure of an individual’s kidney/s. Individuals with end stage renal disease have a very high risk of mortality [1]. ERSD carries a high risk of mortality and affects a large population (Incidence rate 363 per million per year)[1]. ERSD is treated by either dialysis or kidney transplant. ERSD presents high mortality and morbidity, as such it is important that data regarding treatment is regularly assessed to determine prognostic factors which determine treatment success. The Australia and New Zealand Dialysis and Transplant Registry (ANZDATA) collects and compiles data relating to the treatment and outcome of end stage renal failure, patient and clinical demographics such as: age, gender, age at transplant, creatinine levels, and the levels of certain antibodies are also collected. ANZDATA collects data from all renal units across Australia and New Zealand [2]. It was the purpose of this report to analyse the ANZDATA relating to patient and transplant factors exploring variables that effect transplant success. Specifically this report analysed: 1. The relationship between patient cancer history and transplant success 2. Whether kmeans clustering of continous variables associated with transplant can reveal patient sub-structures (clusters) 3. The relationship between graft number, transplant success, and their impact on time on dialysis 4. The relationship between age at time of transplant and graft survival time

##Results

#Read in packages 
library(plyr)
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(naniar)
## Warning: package 'naniar' was built under R version 3.6.3
library(tidyr)
library(ggplot2)
library(RColorBrewer)
library(data.table)
## Warning: package 'data.table' was built under R version 3.6.3
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(Publish)
## Warning: package 'Publish' was built under R version 3.6.3
## Loading required package: prodlim
## Warning: package 'prodlim' was built under R version 3.6.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.6.3
library(dplyr)
library(Rfast)
## Warning: package 'Rfast' was built under R version 3.6.3
## Loading required package: Rcpp
## Loading required package: RcppZiggurat
## Warning: package 'RcppZiggurat' was built under R version 3.6.3
## 
## Attaching package: 'Rfast'
## The following object is masked from 'package:data.table':
## 
##     transpose
## The following object is masked from 'package:dplyr':
## 
##     nth
## The following objects are masked from 'package:purrr':
## 
##     is_integer, transpose
library(car)
## Warning: package 'car' was built under R version 3.6.3
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:Rfast':
## 
##     bc
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
#Read in data 
Patientdata = read.csv("AnzdataPatients.csv")
Transplantdata = read.csv("AnzdataTransplants.csv")

dim(Patientdata)
## [1] 999  23
vis_miss(Patientdata)

dim(Transplantdata)
## [1] 1000   38
vis_miss(Transplantdata)

#Initial full set HLA mismatch to graft success not possible HLAmismatchesDQ = 74% missingness. Could simply exclude it or find another grafting variable to compare to 
colSums(is.na(Transplantdata))
##                             id                        graftno 
##                              0                              0 
##                 transplantdate          transplantcentrestate 
##                              0                              0 
##       recipientantibodycmvcode       recipientantibodyebvcode 
##                              0                              0 
##                donorsourcecode               donorsourceother 
##                              0                              0 
##                       donorage                donorgendercode 
##                             13                              0 
##                      ischaemia           imediatefunctioncode 
##                             59                              0 
##                firstprovendate             diseaseingraftcode 
##                              0                              0 
##          graftfailurecausecode         graftfailurecauseother 
##                              0                              0 
##               graftfailuredate                    countrycode 
##                              0                              0 
##              endtransplantcode              endtransplantdate 
##                              0                              0 
##                lastknownstatus               lastfollowupdate 
##                              0                              0 
##                ageattransplant               transplantstatus 
##                              0                              0 
##               transplantperiod                    alivestatus 
##                              0                              0 
##                    aliveperiod                 hlamismatchesa 
##                              0                             38 
##                 hlamismatchesb                hlamismatchesdr 
##                             38                             41 
##                hlamismatchesdq         maxcytotoxicantibodies 
##                            741                             75 
##     currentcytotoxicantibodies                 timeondialysis 
##                             53                              0 
## lasttreatmentcodepretransplant lasttreatmentdatepretransplant 
##                              0                              0 
##     lasttreatmentpretransplant                       gfailcat 
##                              0                              0

1. Is there evidence that there is a significant relationship between patient cancer history and transplant status (transplant success or failure)

#Is transplant status (success or failure) associated with a patient history of cancer? 

TransplantStat = data.table(Transplantdata$id, Transplantdata$transplantstatus)
colnames(TransplantStat)= c("ID", "Transplant Status") 
TransplantStat = na.omit(TransplantStat)

Cancer = data.table(Patientdata$id, Patientdata$cancerever)
colnames(Cancer) = c("ID", "Cancer ever")
Joint2 = full_join(TransplantStat, Cancer, by = "ID")
#Chi-square test for independence 
#H0 = there is no association between patient cancer history and transplant status (graft success) , Ha = there is an association between patient cancer history and transplant status
TAB = table(Joint2$`Cancer ever`, Joint2$`Transplant Status`)
TAB
##      
##         0   1
##   No  643 222
##   Yes 105  30
test = chisq.test(TAB)
test$expected >= 5 
##      
##          0    1
##   No  TRUE TRUE
##   Yes TRUE TRUE
chisq.test(TAB)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  TAB
## X-squared = 0.5629, df = 1, p-value = 0.4531
odds.ratio(TAB)
## $res
## odds ratio    p-value 
##  0.8275418  0.3920659 
## 
## $ci
## [1] 0.5364489 1.2765903

2. Can Kmeans clustering of continous variables associated with transplant data reveal patient sub-structures (clusters)

#Perform K means clustering on the Transplant dataset, select patient ID and Transplant status (this variable will be used to determine shape within the kmeans cluster plot) and the numeric continuous variables (Transplant period, Time on dialysis, Transplant status, Donor age, Age at Transplant)

data = data.table(Transplantdata$id, Transplantdata$transplantperiod,Transplantdata$timeondialysis, Transplantdata$transplantstatus, Transplantdata$donorage, Transplantdata$ageattransplant)
colnames(data) = c("ID", "Transplant period", "Time on Dialysis", "Transplant Status", "Donor Age", "Age at Transplant")
data = na.omit(data)

Vector = data %>%
  select(-ID) %>% 
  select(-`Transplant Status`)

datanumeric = as.data.frame(sapply(data, as.numeric))
datascaled <-  datanumeric%>%
  scale()

set.seed(50000)
kM <- kmeans(datascaled, 2)
clustered <- data.frame(data, cluster = factor(kM$cluster))
clustered$Transplant.Status = as.factor(clustered$Transplant.Status)
cbPalette <- c("#ff9f88", "#fdc344", "#fdf6eb", "#d5c5a7", "#413d46", "#0072B2", "#D55E00", "#CC79A7")
graph <- ggplot(clustered, aes(x=`Age.at.Transplant`, y=`Time.on.Dialysis`, colour = cluster, shape = `Transplant.Status`)) + geom_point()
graph <- graph + ggtitle("Kmeans clustering: Dialysis, Age at Transplant & Transplant Status") + labs(y="Time on Dialysis", x="Age at Transplant (Yrs)", caption = "Figure1. Kmeans clustering of numeric continuous variables in the ANZ Transplant dataset (Transplant period, Time on dialysis, Transplant status, Donor age, Age at Transplant) plotted with the variables Time on Dialysis and Age at Transplant, with shapes determined by Transplant Status, and clusters by colour")  + theme(plot.title = element_text(hjust=0.5)) + theme(plot.caption = element_text(hjust = 0))  + scale_shape_discrete(name = "Transplant Status", labels = c("Functional", "Failed")) + scale_colour_manual(values=cbPalette)
graph

- The kmeans clustering observed does not indicate any obvious patterns - This may be due to a sub-optimal level of clusters and an absence of PCA - Generally, the pink cluster have a greater spread in terms of time on dialysis and have a smaller age at transplant spread than cluster 2. The yellow cluster have a smaller spread across time on dialysis and a greater spread across age at transplant
- This model indicates Kmeans clustering could potentially reveal patient sub-groups (clusters) relating to transplant data such as: transplant period, age at transplant, and time on dialysis, however, the model generated above is not very useful at this

3. Do the variables patient graft number, time on dialysis, and transplant success have any interaction?

#Two way ANOVA 
# 
Part1 = data.table(Transplantdata$id, Transplantdata$graftno, Transplantdata$timeondialysis, Transplantdata$transplantstatus)
colnames(Part1) = c("ID", "Graft Number", "Time on Dialysis", "Transplant Status")
Dataframe1 = na.omit(Part1)
Dataframe1$`Transplant Status` = as.factor(Dataframe1$`Transplant Status`)

#Does Transplant status interact with Age at Transplant
#boxplot(`Time on Dialysis`~`Graft Number`, Joint3)
#Does cancer ever have an effect on 
#boxplot(`Time on Dialysis`~`Sex`, Joint3)
#Generate an interaction plot to visualise and determine if sex and graft number have an interactive or additive effect on time on dialysis
graph2 <- ggplot(Dataframe1, aes(x=as.factor(`Graft Number`), y=`Time on Dialysis`, colour = `Transplant Status`)) + stat_summary(fun.y=mean, geom = "line", aes(group=`Transplant Status`), size=2) +theme_classic() 
graph2 <- graph2 + ggtitle("Interaction Plot: Transplant Status & Graft Number on Time on Dialysis") + labs(y="Time on Dialysis", x="Graft Number", caption = "Figure2. Interaction Plot of Transplant Status and Graft Number on Time on Dialysis")  + theme(plot.title = element_text(hjust=0.5)) + theme(plot.caption = element_text(hjust = 0)) +scale_colour_manual(values=cbPalette)
graph2

#H01 = there is no difference in the means for Graft number, Ha1 = there is a difference in the means for Graft number; H02 = there is no difference in the means for Transplant Status, Ha2 = there is a difference in means for Transplant Status; H03 = there is no interaction between Graft Number and Transplant Status, Ha3 = there is an interaction between Graft Number and Transplant Status
a1 = aov(`Time on Dialysis`~`Graft Number`*`Transplant Status`, Dataframe1)
summary(a1)
##                                     Df Sum Sq Mean Sq  F value   Pr(>F)    
## `Graft Number`                       1  13489   13489 1316.179  < 2e-16 ***
## `Transplant Status`                  1     35      35    3.434   0.0642 .  
## `Graft Number`:`Transplant Status`   1    259     259   25.260 5.94e-07 ***
## Residuals                          996  10208      10                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(a1)
## Analysis of Variance Table
## 
## Response: Time on Dialysis
##                                     Df  Sum Sq Mean Sq   F value    Pr(>F)    
## `Graft Number`                       1 13489.5 13489.5 1316.1791 < 2.2e-16 ***
## `Transplant Status`                  1    35.2    35.2    3.4341   0.06416 .  
## `Graft Number`:`Transplant Status`   1   258.9   258.9   25.2595 5.939e-07 ***
## Residuals                          996 10208.0    10.2                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Equality of variance 
Dataframe1$`Graft Number` = as.factor(Dataframe1$`Graft Number`)
Dataframe1$`Transplant Status` = as.factor(Dataframe1$`Transplant Status`)
leveneTest(`Time on Dialysis`~`Graft Number`*`Transplant Status`, Dataframe1)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   6  39.406 < 2.2e-16 ***
##       993                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(a1, 1)

#Normality 
plot(a1, 2)

a1residuals = residuals(object = a1)
shapiro.test(x = a1residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  a1residuals
## W = 0.88019, p-value < 2.2e-16

4. What is the relationship between age at time of transplant and graft survivial time, and is there any association with transplant status (transplant success) with this?

#Is there a relationships between Transplant period (Graft survival time) and Age at Transplant 
#Regression model
data = data.table(Transplantdata$id, Transplantdata$transplantperiod, Transplantdata$transplantstatus, Transplantdata$ageattransplant)
colnames(data) = c("ID", "Transplant period",  "Transplant Status", "Age at Transplant")
data = na.omit(data)
data$`Transplant Status` = as.factor(data$`Transplant Status`)

graph3 <- ggplot(data, aes(x=`Age at Transplant`, y = `Transplant period`, colour =`Transplant Status`)) + geom_point() + theme_bw() + geom_smooth(method = 'lm', col = "black") 
graph3 <- graph3 + labs(y="Graft survival time", x="Age at Transplant (Yrs)", caption = "Figure3. Regression model of Graft survival time against patient age at transplant, with colours determined by transplant status")  + theme(plot.title = element_text(hjust=0.5)) + theme(plot.caption = element_text(hjust = 0)) + scale_color_manual(values=cbPalette) + labs(fill="Transplant Status") + scale_fill_discrete(name = "Transplant Status", labels = c("Graft failed", "Functioning"))
graph3

#Assess assumptions 
fit = lm(`Age at Transplant`~`Transplant period`, data)
summary(fit)
## 
## Call:
## lm(formula = `Age at Transplant` ~ `Transplant period`, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.165  -9.926   2.938  11.402  31.598 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         48.4823916  0.7948715   60.99  < 2e-16 ***
## `Transplant period` -0.0013512  0.0002851   -4.74 2.45e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.24 on 998 degrees of freedom
## Multiple R-squared:  0.02202,    Adjusted R-squared:  0.02104 
## F-statistic: 22.47 on 1 and 998 DF,  p-value: 2.449e-06
plot(fit)

- The above model indicates a negative linear association between Graft survival time and Age at Transplant. The model indicates that generally graft survival time decreases with increasing age at time of transplant. This indicates that younger invidividuals have generally higher graft survival time, whereas the grafts in older individuals do not last as long. From the model nothing can be confidently concluded about transplant status. - Assumptions - Homogenity of variance: It can be assumed there is homogenity of variance as there is no clear relationship between the residuals and fitted values in the residual versus fits plot - Normality: It can be assumed the data is normal as within the Normal Q-Q plot of the residuals the points roughly fall along the reference line normality may be assumed - Homoscedasticity: the fitted line in the Scale-location plot indicates the residuals have uniform variance across the range, the data is homoscedastic - Outliers: there do not appear to be any points beyond the Cook’s distance, therefore, there are no evident influential outliers which should be assessed. - As such this model can be retained

##Conclusion The analyses performed within the ANZDATA patient and transplant datasets indicates that cancer history is not predicitive of transplant success or failure. Kmeans clustering of transplant data was not able to elucidate underlying patient sub-groups, however, this was identified to be due to the absence of principal component analysis clustering and a sub-optimal level of cluster numbers. As such, Kmeans clustering should not be excluded as a powerful way to utilise patient and transplant data to reveal important patient population sub-structures. Further this report concluded that there is no significant relationship between time on dialysis pre-transplant and transplant success, however, there is a signficant relationship between time on dialysis pre-transplant and the graft number this was for the patient, in addition there was a statistically significant interaction between graft number and transplant success. This reveals that graft number could be a clinically indicative measure for likely transplant success. This study also generated a linear regression model which demonstrates the negative linear relationship between graft survival time and age at transplant, this indicates that younger patients grafts survive longer whereas older patients grafts do not survive as long. This could be clinically significant and may indicate a need for a shorter time between transplant and followup to assess function in older patients, in addition it may indicate that older patients need more intensive post-transplant care as oppossed to younger patients.

##Bibliography 1 Kidney Disease Statistics for the United States. (2016, December 1). Retrieved April 2020, from https://www.niddk.nih.gov/health-information/health-statistics/kidney-disease 2 Australia and New Zealand Dialysis and Transplant Registry. (n.d.). Retrieved March 20, 2020, from https://www.anzdata.org.au/anzdata/