SID: 490379879

Introduction

This lab report will analyse the health dataset of the Australia and New Zealand Dialysis and Transplant Registry (ANZDATA). This is a clinical quality registry that provides statistical collection of information related to the incidence, prevalence, and outcomes of treatment for patients with end-stage renal failure. ANZDATA has been in operation since 1977 and is located at The Royal Adelaide Hospital, South Australia.

Initial Data Wrangling

# Data was selected
setwd("~/Downloads/AMED3002/42264")
patientData <- read.csv('42264_AnzdataPatients.csv')
transplantData <- read.csv('42264_AnzdataTransplants.csv')
creatineData <- read.csv('42264_AnzdataTransplantSerumCreatinine.csv')

# Multiple transplants were ignored
transplantData2 <- transplantData[!duplicated(transplantData$id),]

# Variables of interest were extracted
Data <- merge(patientData,transplantData2, by = "id", all=TRUE)
use <- c("gendercode", "latereferralcode", "creatinineatentry", "height", "weight", 
    "smokingcode", "cancerever", "chroniclungcode", "coronaryarterycode", "peripheralvascularcode", 
    "cerebrovasularcode", "diabetescode", "graftno", "transplantcentrestate", "recipientantibodycmvcode", "recipientantibodyebvcode", "donorsourcecode", "donorage", "donorgendercode", 
    "ischaemia", "ageattransplant", "hlamismatchesa", "hlamismatchesb", "hlamismatchesdr", 
    "hlamismatchesdq", "maxcytotoxicantibodies", "currentcytotoxicantibodies", "timeondialysis", 
    "transplantstatus")
Data <- Data[, use]

# Some NA was recoded
Data[Data == ""] = NA
Data[Data == "-"] = NA

Chi-square test

Chi-square test – Performs a chisquare test, check assumptions, states whether test of independence or homogeneity, reports odds ratio or relative risk.

H0 (Null hypothesis): There is no relationship between smoking and chronic lung cancer, the two values are independent (p<0.05) HA (Alternative hypothesis): There is a relationship between smoking and chronic lung cancer, the two values are dependent (p>0.05)

LungTable =table(patientData$smokingcode, patientData$cancerever)
chisq.test(LungTable)
## Warning in chisq.test(LungTable): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  LungTable
## X-squared = 1.2456, df = 4, p-value = 0.8705

As the p value was 0.8286, or over the statistically significant value of 0.05, the null hypothesis is rejected, and the data demonstrates that there is a relationship between smoking and chronic lung cancer. This is presumed to be due to how smoking involves carcinogenic chemicals that affect biological function.

Clustering

Clustering – Performs clustering which may or may not provide obvious insight into dataset.

# Data was coverted into numerical, and transplant status and intercept was ignored 

Data <- model.matrix(~ . -transplantstatus,Data)[,-1]

# Only categories with more than 10 observations was considered, the rest was excluded
Data <- Data[,colSums(Data!=0)>=10]


cluster= as.data.frame(Data)
dataScaled <- scale(cluster)
Hcluster=hclust(dist(t(dataScaled)))
plot(Hcluster)

ANOVA

ANOVA – Fits a two-way ANOVA model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation.

Analysis of variance (ANOVA) is a hypothesis testing method that tests whether the means of several groups are equal, and is a generalisation of a t-test, that can be used to test a variety of different hypotheses.

H0 (Null hypothesis): There is no relationship between time for dialysis against length of patient remaining alive HA (Alternative hypothesis): There is a relationship between time for dialysis against length of patient remaining alive

attach(transplantData)
kidney3= aov(timeondialysis~ hlamismatchesb)
summary(kidney3)
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## hlamismatchesb   1    122   122.3   4.991 0.0257 *
## Residuals      960  23524    24.5                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 38 observations deleted due to missingness

As p value is small at 0.0257, under 0.5, there is evidence to reject the null hypothesis that there is no relationship between dialysis duration and length of survival. Hence, there is a relationship between how long a patient receives dialysis to how long they survive. The interpretation between such is that dialysis is a form of treatment for kidney problems, hence more dialysis assists with the problem and improves lifestyle duration.

Regression

Regression – Fits a linear regression model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation

Question: Is there a relationship between height and weight of patients?

H0 (Null hypothesis): There is no relationship between height and weight of patients. The two values are independent (p>0.05)

H1 (Alternative hypothesis): There is a positive relationship between height and weight of patients, where as height increases, weight increases. The two values are dependent (p>0.05)

To investigate this, firstly, a scatter graph will be produced to see whether there is a clear visual trlationship between height and weight. As shown below, there appears to be a trend of increased weight with increasing height, supporting the alternative hypothesis.

scatter.smooth(x=patientData$height, y=patientData$weight, main="Height vs weight of patients", xlab="Height (cm)", ylab= "Weight (kg)", col="cornflowerblue")

To further investigate this, a linear model formula will be built. Based off the below results, the model is:

Height in cm = 122.2662 +0.6083*weight

cor(patientData$height, patientData$weight)
## [1] NA
linearMod <- lm(height ~ weight, data=patientData)
print(linearMod)
## 
## Call:
## lm(formula = height ~ weight, data = patientData)
## 
## Coefficients:
## (Intercept)       weight  
##    122.2663       0.6083

To confirm whether the model is valid and statistically significant, the p-value will be calculated. As it is < 2.2e^(-16), and hence below 0.05, the linear regression model above is statistically significant. Hence, the null hypothesis is rejected, and there is graphical and linear regression model evidence supporting the positive trend between height and weight (the alternative hypothesis).

summary(linearMod)
## 
## Call:
## lm(formula = height ~ weight, data = patientData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.213  -5.176   1.977   7.917  33.413 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 122.26631    1.40354   87.11   <2e-16 ***
## weight        0.60831    0.01862   32.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.52 on 964 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.5254, Adjusted R-squared:  0.5249 
## F-statistic:  1067 on 1 and 964 DF,  p-value: < 2.2e-16

In short: the linear regression model is:

Height in cm = 122.2662 +0.6083*weight

This model is interpreted to be based off how with increased height, there is increased body mass and hence weight.