SID: 490379879
This lab report will analyse the health dataset of the Australia and New Zealand Dialysis and Transplant Registry (ANZDATA). This is a clinical quality registry that provides statistical collection of information related to the incidence, prevalence, and outcomes of treatment for patients with end-stage renal failure. ANZDATA has been in operation since 1977 and is located at The Royal Adelaide Hospital, South Australia.
# Data was selected
setwd("~/Downloads/AMED3002/42264")
patientData <- read.csv('42264_AnzdataPatients.csv')
transplantData <- read.csv('42264_AnzdataTransplants.csv')
creatineData <- read.csv('42264_AnzdataTransplantSerumCreatinine.csv')
# Multiple transplants were ignored
transplantData2 <- transplantData[!duplicated(transplantData$id),]
# Variables of interest were extracted
Data <- merge(patientData,transplantData2, by = "id", all=TRUE)
use <- c("gendercode", "latereferralcode", "creatinineatentry", "height", "weight",
"smokingcode", "cancerever", "chroniclungcode", "coronaryarterycode", "peripheralvascularcode",
"cerebrovasularcode", "diabetescode", "graftno", "transplantcentrestate", "recipientantibodycmvcode", "recipientantibodyebvcode", "donorsourcecode", "donorage", "donorgendercode",
"ischaemia", "ageattransplant", "hlamismatchesa", "hlamismatchesb", "hlamismatchesdr",
"hlamismatchesdq", "maxcytotoxicantibodies", "currentcytotoxicantibodies", "timeondialysis",
"transplantstatus")
Data <- Data[, use]
# Some NA was recoded
Data[Data == ""] = NA
Data[Data == "-"] = NA
Chi-square test – Performs a chisquare test, check assumptions, states whether test of independence or homogeneity, reports odds ratio or relative risk.
H0 (Null hypothesis): There is no relationship between smoking and chronic lung cancer, the two values are independent (p<0.05) HA (Alternative hypothesis): There is a relationship between smoking and chronic lung cancer, the two values are dependent (p>0.05)
LungTable =table(patientData$smokingcode, patientData$cancerever)
chisq.test(LungTable)
## Warning in chisq.test(LungTable): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: LungTable
## X-squared = 1.2456, df = 4, p-value = 0.8705
As the p value was 0.8286, or over the statistically significant value of 0.05, the null hypothesis is rejected, and the data demonstrates that there is a relationship between smoking and chronic lung cancer. This is presumed to be due to how smoking involves carcinogenic chemicals that affect biological function.
Clustering – Performs clustering which may or may not provide obvious insight into dataset.
# Data was coverted into numerical, and transplant status and intercept was ignored
Data <- model.matrix(~ . -transplantstatus,Data)[,-1]
# Only categories with more than 10 observations was considered, the rest was excluded
Data <- Data[,colSums(Data!=0)>=10]
cluster= as.data.frame(Data)
dataScaled <- scale(cluster)
Hcluster=hclust(dist(t(dataScaled)))
plot(Hcluster)
ANOVA – Fits a two-way ANOVA model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation.
Analysis of variance (ANOVA) is a hypothesis testing method that tests whether the means of several groups are equal, and is a generalisation of a t-test, that can be used to test a variety of different hypotheses.
H0 (Null hypothesis): There is no relationship between time for dialysis against length of patient remaining alive HA (Alternative hypothesis): There is a relationship between time for dialysis against length of patient remaining alive
attach(transplantData)
kidney3= aov(timeondialysis~ hlamismatchesb)
summary(kidney3)
## Df Sum Sq Mean Sq F value Pr(>F)
## hlamismatchesb 1 122 122.3 4.991 0.0257 *
## Residuals 960 23524 24.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 38 observations deleted due to missingness
As p value is small at 0.0257, under 0.5, there is evidence to reject the null hypothesis that there is no relationship between dialysis duration and length of survival. Hence, there is a relationship between how long a patient receives dialysis to how long they survive. The interpretation between such is that dialysis is a form of treatment for kidney problems, hence more dialysis assists with the problem and improves lifestyle duration.
Regression – Fits a linear regression model. States hypothesis clearly, checks assumptions, makes decision and provides interpretation
Question: Is there a relationship between height and weight of patients?
H0 (Null hypothesis): There is no relationship between height and weight of patients. The two values are independent (p>0.05)
H1 (Alternative hypothesis): There is a positive relationship between height and weight of patients, where as height increases, weight increases. The two values are dependent (p>0.05)
To investigate this, firstly, a scatter graph will be produced to see whether there is a clear visual trlationship between height and weight. As shown below, there appears to be a trend of increased weight with increasing height, supporting the alternative hypothesis.
scatter.smooth(x=patientData$height, y=patientData$weight, main="Height vs weight of patients", xlab="Height (cm)", ylab= "Weight (kg)", col="cornflowerblue")
To further investigate this, a linear model formula will be built. Based off the below results, the model is:
Height in cm = 122.2662 +0.6083*weight
cor(patientData$height, patientData$weight)
## [1] NA
linearMod <- lm(height ~ weight, data=patientData)
print(linearMod)
##
## Call:
## lm(formula = height ~ weight, data = patientData)
##
## Coefficients:
## (Intercept) weight
## 122.2663 0.6083
To confirm whether the model is valid and statistically significant, the p-value will be calculated. As it is < 2.2e^(-16), and hence below 0.05, the linear regression model above is statistically significant. Hence, the null hypothesis is rejected, and there is graphical and linear regression model evidence supporting the positive trend between height and weight (the alternative hypothesis).
summary(linearMod)
##
## Call:
## lm(formula = height ~ weight, data = patientData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.213 -5.176 1.977 7.917 33.413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122.26631 1.40354 87.11 <2e-16 ***
## weight 0.60831 0.01862 32.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.52 on 964 degrees of freedom
## (33 observations deleted due to missingness)
## Multiple R-squared: 0.5254, Adjusted R-squared: 0.5249
## F-statistic: 1067 on 1 and 964 DF, p-value: < 2.2e-16
In short: the linear regression model is:
Height in cm = 122.2662 +0.6083*weight
This model is interpreted to be based off how with increased height, there is increased body mass and hence weight.