Load dataset
#install.packages("survival","flexsurv","SurvRegCensCov","survminer","lmtest","ggplot2","dplyr" )
library(survival)
library(flexsurv)
library(SurvRegCensCov)
library(survminer)
library(lmtest)
library(ggplot2)
library(dplyr)
survdata <- url('https://acaimo.github.io/teaching/Survival/survdata.Rdata')
load(survdata)
attach(survdata)
Q1 Load the dataset and find the overall number of subjects involved in the study.
length(survdata$id)
## [1] 200
200 subjects in the set
Calculate the Kaplan-Meier estimate of the survival function and plot it.
KM <- survfit(Surv(time, status) ~ 1)
plot(KM, main="Kaplan - Maier Estimates of Survival Time + confidence intervals", xlab="Time", ylab="S(t)")
Calculate the Nelson-Aalen estimate of the survival function and plot it.
N_A <- survfit(coxph(Surv(time, status) ~ 1), type="aalen")
plot(N_A, main="Nelson-Aalen Estimates of Survival Time + Confidence Intervals", xlab="Time", ylab="S(t)")
Estimate the conditional probability that a subject survives beyond time t=7 given that the subject does not survive beyond time t=12.
Using \[\Pr(X>7 | X<12) = \frac{\Pr(X>7 \cap X<12)}{\Pr(X<12)}.\] we get( since \(S(X)=1-F(X)\)) \[Pr(X>7 | X<12) = \frac{S(7) -S(12)}{1-S(12)}\]
s7 <- summary(KM, times=7)$surv #Pr(X>7)
s12 <- summary(KM, times=12)$surv#1-Pr(X<12)
(s7-s12)/(1-s12)
## [1] 0.2048043
So the likelihood of surving for at least T=7 given you dont survive past T=12 is 0.204
Q2. Calculate the Kaplan-Meier estimate of the survival function for each treatment. Plot the two survival curves and comment on them. Test the difference between the two curves and state your hypotheses and conclusions clearly.
KM_group <- survfit(Surv(time, status) ~ trt, conf.type='none')
#summary(KM_group)
plot(KM_group, col=c(1,2), lwd=2, xlab='time', ylab='Survival', main="Survival Rates grouped by treatment")
legend("topright", legend = c("Treatment 1", "Treatment 2"), col=c(1, 2), lwd=c(2, 2))
Here we see a marked drop in survival rates over time for Treatment 2 in comparison to treatment 1.This suggests that T1 is a better treatment. However we need to run some hypothesis tests in order to determine if it is significantly significant The Null Hypothesis: There is no difference in the survival rates of the treatments The Alternative Hypothesis: There is a difference in survival rate depending on treatment
cv<- qchisq(.95, 1) #chisquared test for 95% confidence and 1 degree of freedom
cv
## [1] 3.841459
#checking Log-Rank Test
lr <- survdiff(Surv(time, status) ~ trt, rho = 0 )
lr
## Call:
## survdiff(formula = Surv(time, status) ~ trt, rho = 0)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## trt=1 30 17 29.8 5.47 9.47
## trt=2 170 81 68.2 2.38 9.47
##
## Chisq= 9.5 on 1 degrees of freedom, p= 0.002
lr$chisq > cv
## [1] TRUE
#Checking Wilcoxon Test
wc <- survdiff(Surv(time,status) ~trt, rho = 1)
wc
## Call:
## survdiff(formula = Surv(time, status) ~ trt, rho = 1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## trt=1 30 8.21 15.8 3.66 7.27
## trt=2 170 57.36 49.8 1.16 7.27
##
## Chisq= 7.3 on 1 degrees of freedom, p= 0.007
wc$chisq > cv
## [1] TRUE
In this case both the chi-sq for statistic for both the Log-Rank and the Wilcoxon tests are greater than a the critical value of 3.8414 so we reject the null hypothesis that the 2 treatments have the same survival rate.
Q3 Fit an exponential model to the survival data (removing censored observations) without considering covariate information. Estimate the parameter λ and the conditional probability that a subject survives beyond time t=10 given that the subject survived beyond time t=7. [15 marks]
For this we first need to create a new data frame censoring observations.
new_Data<- subset(survdata, status==1)
knowing that we can estimate the rate parameter: \[\widehat{\lambda} = \frac{n}{\sum_{i=1}^n t_i}\] we get
n <- dim(new_Data)[1]
sum.ti <- sum(new_Data$time)
lambda.hat <- n/sum.ti
lambda.hat
## [1] 0.3119935
from here we can find S(t) for the given parameters using \[S(t) = \exp(-\lambda t)\] we can plot this via
t <- seq(0, max(new_Data$time))
S.t <- exp(-lambda.hat*t)
h.t <- lambda.hat
plot(S.t, type = 'l',
xlab = 'Time', ylab = '',
main = 'Exponential Model',
col = 1, lwd = 2)
legend("right",
legend = "S(t)", lwd = 2)
and using \[S(10|7) = \frac{S(10 \cap 7)}{S(7)}\]
S.10<- exp(-lambda.hat*10)
S.7 <- exp(-lambda.hat*7)
S.10given7 <- (S.7-S.10)/(S.7)
S.10given7
## [1] 0.6077989
So provided the person in this case survives past 7 years there is at 60% chance of them surviving past 10 years
Q4.Fit a PH Weibull regression model including trt and age to the survival data and interpret the parameter estimates obtained. [15 marks] The general expresion for the Survival functon of theWeibull Model is \[S(t) = \exp \{ -\exp \{\beta_1 \delta_1 Treatment2 + \beta_2 age\} \lambda t^{\gamma}\} \] and the Hazard Function is \[h(t) = \lambda \gamma t^{\gamma-1} \exp \{\beta_1 \delta_{trt2} + \beta_2 age\}\] Where \(\delta_{trt2}\) is 1 if and only if treatment 2 is given to the patient and 0 at all other times. To find the missing $ , , $
Wei <- WeibullReg(Surv(time, status)~factor(trt)+age, survdata)
Wei
## $formula
## Surv(time, status) ~ factor(trt) + age
##
## $coef
## Estimate SE
## lambda 32.3009034 22.41879351
## gamma 2.5719182 0.22713563
## factor(trt)2 -0.2976738 0.27590930
## age -0.3701676 0.04474674
##
## $HR
## HR LB UB
## factor(trt)2 0.7425435 0.4323819 1.2751942
## age 0.6906185 0.6326299 0.7539225
##
## $ETR
## ETR LB UB
## factor(trt)2 1.122704 0.9092788 1.386224
## age 1.154799 1.1292913 1.180884
##
## $summary
##
## Call:
## survival::survreg(formula = formula, data = data, dist = "weibull")
## Value Std. Error z p
## (Intercept) -1.3512 0.2350 -5.75 8.9e-09
## factor(trt)2 0.1157 0.1076 1.08 0.28
## age 0.1439 0.0114 12.63 < 2e-16
## Log(scale) -0.9447 0.0883 -10.70 < 2e-16
##
## Scale= 0.389
##
## Weibull distribution
## Loglik(model)= -203.8 Loglik(intercept only)= -275.7
## Chisq= 143.71 on 2 degrees of freedom, p= 6.2e-32
## Number of Newton-Raphson Iterations: 8
## n= 200
There is a lot of information here. The first key piece is the p-value of 6.2e-32 this suggests that the model is signficant in what is happening in this situation.
Filling the coefficent values into the formulae above we get the following \[S(t) = \exp \{ -\exp \{-0.2976738 * \delta_{trt2} -0.3701676 * age\} 32.3009034 t^{2.5719182} \] and the hazard function \[h(t) = 32.3009034 * 2.5719182 t^{\ 2.5719182-1} \exp \{-0.2976738 * \delta_{trt2} -0.3701676 * age\} \]
To calculate the hazard Ratio between treatment types we can use \[\widehat{HR}_{trt = 2\ \mathrm{vs}\ trt = 1} = \exp(\beta_{trt})=exp(-0.2976738)=0.7425441 .\] Knowing this we can say that we expect treatment 2 to be only 75% as effecttive as treatment 1 provided all other variables are kept constant which corresponds to our observations earlier.
Q5 Fit and compare two PH Cox models: one including age only, the other including trt and age. Interpret the parameter estimates obtained. [20 marks]]
coxph.fit<-coxph(Surv(time) ~ age , data = survdata)
summary(coxph.fit)
## Call:
## coxph(formula = Surv(time) ~ age, data = survdata)
##
## n= 200, number of events= 200
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age -2.71575 0.06616 0.17599 -15.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 0.06616 15.12 0.04686 0.0934
##
## Concordance= 0.956 (se = 0.003 )
## Likelihood ratio test= 859.1 on 1 df, p=<2e-16
## Wald test = 238.1 on 1 df, p=<2e-16
## Score (logrank) test = 161.9 on 1 df, p=<2e-16
Adding Trt into the mix
coxph.fit.trt<-coxph(Surv(time, status) ~ age + trt + age*trt , data = survdata)
summary(coxph.fit.trt)
## Call:
## coxph(formula = Surv(time, status) ~ age + trt + age * trt, data = survdata)
##
## n= 200, number of events= 98
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age -2.93233 0.05327 0.31886 -9.196 <2e-16 ***
## trt -1.47803 0.22809 1.42195 -1.039 0.299
## age:trt 0.09506 1.09973 0.08230 1.155 0.248
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 0.05327 18.7713 0.02852 0.09952
## trt 0.22809 4.3843 0.01405 3.70225
## age:trt 1.09973 0.9093 0.93590 1.29224
##
## Concordance= 0.957 (se = 0.004 )
## Likelihood ratio test= 419.5 on 3 df, p=<2e-16
## Wald test = 113.6 on 3 df, p=<2e-16
## Score (logrank) test = 98.97 on 3 df, p=<2e-16
We can see in this model that the interaction factor between age and treatment has a p-value of 0.248 which is not significant so we can disregard this term and look at a more simple verison
coxph.fit.trt<-coxph(formula = Surv(time, status) ~ age + trt , data = survdata)
summary(coxph.fit.trt)
## Call:
## coxph(formula = Surv(time, status) ~ age + trt, data = survdata)
##
## n= 200, number of events= 98
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age -2.7256 0.0655 0.2557 -10.66 <2e-16 ***
## trt 0.1328 1.1421 0.4028 0.33 0.742
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 0.0655 15.2663 0.03969 0.1081
## trt 1.1421 0.8756 0.51858 2.5152
##
## Concordance= 0.955 (se = 0.004 )
## Likelihood ratio test= 418.2 on 2 df, p=<2e-16
## Wald test = 113.8 on 2 df, p=<2e-16
## Score (logrank) test = 79.29 on 2 df, p=<2e-16
Here we see that the treatment factor is not significant p-value= 0.742. This suggests that the previous model with just age is sufficient however we should investigate the log likelihood of this.
lrtest(coxph.fit.trt, coxph.fit)
from this we can see that that logliklihood ratio for the 2 models we have looked at is 449.83 which for a \(\chi^2\) test with 1 degree of freedom (2 for the treatment model and 1 for the age alone model [2-1=1]) falls outside a 95% confidence interval for a chi-sq distribution so we can reject the hypothesis that age alone model will suffice
Fitting the age only models we see that \[log(h_{i}(t))=h_0(t)+\beta_1*age_i+\beta_2*\delta_{trt2}\]
The model is significant with a p-value of 2e-16 far below the 0.05 threshold.
S.Cox.T <- survfit(coxph.fit.trt)
plot(S.Cox.T, xlab = 'time', ylab = 'S(t)', conf.int = F)
Here we see the survival liklihood dropping with the model that we have hypothesised. To check the fit of this lets overlay the data that we have for the time variable and see how this syncs up
plot(S.Cox.T, xlab = 'time', ylab = 'S(t)', conf.int = F, col='Red', ylim=c(0,1.1), main= "Cox_ph V data")
par(new = T)
hist(time, axes=F, xlab=NA, ylab=NA, main=NA)
axis(side=4)
mtext(side=4, line=2, "Counts of subjects Dying at between time steps")
legend("topright", legend=c("Cox-ph model", "histogram of data" ), col=c("red", 'black'), cex=0.8, pch=16)
As we can see from the graph the model does a good job explaining the initial parts of the data however the presence of outliers in the time data( long tail) skews the model at longer time frames
Hazard ratios
ggforest(coxph.fit.trt, data= survdata)
These graphs show that, as we noticed treatment 1 is 14% more effictive as treatment 2 although the pvalue of .742 suggests that this is not significant
More significant however is age. The implication here is that each year of age effectively reduces your time in the study by 0.066 units. This is statistically significant and posiibly explains parts of the long tail of the histogram