The problems on this assignment require you to and compare survival between groups. Answer all questions specified on the problem but if you see something interesting and want to do more analysis please report it as well. Don’t forget to include discussion and to make your plots look nice and presentable. Submit your .rmd file with the knitted PDF (or knitted Word Document saved as a PDF). If you are still having trouble with .rmd, let us know and we will help you, but we are going to start requiring this from here on out.
#load libraries
library(survival)
library(dplyr)
library(tidyr)
library(coin)
library(ggplot2)
library(HSAUR3)
library(party)
An investigator collected data on survival of patients with lung cancer at Mayo Clinic. The investigator would like you, the statistician, to answer the following questions and provide some graphs. Use the cancer data located in the survival package.
What is the probability that someone will survive past 300 days? Answer - The probability that someone will surviving lung cancer past 300 days is \(\approx\) 0.5301.
#load the cancer data
data(cancer)
#build a survival model
surv.mod <- survfit(Surv(time,status==2)~1,data=cancer)
#compute the probability that someone will live past 300 days
summary(surv.mod,time = 300)[6]
## $surv
## [1] 0.5306081
Provide a graph, including 95% confidence limits, of the Kaplan-Meier estimate of the entire study.
par(mfrow=c(1,1),mar=c(5, 4, 4, 2) + 0.1)
plot(surv.mod,main='Kaplan-Meir Estimate',
xlab='Time in Days',
ylab='Probability of Surviving',
col=c(2,1,1),lty = c(2,1,1))
legend("bottomleft", c("Control", "RIT"),
lty = c(2,1,1), col = c(2,1,1))
Is there a difference in the survival rates between males and females? Provide formal statistical test with p-value and visual evidence.
Answer - Yes women have a higher probability of surviving lung cancer than men as show by the Kaplan-Meier estimate plots and by the probability of each surviving past 300 days. The difference in probability is substation and is \(\approx\) 0.233. The null hypothesis can also be rejected and the p-value is also significant at the 0.05 level.
#set the sex variable to a factor
cancer$sex <- as.factor(cancer$sex)
#create a subset for men and women
male <- cancer %>%
filter(sex=='1')
female <- cancer %>%
filter(sex=='2')
#build the two models
surv.male <- survfit(Surv(time,status==2)~1,data=male)
surv.female <- survfit(Surv(time,status==2)~1,data=female)
#compute the p-values
logrank_test(Surv(time,status==2)~sex,data=cancer)
##
## Asymptotic Two-Sample Logrank Test
##
## data: Surv(time, status == 2) by sex (1, 2)
## Z = -3.2779, p-value = 0.001046
## alternative hypothesis: true theta is not equal to 1
The probability of men surviving past 300 days is \(\approx\) 0.441 which is lower than for women.
summary(surv.male,time = 300)[6]
## $surv
## [1] 0.4410889
The probability of men surviving past 300 days is \(\approx\) 0.674 which is higher than for men.
summary(surv.female,time = 300)[6]
## $surv
## [1] 0.6742026
par(mfrow=c(1,2),oma = c(4, 4, 2, 2),mar = c(2, 2, 1, 1))
plot(surv.male,main="Male Patients",col=c(2,1,1),lty = c(2,1,1))
legend("topright", c("Control", "RIT"),
lty = c(2,1,1), col = c(2,1,1))
plot(surv.female,main='Female Patients',col=c(2,1,1),lty = c(2,1,1))
legend("topright", c("Control", "RIT"),
lty = c(2,1,1), col = c(2,1,1))
mtext('Kaplan-Meier Estimate by Gender', outer = TRUE, cex = 1.5)
mtext('Time in Days', side = 1, outer = TRUE, line = 2)
mtext('Probability of Surviving', side = 2, outer = TRUE, line = 2)
Is there a difference in the survival rates for the older half of the group versus the younger half? Provide a formal statistical test with p-value and visual evidence.
Answer - There is a very small difference in the probability of death in the two age groups. However, the p-value shows that the null hypotheses can be rejected, but not at the 0.05 level. The difference in probability of death is only .03.
Judging by the histogram the split should occur at 63 years.
#calculate the median
median_age <- median(cancer$age)
#plot the histogram of age
ggplot(cancer, aes(x=age))+
geom_histogram(binwidth=1,colour = "black") +
geom_vline(xintercept=median_age, colour="red",lwd=1.5) +
geom_text(label = paste(median_age,'years'), y = 4, x=median_age,
angle = 60, hjust = 0,color='red',size=15) +
labs(title='Histogram of Cancer Patient Age',
x='Age of Patient',
y='Count')
#create the agegroup variable splitting old & young by the median
cancer <- cancer %>%
mutate(agegroup = ifelse(age >= median_age,"Old","Young"))
#set the agegroup to a factor
cancer$agegroup <- as.factor(cancer$agegroup)
#create two new partitioned datasets
old <- cancer %>%
filter(agegroup=='Old')
young <- cancer %>%
filter(agegroup=='Young')
logrank_test(Surv(time,status==2)~agegroup,data=cancer)
##
## Asymptotic Two-Sample Logrank Test
##
## data: Surv(time, status == 2) by agegroup (Old, Young)
## Z = -0.8637, p-value = 0.3878
## alternative hypothesis: true theta is not equal to 1
The probability of an adult age 63 and older surviving lung cancer is \(\approx\) 0.518.
surv.old <- survfit(Surv(time,status==2)~1,data=old)
surv.young <- survfit(Surv(time,status==2)~1,data=young)
summary(surv.old,time = 300)[6]
## $surv
## [1] 0.5181564
The probability of an adult younger than 63 surviving lung cancer is \(\approx\) 0.5444.
summary(surv.young,time = 300)[6]
## $surv
## [1] 0.5444021
par(mfrow=c(1,2),oma = c(4, 4, 2, 2),mar = c(2, 2, 1, 1))
plot(surv.old, main='Over 63',col=c(2,1,1))
legend("topright", c("Control", "RIT"),
lty = c(2,1,1), col = c(2,1,1))
plot(surv.young,main='Under 63',col=c(2,1,1))
legend("topright", c("Control", "RIT"),
lty = c(2,1,1), col = c(2,1,1))
mtext('Kaplan-Meier Estimate by Age Group', outer = TRUE, cex = 1.5)
mtext('Time in Days', side = 1, outer = TRUE, line = 2)
mtext('Probability of Surviving', side = 2, outer = TRUE, line = 2)
As shown by the summary of the Cox Proportional Hazards model the ph.ecog, sex, inst, ph.karno, and wt.loss all contribute/have statistical significance with the survival rates of patients with advanced lung cancer.
cancer.cox <- coxph(Surv(time,status==2)~.,data=cancer)
summary(cancer.cox)
## Call:
## coxph(formula = Surv(time, status == 2) ~ ., data = cancer)
##
## n= 167, number of events= 120
## (61 observations deleted due to missingness)
##
## coef exp(coef) se(coef) z Pr(>|z|)
## inst -2.850e-02 9.719e-01 1.305e-02 -2.183 0.029004 *
## age 3.678e-02 1.037e+00 1.936e-02 1.900 0.057452 .
## sex2 -5.826e-01 5.584e-01 2.015e-01 -2.891 0.003840 **
## ph.ecog 8.755e-01 2.400e+00 2.375e-01 3.687 0.000227 ***
## ph.karno 2.354e-02 1.024e+00 1.157e-02 2.035 0.041819 *
## pat.karno -1.110e-02 9.890e-01 8.133e-03 -1.365 0.172353
## meal.cal 1.522e-05 1.000e+00 2.630e-04 0.058 0.953858
## wt.loss -1.680e-02 9.833e-01 7.777e-03 -2.161 0.030715 *
## agegroupYoung 5.283e-01 1.696e+00 3.315e-01 1.594 0.110945
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## inst 0.9719 1.0289 0.9474 0.9971
## age 1.0375 0.9639 0.9988 1.0776
## sex2 0.5584 1.7907 0.3762 0.8289
## ph.ecog 2.4000 0.4167 1.5069 3.8226
## ph.karno 1.0238 0.9767 1.0009 1.0473
## pat.karno 0.9890 1.0112 0.9733 1.0049
## meal.cal 1.0000 1.0000 0.9995 1.0005
## wt.loss 0.9833 1.0169 0.9685 0.9984
## agegroupYoung 1.6961 0.5896 0.8858 3.2478
##
## Concordance= 0.652 (se = 0.031 )
## Rsquare= 0.195 (max possible= 0.998 )
## Likelihood ratio test= 36.23 on 9 df, p=3.608e-05
## Wald test = 34.6 on 9 df, p=7.008e-05
## Score (logrank) test = 35.53 on 9 df, p=4.806e-05
The conditional inference tree clearly shows the increased probability of survival for female patients vs. male patients and also uses the highly significant ph.ecog variable for the first split of the tree.
cancer.tree <- ctree(Surv(time, status==2) ~ ., data = cancer)
plot(cancer.tree)
The mastectomy data from the HSAUR3 package are the survival times (in months) after mastectomy of women with breast cancer. The cancers are classiffied as having metastasized or not based on a histochemical marker.
Answer - There is a substantial difference in the probability of death for metastasized and un-metastasized patients. The difference is 0.2708 at 50 months. This can also be seen with the plot. Notice the much higher upper confidence interval for non-metastasized patients. There is a much higher probability of patients surviving if the cancer has not metastasized.
Plot the survivor functions of each group, estimated using the Kaplan-Meier estimate, on the same graph and comment on the differences.
#load the data
data(mastectomy)
#set the event variable to numeric
mastectomy <- mastectomy %>%
mutate(event = ifelse(event==T,2,1))
#partition the data into metastasized and not metastasized
meta.yes <- mastectomy[mastectomy$metastasized == 'yes',]
meta.no <- mastectomy[mastectomy$metastasized == 'no',]
#create two models
mast.yes <- survfit(Surv(time,event==2)~1,data=meta.yes)
mast.no <- survfit(Surv(time,event==2)~1,data=meta.no)
summary(mast.yes,time = 50)[6]
## $surv
## [1] 0.5625
summary(mast.no,time = 50)[6]
## $surv
## [1] 0.8333333
par(mfrow=c(1,2),oma = c(4, 4, 2, 2),mar = c(2, 2, 1, 1))
plot(mast.yes, main='Metastasized',col=c(2,1,1))
legend("topright", c("Control", "RIT"),
lty = c(2,1,1), col = c(2,1,1))
plot(mast.no,main='Not Metastasized',col=c(2,1,1))
legend("bottomleft", c("Control", "RIT"),
lty = c(2,1,1), col = c(2,1,1))
mtext('Kaplan-Meier Estimate of Mastectomy Data', outer = TRUE, cex = 1.5)
mtext('Time in Months', side = 1, outer = TRUE, line = 2)
mtext('Probability of Surviving', side = 2, outer = TRUE, line = 2)
The null hypothesis can be rejected and the p-value is significant but not at the 0.05 level.
logrank_test(Surv(time,event==2)~metastasized,data=mastectomy)
##
## Asymptotic Two-Sample Logrank Test
##
## data: Surv(time, event == 2) by metastasized (no, yes)
## Z = 1.8667, p-value = 0.06194
## alternative hypothesis: true theta is not equal to 1