Question 1. (Lung.csv)
(a)
coxph( Surv( Time , Status ) ~ Sex , data=Lung )
## Call:
## coxph(formula = Surv(Time, Status) ~ Sex, data = Lung)
##
##
## coef exp(coef) se(coef) z p
## Sex 0.531 1.7 0.167 3.18 0.0015
##
## Likelihood ratio test=10.6 on 1 df, p=0.00111 n= 228, number of events= 165
HR (male to female) = exp(0.531) = 1.7. So, at any time point, the male hazard function is 1.7 times that of the female hazard function.
(b)
m = coxph( Surv( Time , Status ) ~ Sex , data=Lung )
new = data.frame( Sex = c(0,1) )
new
## Sex
## 1 0
## 2 1
# Female curve is solid line and male curve is dashed line:
plot( survfit( m , newdata=new ) , lty=1:2 ,mark.time=F )

In the above graph, the solid line represents females, and the dashed line represents males, so we see that females have substantially longer survival with lung cancer than males.
(c)
KM = survfit( Surv( Time , Status ) ~ Sex , data=Lung )
plot( KM , fun='cloglog' , lty=1:2 , mark.time=F )

The gap between the curves should be interpreted as the log of the hazard ratio between males (dashed curve) and females (solid curve), that is, the gap is like the coefficient from the Cox PH model, except the gap here may stay constant, or may show a changing impact of the covariate, Sex.
(d)
Here it seems like the gap is slightly shrinking over time, that is, males are more likely to die at any time point compared to females, but the gap between the two groups is larger at small times and smaller at larger times. This tells us that the proportional hazards assumption made in Cox’s PH model should be called into question.
Note: From the above answer, we should be able to see that if we made the c-log-log graph using Cox’s PH model, the gap between the curves would be constant, and would be 0.531, that is, the coefficient on Sex from Cox’s PH model.
(e)
m = coxph( Surv( Time , Status ) ~ Sex , data=Lung )
plot( cox.zph(m) )

The y-axis of the Schoenfeld residuals graph can be interpreted as the log of the hazard ratio for Sex, i.e. the coefficient in Cox’s model if it were allow to vary over time. So, this graph estimates the impact of gender as a function of time. If it is flat, then the PH assumption is adequate.
Here, we see that the impact of “going from female to male” is roughly 1 at small times and goes essentially to 0 at larger times. This is exactly what is happening with the gap in the c-log-log graph. At small times, the gap is about 1 and then it goes to essentially 0.
This thus suggests that the PH assumption should be called into question, as before.
(f)
The coefficient on Sex from the Cox PH model is 0.531. This is what the estimated log of the hazard ratio would be if it is forced to be constant (i.e. what the Cox PH model assumes).
Thus, looking at the graph of the Schoenfeld residuals, it should make sense that this 0.531 represents a sort of average value of the smoothed curve shown. We can thus add the coefficient (0.531) to the Schoenfeld residual graph with a horizontal line:
m = coxph( Surv( Time , Status ) ~ Sex , data=Lung )
plot( cox.zph(m) )
abline( 0.531 , 0 , col='red')

Indeed, this shows exactly what we stated above.
(g)
cox.zph( m )
## rho chisq p
## Sex -0.131 2.77 0.0962
The p-value on the formal test of the PH assumption for the Sex variable is 0.0962, not quite significant at the 5% level. This does not agree with what we saw from the c-log-log graph, nor the Schoenfeld residual graph. The formal test will often, but not always, agree with the graphical assessment. In my opinion, using the Cox PH assumption for Sex in this data would be suspect, and the p-value is relatively small, even if not less than 0.05.
(h)
coxph( Surv( Time , Status ) ~ PhysEcog , data=Lung )
## Call:
## coxph(formula = Surv(Time, Status) ~ PhysEcog, data = Lung)
##
##
## coef exp(coef) se(coef) z p
## PhysEcog 0.476 1.61 0.113 4.2 2.7e-05
##
## Likelihood ratio test=17.6 on 1 df, p=2.77e-05 n= 227, number of events= 164
## (1 observation deleted due to missingness)
coxph( Surv( Time , Status ) ~ PhysKarno , data=Lung )
## Call:
## coxph(formula = Surv(Time, Status) ~ PhysKarno, data = Lung)
##
##
## coef exp(coef) se(coef) z p
## PhysKarno -0.0164 0.984 0.00585 -2.81 0.005
##
## Likelihood ratio test=7.56 on 1 df, p=0.00597 n= 227, number of events= 164
## (1 observation deleted due to missingness)
coxph( Surv( Time , Status ) ~ PatKarno , data=Lung )
## Call:
## coxph(formula = Surv(Time, Status) ~ PatKarno, data = Lung)
##
##
## coef exp(coef) se(coef) z p
## PatKarno -0.0199 0.98 0.00547 -3.63 0.00028
##
## Likelihood ratio test=12.5 on 1 df, p=0.000412 n= 225, number of events= 162
## (3 observations deleted due to missingness)
Yes, from the models/p-values above, each of the three measures is significantly related to survival time.
I didn’t ask for this, but you can also see that the direction of each relationship is intuitive based on the way the scales are defined. Higher PatKarno and higher PhysKarno is good, and these are associated with lower hazard, as seen by the negative coefficients above. Higher PhysEcog is bad, and this is associated with higher hazard, as seen by the positive coefficient above.
(i)
coxph( Surv( Time , Status ) ~ PhysEcog + PhysKarno + PatKarno , data=Lung )
## Call:
## coxph(formula = Surv(Time, Status) ~ PhysEcog + PhysKarno + PatKarno,
## data = Lung)
##
##
## coef exp(coef) se(coef) z p
## PhysEcog 0.5061 1.659 0.18822 2.69 0.0072
## PhysKarno 0.0116 1.012 0.01007 1.15 0.2500
## PatKarno -0.0111 0.989 0.00669 -1.66 0.0980
##
## Likelihood ratio test=20.2 on 3 df, p=0.000155 n= 223, number of events= 160
## (5 observations deleted due to missingness)
No, in the presence of the other two measures, the only performance measure which maintains significance is PhysEcog. Clearly, these measures are not independent, in fact, they are heavily correlated, as can be verified using basic linear models (lm). Knowing the value of one of the three measures makes knowing the value of the other two not very helpful.
(j) Interestingly, it is the simpler 5-point version of the performance measure (PhysEcog) which correlates best with survival, not either of the two 100-point scales. In fact, this agrees with some other literature on using performance scales which are too “fine” versus ones that are more “coarse”.
(k) In this scenario, you can fit a stratified Cox PH model, which does not make the PH assumption for Sex, but still controls for its effect on survival when evaluating the effect of PhysEcog (so as not to serve as a potential confounder between PhysEcog and survival). That is, we would fit the following:
coxph( Surv( Time , Status ) ~ PhysEcog + strata(Sex) , data=Lung )
## Call:
## coxph(formula = Surv(Time, Status) ~ PhysEcog + strata(Sex),
## data = Lung)
##
##
## coef exp(coef) se(coef) z p
## PhysEcog 0.485 1.62 0.113 4.27 1.9e-05
##
## Likelihood ratio test=18.1 on 1 df, p=2.04e-05 n= 227, number of events= 164
## (1 observation deleted due to missingness)
PhysEcog maintains its significance, even in the presence of Sex.
Moreover, since the coefficient and p-value of PhysEcog change very little when controlling for Sex compared to when we don’t control for Sex, then we can conclude that PhysEcog and Sex are not very correlated.
Question 2.
(a)
\(h_1(t) = h_0(t)*exp(b)\)
Note: In general, the multiplicative constant would be exp(bx), but here, x=1.
(b)
In general, \(-log( S(t) ) = \int_0^t h(u) du\)
Or, \(S(t) = exp( -\int_0^t h(u) du )\) , that is, survival = exp( \(-\)cumulative hazard )
Thus, we have the following series of steps to perform in order to go from the equation in (a) to a relationship between \(S_1(t)\) and \(S_0(t)\):
\(h_1(t) = h_0(t)*exp(b)\) , if and only if
\(\int_0^t h_1(u) du = \int_0^t [ h_0(u)*exp(b) du ]\) , if and only if
\(-\int_0^t h_1(u) du = exp(b)*(-\int_0^t h_0(u) du)\) , if and only if
\(exp( -\int_0^t h_1(u) du ) = exp( exp(b)*(-\int_0^t h_0(u) du) )\)
Now, the left-hand side above = \(S_1(t)\)
And, the right-hand side above = \(S_0(t)^{exp(b)}\)
Thus, \(S_1(t) = S_0(t)^{exp(b)}\). That is, multiplying the hazard by exp(b) is equivalent to raising the survival curve to the power of exp(b).
(c) We use the Cox PH model to test whether the log of the hazard ratio is 0 or not. This is identical to saying that we check whether b=0 or not. Thus, if b=0, then \(S_1(t)\) = \(S_0(t)\), and this shows that checking whether two hazard functions are identical is equivalent to checking whether two survival curves are identical.