PART ONE

1 Theory of survival analysis A clinical trial is intended to test a new treatment for malignant melanoma vs. current standard of care. The main outcome is disease-free survival. Each patient has a time 𝑡 in days after start of therapy that either represents the time of recurrence or death (if status = 1) or the end of the study (if status = 0). Each patient has two covariates we can use: Tx = 1 if they are on the new therapy or Tx = 0 if they are on standard of care. There are four subtypes of melanoma, which we will characterize as Type = A, B, C, or D. The effect of the new therapy may differ among the four subtypes.

1.1 Kaplan-Meier product limit estimator We can estimate the survival curve for patients on the new and standard therapy including all four subtypes using the Kaplan-Meier product limit estimator. Suppose at a given time t (in days after starting therapy), and for a particular subset of the patients, that there are 194 patients whose survival or censoring time is at least t. Suppose that 3 patients died or relapsed on day t and 2 patients had a censoring time of t. By what fraction does the estimated survival curve drop at time t? How many patients are in the risk set just before t and just after t?

The estimated survival curve drop at time t by the value of the point hazard at time t the point hazard can be estimated by taking the number of individual who experienced the event at time t divided by the population at risk at time t (d).

There are 194 individuals that lived at least to time t before experiencing the event or leaving the study. Therefore they are the population at risk just before time t. After time t, 3 patients experienced the event, and 2 patients left the study. Thus there are 189 individuals at risk just after time t.

1.2 Cox proportional hazards model Suppose that the reference levels of the covariates are Tx = 0 and Type = A. Write down the Cox model for predicting survival from Tx, Type, and the Tx by Type interaction including a definition of the coeﬀicients and their relationship to survival probabilities.

\[h(t|Covariates) = h_{0}*e^{\beta_{therapy}x_{1}+\beta_{typeA}x_2+\beta_{therapy:typeA}(x_1*x_3)} \] \(\beta_{therapy}\) = the average change in the log of the hazard ratio of having the new therapy type (x1=1) over the old therapy

\(\beta_{typeA}\) = the average change in the log of the hazard ratio of having type A melanoma over having a different type

\(\beta_{therapy:typeA}\) = the average change in the log of the hazard ratio caused by the interaction of type (A vs not A) and the new therapy

State the important assumptions. We assume that there is a baseline hazard in the population that depends on time (h0(t)) and is scaled by a function of the covariates. Thus, we assume that the hazard between the populations of interest in the study (type A vs not type A), (new therapy vs standard of care) is proportional between the groups scaled by a function of the covariates.

1.3 Partial likelihood If there are 20 patients at risk at a given time and 1 of them fails, write down the contribution (factor) to the partial likelihood from that failure time in terms of the model specification. Does it depend on the base hazard? The generic likelihood for a Cox proportional hazard model may be written as such

\[L(\beta) = \Pi_{i=1}^{f}\frac{h_{0}(t_{i})*e^{\beta^{T}X_{j}(i)}}{\Sigma_{j \in R_{i}}h_{0}(t_{i})*e^{\beta^{T}X_{j}(i)}} \] Where f is the set of chronologically ordered event times; and Ri identifies the set of subjects at risk at time t

We can see the base hazard cancels out from the numerator and denominator such that \[L(\beta) = \Pi_{i=1}^{f}\frac{e^{\beta^{T}X_{j}(i)}}{\Sigma_{j \in R_{i}}e^{\beta^{T}X_{j}(i)}} \] Then if we are just interested in a single event fi we can simplify to at time ti

\[L(\beta) = \frac{e^{\beta^{T}X_{j}(i)}}{\Sigma_{j \in R_{i}}e^{\beta^{T}X_{j}(i)}} \]

Then if we plug in our particular covariates

\[L(\beta) = \frac{e^{(\beta_{therapy}x_{1}+\beta_{typeA}x_2+\beta_{therapy:typeA}(x_1*x_3))_{i}}}{\Sigma_{j \in R_{i}}e^{(\beta_{therapy}x_{1}+\beta_{typeA}x_2+\beta_{therapy:typeA}(x_1*x_3))_{j}}} \]

1.4 Censoring and sample size Which will generate more accurate coeﬀicient estimates, a study with 1000 patients of whom 900 survive to the end of the study without recurrence, or a study with 500 patients 300 of whom survive? Why?

A study with 500 patients 300 of whom survive. This is because this study has more events occur over the study period. From the partial likelihood equation we see that in order to get a likelihood of failure at any particular time i where an event occurred at time i we need an event to have occured at time i (otherwise the numerator would be 0 and the likelihood would be 0). If we have more events occuring at various times i; we get more data point which leads to more precise estimates of the parameter or parameters. That subseuqently our confidence intervals for these parameter (in our case coefficients) should be narrower.

1.5 Hypothesis testing Describe the most appropriate hypothesis test for whether the interaction term is required in the model. How the test statistic be calculated? To what specific statistical distribution would the test statistic be compared?

You could perform a nested model test wherein you fit one model without the interaction term, and another model with the interaction term. Then you would compare the two models using a likelihood ratio test, wherein 2*(log likelihood of the larger model/log likelihood of the smaller model) should be approximately chi-square distributed with degrees of freedom equal to the difference in the number of parameters between the models which here would be 1.

1.6 Contrasts In terms of the model coeﬀicients listed previously, what would be the estimated log hazard ratio of a patient with Type B melanoma on Tx = 1 to a patient with Type C melanoma on Tx = 0? What would be the estimated hazard ratio? Since we coded ther type variable to be a difference between type A and non-type A we would be comparing two non-type A individuals here so the ratio of their log hazards would just be the ratio of the log hazards of their differing therapy statuses which on a log scale is a difference \[\beta_{\text{therapy}}*1 - \beta_{\text{therapy}}*(0)\] Subsequently this would reduce to \[\beta_{\text{therapy}}\]

1.7 Proportional hazards assumption How would you examine the proportionality assumption, graphically and/or with a statistical test? You could examine the proportional assumption graphically by two methods. First you could plot the hazard ratios for each group over time (if the hazard are proportional then the ratio of the hazards should be constant over time) so you see the hazard ratios over time being roughly straight lines being offset by some vertical component. Second you could plot the complementary log-log survival curves. On the x-axis would be time and on the y-axis we take a log of the cumulative hazard. The curves for eacg group should look similar just proportional scaled up or down by a function of the covaraites of the group.

1.8 Accomodating non-proportional hazards If it appears that the different subtype have non-proportional hazards, how would you change the model so that this could be accommodated? You could choose to exclude the subtypes which have non-proportional hazards and model those using a Kaplan-Meier (KM) survival curve. Alternatively you could model all the subtypes by a KM survival curve.

PART TWO

library(haven)
addicts =
"http://web1.sph.emory.edu/dkleinb/allDatasets/surv2datasets/addicts.dta" |>
haven::read_dta()
head(addicts)

## # A tibble: 6 × 6
##      id clinic status survt prison  dose
##   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
## 1     1      1      1   428      0    50
## 2     2      1      1   275      1    55
## 3     3      1      1   262      0    55
## 4     4      1      1   183      0    30
## 5     5      1      1   259      1    65
## 6     6      1      1   714      0    55

The variables are as follows: • id: Subject ID • clinic: Clinic (1 or 2) • status: Survival status (0 = censored, 1 = departed from clinic) • time: Survival time in days • prison: Prison record (0 = none, 1 = any) • methadone: Methadone dose (mg/day

##2.1 Kaplan-Meier curves Plot the Kaplan-Meier survival and cumulative hazard curves for the two clinics.

addicts.surv = Surv(time = addicts[["survt"]],
                 event = addicts[["status"]],
                 type = "right")
addicts.surv |> head()

## [1] 428 275 262 183 259 714

addicts.survfit = survfit(addicts.surv ~ 1)

options(max.print = 50)
addicts.surv |>
  summary()

##       time            status      
##  Min.   :   2.0   Min.   :0.0000  
##  1st Qu.: 171.2   1st Qu.:0.0000  
##  Median : 367.5   Median :1.0000  
##  Mean   : 402.6   Mean   :0.6303  
##  3rd Qu.: 585.5   3rd Qu.:1.0000  
##  Max.   :1076.0   Max.   :1.0000

ggsurvplot(addicts.survfit, data = addicts)

addicts.survfit.clinic = survfit(addicts.surv ~ addicts[["clinic"]])

ggsurvplot(addicts.survfit.clinic, 
           data = addicts,
           legend.title = "Clinic",
           legend.labs = c("Clinic 1", "Clinic 2"),
           conf.int = TRUE)

##2.2 Test for differences Test for whether the two survival curves could have come from the same process using survdiff.

addicts.clinic.survdiff = survdiff(addicts.surv~clinic, data = addicts)

addicts.clinic.survdiff  |> pander()

Call: addicts.surv ~ clinic Chisq = 27.892679 on 1 degrees of freedom, p = 0.000000 Here the p-value for the log-rank test is <0.000000 . Thus we reject the null hypothesis that survival probabilities are equal at all time points. Therefore we accept that the survivial probabilities are different at at least 1 time point.
	N	Observed	Expected	(O-E)^2/E	(O-E)^2/V
clinic=1	163	122	90.91	10.63	27.89
clinic=2	75	28	59.09	16.36	27.89

##2.3 Nelson-Aalen curves Plot the cumulative hazard and survival curves from the Nelson-Aalen estimator.

plot(addicts.survfit.clinic,
     col = c(2,4),
fun = "cumhaz",
xlab = "Time (Days)",
ylab = "Cumulative Hazard",
main = "Cumulative Hazard for Clinic Departure",)
legend("topleft", c("Clinic 1", "Clinic 2"), col = c(2,4), lwd = 2)

##2.4 Check the proportional hazards assumption A common comparison plot for proportional hazards is the complimentary log-log survival plot which plots ln(− ln[S(x)]) = ln(H(t)) against ln(t). If the hazards are proportional, then so are the cumulative hazards, and after taking logs, the curves should be parallel. If the lines are straight, then the Weibull model may be appropriate. Make this plot for the two clinics using the Nelson-Aalen estimator and comment on the results.

addicts.survfit.clinic

## Call: survfit(formula = addicts.surv ~ addicts[["clinic"]])
## 
##                         n events median 0.95LCL 0.95UCL
## addicts[["clinic"]]=1 163    122    428     348     514
## addicts[["clinic"]]=2  75     28     NA     661      NA

plot(survfit(addicts.surv~ clinic, data = addicts), 
     col = 1:2, 
     fun = "cloglog")
legend("topleft", 
       c("Clinic 1", "Clinic 2"), 
       col = 1:2, 
       lwd = 2)

There is some crossing over of the ln(H(t)) between the two clinics at early time points, it also seems like there is non-proportionality at the later time points as clinic 1’s hazard gets a lot larger faster than clinic 2.

##2.5 Cox proportional hazards model Construct a Cox model using only the clinic variable. Is there suﬀicient evidence to detect a difference in hazard rates at the two clinics? What is the estimated hazard (risk) ratio, a test for significance, and a 95% confidence inter- val?

model.addicts = coxph(addicts.surv~clinic, data = addicts)
summary(model.addicts)

## Call:
## coxph(formula = addicts.surv ~ clinic, data = addicts)
## 
##   n= 238, number of events= 150 
## 
##           coef exp(coef) se(coef)      z Pr(>|z|)    
## clinic -1.0754    0.3412   0.2127 -5.057 4.26e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## clinic    0.3412      2.931    0.2249    0.5176
## 
## Concordance= 0.574  (se = 0.022 )
## Likelihood ratio test= 30.99  on 1 df,   p=3e-08
## Wald test            = 25.57  on 1 df,   p=4e-07
## Score (logrank) test = 27.92  on 1 df,   p=1e-07

#Cox Zph test
cox.zph(model.addicts) # this seems to suggest we can't split a cox proportional hazard; that the hazard interacting with clinic may vary over time.

##        chisq df      p
## clinic  10.5  1 0.0012
## GLOBAL  10.5  1 0.0012

Based on a cox-proportional hazard there appear to be a significant difference in hazard ratio based on the clinic. Clinic 1 compared to clinic 2 has a hazard ratio of 2.931; suggesting that the hazard is 2.931 times higher in those that are in clinic 1. The test we used to assess the significance of the hazard ratio parameter is the Wald test which gives us a p-value of 4.26*10^-07 significantly below a typical alpha of .05. The 95% confidence interval for this log of the hazard difference is 0.2249 to 0.5176.

##2.6 Comparisons with null model Pick one test for the null hypothesis that the clinics do not differ. Why would you depend on this test more than the others?

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

model.addicts = coxph(addicts.surv~clinic, data = addicts)
model.addicts.null = coxph(addicts.surv~ -1, data = addicts)
anova(model.addicts, model.addicts.null)

## Analysis of Deviance Table
##  Cox model: response is  addicts.surv
##  Model 1: ~ clinic
##  Model 2: ~ -1
##    loglik Chisq Df Pr(>|Chi|)    
## 1 -690.04                        
## 2 -705.54 30.99  1  2.594e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(model.addicts)

## Call:
## coxph(formula = addicts.surv ~ clinic, data = addicts)
## 
##   n= 238, number of events= 150 
## 
##           coef exp(coef) se(coef)      z Pr(>|z|)    
## clinic -1.0754    0.3412   0.2127 -5.057 4.26e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## clinic    0.3412      2.931    0.2249    0.5176
## 
## Concordance= 0.574  (se = 0.022 )
## Likelihood ratio test= 30.99  on 1 df,   p=3e-08
## Wald test            = 25.57  on 1 df,   p=4e-07
## Score (logrank) test = 27.92  on 1 df,   p=1e-07

lrtest(model.addicts, model.addicts.null)

## Likelihood ratio test
## 
## Model 1: addicts.surv ~ clinic
## Model 2: addicts.surv ~ -1
##   #Df  LogLik Df Chisq Pr(>Chisq)    
## 1   1 -690.04                        
## 2   0 -705.54 -1 30.99  2.594e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the likelihood ratio test we can see that compared to the null model the model considering clinic; does significantly improve the likelihood. We can see this given that the p-value of the likelihood ratio test is 2.594e-08 therefore we reject the null hypothesis that the two models are equivalently likely.

##2.7 Adding more variables Consider adding the prison and methodone variables. Which of these covariates significantly improve the model?

addicts %>% group_by(clinic) %>% summarize(mean =  mean(dose))

## # A tibble: 2 × 2
##   clinic  mean
##    <dbl> <dbl>
## 1      1  59.0
## 2      2  63.5

addict_fix = addicts |>
mutate(
# We add factor labels to the categorical variables:
clinic = clinic |>
  case_match(
    1 ~ "Clinic_1",
    2 ~ "Clinic_2") |>
  factor() |> 
  relevel(ref = "Clinic_1"),
prison = prison |>
case_match(
0 ~ "No_Prison_History",
1 ~ "Prison_History") ,
)

addict_fix$mean_dose <- rep(0, times=nrow(addicts))
for (i in 1:nrow(addicts)){
  addict_fix$mean_dose[i] <- ifelse(addicts$clinic[i] == 1, 58.95706, 63.53333)
}

model.addicts_2 = coxph(addicts.surv~clinic + prison + dose, data = addict_fix)
summary(model.addicts_2)

## Call:
## coxph(formula = addicts.surv ~ clinic + prison + dose, data = addict_fix)
## 
##   n= 238, number of events= 150 
## 
##                           coef exp(coef)  se(coef)      z Pr(>|z|)    
## clinicClinic_2       -1.009896  0.364257  0.214889 -4.700 2.61e-06 ***
## prisonPrison_History  0.326555  1.386184  0.167225  1.953   0.0508 .  
## dose                 -0.035369  0.965249  0.006379 -5.545 2.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                      exp(coef) exp(-coef) lower .95 upper .95
## clinicClinic_2          0.3643     2.7453    0.2391    0.5550
## prisonPrison_History    1.3862     0.7214    0.9988    1.9238
## dose                    0.9652     1.0360    0.9533    0.9774
## 
## Concordance= 0.665  (se = 0.025 )
## Likelihood ratio test= 64.56  on 3 df,   p=6e-14
## Wald test            = 54.12  on 3 df,   p=1e-11
## Score (logrank) test = 56.32  on 3 df,   p=4e-12

lrtest(model.addicts, model.addicts_2)

## Likelihood ratio test
## 
## Model 1: addicts.surv ~ clinic
## Model 2: addicts.surv ~ clinic + prison + dose
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   1 -690.04                         
## 2   3 -673.26  2 33.571  5.131e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding prison history and dosage into the model we see that the likelihood of the model is improved over the null model; based on the likelihood ratio test given by the summary() function. Performing a likelihood ratio test between the model considering just clinic as a covariate and the model considering clinic, prison history, and dosage we see that the additional covariates do significantly improve the model as the p-value for this likelihood ratio test is <.05. Therefore, we may reject the null hypothesis that the two models have equivalent likelihood and accept that the additional covariates signifcantly improve the model.

2.8 Comparison with non-parametric model Plot the survival curves from your chosen Cox model and add the corresponding KM survival curves. Does it seem like you have found a good model for the data?

** May need to facet wrap your continuous variable in fitting a KM curve (in ggplot2 +facet_wrap(~variable) or take the mean(dose))**

# Create KM Curves
addict_fix = addicts |>
mutate(
# We add factor labels to the categorical variables:
clinic = clinic |>
  case_match(
    1 ~ "Clinic_1",
    2 ~ "Clinic_2") |>
  factor() |> 
  relevel(ref = "Clinic_1"),
prison = prison |>
case_match(
0 ~ "No_Prison_History",
1 ~ "Prison_History") ,
)



addicts.surv = Surv(time = addict_fix[["survt"]],
                 event = addict_fix[["status"]],
                 type = "right")

km_model = survfit(formula = Surv(survt, status) ~ clinic + prison, data = addict_fix)
library(ggfortify)
km_model |>
autoplot(conf.int = FALSE) +
theme_bw() + 
theme(
legend.position="bottom",
legend.title = element_blank(),
legend.text = element_text(size=16),
legend.box="vertical",
legend.margin=margin(),
) +
guides(col=guide_legend(ncol=2))

plot(survfit(addicts.surv ~ clinic + prison, data = addict_fix))

#lines(survfit(model.addicts_2,
#              data.frame(clinic=factor(c("Clinic_1", "Clinic_2")), prison = c( "No_Prison_History", "Prison_History"),conf.int=F),col="red"),
      
#    legend("bottomleft",c("Kaplan-Meier","Cox Model"),col=c("black","red"),lwd=1))

I could not figure out how to get the cox survival curve to plot correctly on the same plot as the KM curves.

#PART THREE Application, part 2 (extra credit)

##3.1 Diagnostics Fit the model for the addicts data using clinic, prison, and methadone. Then perform all the model checking procedures.

model.addicts_2 = coxph(addicts.surv~clinic + prison + dose, data = addict_fix)
model.addicts_2

## Call:
## coxph(formula = addicts.surv ~ clinic + prison + dose, data = addict_fix)
## 
##                           coef exp(coef)  se(coef)      z        p
## clinicClinic_2       -1.009896  0.364257  0.214889 -4.700 2.61e-06
## prisonPrison_History  0.326555  1.386184  0.167225  1.953   0.0508
## dose                 -0.035369  0.965249  0.006379 -5.545 2.94e-08
## 
## Likelihood ratio test=64.56  on 3 df, p=6.229e-14
## n= 238, number of events= 150

addict_zph <- cox.zph(model.addicts_2)
ggcoxzph(addict_zph, var="clinic")

ggcoxzph(addict_zph, var="prison")

#ggcoxzph(addict_zph, var="mean_dose")

The Schoenfield residuals look fine for prison and clinic. They did not work for mean dosage but that would be because the value of mean_dose is exactly determined by clinic. Martingale residuals do not work for categorical predictors

##3.2 Stratified models There is a problem with proportionality of hazards for the clinics. Fit the stratified model using clinics to define the strata.

coxph.stratified = 
  coxph( 
    formula = Surv(time = survt, event = status) ~
prison+dose+strata(clinic), data = addict_fix)

summary(coxph.stratified)

## Call:
## coxph(formula = Surv(time = survt, event = status) ~ prison + 
##     dose + strata(clinic), data = addict_fix)
## 
##   n= 238, number of events= 150 
## 
##                           coef exp(coef)  se(coef)      z Pr(>|z|)    
## prisonPrison_History  0.389605  1.476397  0.168930  2.306   0.0211 *  
## dose                 -0.035115  0.965495  0.006465 -5.432 5.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                      exp(coef) exp(-coef) lower .95 upper .95
## prisonPrison_History    1.4764     0.6773    1.0603    2.0559
## dose                    0.9655     1.0357    0.9533    0.9778
## 
## Concordance= 0.651  (se = 0.026 )
## Likelihood ratio test= 33.91  on 2 df,   p=4e-08
## Wald test            = 32.66  on 2 df,   p=8e-08
## Score (logrank) test = 33.33  on 2 df,   p=6e-08

##3.3 Interactions with strata Is there an interaction with the strata? We can assess this by first fitting a models stratified by sex (as in the above); then fitting two models seperately on clinic 1 and clinic 2.

anderson.coxph.clinic1 =
coxph(
formula = Surv(time = survt, event = status) ~ prison+dose,
subset = clinic == "Clinic_1",
data = addict_fix)
summary(anderson.coxph.clinic1)

## Call:
## coxph(formula = Surv(time = survt, event = status) ~ prison + 
##     dose, data = addict_fix, subset = clinic == "Clinic_1")
## 
##   n= 163, number of events= 122 
## 
##                           coef exp(coef)  se(coef)      z Pr(>|z|)    
## prisonPrison_History  0.502846  1.653421  0.188706  2.665  0.00771 ** 
## dose                 -0.035799  0.964834  0.007738 -4.626 3.72e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                      exp(coef) exp(-coef) lower .95 upper .95
## prisonPrison_History    1.6534     0.6048    1.1422    2.3934
## dose                    0.9648     1.0364    0.9503    0.9796
## 
## Concordance= 0.647  (se = 0.028 )
## Likelihood ratio test= 26.05  on 2 df,   p=2e-06
## Wald test            = 25.11  on 2 df,   p=4e-06
## Score (logrank) test = 25.57  on 2 df,   p=3e-06

anderson.coxph.clinic2 =
coxph(
formula = Surv(time = survt, event = status) ~ prison+dose,
subset = clinic == "Clinic_2",
data = addict_fix)
summary(anderson.coxph.clinic2)

## Call:
## coxph(formula = Surv(time = survt, event = status) ~ prison + 
##     dose, data = addict_fix, subset = clinic == "Clinic_2")
## 
##   n= 75, number of events= 28 
## 
##                          coef exp(coef) se(coef)      z Pr(>|z|)   
## prisonPrison_History -0.08014   0.92298  0.38430 -0.209  0.83481   
## dose                 -0.03696   0.96371  0.01235 -2.994  0.00275 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                      exp(coef) exp(-coef) lower .95 upper .95
## prisonPrison_History    0.9230      1.083    0.4346    1.9603
## dose                    0.9637      1.038    0.9407    0.9873
## 
## Concordance= 0.663  (se = 0.057 )
## Likelihood ratio test= 9.72  on 2 df,   p=0.008
## Wald test            = 8.98  on 2 df,   p=0.01
## Score (logrank) test = 9.41  on 2 df,   p=0.009

There’s some difference in the significance of coefficients based on subset. Let’s explicitly code an interaction term to assess

Int_model <- coxph( 
    formula = Surv(time = survt, event = status) ~
strata(clinic)*(prison+dose), data = addict_fix)
lrtest(Int_model, coxph.stratified)

## Likelihood ratio test
## 
## Model 1: Surv(time = survt, event = status) ~ strata(clinic) * (prison + 
##     dose)
## Model 2: Surv(time = survt, event = status) ~ prison + dose + strata(clinic)
##   #Df  LogLik Df Chisq Pr(>Chisq)
## 1   4 -596.59                    
## 2   2 -597.52 -2 1.859     0.3948

The likelihood ratio test reveals that there is not a significant difference between the likelihood of the interaction model versus the stratified model. Therefore we should prefer the smaller stratified model which fits a smaller number of paramters to the larger interaction model.

##3.4 Transformations of continuous covariates One of the model checks of the original model is to look for possible transformations of the methadone variable. What is your conclusion? We can examine if just dose (rather than mean dose) produces poor matingale residuals

model.addicts_2 <- coxph(formula = addicts.surv ~ clinic + prison + dose, data = addict_fix)

addict_fix$mres =
model.addicts_2 |>
residuals(,type="martingale")

addict_fix |>
ggplot(aes(x = dose, y = mres)) +
geom_point() +
geom_smooth(method = "loess", aes(col = "loess")) +
geom_smooth(method = 'lm', aes(col = "lm")) +
theme_classic() +
xlab("Dosage") +
ylab("Martingale Residuals") +
guides(col=guide_legend(title = ""))

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

It seems like there’s some curvature in the martingale residuals near the lower and higher doses. We could attempt a transformation dichotomizing dosage into a high or low dosage. Where high >= 60, low < 60.

addict_fix =
addict_fix |>
mutate(
dose = cut(
dose,c(0, 60, 110),
labels=c("low","high")))
addict_fix

## # A tibble: 238 × 7
##       id clinic   status survt prison            dose     mres
##    <dbl> <fct>     <dbl> <dbl> <chr>             <fct>   <dbl>
##  1     1 Clinic_1      1   428 No_Prison_History low    0.0864
##  2     2 Clinic_1      1   275 Prison_History    low    0.301 
##  3     3 Clinic_1      1   262 No_Prison_History low    0.529 
##  4     4 Clinic_1      1   183 No_Prison_History low    0.294 
##  5     5 Clinic_1      1   259 Prison_History    high   0.555 
##  6     6 Clinic_1      1   714 No_Prison_History low   -0.697 
##  7     7 Clinic_1      1   438 Prison_History    high   0.228 
##  8     8 Clinic_1      0   796 Prison_History    low   -2.49  
##  9     9 Clinic_1      1   892 No_Prison_History low   -2.90  
## 10    10 Clinic_1      1   393 Prison_History    high   0.294 
## # ℹ 228 more rows

model.cox2_transformed =
coxph(
formula =
Surv(survt, event = status) ~
clinic+dose+prison,
data = addict_fix)
model.addicts_2 |> drop1(test="Chisq")

## Single term deletions
## 
## Model:
## addicts.surv ~ clinic + prison + dose
##        Df    AIC    LRT  Pr(>Chi)    
## <none>    1352.5                     
## clinic  1 1387.3 36.747 1.345e-09 ***
## prison  1 1366.3 15.778 7.124e-05 ***
## dose    1 1381.3 30.782 2.887e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model.cox2_transformed |> drop1(test="Chisq")

## Single term deletions
## 
## Model:
## Surv(survt, event = status) ~ clinic + dose + prison
##        Df    AIC     LRT  Pr(>Chi)    
## <none>    1365.7                      
## clinic  1 1387.3 23.5546 1.214e-06 ***
## dose    1 1381.3 17.5894 2.741e-05 ***
## prison  1 1366.3  2.5851    0.1079    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The non-transformed model had lower AIC, we don’t have strong evidence to reccomend the model using a dichotomized dosage.

##3.5 Stratification after variable transformations If you considered a reformulation of the methadone variable, does this affect the decision to stratify.

You should consider examining interaction effects and stratifying again if you significantly changed a variable (in this case turning a stratified variable into a binary categorical variable); in our case we didn’t end up keeping the transformed variable as that did not improve our model. We can however examine if there was a significant interaction effect on the transformed variable even if we did not keep the transformed variable.

##3.6 Interpretation Interpret your final model and repeat the model checks.

addict_fix$dose <- as.numeric(addicts$dose)
coxph(formula = Surv(time = survt, event = status) ~ prison + 
    dose + strata(clinic), data = addict_fix)

## Call:
## coxph(formula = Surv(time = survt, event = status) ~ prison + 
##     dose + strata(clinic), data = addict_fix)
## 
##                           coef exp(coef)  se(coef)      z        p
## prisonPrison_History  0.389605  1.476397  0.168930  2.306   0.0211
## dose                 -0.035115  0.965495  0.006465 -5.432 5.59e-08
## 
## Likelihood ratio test=33.91  on 2 df, p=4.322e-08
## n= 238, number of events= 150

coxph.stratified

## Call:
## coxph(formula = Surv(time = survt, event = status) ~ prison + 
##     dose + strata(clinic), data = addict_fix)
## 
##                           coef exp(coef)  se(coef)      z        p
## prisonPrison_History  0.389605  1.476397  0.168930  2.306   0.0211
## dose                 -0.035115  0.965495  0.006465 -5.432 5.59e-08
## 
## Likelihood ratio test=33.91  on 2 df, p=4.322e-08
## n= 238, number of events= 150

cox.zph(coxph.stratified)

##         chisq df    p
## prison 0.0264  1 0.87
## dose   1.0552  1 0.30
## GLOBAL 1.1302  2 0.57

# We see that we meet the cox proportional hazard assumption
addict_zph <- cox.zph(coxph.stratified)
ggcoxzph(addict_zph, var="prison")

#Schoenfeld residuals look okay for prison

#Mart

addict_fix$mres =
coxph.stratified |>
residuals(,type="martingale")

addict_fix

## # A tibble: 238 × 7
##       id clinic   status survt prison             dose   mres
##    <dbl> <fct>     <dbl> <dbl> <chr>             <dbl>  <dbl>
##  1     1 Clinic_1      1   428 No_Prison_History    50  0.187
##  2     2 Clinic_1      1   275 Prison_History       55  0.351
##  3     3 Clinic_1      1   262 No_Prison_History    55  0.577
##  4     4 Clinic_1      1   183 No_Prison_History    30  0.387
##  5     5 Clinic_1      1   259 Prison_History       65  0.577
##  6     6 Clinic_1      1   714 No_Prison_History    55 -0.703
##  7     7 Clinic_1      1   438 Prison_History       65  0.256
##  8     8 Clinic_1      0   796 Prison_History       60 -2.83 
##  9     9 Clinic_1      1   892 No_Prison_History    50 -3.38 
## 10    10 Clinic_1      1   393 Prison_History       65  0.340
## # ℹ 228 more rows

addict_fix |>
ggplot(aes(x = dose, y = mres)) +
geom_point() +
geom_smooth(method = "loess", aes(col = "loess")) +
geom_smooth(method = 'lm', aes(col = "lm")) +
theme_classic() +
xlab("Dosage") +
ylab("Martingale Residuals") +
guides(col=guide_legend(title = ""))

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

addicts

## # A tibble: 238 × 6
##       id clinic status survt prison  dose
##    <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1     1      1      1   428      0    50
##  2     2      1      1   275      1    55
##  3     3      1      1   262      0    55
##  4     4      1      1   183      0    30
##  5     5      1      1   259      1    65
##  6     6      1      1   714      0    55
##  7     7      1      1   438      1    65
##  8     8      1      0   796      1    60
##  9     9      1      1   892      0    50
## 10    10      1      1   393      1    65
## # ℹ 228 more rows

These diagnostic plots look okay. The

HW6

Zachary Tobar

2023-06-01

PART ONE

PART TWO