This example will illustrate how to fit parametric the Cox Proportional hazards model to a discrete-time (longitudinal) data set and examine various model diagnostics to evaluate the overall model fit. The data example uses data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and third grade.

options("scipen"=999, "digits"=4)
#Load required libraries
library(foreign)
library(survival)
library(car)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(muhaz)

Using Longitudinal Data

As in the other examples, I illustrate fitting these models to data that are longitudinal, instead of person-duration. In this example, we will examine how to fit the Cox model to a longitudinally collected data set.

First we load our data

load("~/Google Drive/dem7223/class2016/data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r3age", "r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc","wkpov_r", "w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed","w1momed", "w3momed", "w5momed", "w8momed", "s2_id", "c1_7fp0", "c17fpstr", "c17fppsu")
eclsk<-eclsk[,myvars]


eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r3age==-9, NA, eclsk$r3age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12
eclsk$age4<-recode(eclsk$r7age,recodes="1=155; 2=166; 3=172; 4=178; 5=192; -9=NA")/12


eclsk$pov1<-ifelse(eclsk$wkpov_r==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)
eclsk$pov4<-ifelse(eclsk$w8povrty==1,1,0)

#mother's education-time varying > hs ==1
eclsk$momedu1<-recode(eclsk$wkmomed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu1<-relevel(eclsk$momedu1, ref = "highschool") 
eclsk$momedu2<-recode(eclsk$w3momed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu2<-relevel(eclsk$momedu2, ref = "highschool")
eclsk$momedu3<-recode(eclsk$w5momed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu3<-relevel(eclsk$momedu3, ref = "highschool") 


#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$black<-recode (eclsk$race, recodes="2=1;-9=NA; else=0")
eclsk$asian<-recode (eclsk$race, recodes="5=1;-9=NA; else=0")
eclsk$nahn<-recode (eclsk$race, recodes="6:7=1;-9=NA; else=0")
eclsk$other<-recode (eclsk$race, recodes="8=1;-9=NA; else=0")
eclsk$race_gr<-recode(eclsk$race, recodes="3:4='hisp'; 2='nh black'; 5='nh asian'; 6:7='nahn'; 8='other'; 1='nh white'; else=NA", as.factor.result = T)
eclsk$race_gr<-relevel(eclsk$race_gr, ref = 'nh white')
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")

Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise. NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition.

eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1&is.na(eclsk$c17fpstr)==F)

Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.

e.long<-reshape(eclsk, idvar="childid", varying=list(age=c("age1","age2", "age3"), age2=c("age2", "age3", "age4"), momedu=c("momedu1", "momedu2", "momedu3")), times=1:3, direction="long" , v.names = c("age_enter", "age_exit", "momedu"))
e.long<-e.long[order(e.long$childid, e.long$time),]
head(e.long)
##             childid gender race r1_kage r3age r4age r5age r6age r7age
## 0015006C.1 0015006C      1    1    76.8 87.83 95.53     6    NA     4
## 0015006C.2 0015006C      1    1    76.8 87.83 95.53     6    NA     4
## 0015006C.3 0015006C      1    1    76.8 87.83 95.53     6    NA     4
## 0015014C.1 0015014C      2    1    64.2 75.23 82.93     2     2     2
## 0015014C.2 0015014C      2    1    64.2 75.23 82.93     2     2     2
## 0015014C.3 0015014C      2    1    64.2 75.23 82.93     2     2     2
##            c1r4mtsc c4r4mtsc c5r4mtsc c6r4mtsc c7r4mtsc wkpov_r w1povrty
## 0015006C.1    56.19    51.94    52.68       NA    50.26       2        2
## 0015006C.2    56.19    51.94    52.68       NA    50.26       2        2
## 0015006C.3    56.19    51.94    52.68       NA    50.26       2        2
## 0015014C.1    57.74    55.18    51.98    59.29    60.00       2        2
## 0015014C.2    57.74    55.18    51.98    59.29    60.00       2        2
## 0015014C.3    57.74    55.18    51.98    59.29    60.00       2        2
##            w3povrty w5povrty w8povrty wkmomed w1momed w3momed w5momed
## 0015006C.1        2        2        2       5       5       6       8
## 0015006C.2        2        2        2       5       5       6       8
## 0015006C.3        2        2        2       5       5       6       8
## 0015014C.1        2        2        2       5       6       6       6
## 0015014C.2        2        2        2       5       6       6       6
## 0015014C.3        2        2        2       5       6       6       6
##            w8momed s2_id c1_7fp0 c17fpstr c17fppsu pov1 pov2 pov3 pov4
## 0015006C.1       8  0015   934.1       77        1    0    0    0    0
## 0015006C.2       8  0015   934.1       77        1    0    0    0    0
## 0015006C.3       8  0015   934.1       77        1    0    0    0    0
## 0015014C.1       6  0015   129.6       77        1    0    0    0    0
## 0015014C.2       6  0015   129.6       77        1    0    0    0    0
## 0015014C.3       6  0015   129.6       77        1    0    0    0    0
##            hisp black asian nahn other  race_gr male time age_enter
## 0015006C.1    0     0     0    0     0 nh white    1    1     6.400
## 0015006C.2    0     0     0    0     0 nh white    1    2     7.319
## 0015006C.3    0     0     0    0     0 nh white    1    3     9.750
## 0015014C.1    0     0     0    0     0 nh white    0    1     5.350
## 0015014C.2    0     0     0    0     0 nh white    0    2     6.269
## 0015014C.3    0     0     0    0     0 nh white    0    3     8.917
##            age_exit   momedu
## 0015006C.1    7.319 jcollege
## 0015006C.2    9.750  college
## 0015006C.3   14.833  college
## 0015014C.1    6.269 jcollege
## 0015014C.2    8.917  college
## 0015014C.3   13.833  college

Since I have 3 risk periods this time, the previous way of coding the transitions has become burdensome. So, instead I hard code it. To do this, I initialize the povtran variable as missing, then hard code the specific transitions I’m interested in. This lets me identify when values between waves stay the same, versus when they change. Also, since I’m initilizing the variable as missing, I dont’ have to recode the missing values as I’ve done in previous examples.

#find which kids failed in the first time period and remove them from the second risk period risk set
e.long$povtran<-NA

e.long$povtran[e.long$pov1==0&e.long$pov2==1&e.long$time==1]<-1
e.long$povtran[e.long$pov2==0&e.long$pov3==1&e.long$time==2]<-1
e.long$povtran[e.long$pov3==0&e.long$pov4==1&e.long$time==3]<-1
e.long$povtran[e.long$pov3==0&e.long$pov4==1&e.long$time==3]<-1

e.long$povtran[e.long$pov1==0&e.long$pov2==0&e.long$time==1]<-0
e.long$povtran[e.long$pov2==0&e.long$pov3==0&e.long$time==2]<-0
e.long$povtran[e.long$pov3==0&e.long$pov4==0&e.long$time==3]<-0

#find which kids failed in earlier time periods and remove them from the second & third period risk set
failed1<-which(is.na(e.long$povtran)==T)
e.long<-e.long[-failed1,]

#round age to the nearest year
e.long$age_enter_r<-round(e.long$age_enter, 0)
e.long$age_exit_r<-round(e.long$age_exit, 0)

head(e.long[, c("childid","time", "age_enter", "age_exit", "pov1", "pov2", "pov3", "time", "povtran")], n=10)
##             childid time age_enter age_exit pov1 pov2 pov3 time.1 povtran
## 0015006C.1 0015006C    1     6.400    7.319    0    0    0      1       0
## 0015006C.2 0015006C    2     7.319    9.750    0    0    0      2       0
## 0015006C.3 0015006C    3     9.750   14.833    0    0    0      3       0
## 0015014C.1 0015014C    1     5.350    6.269    0    0    0      1       0
## 0015014C.2 0015014C    2     6.269    8.917    0    0    0      2       0
## 0015014C.3 0015014C    3     8.917   13.833    0    0    0      3       0
## 0015019C.1 0015019C    1     5.686    6.606    0    0    0      1       0
## 0015019C.2 0015019C    2     6.606    9.333    0    0    0      2       0
## 0015019C.3 0015019C    3     9.333   14.333    0    0    0      3       0
## 0015020C.1 0015020C    1     6.583    7.802    0    0    0      1       0

Construct survey design and fit basic Cox model

Now we fit the Cox model using full survey design. In the ECLS-K, I use the longitudinal weight for waves 1-7, as well as the associated psu and strata id’s for the longitudinal data from these waves from the parents of the child, since no data from the child themselves are used in the outcome.

options(survey.lonely.psu = "adjust")
des2<-svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights=~c1_7fp0, data=e.long[complete.cases(e.long),], nest=T)

#Fit the model
fitl1<-svycoxph(Surv(time = time, event = povtran)~momedu+race_gr, design=des2)
summary(fitl1) 
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (160) clusters.
## svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights = ~c1_7fp0, 
##     data = e.long[complete.cases(e.long), ], nest = T)
## Call:
## svycoxph(formula = Surv(time = time, event = povtran) ~ momedu + 
##     race_gr, design = des2)
## 
##   n= 4509, number of events= 144 
## 
##                     coef exp(coef) se(coef)     z   Pr(>|z|)    
## momeducollege    -3.2099    0.0404   0.6253 -5.13 0.00000029 ***
## momedujcollege   -0.8280    0.4369   0.3153 -2.63     0.0086 ** 
## momeduprimarysch  0.5419    1.7192   0.3089  1.75     0.0794 .  
## race_grhisp       0.6453    1.9066   0.2471  2.61     0.0090 ** 
## race_grnahn       1.1693    3.2197   0.5782  2.02     0.0432 *  
## race_grnh asian  -0.9469    0.3879   0.8786 -1.08     0.2811    
## race_grnh black   0.7533    2.1240   0.4353  1.73     0.0836 .  
## race_grother      0.3652    1.4408   0.7386  0.49     0.6210    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## momeducollege       0.0404     24.777    0.0118     0.137
## momedujcollege      0.4369      2.289    0.2355     0.810
## momeduprimarysch    1.7192      0.582    0.9385     3.150
## race_grhisp         1.9066      0.525    1.1746     3.095
## race_grnahn         3.2197      0.311    1.0366    10.000
## race_grnh asian     0.3879      2.578    0.0693     2.171
## race_grnh black     2.1240      0.471    0.9049     4.986
## race_grother        1.4408      0.694    0.3387     6.128
## 
## Concordance= 0.779  (se = 0.022 )
## Rsquare= NA   (max possible= NA )
## Likelihood ratio test= NA  on 8 df,   p=NA
## Wald test            = 71.4  on 8 df,   p=0.00000000000258
## Score (logrank) test = NA  on 8 df,   p=NA

Model Residuals

There are several types of residuals for the Cox model, and they are used for different purposes.

First, we will extract the Shoenfeld residuals, which are useful for examining non-proportional hazards with repect to time. This means that the covariate effect could exhibit time-dependency.

First we extract the residuals from the model, then we fit a linear model to the residual and the observed (uncensored) failure times, we DO NOT WANT TO SEE A SIGNIFICANT MODEL HERE!!!!!

schoenresid<-resid(fitl1, type="schoenfeld")

fit.sr<-lm(schoenresid~des2$variables$time[des2$variables$povtran==1])
summary(fit.sr)
## Response momeducollege :
## 
## Call:
## lm(formula = momeducollege ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0504 -0.0396 -0.0348 -0.0273  0.9727 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                       -0.0119     0.0370
## des2$variables$time[des2$variables$povtran == 1]   0.0109     0.0188
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.32     0.75
## des2$variables$time[des2$variables$povtran == 1]    0.58     0.56
## 
## Residual standard error: 0.184 on 142 degrees of freedom
## Multiple R-squared:  0.00235,    Adjusted R-squared:  -0.00467 
## F-statistic: 0.335 on 1 and 142 DF,  p-value: 0.564
## 
## 
## Response momedujcollege :
## 
## Call:
## lm(formula = momedujcollege ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.518 -0.455 -0.412  0.526  0.588 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0863     0.1008
## des2$variables$time[des2$variables$povtran == 1]  -0.0213     0.0512
##                                                  t value Pr(>|t|)
## (Intercept)                                         0.86     0.39
## des2$variables$time[des2$variables$povtran == 1]   -0.42     0.68
## 
## Residual standard error: 0.501 on 142 degrees of freedom
## Multiple R-squared:  0.00122,    Adjusted R-squared:  -0.00582 
## F-statistic: 0.173 on 1 and 142 DF,  p-value: 0.678
## 
## 
## Response momeduprimarysch :
## 
## Call:
## lm(formula = momeduprimarysch ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.231 -0.205 -0.170 -0.135  0.873 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0857     0.0784
## des2$variables$time[des2$variables$povtran == 1]  -0.0352     0.0398
##                                                  t value Pr(>|t|)
## (Intercept)                                         1.09     0.28
## des2$variables$time[des2$variables$povtran == 1]   -0.88     0.38
## 
## Residual standard error: 0.39 on 142 degrees of freedom
## Multiple R-squared:  0.00548,    Adjusted R-squared:  -0.00153 
## F-statistic: 0.782 on 1 and 142 DF,  p-value: 0.378
## 
## 
## Response race_grhisp :
## 
## Call:
## lm(formula = race_grhisp ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.301 -0.293 -0.250  0.699  0.793 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0786     0.0890
## des2$variables$time[des2$variables$povtran == 1]  -0.0427     0.0452
##                                                  t value Pr(>|t|)
## (Intercept)                                         0.88     0.38
## des2$variables$time[des2$variables$povtran == 1]   -0.94     0.35
## 
## Residual standard error: 0.443 on 142 degrees of freedom
## Multiple R-squared:  0.00622,    Adjusted R-squared:  -0.000774 
## F-statistic: 0.889 on 1 and 142 DF,  p-value: 0.347
## 
## 
## Response race_grnahn :
## 
## Call:
## lm(formula = race_grnahn ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0991 -0.0774 -0.0556 -0.0505  0.9495 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                      -0.00994    0.05137
## des2$variables$time[des2$variables$povtran == 1]  0.02176    0.02610
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.19     0.85
## des2$variables$time[des2$variables$povtran == 1]    0.83     0.41
## 
## Residual standard error: 0.255 on 142 degrees of freedom
## Multiple R-squared:  0.00487,    Adjusted R-squared:  -0.00214 
## F-statistic: 0.695 on 1 and 142 DF,  p-value: 0.406
## 
## 
## Response race_grnh asian :
## 
## Call:
## lm(formula = `race_grnh asian` ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0653 -0.0638 -0.0366 -0.0079  0.9928 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0756     0.0402
## des2$variables$time[des2$variables$povtran == 1]  -0.0287     0.0204
##                                                  t value Pr(>|t|)  
## (Intercept)                                         1.88    0.062 .
## des2$variables$time[des2$variables$povtran == 1]   -1.41    0.162  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2 on 142 degrees of freedom
## Multiple R-squared:  0.0137, Adjusted R-squared:  0.00679 
## F-statistic: 1.98 on 1 and 142 DF,  p-value: 0.162
## 
## 
## Response race_grnh black :
## 
## Call:
## lm(formula = `race_grnh black` ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1569 -0.1501 -0.1229 -0.0822  0.9178 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                       -0.0299     0.0667
## des2$variables$time[des2$variables$povtran == 1]  -0.0340     0.0339
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.45     0.65
## des2$variables$time[des2$variables$povtran == 1]   -1.00     0.32
## 
## Residual standard error: 0.331 on 142 degrees of freedom
## Multiple R-squared:  0.00703,    Adjusted R-squared:  3.26e-05 
## F-statistic:    1 on 1 and 142 DF,  p-value: 0.318
## 
## 
## Response race_grother :
## 
## Call:
## lm(formula = race_grother ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0199 -0.0180 -0.0138 -0.0077  0.9887 
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                      -0.000202   0.023719
## des2$variables$time[des2$variables$povtran == 1] -0.006083   0.012049
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.01     0.99
## des2$variables$time[des2$variables$povtran == 1]   -0.50     0.61
## 
## Residual standard error: 0.118 on 142 degrees of freedom
## Multiple R-squared:  0.00179,    Adjusted R-squared:  -0.00524 
## F-statistic: 0.255 on 1 and 142 DF,  p-value: 0.614

And we jump up and down! None of our predictors appear to be associated with the residuals, which suggests the effects are constant over time

Not so soon

We can also get a formal test using weighted residuals in a nice pre-rolled form with a plot, a la Grambsch and Therneau (1994) :

fit.test<-cox.zph(fitl1)
fit.test
##                      rho  chisq     p
## momeducollege     0.0596 1.2413 0.265
## momedujcollege   -0.0148 0.1118 0.738
## momeduprimarysch -0.1196 3.8768 0.049
## race_grhisp       0.0437 0.5093 0.475
## race_grnahn      -0.0321 0.4636 0.496
## race_grnh asian   0.0432 1.4059 0.236
## race_grnh black  -0.0706 2.0386 0.153
## race_grother      0.0046 0.0038 0.951
## GLOBAL                NA 8.8204 0.358
par(mfrow=c(3,3))
plot(fit.test, df=2)
par(mfrow=c(1,1))

Here, we see that the primary school education category has a significant test in the formal testing procedure, which suggests a non-proportionality according to education.

Next we examine Martingale residuals. Martingale residuals are also useful for assessing the functional form of a covariate. A plot of the martingale residuals against the values of the covariate will indicate if the covariate is being modeled correctly, i.e. linearly in the Cox model. If a line fit to these residuals is a straight line, then the covariate has been modeled effectively, if it is curvilinear, you may need to enter the covariate as a quadratic, although this is not commonly a problem for dummy variables.

#extract Martingale residuals
res.mar<-resid(fitl1, type="martingale")

#plot vs maternal education
scatter.smooth(des2$variables$momedu, res.mar,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.25, lpars=list(col = "red", lwd = 3))
title(main="Martingale residuals for Mother' < High School's Education")