This example will illustrate how to examine the fit of the Cox Proportional hazards model to a discrete-time (longitudinal) data set and examine various model diagnostics to evaluate the overall model fit. The data example uses data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and third grade.

Using Longitudinal Data

As in the other examples, I illustrate fitting these models to data that are longitudinal, instead of person-duration. In this example, we will examine how to fit the Cox model to a longitudinally collected data set.

First we load our data

load("~/Google Drive/classes/dem7223/class17//data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r3age", "r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc","wkpov_r", "w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed","w1momed", "w3momed", "w5momed", "w8momed", "s2_id", "c1_7fp0", "c17fpstr", "c17fppsu")
eclsk<-eclsk[,myvars]


eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r3age==-9, NA, eclsk$r3age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12
eclsk$age4<-recode(eclsk$r7age,recodes="1=155; 2=166; 3=172; 4=178; 5=192; -9=NA")/12


eclsk$pov1<-ifelse(eclsk$wkpov_r==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)
eclsk$pov4<-ifelse(eclsk$w8povrty==1,1,0)

#mother's education-time varying > hs ==1
eclsk$momedu1<-recode(eclsk$wkmomed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu1<-relevel(eclsk$momedu1, ref = "highschool") 
eclsk$momedu2<-recode(eclsk$w3momed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu2<-relevel(eclsk$momedu2, ref = "highschool")
eclsk$momedu3<-recode(eclsk$w5momed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu3<-relevel(eclsk$momedu3, ref = "highschool") 


#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$black<-recode (eclsk$race, recodes="2=1;-9=NA; else=0")
eclsk$asian<-recode (eclsk$race, recodes="5=1;-9=NA; else=0")
eclsk$nahn<-recode (eclsk$race, recodes="6:7=1;-9=NA; else=0")
eclsk$other<-recode (eclsk$race, recodes="8=1;-9=NA; else=0")
eclsk$race_gr<-recode(eclsk$race, recodes="3:4='hisp'; 2='nh black'; 5='nh asian'; 6:7='nahn'; 8='other'; 1='nh white'; else=NA", as.factor.result = T)
eclsk$race_gr<-relevel(eclsk$race_gr, ref = 'nh white')
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")

Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise. NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition.

eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1&is.na(eclsk$c17fpstr)==F)

Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.

e.long<-reshape(eclsk, idvar="childid", varying=list(age=c("age1","age2", "age3"), age2=c("age2", "age3", "age4"), momedu=c("momedu1", "momedu2", "momedu3")), times=1:3, direction="long" , v.names = c("age_enter", "age_exit", "momedu"))
e.long<-e.long[order(e.long$childid, e.long$time),]
head(e.long)

##             childid gender race r1_kage r3age r4age r5age r6age r7age
## 0015006C.1 0015006C      1    1    76.8 87.83 95.53     6    NA     4
## 0015006C.2 0015006C      1    1    76.8 87.83 95.53     6    NA     4
## 0015006C.3 0015006C      1    1    76.8 87.83 95.53     6    NA     4
## 0015014C.1 0015014C      2    1    64.2 75.23 82.93     2     2     2
## 0015014C.2 0015014C      2    1    64.2 75.23 82.93     2     2     2
## 0015014C.3 0015014C      2    1    64.2 75.23 82.93     2     2     2
##            c1r4mtsc c4r4mtsc c5r4mtsc c6r4mtsc c7r4mtsc wkpov_r w1povrty
## 0015006C.1    56.19    51.94    52.68       NA    50.26       2        2
## 0015006C.2    56.19    51.94    52.68       NA    50.26       2        2
## 0015006C.3    56.19    51.94    52.68       NA    50.26       2        2
## 0015014C.1    57.74    55.18    51.98    59.29    60.00       2        2
## 0015014C.2    57.74    55.18    51.98    59.29    60.00       2        2
## 0015014C.3    57.74    55.18    51.98    59.29    60.00       2        2
##            w3povrty w5povrty w8povrty wkmomed w1momed w3momed w5momed
## 0015006C.1        2        2        2       5       5       6       8
## 0015006C.2        2        2        2       5       5       6       8
## 0015006C.3        2        2        2       5       5       6       8
## 0015014C.1        2        2        2       5       6       6       6
## 0015014C.2        2        2        2       5       6       6       6
## 0015014C.3        2        2        2       5       6       6       6
##            w8momed s2_id c1_7fp0 c17fpstr c17fppsu pov1 pov2 pov3 pov4
## 0015006C.1       8  0015   934.1       77        1    0    0    0    0
## 0015006C.2       8  0015   934.1       77        1    0    0    0    0
## 0015006C.3       8  0015   934.1       77        1    0    0    0    0
## 0015014C.1       6  0015   129.6       77        1    0    0    0    0
## 0015014C.2       6  0015   129.6       77        1    0    0    0    0
## 0015014C.3       6  0015   129.6       77        1    0    0    0    0
##            hisp black asian nahn other  race_gr male time age_enter
## 0015006C.1    0     0     0    0     0 nh white    1    1     6.400
## 0015006C.2    0     0     0    0     0 nh white    1    2     7.319
## 0015006C.3    0     0     0    0     0 nh white    1    3     9.750
## 0015014C.1    0     0     0    0     0 nh white    0    1     5.350
## 0015014C.2    0     0     0    0     0 nh white    0    2     6.269
## 0015014C.3    0     0     0    0     0 nh white    0    3     8.917
##            age_exit   momedu
## 0015006C.1    7.319 jcollege
## 0015006C.2    9.750  college
## 0015006C.3   14.833  college
## 0015014C.1    6.269 jcollege
## 0015014C.2    8.917  college
## 0015014C.3   13.833  college

Since I have 3 risk periods this time, the previous way of coding the transitions has become burdensome. So, instead I hard code it. To do this, I initialize the povtran variable as missing, then hard code the specific transitions I’m interested in. This lets me identify when values between waves stay the same, versus when they change. Also, since I’m initilizing the variable as missing, I dont’ have to recode the missing values as I’ve done in previous examples.

#find which kids failed in the first time period and remove them from the second risk period risk set
e.long$povtran<-NA

e.long$povtran[e.long$pov1==0&e.long$pov2==1&e.long$time==1]<-1
e.long$povtran[e.long$pov2==0&e.long$pov3==1&e.long$time==2]<-1
e.long$povtran[e.long$pov3==0&e.long$pov4==1&e.long$time==3]<-1
e.long$povtran[e.long$pov3==0&e.long$pov4==1&e.long$time==3]<-1

e.long$povtran[e.long$pov1==0&e.long$pov2==0&e.long$time==1]<-0
e.long$povtran[e.long$pov2==0&e.long$pov3==0&e.long$time==2]<-0
e.long$povtran[e.long$pov3==0&e.long$pov4==0&e.long$time==3]<-0

#find which kids failed in earlier time periods and remove them from the second & third period risk set
failed1<-which(is.na(e.long$povtran)==T)
e.long<-e.long[-failed1,]

#round age to the nearest year
e.long$age_enter_r<-round(e.long$age_enter, 0)
e.long$age_exit_r<-round(e.long$age_exit, 0)
e.long$time_start<-e.long$time-1
head(e.long[, c("childid","time_start" , "time", "age_enter", "age_exit", "pov1", "pov2", "pov3", "momedu", "povtran")], n=10)

##             childid time_start time age_enter age_exit pov1 pov2 pov3
## 0015006C.1 0015006C          0    1     6.400    7.319    0    0    0
## 0015006C.2 0015006C          1    2     7.319    9.750    0    0    0
## 0015006C.3 0015006C          2    3     9.750   14.833    0    0    0
## 0015014C.1 0015014C          0    1     5.350    6.269    0    0    0
## 0015014C.2 0015014C          1    2     6.269    8.917    0    0    0
## 0015014C.3 0015014C          2    3     8.917   13.833    0    0    0
## 0015019C.1 0015019C          0    1     5.686    6.606    0    0    0
## 0015019C.2 0015019C          1    2     6.606    9.333    0    0    0
## 0015019C.3 0015019C          2    3     9.333   14.333    0    0    0
## 0015020C.1 0015020C          0    1     6.583    7.802    0    0    0
##              momedu povtran
## 0015006C.1 jcollege       0
## 0015006C.2  college       0
## 0015006C.3  college       0
## 0015014C.1 jcollege       0
## 0015014C.2  college       0
## 0015014C.3  college       0
## 0015019C.1 jcollege       0
## 0015019C.2  college       0
## 0015019C.3  college       0
## 0015020C.1  college       0

Construct survey design and fit basic Cox model

Now we fit the Cox model using full survey design. In the ECLS-K, I use the longitudinal weight for waves 1-7, as well as the associated psu and strata id’s for the longitudinal data from these waves from the parents of the child, since no data from the child themselves are used in the outcome.

options(survey.lonely.psu = "adjust")
des2<-svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights=~c1_7fp0, data=e.long[complete.cases(e.long),], nest=T)

#Fit the model
fitl1<-svycoxph(Surv(time = time_start, time2=time, event = povtran)~momedu+race_gr, design=des2)
summary(fitl1)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (160) clusters.
## svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights = ~c1_7fp0, 
##     data = e.long[complete.cases(e.long), ], nest = T)

## Call:
## svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~ 
##     momedu + race_gr, design = des2)
## 
##   n= 4509, number of events= 144 
## 
##                     coef exp(coef) se(coef)     z   Pr(>|z|)    
## momeducollege    -3.0827    0.0458   0.6278 -4.91 0.00000091 ***
## momedujcollege   -0.6965    0.4983   0.3233 -2.15      0.031 *  
## momeduprimarysch  0.4596    1.5834   0.2994  1.54      0.125    
## race_grhisp       0.6553    1.9258   0.2385  2.75      0.006 ** 
## race_grnahn       1.1065    3.0237   0.5310  2.08      0.037 *  
## race_grnh asian  -0.8537    0.4258   0.8263 -1.03      0.302    
## race_grnh black   0.6766    1.9672   0.4412  1.53      0.125    
## race_grother      0.2914    1.3383   0.7088  0.41      0.681    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## momeducollege       0.0458     21.817    0.0134     0.157
## momedujcollege      0.4983      2.007    0.2644     0.939
## momeduprimarysch    1.5834      0.632    0.8806     2.847
## race_grhisp         1.9258      0.519    1.2066     3.074
## race_grnahn         3.0237      0.331    1.0680     8.560
## race_grnh asian     0.4258      2.348    0.0843     2.151
## race_grnh black     1.9672      0.508    0.8286     4.671
## race_grother        1.3383      0.747    0.3336     5.369
## 
## Concordance= 0.766  (se = 0.021 )
## Rsquare= NA   (max possible= NA )
## Likelihood ratio test= NA  on 8 df,   p=NA
## Wald test            = 64  on 8 df,   p=0.0000000000751
## Score (logrank) test = NA  on 8 df,   p=NA

Model Residuals

There are several types of residuals for the Cox model, and they are used for different purposes.

First, we will extract the Shoenfeld residuals, which are useful for examining non-proportional hazards with repect to time. This means that the covariate effect could exhibit time-dependency.

First we extract the residuals from the model, then we fit a linear model to the residual and the observed (uncensored) failure times

WE DO NOT WANT TO SEE A SIGNIFICANT MODEL HERE!!!!!

that would indicate dependence between the residual and outcome, or nonpropotionality, similar to doing a test for heteroskedasticity in OLS

schoenresid<-resid(fitl1, type="schoenfeld")

fit.sr<-lm(schoenresid~des2$variables$time[des2$variables$povtran==1])
summary(fit.sr)

## Response momeducollege :
## 
## Call:
## lm(formula = momeducollege ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0527 -0.0418 -0.0324 -0.0292  0.9708 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                       -0.0120     0.0369
## des2$variables$time[des2$variables$povtran == 1]   0.0109     0.0188
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.33     0.75
## des2$variables$time[des2$variables$povtran == 1]    0.58     0.56
## 
## Residual standard error: 0.184 on 142 degrees of freedom
## Multiple R-squared:  0.00236,    Adjusted R-squared:  -0.00467 
## F-statistic: 0.336 on 1 and 142 DF,  p-value: 0.563
## 
## 
## Response momedujcollege :
## 
## Call:
## lm(formula = momedujcollege ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.548 -0.467 -0.386  0.512  0.614 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0851     0.1009
## des2$variables$time[des2$variables$povtran == 1]  -0.0213     0.0512
##                                                  t value Pr(>|t|)
## (Intercept)                                         0.84     0.40
## des2$variables$time[des2$variables$povtran == 1]   -0.42     0.68
## 
## Residual standard error: 0.501 on 142 degrees of freedom
## Multiple R-squared:  0.00122,    Adjusted R-squared:  -0.00582 
## F-statistic: 0.173 on 1 and 142 DF,  p-value: 0.678
## 
## 
## Response momeduprimarysch :
## 
## Call:
## lm(formula = momeduprimarysch ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.246 -0.195 -0.176 -0.124  0.887 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0863     0.0781
## des2$variables$time[des2$variables$povtran == 1]  -0.0352     0.0397
##                                                  t value Pr(>|t|)
## (Intercept)                                         1.10     0.27
## des2$variables$time[des2$variables$povtran == 1]   -0.89     0.38
## 
## Residual standard error: 0.388 on 142 degrees of freedom
## Multiple R-squared:  0.00551,    Adjusted R-squared:  -0.0015 
## F-statistic: 0.786 on 1 and 142 DF,  p-value: 0.377
## 
## 
## Response race_grhisp :
## 
## Call:
## lm(formula = race_grhisp ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.303 -0.290 -0.247  0.697  0.795 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0787     0.0891
## des2$variables$time[des2$variables$povtran == 1]  -0.0427     0.0452
##                                                  t value Pr(>|t|)
## (Intercept)                                         0.88     0.38
## des2$variables$time[des2$variables$povtran == 1]   -0.94     0.35
## 
## Residual standard error: 0.443 on 142 degrees of freedom
## Multiple R-squared:  0.00622,    Adjusted R-squared:  -0.000774 
## F-statistic: 0.889 on 1 and 142 DF,  p-value: 0.347
## 
## 
## Response race_grnahn :
## 
## Call:
## lm(formula = race_grnahn ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0996 -0.0778 -0.0561 -0.0511  0.9489 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                      -0.00987    0.05136
## des2$variables$time[des2$variables$povtran == 1]  0.02175    0.02609
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.19     0.85
## des2$variables$time[des2$variables$povtran == 1]    0.83     0.41
## 
## Residual standard error: 0.255 on 142 degrees of freedom
## Multiple R-squared:  0.00487,    Adjusted R-squared:  -0.00214 
## F-statistic: 0.695 on 1 and 142 DF,  p-value: 0.406
## 
## 
## Response race_grnh asian :
## 
## Call:
## lm(formula = `race_grnh asian` ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0656 -0.0635 -0.0368 -0.0081  0.9927 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                        0.0756     0.0402
## des2$variables$time[des2$variables$povtran == 1]  -0.0287     0.0204
##                                                  t value Pr(>|t|)  
## (Intercept)                                         1.88    0.062 .
## des2$variables$time[des2$variables$povtran == 1]   -1.41    0.162  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2 on 142 degrees of freedom
## Multiple R-squared:  0.0137, Adjusted R-squared:  0.00679 
## F-statistic: 1.98 on 1 and 142 DF,  p-value: 0.162
## 
## 
## Response race_grnh black :
## 
## Call:
## lm(formula = `race_grnh black` ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1590 -0.1506 -0.1251 -0.0826  0.9174 
## 
## Coefficients:
##                                                  Estimate Std. Error
## (Intercept)                                       -0.0297     0.0667
## des2$variables$time[des2$variables$povtran == 1]  -0.0340     0.0339
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.45     0.66
## des2$variables$time[des2$variables$povtran == 1]   -1.00     0.32
## 
## Residual standard error: 0.331 on 142 degrees of freedom
## Multiple R-squared:  0.00704,    Adjusted R-squared:  4.58e-05 
## F-statistic: 1.01 on 1 and 142 DF,  p-value: 0.317
## 
## 
## Response race_grother :
## 
## Call:
## lm(formula = race_grother ~ des2$variables$time[des2$variables$povtran == 
##     1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.0209 -0.0172 -0.0148 -0.0087  0.9895 
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                      -0.000174   0.023742
## des2$variables$time[des2$variables$povtran == 1] -0.006078   0.012061
##                                                  t value Pr(>|t|)
## (Intercept)                                        -0.01     0.99
## des2$variables$time[des2$variables$povtran == 1]   -0.50     0.62
## 
## Residual standard error: 0.118 on 142 degrees of freedom
## Multiple R-squared:  0.00179,    Adjusted R-squared:  -0.00524 
## F-statistic: 0.254 on 1 and 142 DF,  p-value: 0.615

And we jump up and down! None of our predictors appear to be associated with the residuals, which suggests the effects are constant over time

Not so soon

We can also get a formal test using weighted residuals in a nice pre-rolled form with a plot, a la Grambsch and Therneau (1994) :

fit.test<-cox.zph(fitl1)
fit.test

##                      rho  chisq      p
## momeducollege     0.0493 0.8698 0.3510
## momedujcollege   -0.0441 1.0507 0.3053
## momeduprimarysch -0.1089 3.1719 0.0749
## race_grhisp       0.0357 0.3253 0.5684
## race_grnahn      -0.0138 0.0717 0.7889
## race_grnh asian   0.0292 0.5822 0.4455
## race_grnh black  -0.0563 1.3073 0.2529
## race_grother      0.0415 0.2828 0.5949
## GLOBAL                NA 8.0631 0.4273

par(mfrow=c(3,3))
plot(fit.test, df=2)
par(mfrow=c(1,1))

Here, we see that the primary school education category almost has a significant test in the formal testing procedure, which suggests almost having non-proportionality according to education.

Next we examine Martingale residuals. Martingale residuals are also useful for assessing the functional form of a covariate. A plot of the martingale residuals against the values of the covariate will indicate if the covariate is being modeled correctly, i.e. linearly in the Cox model. If a line fit to these residuals is a straight line, then the covariate has been modeled effectively, if it is curvilinear, you may need to enter the covariate as a quadratic, although this is not commonly a problem for dummy variables.

#extract Martingale residuals
res.mar<-resid(fitl1, type="martingale")

#plot vs maternal education
scatter.smooth(des2$variables$momedu, res.mar,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.25, lpars=list(col = "red", lwd = 3))
title(main="Martingale residuals for Mother' < High School's Education")

Which shows nothing in the way of nonlinearity in this case.

Stratification

Above, we observed evidence of non-proportional effects by education. There are a few standard ways of dealing with this in practice. The first is stratification of the model by the offending predictor. If one of the covariates exhibits non-proportionality we can re-specify the model so that each group will have its own baseline hazard rate. This is direct enough to do by using the strata() function within a model. This is of best use when a covariate is categorical, and not of direct importance for our model (i.e. a control variable).

fitl2<-svycoxph(Surv(time = time_start, time2 = time, event = povtran)~race_gr+strata(momedu), design=des2)
summary(fitl2)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (160) clusters.
## svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights = ~c1_7fp0, 
##     data = e.long[complete.cases(e.long), ], nest = T)

## Call:
## svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~ 
##     race_gr + strata(momedu), design = des2)
## 
##   n= 4509, number of events= 144 
## 
##                   coef exp(coef) se(coef)     z Pr(>|z|)   
## race_grhisp      0.658     1.932    0.240  2.75    0.006 **
## race_grnahn      1.094     2.987    0.512  2.14    0.033 * 
## race_grnh asian -0.814     0.443    0.820 -0.99    0.321   
## race_grnh black  0.690     1.994    0.436  1.58    0.113   
## race_grother     0.302     1.353    0.710  0.43    0.671   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                 exp(coef) exp(-coef) lower .95 upper .95
## race_grhisp         1.932      0.518    1.2078      3.09
## race_grnahn         2.987      0.335    1.0954      8.14
## race_grnh asian     0.443      2.256    0.0889      2.21
## race_grnh black     1.994      0.501    0.8491      4.68
## race_grother        1.353      0.739    0.3361      5.44
## 
## Concordance= 0.594  (se = 0.033 )
## Rsquare= NA   (max possible= NA )
## Likelihood ratio test= NA  on 5 df,   p=NA
## Wald test            = 13.2  on 5 df,   p=0.0216
## Score (logrank) test = NA  on 5 df,   p=NA

#plot the two models
dat<-expand.grid(momedu=levels(e.long$momedu), race_gr=levels(e.long$race_gr))
dat

##        momedu  race_gr
## 1  highschool nh white
## 2     college nh white
## 3    jcollege nh white
## 4  primarysch nh white
## 5  highschool     hisp
## 6     college     hisp
## 7    jcollege     hisp
## 8  primarysch     hisp
## 9  highschool     nahn
## 10    college     nahn
## 11   jcollege     nahn
## 12 primarysch     nahn
## 13 highschool nh asian
## 14    college nh asian
## 15   jcollege nh asian
## 16 primarysch nh asian
## 17 highschool nh black
## 18    college nh black
## 19   jcollege nh black
## 20 primarysch nh black
## 21 highschool    other
## 22    college    other
## 23   jcollege    other
## 24 primarysch    other

plot(survfit(fitl1, newdata = dat[c( 5:8),]), col = 1, lty=1:4, ylim=c(.6, 1))
legend("bottomleft", legend=levels(e.long$momedu), col=1, lty=1:4)
title(main="Surviorship function for poverty transition - Un-stratified Cox Model", sub="Effect of Education in Hispanics")

plot(survfit(fitl2,newdata = dat[c(5:8),]), col = 1,lty= 1:4, ylim=c(.6, 1))
legend("bottomleft", legend=levels(e.long$momedu), col=1, lty=1:4)
title(main="Surviorship function for poverty transition - Stratified Cox Model", sub="Effect of Education in Hispanics")

Non-proportional effects with time

In this example, time is so limited that it doesn’t make sense to do this.

ANOVA like tests for factors

You can use the regTermTest() function in the survey() package to do omnibus tests for variatoin across a factor variable.

fit3<-svycoxph(Surv(time = time_start, time2=time, event = povtran)~momedu+race_gr, design=des2)
summary(fit3)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (160) clusters.
## svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights = ~c1_7fp0, 
##     data = e.long[complete.cases(e.long), ], nest = T)

## Call:
## svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~ 
##     momedu + race_gr, design = des2)
## 
##   n= 4509, number of events= 144 
## 
##                     coef exp(coef) se(coef)     z   Pr(>|z|)    
## momeducollege    -3.0827    0.0458   0.6278 -4.91 0.00000091 ***
## momedujcollege   -0.6965    0.4983   0.3233 -2.15      0.031 *  
## momeduprimarysch  0.4596    1.5834   0.2994  1.54      0.125    
## race_grhisp       0.6553    1.9258   0.2385  2.75      0.006 ** 
## race_grnahn       1.1065    3.0237   0.5310  2.08      0.037 *  
## race_grnh asian  -0.8537    0.4258   0.8263 -1.03      0.302    
## race_grnh black   0.6766    1.9672   0.4412  1.53      0.125    
## race_grother      0.2914    1.3383   0.7088  0.41      0.681    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## momeducollege       0.0458     21.817    0.0134     0.157
## momedujcollege      0.4983      2.007    0.2644     0.939
## momeduprimarysch    1.5834      0.632    0.8806     2.847
## race_grhisp         1.9258      0.519    1.2066     3.074
## race_grnahn         3.0237      0.331    1.0680     8.560
## race_grnh asian     0.4258      2.348    0.0843     2.151
## race_grnh black     1.9672      0.508    0.8286     4.671
## race_grother        1.3383      0.747    0.3336     5.369
## 
## Concordance= 0.766  (se = 0.021 )
## Rsquare= NA   (max possible= NA )
## Likelihood ratio test= NA  on 8 df,   p=NA
## Wald test            = 64  on 8 df,   p=0.0000000000751
## Score (logrank) test = NA  on 8 df,   p=NA

regTermTest(fit3, ~momedu, method="LRT")

## Working (Rao-Scott+F) LRT for momedu
##  in svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~ 
##     momedu + race_gr, design = des2)
## Working 2logLR =  54.39 p= 0.0000005 
## (scale factors:  1.7 0.7 0.56 );  denominator df= 83

regTermTest(fit3, ~race_gr, method="LRT")

## Working (Rao-Scott+F) LRT for race_gr
##  in svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~ 
##     momedu + race_gr, design = des2)
## Working 2logLR =  11.17 p= 0.07 
## (scale factors:  2.1 0.85 0.81 0.73 0.49 );  denominator df= 83

DHS data example

library(haven)
#load the data
model.dat<-read_dta("~/Google Drive//classes/dem7223/class17/data/zzir62dt/ZZIR62FL.DTA")

In the DHS individual recode file, information on every live birth is collected using a retrospective birth history survey mechanism.

Since our outcome is time between first and second birth, we must select as our risk set, only women who have had a first birth.

The bidx variable indexes the birth history and if bidx_01 is not missing, then the woman should be at risk of having a second birth (i.e. she has had a first birth, i.e. bidx_01==1).

I also select only non-twin births (b0 == 0).

The DHS provides the dates of when each child was born in Century Month Codes.

To get the interval for women who acutally had a second birth, that is the difference between the CMC for the first birth b3_01 and the second birth b3_02, but for women who had not had a second birth by the time of the interview, the censored time between births is the difference between b3_01 and v008, the date of the interview.

We have 6161 women who are at risk of a second birth.

table(is.na(model.dat$bidx_01))

## 
## FALSE  TRUE 
##  6161  2187

#now we extract those women
sub<-subset(model.dat, model.dat$bidx_01==1&model.dat$b0_01==0)

#Here I keep only a few of the variables for the dates, and some characteristics of the women, and details of the survey
sub2<-data.frame(CASEID=sub$caseid, 
                 int.cmc=sub$v008,
                 fbir.cmc=sub$b3_01,
                 sbir.cmc=sub$b3_02,
                 marr.cmc=sub$v509,
                 rural=sub$v025,
                 educ=sub$v106,
                 age=sub$v012/5,
                 partneredu=sub$v701,
                 partnerage=sub$v730,
                 weight=sub$v005/1000000,
                 psu=sub$v021, strata=sub$v022)

Now I need to calculate the birth intervals, both observed and censored, and the event indicator (i.e. did the women have the second birth?)

sub2$secbi<-ifelse(is.na(sub2$sbir.cmc)==T, ((sub2$int.cmc))-((sub2$fbir.cmc)), (sub2$fbir.cmc-sub2$sbir.cmc))
sub2$b2event<-ifelse(is.na(sub2$sbir.cmc)==T,0,1)

Create covariates

Here, we create some predictor variables: Woman’s education (secondary +, vs < secondary), Woman’s age^2, Partner’s education (> secondary school)

sub2$educ.high<-ifelse(sub2$educ %in% c(2,3), 1, 0)
sub2$age2<-(sub2$age)^2
sub2$partnerhiedu<-ifelse(sub2$partneredu<3,0,ifelse(sub2$partneredu%in%c(8,9),NA,1 ))

options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~psu, strata=~strata, data=sub2[sub2$secbi>0,], weight=~weight )

Fit the model

#use survey design
des<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=sub2[is.na(sub2$partnerhiedu)==F,])

cox.s<-svycoxph(Surv(secbi,b2event)~educ.high+partnerhiedu+I(age/5)+age2, design=des)
summary(cox.s)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (217) clusters.
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = sub2[is.na(sub2$partnerhiedu) == 
##     F, ])

## Call:
## svycoxph(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     I(age/5) + age2, design = des)
## 
##   n= 5289, number of events= 4527 
## 
##                  coef exp(coef) se(coef)     z         Pr(>|z|)    
## educ.high    -0.41790   0.65843  0.05698 -7.33 0.00000000000022 ***
## partnerhiedu -0.23520   0.79041  0.09123 -2.58           0.0099 ** 
## I(age/5)     -0.98545   0.37327  0.52079 -1.89           0.0585 .  
## age2          0.00981   1.00986  0.00756  1.30           0.1943    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## educ.high        0.658       1.52     0.589     0.736
## partnerhiedu     0.790       1.27     0.661     0.945
## I(age/5)         0.373       2.68     0.135     1.036
## age2             1.010       0.99     0.995     1.025
## 
## Concordance= 0.555  (se = 0.005 )
## Rsquare= NA   (max possible= NA )
## Likelihood ratio test= NA  on 4 df,   p=NA
## Wald test            = 98.9  on 4 df,   p=0
## Score (logrank) test = NA  on 4 df,   p=NA

#Schoenfeld test
fit.test<-cox.zph(cox.s)
fit.test

##                   rho chisq      p
## educ.high     0.00812 0.591 0.4419
## partnerhiedu  0.01863 3.618 0.0571
## I(age/5)     -0.00945 0.719 0.3966
## age2          0.00589 0.283 0.5949
## GLOBAL             NA 9.664 0.0465

par(mfrow=c(3,3))
plot(fit.test, df=2)
par(mfrow=c(1,1))

#martingale residuals
#extract Martingale residuals
res.mar<-resid(cox.s, type="martingale")

#plot vs maternal education
scatter.smooth(des$variables$age, res.mar,degree = 2,span = 1, ylab="Martingale Residual",col=1,  cex=.25, lpars=list(col = "red", lwd = 3))
title(main="Martingale residuals for Mother Age' ")

Non-proportional effects with time

We can also include a time by covariate interaction term to model directly any time-dependence in the covariate effect. Different people say to do different things, some advocate for simply interacting time with the covariate, others say use a nonlinear function of time, e.g. log(time) * the covariate, others say use time-1 * covariate, which is called the “heavy side function”, according to Mills. Mills cites Allison, in saying that, to interpret the heavy side function, you go with the rule : “If \(\beta_2\) is positive, then the effect of the covariate x increases over time, while if \(\beta_2\) is negative, the effect of x decreases over time.”

sub.split<-survSplit(Surv(secbi, b2event)~., data= sub2[sub2$secbi>0,], cut=36, episode = "timegroup")
sub.split<-sub.split[order(sub.split$CASEID, sub.split$timegroup),]
sub.split$hv1<-sub.split$age*(1-sub.split$timegroup)
sub.split$hv2<-sub.split$age*(sub.split$timegroup)

head(sub.split)

##            CASEID int.cmc fbir.cmc sbir.cmc marr.cmc rural educ age
## 1         1  1  2    1386     1381     1349     1213     2    0 6.0
## 2         1  3  2    1386     1356       NA     1334     2    2 4.4
## 3         1  4  2    1386     1262     1230     1201     2    0 8.4
## 4         1  4  3    1386     1376     1356     1313     2    1 5.0
## 5         1  5  1    1386     1385     1352     1280     2    2 5.0
## 6         1  6  2    1386     1323     1226     1183     2    0 7.4
##   partneredu partnerage weight psu strata educ.high  age2 partnerhiedu
## 1          0         49  1.058   1     26         0 36.00            0
## 2          2         22  1.058   1     26         1 19.36            0
## 3          0         NA  1.058   1     26         0 70.56            0
## 4          0         35  1.058   1     26         0 25.00            0
## 5          0         31  1.058   1     26         1 25.00            0
## 6          0         45  1.058   1     26         0 54.76            0
##   tstart secbi b2event timegroup hv1 hv2
## 1      0    32       1         1   0 6.0
## 2      0    30       0         1   0 4.4
## 3      0    32       1         1   0 8.4
## 4      0    20       1         1   0 5.0
## 5      0    33       1         1   0 5.0
## 6      0    36       0         1   0 7.4

des3<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=sub.split[is.na(sub.split$partnerhiedu)==F,]) 

cox.s2<-svycoxph(Surv(secbi,b2event)~educ.high+partnerhiedu+hv1+hv2, design=des3)
summary(cox.s2)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (217) clusters.
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = sub.split[is.na(sub.split$partnerhiedu) == 
##     F, ])

## Call:
## svycoxph(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu + 
##     hv1 + hv2, design = des3)
## 
##   n= 7812, number of events= 4527 
## 
##                 coef exp(coef) se(coef)     z             Pr(>|z|)    
## educ.high    -0.4138    0.6611   0.0607 -6.82      0.0000000000089 ***
## partnerhiedu -0.2628    0.7689   0.0953 -2.76               0.0058 ** 
## hv1           0.4637    1.5899   0.0169 27.45 < 0.0000000000000002 ***
## hv2           0.1107    1.1170   0.0131  8.46 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## educ.high        0.661      1.513     0.587     0.745
## partnerhiedu     0.769      1.301     0.638     0.927
## hv1              1.590      0.629     1.538     1.643
## hv2              1.117      0.895     1.089     1.146
## 
## Concordance= 0.683  (se = 0.005 )
## Rsquare= NA   (max possible= NA )
## Likelihood ratio test= NA  on 4 df,   p=NA
## Wald test            = 1505  on 4 df,   p=0
## Score (logrank) test = NA  on 4 df,   p=NA

So, for us \(\beta_2\) in the heavyside function is positive, suggesting that the age effect increase over time

DEM 7223 Event History Analysis - Example 5 Cox Proportional Hazards Model Part 2 - Model Checking

Corey Sparks, Ph.D.

October 9, 2017

Using Longitudinal Data

Construct survey design and fit basic Cox model

Model Residuals

WE DO NOT WANT TO SEE A SIGNIFICANT MODEL HERE!!!!!

Stratification

Non-proportional effects with time

ANOVA like tests for factors

DHS data example

Create covariates

Fit the model

Non-proportional effects with time