This example will illustrate how to examine the fit of the Cox Proportional hazards model to a discrete-time (longitudinal) data set and examine various model diagnostics to evaluate the overall model fit. The data example uses data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and third grade.
options("scipen"=999, "digits"=4)
#Load required libraries
library(foreign)
library(survival)
library(car)
library(survey)
## Loading required package: grid
## Loading required package: Matrix
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(muhaz)
As in the other examples, I illustrate fitting these models to data that are longitudinal, instead of person-duration. In this example, we will examine how to fit the Cox model to a longitudinally collected data set.
First we load our data
load("~/Google Drive/classes/dem7223/class17//data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r3age", "r4age", "r5age", "r6age", "r7age","c1r4mtsc", "c4r4mtsc", "c5r4mtsc", "c6r4mtsc", "c7r4mtsc","wkpov_r", "w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed","w1momed", "w3momed", "w5momed", "w8momed", "s2_id", "c1_7fp0", "c17fpstr", "c17fppsu")
eclsk<-eclsk[,myvars]
eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r3age==-9, NA, eclsk$r3age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12
eclsk$age4<-recode(eclsk$r7age,recodes="1=155; 2=166; 3=172; 4=178; 5=192; -9=NA")/12
eclsk$pov1<-ifelse(eclsk$wkpov_r==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)
eclsk$pov4<-ifelse(eclsk$w8povrty==1,1,0)
#mother's education-time varying > hs ==1
eclsk$momedu1<-recode(eclsk$wkmomed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu1<-relevel(eclsk$momedu1, ref = "highschool")
eclsk$momedu2<-recode(eclsk$w3momed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu2<-relevel(eclsk$momedu2, ref = "highschool")
eclsk$momedu3<-recode(eclsk$w5momed, recodes = "1:2='primarysch';3='highschool';4:5='jcollege'; 6:9='college'; else =NA", as.factor.result = T); eclsk$momedu3<-relevel(eclsk$momedu3, ref = "highschool")
#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$black<-recode (eclsk$race, recodes="2=1;-9=NA; else=0")
eclsk$asian<-recode (eclsk$race, recodes="5=1;-9=NA; else=0")
eclsk$nahn<-recode (eclsk$race, recodes="6:7=1;-9=NA; else=0")
eclsk$other<-recode (eclsk$race, recodes="8=1;-9=NA; else=0")
eclsk$race_gr<-recode(eclsk$race, recodes="3:4='hisp'; 2='nh black'; 5='nh asian'; 6:7='nahn'; 8='other'; 1='nh white'; else=NA", as.factor.result = T)
eclsk$race_gr<-relevel(eclsk$race_gr, ref = 'nh white')
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")
Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise. NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition.
eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1&is.na(eclsk$c17fpstr)==F)
Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.
e.long<-reshape(eclsk, idvar="childid", varying=list(age=c("age1","age2", "age3"), age2=c("age2", "age3", "age4"), momedu=c("momedu1", "momedu2", "momedu3")), times=1:3, direction="long" , v.names = c("age_enter", "age_exit", "momedu"))
e.long<-e.long[order(e.long$childid, e.long$time),]
head(e.long)
## childid gender race r1_kage r3age r4age r5age r6age r7age
## 0015006C.1 0015006C 1 1 76.8 87.83 95.53 6 NA 4
## 0015006C.2 0015006C 1 1 76.8 87.83 95.53 6 NA 4
## 0015006C.3 0015006C 1 1 76.8 87.83 95.53 6 NA 4
## 0015014C.1 0015014C 2 1 64.2 75.23 82.93 2 2 2
## 0015014C.2 0015014C 2 1 64.2 75.23 82.93 2 2 2
## 0015014C.3 0015014C 2 1 64.2 75.23 82.93 2 2 2
## c1r4mtsc c4r4mtsc c5r4mtsc c6r4mtsc c7r4mtsc wkpov_r w1povrty
## 0015006C.1 56.19 51.94 52.68 NA 50.26 2 2
## 0015006C.2 56.19 51.94 52.68 NA 50.26 2 2
## 0015006C.3 56.19 51.94 52.68 NA 50.26 2 2
## 0015014C.1 57.74 55.18 51.98 59.29 60.00 2 2
## 0015014C.2 57.74 55.18 51.98 59.29 60.00 2 2
## 0015014C.3 57.74 55.18 51.98 59.29 60.00 2 2
## w3povrty w5povrty w8povrty wkmomed w1momed w3momed w5momed
## 0015006C.1 2 2 2 5 5 6 8
## 0015006C.2 2 2 2 5 5 6 8
## 0015006C.3 2 2 2 5 5 6 8
## 0015014C.1 2 2 2 5 6 6 6
## 0015014C.2 2 2 2 5 6 6 6
## 0015014C.3 2 2 2 5 6 6 6
## w8momed s2_id c1_7fp0 c17fpstr c17fppsu pov1 pov2 pov3 pov4
## 0015006C.1 8 0015 934.1 77 1 0 0 0 0
## 0015006C.2 8 0015 934.1 77 1 0 0 0 0
## 0015006C.3 8 0015 934.1 77 1 0 0 0 0
## 0015014C.1 6 0015 129.6 77 1 0 0 0 0
## 0015014C.2 6 0015 129.6 77 1 0 0 0 0
## 0015014C.3 6 0015 129.6 77 1 0 0 0 0
## hisp black asian nahn other race_gr male time age_enter
## 0015006C.1 0 0 0 0 0 nh white 1 1 6.400
## 0015006C.2 0 0 0 0 0 nh white 1 2 7.319
## 0015006C.3 0 0 0 0 0 nh white 1 3 9.750
## 0015014C.1 0 0 0 0 0 nh white 0 1 5.350
## 0015014C.2 0 0 0 0 0 nh white 0 2 6.269
## 0015014C.3 0 0 0 0 0 nh white 0 3 8.917
## age_exit momedu
## 0015006C.1 7.319 jcollege
## 0015006C.2 9.750 college
## 0015006C.3 14.833 college
## 0015014C.1 6.269 jcollege
## 0015014C.2 8.917 college
## 0015014C.3 13.833 college
Since I have 3 risk periods this time, the previous way of coding the transitions has become burdensome. So, instead I hard code it. To do this, I initialize the povtran variable as missing, then hard code the specific transitions I’m interested in. This lets me identify when values between waves stay the same, versus when they change. Also, since I’m initilizing the variable as missing, I dont’ have to recode the missing values as I’ve done in previous examples.
#find which kids failed in the first time period and remove them from the second risk period risk set
e.long$povtran<-NA
e.long$povtran[e.long$pov1==0&e.long$pov2==1&e.long$time==1]<-1
e.long$povtran[e.long$pov2==0&e.long$pov3==1&e.long$time==2]<-1
e.long$povtran[e.long$pov3==0&e.long$pov4==1&e.long$time==3]<-1
e.long$povtran[e.long$pov3==0&e.long$pov4==1&e.long$time==3]<-1
e.long$povtran[e.long$pov1==0&e.long$pov2==0&e.long$time==1]<-0
e.long$povtran[e.long$pov2==0&e.long$pov3==0&e.long$time==2]<-0
e.long$povtran[e.long$pov3==0&e.long$pov4==0&e.long$time==3]<-0
#find which kids failed in earlier time periods and remove them from the second & third period risk set
failed1<-which(is.na(e.long$povtran)==T)
e.long<-e.long[-failed1,]
#round age to the nearest year
e.long$age_enter_r<-round(e.long$age_enter, 0)
e.long$age_exit_r<-round(e.long$age_exit, 0)
e.long$time_start<-e.long$time-1
head(e.long[, c("childid","time_start" , "time", "age_enter", "age_exit", "pov1", "pov2", "pov3", "momedu", "povtran")], n=10)
## childid time_start time age_enter age_exit pov1 pov2 pov3
## 0015006C.1 0015006C 0 1 6.400 7.319 0 0 0
## 0015006C.2 0015006C 1 2 7.319 9.750 0 0 0
## 0015006C.3 0015006C 2 3 9.750 14.833 0 0 0
## 0015014C.1 0015014C 0 1 5.350 6.269 0 0 0
## 0015014C.2 0015014C 1 2 6.269 8.917 0 0 0
## 0015014C.3 0015014C 2 3 8.917 13.833 0 0 0
## 0015019C.1 0015019C 0 1 5.686 6.606 0 0 0
## 0015019C.2 0015019C 1 2 6.606 9.333 0 0 0
## 0015019C.3 0015019C 2 3 9.333 14.333 0 0 0
## 0015020C.1 0015020C 0 1 6.583 7.802 0 0 0
## momedu povtran
## 0015006C.1 jcollege 0
## 0015006C.2 college 0
## 0015006C.3 college 0
## 0015014C.1 jcollege 0
## 0015014C.2 college 0
## 0015014C.3 college 0
## 0015019C.1 jcollege 0
## 0015019C.2 college 0
## 0015019C.3 college 0
## 0015020C.1 college 0
Now we fit the Cox model using full survey design. In the ECLS-K, I use the longitudinal weight for waves 1-7, as well as the associated psu and strata id’s for the longitudinal data from these waves from the parents of the child, since no data from the child themselves are used in the outcome.
options(survey.lonely.psu = "adjust")
des2<-svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights=~c1_7fp0, data=e.long[complete.cases(e.long),], nest=T)
#Fit the model
fitl1<-svycoxph(Surv(time = time_start, time2=time, event = povtran)~momedu+race_gr, design=des2)
summary(fitl1)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (160) clusters.
## svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights = ~c1_7fp0,
## data = e.long[complete.cases(e.long), ], nest = T)
## Call:
## svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~
## momedu + race_gr, design = des2)
##
## n= 4509, number of events= 144
##
## coef exp(coef) se(coef) z Pr(>|z|)
## momeducollege -3.0827 0.0458 0.6278 -4.91 0.00000091 ***
## momedujcollege -0.6965 0.4983 0.3233 -2.15 0.031 *
## momeduprimarysch 0.4596 1.5834 0.2994 1.54 0.125
## race_grhisp 0.6553 1.9258 0.2385 2.75 0.006 **
## race_grnahn 1.1065 3.0237 0.5310 2.08 0.037 *
## race_grnh asian -0.8537 0.4258 0.8263 -1.03 0.302
## race_grnh black 0.6766 1.9672 0.4412 1.53 0.125
## race_grother 0.2914 1.3383 0.7088 0.41 0.681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## momeducollege 0.0458 21.817 0.0134 0.157
## momedujcollege 0.4983 2.007 0.2644 0.939
## momeduprimarysch 1.5834 0.632 0.8806 2.847
## race_grhisp 1.9258 0.519 1.2066 3.074
## race_grnahn 3.0237 0.331 1.0680 8.560
## race_grnh asian 0.4258 2.348 0.0843 2.151
## race_grnh black 1.9672 0.508 0.8286 4.671
## race_grother 1.3383 0.747 0.3336 5.369
##
## Concordance= 0.766 (se = 0.021 )
## Rsquare= NA (max possible= NA )
## Likelihood ratio test= NA on 8 df, p=NA
## Wald test = 64 on 8 df, p=0.0000000000751
## Score (logrank) test = NA on 8 df, p=NA
There are several types of residuals for the Cox model, and they are used for different purposes.
First, we will extract the Shoenfeld residuals, which are useful for examining non-proportional hazards with repect to time. This means that the covariate effect could exhibit time-dependency.
First we extract the residuals from the model, then we fit a linear model to the residual and the observed (uncensored) failure times
that would indicate dependence between the residual and outcome, or nonpropotionality, similar to doing a test for heteroskedasticity in OLS
schoenresid<-resid(fitl1, type="schoenfeld")
fit.sr<-lm(schoenresid~des2$variables$time[des2$variables$povtran==1])
summary(fit.sr)
## Response momeducollege :
##
## Call:
## lm(formula = momeducollege ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0527 -0.0418 -0.0324 -0.0292 0.9708
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.0120 0.0369
## des2$variables$time[des2$variables$povtran == 1] 0.0109 0.0188
## t value Pr(>|t|)
## (Intercept) -0.33 0.75
## des2$variables$time[des2$variables$povtran == 1] 0.58 0.56
##
## Residual standard error: 0.184 on 142 degrees of freedom
## Multiple R-squared: 0.00236, Adjusted R-squared: -0.00467
## F-statistic: 0.336 on 1 and 142 DF, p-value: 0.563
##
##
## Response momedujcollege :
##
## Call:
## lm(formula = momedujcollege ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.548 -0.467 -0.386 0.512 0.614
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.0851 0.1009
## des2$variables$time[des2$variables$povtran == 1] -0.0213 0.0512
## t value Pr(>|t|)
## (Intercept) 0.84 0.40
## des2$variables$time[des2$variables$povtran == 1] -0.42 0.68
##
## Residual standard error: 0.501 on 142 degrees of freedom
## Multiple R-squared: 0.00122, Adjusted R-squared: -0.00582
## F-statistic: 0.173 on 1 and 142 DF, p-value: 0.678
##
##
## Response momeduprimarysch :
##
## Call:
## lm(formula = momeduprimarysch ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.246 -0.195 -0.176 -0.124 0.887
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.0863 0.0781
## des2$variables$time[des2$variables$povtran == 1] -0.0352 0.0397
## t value Pr(>|t|)
## (Intercept) 1.10 0.27
## des2$variables$time[des2$variables$povtran == 1] -0.89 0.38
##
## Residual standard error: 0.388 on 142 degrees of freedom
## Multiple R-squared: 0.00551, Adjusted R-squared: -0.0015
## F-statistic: 0.786 on 1 and 142 DF, p-value: 0.377
##
##
## Response race_grhisp :
##
## Call:
## lm(formula = race_grhisp ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.303 -0.290 -0.247 0.697 0.795
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.0787 0.0891
## des2$variables$time[des2$variables$povtran == 1] -0.0427 0.0452
## t value Pr(>|t|)
## (Intercept) 0.88 0.38
## des2$variables$time[des2$variables$povtran == 1] -0.94 0.35
##
## Residual standard error: 0.443 on 142 degrees of freedom
## Multiple R-squared: 0.00622, Adjusted R-squared: -0.000774
## F-statistic: 0.889 on 1 and 142 DF, p-value: 0.347
##
##
## Response race_grnahn :
##
## Call:
## lm(formula = race_grnahn ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0996 -0.0778 -0.0561 -0.0511 0.9489
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.00987 0.05136
## des2$variables$time[des2$variables$povtran == 1] 0.02175 0.02609
## t value Pr(>|t|)
## (Intercept) -0.19 0.85
## des2$variables$time[des2$variables$povtran == 1] 0.83 0.41
##
## Residual standard error: 0.255 on 142 degrees of freedom
## Multiple R-squared: 0.00487, Adjusted R-squared: -0.00214
## F-statistic: 0.695 on 1 and 142 DF, p-value: 0.406
##
##
## Response race_grnh asian :
##
## Call:
## lm(formula = `race_grnh asian` ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0656 -0.0635 -0.0368 -0.0081 0.9927
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.0756 0.0402
## des2$variables$time[des2$variables$povtran == 1] -0.0287 0.0204
## t value Pr(>|t|)
## (Intercept) 1.88 0.062 .
## des2$variables$time[des2$variables$povtran == 1] -1.41 0.162
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2 on 142 degrees of freedom
## Multiple R-squared: 0.0137, Adjusted R-squared: 0.00679
## F-statistic: 1.98 on 1 and 142 DF, p-value: 0.162
##
##
## Response race_grnh black :
##
## Call:
## lm(formula = `race_grnh black` ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1590 -0.1506 -0.1251 -0.0826 0.9174
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.0297 0.0667
## des2$variables$time[des2$variables$povtran == 1] -0.0340 0.0339
## t value Pr(>|t|)
## (Intercept) -0.45 0.66
## des2$variables$time[des2$variables$povtran == 1] -1.00 0.32
##
## Residual standard error: 0.331 on 142 degrees of freedom
## Multiple R-squared: 0.00704, Adjusted R-squared: 4.58e-05
## F-statistic: 1.01 on 1 and 142 DF, p-value: 0.317
##
##
## Response race_grother :
##
## Call:
## lm(formula = race_grother ~ des2$variables$time[des2$variables$povtran ==
## 1])
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0209 -0.0172 -0.0148 -0.0087 0.9895
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.000174 0.023742
## des2$variables$time[des2$variables$povtran == 1] -0.006078 0.012061
## t value Pr(>|t|)
## (Intercept) -0.01 0.99
## des2$variables$time[des2$variables$povtran == 1] -0.50 0.62
##
## Residual standard error: 0.118 on 142 degrees of freedom
## Multiple R-squared: 0.00179, Adjusted R-squared: -0.00524
## F-statistic: 0.254 on 1 and 142 DF, p-value: 0.615
And we jump up and down! None of our predictors appear to be associated with the residuals, which suggests the effects are constant over time
Not so soon
We can also get a formal test using weighted residuals in a nice pre-rolled form with a plot, a la Grambsch and Therneau (1994) :
fit.test<-cox.zph(fitl1)
fit.test
## rho chisq p
## momeducollege 0.0493 0.8698 0.3510
## momedujcollege -0.0441 1.0507 0.3053
## momeduprimarysch -0.1089 3.1719 0.0749
## race_grhisp 0.0357 0.3253 0.5684
## race_grnahn -0.0138 0.0717 0.7889
## race_grnh asian 0.0292 0.5822 0.4455
## race_grnh black -0.0563 1.3073 0.2529
## race_grother 0.0415 0.2828 0.5949
## GLOBAL NA 8.0631 0.4273
par(mfrow=c(3,3))
plot(fit.test, df=2)
par(mfrow=c(1,1))
Here, we see that the primary school education category almost has a significant test in the formal testing procedure, which suggests almost having non-proportionality according to education.
Next we examine Martingale residuals. Martingale residuals are also useful for assessing the functional form of a covariate. A plot of the martingale residuals against the values of the covariate will indicate if the covariate is being modeled correctly, i.e. linearly in the Cox model. If a line fit to these residuals is a straight line, then the covariate has been modeled effectively, if it is curvilinear, you may need to enter the covariate as a quadratic, although this is not commonly a problem for dummy variables.
#extract Martingale residuals
res.mar<-resid(fitl1, type="martingale")
#plot vs maternal education
scatter.smooth(des2$variables$momedu, res.mar,degree = 2,span = 1, ylab="Martingale Residual",col=1, cex=.25, lpars=list(col = "red", lwd = 3))
title(main="Martingale residuals for Mother' < High School's Education")
Which shows nothing in the way of nonlinearity in this case.
Above, we observed evidence of non-proportional effects by education. There are a few standard ways of dealing with this in practice. The first is stratification of the model by the offending predictor. If one of the covariates exhibits non-proportionality we can re-specify the model so that each group will have its own baseline hazard rate. This is direct enough to do by using the strata() function within a model. This is of best use when a covariate is categorical, and not of direct importance for our model (i.e. a control variable).
fitl2<-svycoxph(Surv(time = time_start, time2 = time, event = povtran)~race_gr+strata(momedu), design=des2)
summary(fitl2)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (160) clusters.
## svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights = ~c1_7fp0,
## data = e.long[complete.cases(e.long), ], nest = T)
## Call:
## svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~
## race_gr + strata(momedu), design = des2)
##
## n= 4509, number of events= 144
##
## coef exp(coef) se(coef) z Pr(>|z|)
## race_grhisp 0.658 1.932 0.240 2.75 0.006 **
## race_grnahn 1.094 2.987 0.512 2.14 0.033 *
## race_grnh asian -0.814 0.443 0.820 -0.99 0.321
## race_grnh black 0.690 1.994 0.436 1.58 0.113
## race_grother 0.302 1.353 0.710 0.43 0.671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## race_grhisp 1.932 0.518 1.2078 3.09
## race_grnahn 2.987 0.335 1.0954 8.14
## race_grnh asian 0.443 2.256 0.0889 2.21
## race_grnh black 1.994 0.501 0.8491 4.68
## race_grother 1.353 0.739 0.3361 5.44
##
## Concordance= 0.594 (se = 0.033 )
## Rsquare= NA (max possible= NA )
## Likelihood ratio test= NA on 5 df, p=NA
## Wald test = 13.2 on 5 df, p=0.0216
## Score (logrank) test = NA on 5 df, p=NA
#plot the two models
dat<-expand.grid(momedu=levels(e.long$momedu), race_gr=levels(e.long$race_gr))
dat
## momedu race_gr
## 1 highschool nh white
## 2 college nh white
## 3 jcollege nh white
## 4 primarysch nh white
## 5 highschool hisp
## 6 college hisp
## 7 jcollege hisp
## 8 primarysch hisp
## 9 highschool nahn
## 10 college nahn
## 11 jcollege nahn
## 12 primarysch nahn
## 13 highschool nh asian
## 14 college nh asian
## 15 jcollege nh asian
## 16 primarysch nh asian
## 17 highschool nh black
## 18 college nh black
## 19 jcollege nh black
## 20 primarysch nh black
## 21 highschool other
## 22 college other
## 23 jcollege other
## 24 primarysch other
plot(survfit(fitl1, newdata = dat[c( 5:8),]), col = 1, lty=1:4, ylim=c(.6, 1))
legend("bottomleft", legend=levels(e.long$momedu), col=1, lty=1:4)
title(main="Surviorship function for poverty transition - Un-stratified Cox Model", sub="Effect of Education in Hispanics")
plot(survfit(fitl2,newdata = dat[c(5:8),]), col = 1,lty= 1:4, ylim=c(.6, 1))
legend("bottomleft", legend=levels(e.long$momedu), col=1, lty=1:4)
title(main="Surviorship function for poverty transition - Stratified Cox Model", sub="Effect of Education in Hispanics")
We can also include a time by covariate interaction term to model directly any time-dependence in the covariate effect. Different people say to do different things, some advocate for simply interacting time with the covariate, others say use a nonlinear function of time, e.g. log(time) * the covariate, others say use time-1 * covariate, which is called the “heavy side function”, according to Mills.
In this example, time is so limited that it doesn’t make sense to do this.
You can use the regTermTest() function in the survey() package to do omnibus tests for variatoin across a factor variable.
fit3<-svycoxph(Surv(time = time_start, time2=time, event = povtran)~momedu+race_gr, design=des2)
summary(fit3)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (160) clusters.
## svydesign(ids = ~c17fppsu, strata = ~c17fpstr, weights = ~c1_7fp0,
## data = e.long[complete.cases(e.long), ], nest = T)
## Call:
## svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~
## momedu + race_gr, design = des2)
##
## n= 4509, number of events= 144
##
## coef exp(coef) se(coef) z Pr(>|z|)
## momeducollege -3.0827 0.0458 0.6278 -4.91 0.00000091 ***
## momedujcollege -0.6965 0.4983 0.3233 -2.15 0.031 *
## momeduprimarysch 0.4596 1.5834 0.2994 1.54 0.125
## race_grhisp 0.6553 1.9258 0.2385 2.75 0.006 **
## race_grnahn 1.1065 3.0237 0.5310 2.08 0.037 *
## race_grnh asian -0.8537 0.4258 0.8263 -1.03 0.302
## race_grnh black 0.6766 1.9672 0.4412 1.53 0.125
## race_grother 0.2914 1.3383 0.7088 0.41 0.681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## momeducollege 0.0458 21.817 0.0134 0.157
## momedujcollege 0.4983 2.007 0.2644 0.939
## momeduprimarysch 1.5834 0.632 0.8806 2.847
## race_grhisp 1.9258 0.519 1.2066 3.074
## race_grnahn 3.0237 0.331 1.0680 8.560
## race_grnh asian 0.4258 2.348 0.0843 2.151
## race_grnh black 1.9672 0.508 0.8286 4.671
## race_grother 1.3383 0.747 0.3336 5.369
##
## Concordance= 0.766 (se = 0.021 )
## Rsquare= NA (max possible= NA )
## Likelihood ratio test= NA on 8 df, p=NA
## Wald test = 64 on 8 df, p=0.0000000000751
## Score (logrank) test = NA on 8 df, p=NA
regTermTest(fit3, ~momedu, method="LRT")
## Working (Rao-Scott+F) LRT for momedu
## in svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~
## momedu + race_gr, design = des2)
## Working 2logLR = 54.39 p= 0.0000005
## (scale factors: 1.7 0.7 0.56 ); denominator df= 83
regTermTest(fit3, ~race_gr, method="LRT")
## Working (Rao-Scott+F) LRT for race_gr
## in svycoxph(formula = Surv(time = time_start, time2 = time, event = povtran) ~
## momedu + race_gr, design = des2)
## Working 2logLR = 11.17 p= 0.07
## (scale factors: 2.1 0.85 0.81 0.73 0.49 ); denominator df= 83
library(haven)
#load the data
model.dat<-read_dta("~/Google Drive//classes/dem7223/class17/data/zzir62dt/ZZIR62FL.DTA")
In the DHS individual recode file, information on every live birth is collected using a retrospective birth history survey mechanism.
Since our outcome is time between first and second birth, we must select as our risk set, only women who have had a first birth.
The bidx variable indexes the birth history and if bidx_01 is not missing, then the woman should be at risk of having a second birth (i.e. she has had a first birth, i.e. bidx_01==1).
I also select only non-twin births (b0 == 0).
The DHS provides the dates of when each child was born in Century Month Codes.
To get the interval for women who acutally had a second birth, that is the difference between the CMC for the first birth b3_01 and the second birth b3_02, but for women who had not had a second birth by the time of the interview, the censored time between births is the difference between b3_01 and v008, the date of the interview.
We have 6161 women who are at risk of a second birth.
table(is.na(model.dat$bidx_01))
##
## FALSE TRUE
## 6161 2187
#now we extract those women
sub<-subset(model.dat, model.dat$bidx_01==1&model.dat$b0_01==0)
#Here I keep only a few of the variables for the dates, and some characteristics of the women, and details of the survey
sub2<-data.frame(CASEID=sub$caseid,
int.cmc=sub$v008,
fbir.cmc=sub$b3_01,
sbir.cmc=sub$b3_02,
marr.cmc=sub$v509,
rural=sub$v025,
educ=sub$v106,
age=sub$v012/5,
partneredu=sub$v701,
partnerage=sub$v730,
weight=sub$v005/1000000,
psu=sub$v021, strata=sub$v022)
Now I need to calculate the birth intervals, both observed and censored, and the event indicator (i.e. did the women have the second birth?)
sub2$secbi<-ifelse(is.na(sub2$sbir.cmc)==T, ((sub2$int.cmc))-((sub2$fbir.cmc)), (sub2$fbir.cmc-sub2$sbir.cmc))
sub2$b2event<-ifelse(is.na(sub2$sbir.cmc)==T,0,1)
Here, we create some predictor variables: Woman’s education (secondary +, vs < secondary), Woman’s age^2, Partner’s education (> secondary school)
sub2$educ.high<-ifelse(sub2$educ %in% c(2,3), 1, 0)
sub2$age2<-(sub2$age)^2
sub2$partnerhiedu<-ifelse(sub2$partneredu<3,0,ifelse(sub2$partneredu%in%c(8,9),NA,1 ))
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~psu, strata=~strata, data=sub2[sub2$secbi>0,], weight=~weight )
#use survey design
des<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=sub2[is.na(sub2$partnerhiedu)==F,])
cox.s<-svycoxph(Surv(secbi,b2event)~educ.high+partnerhiedu+I(age/5)+age2, design=des)
summary(cox.s)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (217) clusters.
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = sub2[is.na(sub2$partnerhiedu) ==
## F, ])
## Call:
## svycoxph(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu +
## I(age/5) + age2, design = des)
##
## n= 5289, number of events= 4527
##
## coef exp(coef) se(coef) z Pr(>|z|)
## educ.high -0.41790 0.65843 0.05698 -7.33 0.00000000000022 ***
## partnerhiedu -0.23520 0.79041 0.09123 -2.58 0.0099 **
## I(age/5) -0.98545 0.37327 0.52079 -1.89 0.0585 .
## age2 0.00981 1.00986 0.00756 1.30 0.1943
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## educ.high 0.658 1.52 0.589 0.736
## partnerhiedu 0.790 1.27 0.661 0.945
## I(age/5) 0.373 2.68 0.135 1.036
## age2 1.010 0.99 0.995 1.025
##
## Concordance= 0.555 (se = 0.005 )
## Rsquare= NA (max possible= NA )
## Likelihood ratio test= NA on 4 df, p=NA
## Wald test = 98.9 on 4 df, p=0
## Score (logrank) test = NA on 4 df, p=NA
#Schoenfeld test
fit.test<-cox.zph(cox.s)
fit.test
## rho chisq p
## educ.high 0.00812 0.591 0.4419
## partnerhiedu 0.01863 3.618 0.0571
## I(age/5) -0.00945 0.719 0.3966
## age2 0.00589 0.283 0.5949
## GLOBAL NA 9.664 0.0465
par(mfrow=c(3,3))
plot(fit.test, df=2)
par(mfrow=c(1,1))
#martingale residuals
#extract Martingale residuals
res.mar<-resid(cox.s, type="martingale")
#plot vs maternal education
scatter.smooth(des$variables$age, res.mar,degree = 2,span = 1, ylab="Martingale Residual",col=1, cex=.25, lpars=list(col = "red", lwd = 3))
title(main="Martingale residuals for Mother Age' ")
We can also include a time by covariate interaction term to model directly any time-dependence in the covariate effect. Different people say to do different things, some advocate for simply interacting time with the covariate, others say use a nonlinear function of time, e.g. log(time) * the covariate, others say use time-1 * covariate, which is called the “heavy side function”, according to Mills. Mills cites Allison, in saying that, to interpret the heavy side function, you go with the rule : “If \(\beta_2\) is positive, then the effect of the covariate x increases over time, while if \(\beta_2\) is negative, the effect of x decreases over time.”
sub.split<-survSplit(Surv(secbi, b2event)~., data= sub2[sub2$secbi>0,], cut=36, episode = "timegroup")
sub.split<-sub.split[order(sub.split$CASEID, sub.split$timegroup),]
sub.split$hv1<-sub.split$age*(1-sub.split$timegroup)
sub.split$hv2<-sub.split$age*(sub.split$timegroup)
head(sub.split)
## CASEID int.cmc fbir.cmc sbir.cmc marr.cmc rural educ age
## 1 1 1 2 1386 1381 1349 1213 2 0 6.0
## 2 1 3 2 1386 1356 NA 1334 2 2 4.4
## 3 1 4 2 1386 1262 1230 1201 2 0 8.4
## 4 1 4 3 1386 1376 1356 1313 2 1 5.0
## 5 1 5 1 1386 1385 1352 1280 2 2 5.0
## 6 1 6 2 1386 1323 1226 1183 2 0 7.4
## partneredu partnerage weight psu strata educ.high age2 partnerhiedu
## 1 0 49 1.058 1 26 0 36.00 0
## 2 2 22 1.058 1 26 1 19.36 0
## 3 0 NA 1.058 1 26 0 70.56 0
## 4 0 35 1.058 1 26 0 25.00 0
## 5 0 31 1.058 1 26 1 25.00 0
## 6 0 45 1.058 1 26 0 54.76 0
## tstart secbi b2event timegroup hv1 hv2
## 1 0 32 1 1 0 6.0
## 2 0 30 0 1 0 4.4
## 3 0 32 1 1 0 8.4
## 4 0 20 1 1 0 5.0
## 5 0 33 1 1 0 5.0
## 6 0 36 0 1 0 7.4
des3<-svydesign(ids=~psu, strata = ~strata , weights=~weight, data=sub.split[is.na(sub.split$partnerhiedu)==F,])
cox.s2<-svycoxph(Surv(secbi,b2event)~educ.high+partnerhiedu+hv1+hv2, design=des3)
summary(cox.s2)
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (217) clusters.
## svydesign(ids = ~psu, strata = ~strata, weights = ~weight, data = sub.split[is.na(sub.split$partnerhiedu) ==
## F, ])
## Call:
## svycoxph(formula = Surv(secbi, b2event) ~ educ.high + partnerhiedu +
## hv1 + hv2, design = des3)
##
## n= 7812, number of events= 4527
##
## coef exp(coef) se(coef) z Pr(>|z|)
## educ.high -0.4138 0.6611 0.0607 -6.82 0.0000000000089 ***
## partnerhiedu -0.2628 0.7689 0.0953 -2.76 0.0058 **
## hv1 0.4637 1.5899 0.0169 27.45 < 0.0000000000000002 ***
## hv2 0.1107 1.1170 0.0131 8.46 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## educ.high 0.661 1.513 0.587 0.745
## partnerhiedu 0.769 1.301 0.638 0.927
## hv1 1.590 0.629 1.538 1.643
## hv2 1.117 0.895 1.089 1.146
##
## Concordance= 0.683 (se = 0.005 )
## Rsquare= NA (max possible= NA )
## Likelihood ratio test= NA on 4 df, p=NA
## Wald test = 1505 on 4 df, p=0
## Score (logrank) test = NA on 4 df, p=NA
So, for us \(\beta_2\) in the heavyside function is positive, suggesting that the age effect increase over time