This example will illustrate how to test for differences between survival functions estimated by the Kaplan-Meier product limit estimator. The tests all follow the methods described by Harrington and Fleming (1982) Link.

The first example will use as its outcome variable, the event of a child dying before age 1. The data for this example come from the model.data Demographic and Health Survey for 2012 children’s recode file. This file contains information for all births in the last 5 years prior to the survey.

The second example, we will examine how to calculate the survival function for a longitudinally collected data set. Here I use data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and fifth grade.

#load libraries
library(haven)
library(survival)
library(car)
library(muhaz)
model.dat<-read_dta("C:/Users/ozd504/Google Drive/zzkr62dt/ZZKR62FL.DTA")

Event - Infant Mortality

In the DHS, they record if a child is dead or alive and the age at death if the child is dead. This can be understood using a series of variables about each child.

If the child is alive at the time of interview, then the variable B5==1, and the age at death is censored.

If the age at death is censored, then the age at the date of interview (censored age at death) is the date of the interview - date of birth (in months).

If the child is dead at the time of interview,then the variable B5!=1, then the age at death in months is the variable B7. Here we code this:

model.dat$death.age<-ifelse(model.dat$b5==1,
                          ((((model.dat$v008))+1900)-(((model.dat$b3))+1900)) 
                          ,model.dat$b7)

#censoring indicator for death by age 1, in months (12 months)
model.dat$d.event<-ifelse(is.na(model.dat$b7)==T|model.dat$b7>12,0,1)
model.dat$d.eventfac<-factor(model.dat$d.event); levels(model.dat$d.eventfac)<-c("Alive at 1", "Dead by 1")
table(model.dat$d.eventfac)
## 
## Alive at 1  Dead by 1 
##       5434        534

Compairing Two Groups

We will now test for differences in survival by characteristics of the household. First we will examine whether the survival chances are the same for children in relatively high ses (in material terms) households, compared to those in relatively low-ses households.

This is the equvalent of doing a t-test, or Mann-Whitney U test for differences between two groups.

model.dat$highses<-recode(model.dat$v190, recodes ="1:3 = 0; 4:5=1; else=NA")
fit1<-survfit(Surv(death.age, d.event)~highses, data=model.dat)
plot(fit1, ylim=c(.85,1), xlim=c(0,12), col=c(1,2), conf.int=T)
title(main="Survival Function for Infant Mortality", sub="Low vs. High SES Households")
legend("topright", legend = c("Low SES  ", "High SES  "), col=c(1,2), lty=1)

fit1
## Call: survfit(formula = Surv(death.age, d.event) ~ highses, data = model.dat)
## 
##              n events median 0.95LCL 0.95UCL
## highses=0 4179    362     NA      NA      NA
## highses=1 1789    172     NA      NA      NA
summary(fit1)
## Call: survfit(formula = Surv(death.age, d.event) ~ highses, data = model.dat)
## 
##                 highses=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   4179     134    0.968 0.00273        0.963        0.973
##     1   3992      18    0.964 0.00290        0.958        0.969
##     2   3914      28    0.957 0.00316        0.951        0.963
##     3   3808      24    0.951 0.00337        0.944        0.957
##     4   3709      10    0.948 0.00346        0.941        0.955
##     5   3625      15    0.944 0.00359        0.937        0.951
##     6   3520      20    0.939 0.00376        0.931        0.946
##     7   3414      14    0.935 0.00389        0.927        0.943
##     8   3325      21    0.929 0.00407        0.921        0.937
##     9   3238      17    0.924 0.00422        0.916        0.932
##    10   3159       3    0.923 0.00424        0.915        0.932
##    11   3090      10    0.920 0.00433        0.912        0.929
##    12   3015      48    0.906 0.00475        0.896        0.915
## 
##                 highses=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   1789      75    0.958 0.00474        0.949        0.967
##     1   1698       8    0.954 0.00498        0.944        0.963
##     2   1659       9    0.948 0.00524        0.938        0.959
##     3   1615      14    0.940 0.00564        0.929        0.951
##     4   1573      15    0.931 0.00604        0.919        0.943
##     5   1536       7    0.927 0.00622        0.915        0.939
##     6   1501       6    0.923 0.00638        0.911        0.936
##     7   1466       4    0.921 0.00648        0.908        0.934
##     8   1430       5    0.918 0.00662        0.905        0.931
##     9   1383       6    0.914 0.00679        0.900        0.927
##    10   1348       2    0.912 0.00684        0.899        0.926
##    11   1315       3    0.910 0.00693        0.897        0.924
##    12   1288      18    0.897 0.00746        0.883        0.912

Gives us the basic survival plot.

Next we will use survtest() to test for differences between the two or more groups. The survdiff() function performs the log-rank test to compare the survival patterns of two or more groups.

#two group compairison
survdiff(Surv(death.age, d.event)~highses, data=model.dat)
## Call:
## survdiff(formula = Surv(death.age, d.event) ~ highses, data = model.dat)
## 
##              N Observed Expected (O-E)^2/E (O-E)^2/V
## highses=0 4179      362      374     0.401      1.37
## highses=1 1789      172      160     0.940      1.37
## 
##  Chisq= 1.4  on 1 degrees of freedom, p= 0.243

In this case, we see no difference in survival status based on household SES.

How about rural vs urban residence?

table(model.dat$v025)
## 
##    1    2 
## 1830 4138
model.dat$rural<-recode(model.dat$v025, recodes ="2 = 1; 1=0; else=NA")

fit2<-survfit(Surv(death.age, d.event)~rural, data=model.dat, conf.type = "log")
plot(fit2,ylim=c(.85, 1), xlim=c(0,12), col=c(1,2), conf.int=T)
title(main="Survival Function for Infant Mortality", sub="Rural vs Urban Residence")
legend("topright", legend = c("Urban","Rural" ), col=c(1,2), lty=1)
fit2
## Call: survfit(formula = Surv(death.age, d.event) ~ rural, data = model.dat, 
##     conf.type = "log")
## 
##            n events median 0.95LCL 0.95UCL
## rural=0 1830    188     NA      NA      NA
## rural=1 4138    346     NA      NA      NA
summary(fit2)
## Call: survfit(formula = Surv(death.age, d.event) ~ rural, data = model.dat, 
##     conf.type = "log")
## 
##                 rural=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   1830      79    0.957 0.00475        0.948        0.966
##     1   1731      11    0.951 0.00506        0.941        0.961
##     2   1694       8    0.946 0.00528        0.936        0.957
##     3   1648      14    0.938 0.00566        0.927        0.949
##     4   1600      10    0.932 0.00592        0.921        0.944
##     5   1571       9    0.927 0.00615        0.915        0.939
##     6   1530       9    0.922 0.00637        0.909        0.934
##     7   1498       5    0.918 0.00650        0.906        0.931
##     8   1456       7    0.914 0.00668        0.901        0.927
##     9   1413       6    0.910 0.00683        0.897        0.924
##    10   1381       2    0.909 0.00689        0.895        0.922
##    11   1345       4    0.906 0.00700        0.893        0.920
##    12   1313      24    0.890 0.00764        0.875        0.905
## 
##                 rural=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   4138     130    0.969 0.00271        0.963        0.974
##     1   3959      15    0.965 0.00286        0.959        0.971
##     2   3879      29    0.958 0.00314        0.952        0.964
##     3   3775      24    0.952 0.00336        0.945        0.958
##     4   3682      15    0.948 0.00349        0.941        0.955
##     5   3590      13    0.944 0.00360        0.937        0.951
##     6   3491      17    0.940 0.00375        0.932        0.947
##     7   3382      13    0.936 0.00387        0.929        0.944
##     8   3299      19    0.931 0.00404        0.923        0.939
##     9   3208      17    0.926 0.00419        0.918        0.934
##    10   3126       3    0.925 0.00422        0.917        0.933
##    11   3060       9    0.922 0.00430        0.914        0.931
##    12   2990      42    0.909 0.00469        0.900        0.918
#Two- sample test
survdiff(Surv(death.age, d.event)~rural, data=model.dat)
## Call:
## survdiff(formula = Surv(death.age, d.event) ~ rural, data = model.dat)
## 
##            N Observed Expected (O-E)^2/E (O-E)^2/V
## rural=0 1830      188      163      3.79      5.55
## rural=1 4138      346      371      1.67      5.55
## 
##  Chisq= 5.6  on 1 degrees of freedom, p= 0.0184
library(survminer)
## Loading required package: ggplot2
## Loading required package: ggpubr
## Loading required package: magrittr
ggsurvplot(fit2,conf.int = T, risk.table = F, title = "Survivorship Function for Infant Mortality", xlab = "Time in Months", xlim = c(0,12), ylim=c(.8, 1))

Which shows a statistically significant difference in survival between rural and urban children, with rural children showing lower survivorship at all ages.

We can also compare the 95% survival point for rural and urban residents

quantile(fit2, probs=.05)
## $quantile
##         5
## rural=0 2
## rural=1 4
## 
## $lower
##         5
## rural=0 0
## rural=1 3
## 
## $upper
##         5
## rural=0 3
## rural=1 6

We can also calculate the hazard function for each group using the kphaz.fit function in the muhaz library.

haz2<-kphaz.fit(model.dat$death.age, model.dat$d.event, model.dat$rural)
haz2
## $time
##  [1]  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5  0.5  1.5
## [15]  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5
## 
## $haz
##  [1] 0.0064084477 0.0047764602 0.0086460188 0.0063018400 0.0058154299
##  [6] 0.0059338979 0.0033648177 0.0048901211 0.0042929999 0.0014588521
## [11] 0.0030184080 0.0185979202 0.0038314014 0.0075716497 0.0064378810
## [16] 0.0041197145 0.0036742403 0.0049398949 0.0038930677 0.0058337248
## [21] 0.0053636408 0.0009670736 0.0029830591 0.0143373352
## 
## $var
##  [1] 3.733606e-06 2.852025e-06 5.339922e-06 3.971474e-06 3.757836e-06
##  [6] 3.912526e-06 2.264454e-06 3.416455e-06 3.071735e-06 1.064171e-06
## [11] 2.277779e-06 1.441308e-05 9.786715e-07 1.976995e-06 1.727037e-06
## [16] 1.131544e-06 1.038557e-06 1.435586e-06 1.165912e-06 1.791334e-06
## [21] 1.692389e-06 3.117631e-07 9.887982e-07 4.894885e-06
## 
## $strata
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
plot(y=haz2$haz[1:12], x=haz2$time[1:12], col=1, lty=1, type="s")
lines(y=haz2$haz[13:24], x=haz2$time[13:24], col=2, lty=1, type="s")

This may be suggestive that children in urban areas may live in poorer environmental conditions.

k- sample test

Next we illustrate a k-sample test. This would be the equivalent of the ANOVA if we were doing ordinary linear models.

In this example, I use the v024 variable, which corresponds to the region of residence in this data. Effectively we are testing for differences in risk of infant mortality by region.

model.dat
## # A tibble: 5,968 x 945
##             caseid  midx  v000  v001  v002  v003  v004    v005  v006  v007
##              <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
##  1         1  1  2     1   ZZ6     1     1     2     1 1057703     6  2015
##  2         1  1  2     2   ZZ6     1     1     2     1 1057703     6  2015
##  3         1  3  2     1   ZZ6     1     3     2     1 1057703     6  2015
##  4         1  4  3     1   ZZ6     1     4     3     1 1057703     6  2015
##  5         1  4  3     2   ZZ6     1     4     3     1 1057703     6  2015
##  6         1  5  1     1   ZZ6     1     5     1     1 1057703     6  2015
##  7         1  5  1     2   ZZ6     1     5     1     1 1057703     6  2015
##  8         1 15  1     1   ZZ6     1    15     1     1 1057703     6  2015
##  9         1 15  1     2   ZZ6     1    15     1     1 1057703     6  2015
## 10         1 15  1     3   ZZ6     1    15     1     1 1057703     6  2015
## # ... with 5,958 more rows, and 935 more variables: v008 <dbl>,
## #   v009 <dbl>, v010 <dbl>, v011 <dbl>, v012 <dbl>, v013 <dbl+lbl>,
## #   v014 <dbl+lbl>, v015 <dbl+lbl>, v016 <dbl>, v017 <dbl>,
## #   v018 <dbl+lbl>, v019 <dbl+lbl>, v019a <dbl+lbl>, v020 <dbl+lbl>,
## #   v021 <dbl>, v022 <dbl>, v023 <dbl+lbl>, v024 <dbl+lbl>,
## #   v025 <dbl+lbl>, v026 <dbl+lbl>, v027 <dbl>, v028 <dbl>, v029 <dbl>,
## #   v030 <dbl>, v031 <dbl>, v032 <dbl>, v034 <dbl+lbl>, v040 <dbl>,
## #   v042 <dbl+lbl>, v044 <dbl+lbl>, v101 <dbl+lbl>, v102 <dbl+lbl>,
## #   v103 <dbl+lbl>, v104 <dbl+lbl>, v105 <dbl+lbl>, v106 <dbl+lbl>,
## #   v107 <dbl+lbl>, v113 <dbl+lbl>, v115 <dbl+lbl>, v116 <dbl+lbl>,
## #   v119 <dbl+lbl>, v120 <dbl+lbl>, v121 <dbl+lbl>, v122 <dbl+lbl>,
## #   v123 <dbl+lbl>, v124 <dbl+lbl>, v125 <dbl+lbl>, v127 <dbl+lbl>,
## #   v128 <dbl+lbl>, v129 <dbl+lbl>, v130 <dbl+lbl>, v131 <dbl+lbl>,
## #   v133 <dbl+lbl>, v134 <dbl+lbl>, v135 <dbl+lbl>, v136 <dbl>,
## #   v137 <dbl>, v138 <dbl>, v139 <dbl+lbl>, v140 <dbl+lbl>,
## #   v141 <dbl+lbl>, v149 <dbl+lbl>, v150 <dbl+lbl>, v151 <dbl+lbl>,
## #   v152 <dbl+lbl>, v153 <dbl+lbl>, awfactt <dbl>, awfactu <dbl>,
## #   awfactr <dbl>, awfacte <dbl>, awfactw <dbl>, v155 <dbl+lbl>,
## #   v156 <dbl+lbl>, v157 <dbl+lbl>, v158 <dbl+lbl>, v159 <dbl+lbl>,
## #   v160 <dbl+lbl>, v161 <dbl+lbl>, v166 <dbl+lbl>, v167 <dbl+lbl>,
## #   v168 <dbl+lbl>, v190 <dbl+lbl>, v191 <dbl>, ml101 <dbl+lbl>,
## #   v201 <dbl>, v202 <dbl>, v203 <dbl>, v204 <dbl>, v205 <dbl>,
## #   v206 <dbl>, v207 <dbl>, v208 <dbl+lbl>, v209 <dbl+lbl>, v210 <dbl>,
## #   v211 <dbl>, v212 <dbl>, v213 <dbl+lbl>, v214 <dbl>, v215 <dbl+lbl>,
## #   v216 <dbl+lbl>, ...
table(model.dat$v024, model.dat$d.eventfac)
##    
##     Alive at 1 Dead by 1
##   1       2229       181
##   2       1435       141
##   3        631        94
##   4       1139       118
fit3<-survfit(Surv(death.age, d.event)~v024, data=model.dat)
fit3
## Call: survfit(formula = Surv(death.age, d.event) ~ v024, data = model.dat)
## 
##           n events median 0.95LCL 0.95UCL
## v024=1 2410    181     NA      NA      NA
## v024=2 1576    141     NA      NA      NA
## v024=3  725     94     NA      NA      NA
## v024=4 1257    118     NA      NA      NA
summary(fit3)
## Call: survfit(formula = Surv(death.age, d.event) ~ v024, data = model.dat)
## 
##                 v024=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   2410      55    0.977 0.00304        0.971        0.983
##     1   2320      12    0.972 0.00336        0.966        0.979
##     2   2264      17    0.965 0.00377        0.957        0.972
##     3   2195      13    0.959 0.00407        0.951        0.967
##     4   2140       5    0.957 0.00418        0.949        0.965
##     5   2092       9    0.953 0.00438        0.944        0.961
##     6   2036      11    0.948 0.00462        0.939        0.957
##     7   1976       6    0.945 0.00476        0.935        0.954
##     8   1920      12    0.939 0.00502        0.929        0.949
##     9   1860      11    0.933 0.00527        0.923        0.944
##    10   1805       3    0.932 0.00533        0.921        0.942
##    11   1757       2    0.931 0.00538        0.920        0.941
##    12   1713      25    0.917 0.00595        0.905        0.929
## 
##                 v024=2 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   1576      57    0.964 0.00470        0.955        0.973
##     1   1507       7    0.959 0.00498        0.950        0.969
##     2   1485       8    0.954 0.00527        0.944        0.965
##     3   1451      12    0.946 0.00570        0.935        0.958
##     4   1410       8    0.941 0.00598        0.929        0.953
##     5   1377       5    0.938 0.00615        0.926        0.950
##     6   1331       7    0.933 0.00639        0.920        0.945
##     7   1294       2    0.931 0.00646        0.919        0.944
##     8   1267       7    0.926 0.00671        0.913        0.939
##     9   1235       7    0.921 0.00696        0.907        0.934
##    11   1178       1    0.920 0.00700        0.906        0.934
##    12   1152      20    0.904 0.00774        0.889        0.919
## 
##                 v024=3 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    725      48    0.934 0.00923        0.916        0.952
##     1    663       2    0.931 0.00942        0.913        0.950
##     2    650       3    0.927 0.00970        0.908        0.946
##     3    630      10    0.912 0.01060        0.891        0.933
##     4    612       6    0.903 0.01111        0.882        0.925
##     5    600       3    0.899 0.01135        0.877        0.921
##     6    585       5    0.891 0.01176        0.868        0.914
##     7    569       3    0.886 0.01201        0.863        0.910
##     8    549       2    0.883 0.01218        0.859        0.907
##     9    528       1    0.881 0.01227        0.858        0.906
##    10    517       2    0.878 0.01246        0.854        0.903
##    11    504       4    0.871 0.01284        0.846        0.896
##    12    490       5    0.862 0.01331        0.836        0.888
## 
##                 v024=4 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   1257      49    0.961 0.00546        0.950        0.972
##     1   1200       5    0.957 0.00572        0.946        0.968
##     2   1174       9    0.950 0.00618        0.938        0.962
##     3   1147       3    0.947 0.00633        0.935        0.960
##     4   1120       6    0.942 0.00662        0.929        0.955
##     5   1092       5    0.938 0.00687        0.924        0.951
##     6   1069       3    0.935 0.00702        0.922        0.949
##     7   1041       7    0.929 0.00736        0.915        0.943
##     8   1019       5    0.924 0.00760        0.910        0.939
##     9    998       4    0.921 0.00779        0.905        0.936
##    11    966       6    0.915 0.00809        0.899        0.931
##    12    948      16    0.899 0.00882        0.882        0.917
quantile(fit3, probs=.05)
## $quantile
##        5
## v024=1 6
## v024=2 3
## v024=3 0
## v024=4 2
## 
## $lower
##        5
## v024=1 4
## v024=2 1
## v024=3 0
## v024=4 1
## 
## $upper
##        5
## v024=1 8
## v024=2 5
## v024=3 1
## v024=4 6
ggsurvplot(fit3,conf.int = T, risk.table = F, title = "Survivorship Function for Infant Mortality", xlab = "Time in Months", xlim = c(0,12), ylim=c(.8, 1))

#plot(fit3, ylim=c(.85,1), xlim=c(0,12), col=1:4, conf.int=F)
#title(main="Survival Function for Infant Mortality", sub="Household SES")
#legend("topright", legend = c("Region 1","Region 2", "Region 3", "Region 4" ), col=1:4, lty=1)


#K- sample test
survdiff(Surv(death.age, d.event)~v024, data=model.dat)
## Call:
## survdiff(formula = Surv(death.age, d.event) ~ v024, data = model.dat)
## 
##           N Observed Expected (O-E)^2/E (O-E)^2/V
## v024=1 2410      181    215.5   5.52534   9.43344
## v024=2 1576      141    141.9   0.00537   0.00745
## v024=3  725       94     62.9  15.33021  17.70233
## v024=4 1257      118    113.7   0.16401   0.21218
## 
##  Chisq= 21.4  on 3 degrees of freedom, p= 8.63e-05

Which shows significant variation in survival between regions. The biggest difference we see is between region 3 green) and region 1 (black line) groups.

Lastly, we examine comparing survival across multiple variables, in this case the education of the mother (secedu) and the rural/urban residence rural:

model.dat$secedu<-recode(model.dat$v106, recodes ="2:3 = 1; 0:1=0; else=NA")
table(model.dat$secedu, model.dat$d.eventfac)
##    
##     Alive at 1 Dead by 1
##   0       4470       430
##   1        964       104
fit4<-survfit(Surv(death.age, d.event)~rural+secedu, data=model.dat)
summary(fit4)
## Call: survfit(formula = Surv(death.age, d.event) ~ rural + secedu, 
##     data = model.dat)
## 
##                 rural=0, secedu=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   1193      52    0.956 0.00591        0.945        0.968
##     1   1129       6    0.951 0.00623        0.939        0.964
##     2   1107       6    0.946 0.00655        0.933        0.959
##     3   1079       5    0.942 0.00680        0.929        0.955
##     4   1054       6    0.936 0.00711        0.923        0.950
##     5   1031       5    0.932 0.00736        0.918        0.946
##     6   1002       4    0.928 0.00756        0.913        0.943
##     7    985       4    0.924 0.00776        0.909        0.940
##     8    964       4    0.921 0.00796        0.905        0.936
##     9    939       4    0.917 0.00817        0.901        0.933
##    10    919       2    0.915 0.00827        0.899        0.931
##    11    896       3    0.912 0.00843        0.895        0.928
##    12    875      21    0.890 0.00948        0.871        0.908
## 
##                 rural=0, secedu=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    637      27    0.958 0.00798        0.942        0.973
##     1    602       5    0.950 0.00867        0.933        0.967
##     2    587       2    0.946 0.00894        0.929        0.964
##     3    569       9    0.931 0.01010        0.912        0.951
##     4    546       4    0.925 0.01058        0.904        0.946
##     5    540       4    0.918 0.01104        0.896        0.940
##     6    528       5    0.909 0.01160        0.887        0.932
##     7    513       1    0.907 0.01172        0.885        0.931
##     8    492       3    0.902 0.01207        0.878        0.926
##     9    474       2    0.898 0.01232        0.874        0.922
##    11    449       1    0.896 0.01245        0.872        0.921
##    12    438       3    0.890 0.01286        0.865        0.915
## 
##                 rural=1, secedu=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   3707     112    0.970 0.00281        0.964        0.975
##     1   3553      14    0.966 0.00298        0.960        0.972
##     2   3482      28    0.958 0.00330        0.952        0.965
##     3   3388      21    0.952 0.00352        0.945        0.959
##     4   3306      15    0.948 0.00368        0.941        0.955
##     5   3227      12    0.944 0.00380        0.937        0.952
##     6   3144      14    0.940 0.00395        0.932        0.948
##     7   3043      12    0.936 0.00408        0.929        0.945
##     8   2971      16    0.931 0.00424        0.923        0.940
##     9   2893      15    0.927 0.00440        0.918        0.935
##    10   2817       3    0.926 0.00443        0.917        0.934
##    11   2769       6    0.924 0.00450        0.915        0.932
##    12   2711      40    0.910 0.00492        0.900        0.920
## 
##                 rural=1, secedu=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    431      18    0.958 0.00964        0.940        0.977
##     1    406       1    0.956 0.00990        0.937        0.975
##     2    397       1    0.953 0.01016        0.934        0.974
##     3    387       3    0.946 0.01094        0.925        0.968
##     5    363       1    0.943 0.01122        0.922        0.966
##     6    347       3    0.935 0.01207        0.912        0.959
##     7    339       1    0.933 0.01234        0.909        0.957
##     8    328       3    0.924 0.01318        0.899        0.950
##     9    315       2    0.918 0.01373        0.892        0.945
##    11    291       3    0.909 0.01464        0.880        0.938
##    12    279       2    0.902 0.01524        0.873        0.933
ggsurvplot(fit4,conf.int = T, risk.table = F, title = "Survivorship Function for Infant Mortality", xlab = "Time in Months", xlim = c(0,12), ylim=c(.8, 1))

#plot(fit4, ylim=c(.85,1), xlim=c(0,12), col=c(1,1,2,2),lty=c(1,2,1,2), conf.int=F)
#title(main="Survival Function for Infant Mortality", sub="Rural/Urban * Mother's Education")
#legend("topright", legend = c("Urban, Low Edu","Urban High Edu     ", "Rural, Low Edu","Rural High Edu     " ), col=c(1,1,2,2),lty=c(1,2,1,2))

# test
survdiff(Surv(death.age, d.event)~rural+secedu, data=model.dat)
## Call:
## survdiff(formula = Surv(death.age, d.event) ~ rural + secedu, 
##     data = model.dat)
## 
##                      N Observed Expected (O-E)^2/E (O-E)^2/V
## rural=0, secedu=0 1193      122    107.1   2.07801   2.64671
## rural=0, secedu=1  637       66     56.1   1.76227   2.00529
## rural=1, secedu=0 3707      308    333.3   1.92685   5.22186
## rural=1, secedu=1  431       38     37.5   0.00632   0.00693
## 
##  Chisq= 5.9  on 3 degrees of freedom, p= 0.118

Which shows a marginally significant difference between at least two of the groups, in this case, I would say that it’s most likely finding differences between the Urban, low Education and the Rural low education, because there have the higher ratio of observed vs expected.

Survival analysis using survey design

This example will cover the use of R functions for analyzing complex survey data. Most social and health surveys are not simple random samples of the population, but instead consist of respondents from a complex survey design. These designs often stratify the population based on one or more characteristics, including geography, race, age, etc. In addition the designs can be multi-stage, meaning that initial strata are created, then respondents are sampled from smaller units within those strata. An example would be if a school district was chosen as a sample strata, and then schools were then chosen as the primary sampling units (PSUs) within the district. From this 2 stage design, we could further sample classrooms within the school (3 stage design) or simply sample students (or whatever our unit of interest is).

A second feature of survey data we often want to account for is differential respondent weighting. This means that each respondent is given a weight to represent how common that particular respondent is within the population. This reflects the differenital probability of sampling based on respondent characteristics. As demographers, we are also often interested in making inference for the population, not just the sample, so our results must be generalizable to the population at large. Sample weights are used in the process as well.

When such data are analyzed, we must take into account this nesting structure (sample design) as well as the respondent sample weight in order to make valid estimates of ANY statistical parameter. If we do not account for design, the parameter standard errors will be incorrect, and if we do not account for weighting, the parameters themselves will be incorrect and biased.

In general there are typically three things we need to find in our survey data codebooks: The sample strata identifier, the sample primary sampling unit identifier (often called a cluster identifier) and the respondent survey weight. These will typically have one of these names and should be easily identifiable in the codebook.

Statistical software will have special routines for analyzing these types of data and you must be aware that the diversity of statistical routines that generally exists will be lower for analyzing complex survey data, and some forms of analysis may not be available!

In the DHS Recode manual, the sampling information for the data is found in variables v021 and v022, which are the primary sampling unit (PSU) and sample strata, respectively. The person weight is found in variable v005, and following DHS protocol, this has six implied decimal places, so we must divide it by 1000000, again, following the DHS manual.

library(survey)
## Loading required package: grid
## Loading required package: Matrix
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
model.dat$wt<-model.dat$v005/1000000

#create the design: ids == PSU, strata==strata, weights==weights.
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~v021, strata = ~v022, weights=~wt, data=model.dat)

fit.s<-svykm(Surv(death.age, d.event)~rural, design=des, se=T)

#use svyby to find the %of infants that die before age 1, by rural/urban status
svyby(~d.event, ~rural, des, svymean)
##   rural    d.event          se
## 0     0 0.10760897 0.009765244
## 1     1 0.08655918 0.005201127
#the plotting is a bit more of a challenge, as the survey version of the function isn't as nice

plot(fit.s[[1]], ylim=c(.8,1), xlim=c(0,12),col=1, ci=F )
lines(fit.s[[2]], col=2) 
title(main="Survival Function for Infant Mortality", sub="Rural vs Urban Residence")
legend("topright", legend = c("Urban","Rural" ), col=c(1,2), lty=1)

#test statistic
svylogrank(Surv(death.age, d.event)~rural, design=des)
## [[1]]
##          score                             
## [1,] -28.80354 15.1323 -1.903447 0.05698224
## 
## [[2]]
##      chisq          p 
## 3.62311066 0.05698224 
## 
## attr(,"class")
## [1] "svylogrank"

And we see the p-value is larger than assuming random sampling.

Using Longitudinal Data

In this example, we will examine how to calculate the survival function for a longitudinally collected data set. Here I use data from the ECLS-K. Specifically, we will examine the transition into poverty between kindergarten and third grade.

First we load our data

load("C:/Users/ozd504/Google Drive/classes/dem7223/class17/data/eclsk.Rdata")
names(eclsk)<-tolower(names(eclsk))
#get out only the variables I'm going to use for this example
myvars<-c( "childid","gender", "race", "r1_kage","r4age", "r5age", "r6age", "r7age", "w1povrty","w1povrty","w3povrty", "w5povrty", "w8povrty","wkmomed", "s2_id", "c1_5fp0", "c15fpstr", "c15fppsu")
eclsk<-eclsk[,myvars]


eclsk$age1<-ifelse(eclsk$r1_kage==-9, NA, eclsk$r1_kage/12)
eclsk$age2<-ifelse(eclsk$r4age==-9, NA, eclsk$r4age/12)
#for the later waves, the NCES group the ages into ranges of months, so 1= <105 months, 2=105 to 108 months. So, I fix the age at the midpoint of the interval they give, and make it into years by dividing by 12
eclsk$age3<-recode(eclsk$r5age,recodes="1=105; 2=107; 3=109; 4=112; 5=115; 6=117; -9=NA")/12

eclsk$pov1<-ifelse(eclsk$w1povrty==1,1,0)
eclsk$pov2<-ifelse(eclsk$w3povrty==1,1,0)
eclsk$pov3<-ifelse(eclsk$w5povrty==1,1,0)

#Recode race with white, non Hispanic as reference using dummy vars
eclsk$hisp<-recode (eclsk$race, recodes="3:4=1;-9=NA; else=0")
eclsk$race_rec<-recode (eclsk$race, recodes="1 = 'nhwhite';2='nhblack';3:4='hispanic';5='nhasian'; 6:8='other';-9=NA", as.factor.result = T)
eclsk$male<-recode(eclsk$gender, recodes="1=1; 2=0; -9=NA")
eclsk$mlths<-recode(eclsk$wkmomed, recodes = "1:2=1; 3:9=0; else = NA")
eclsk$mgths<-recode(eclsk$wkmomed, recodes = "1:3=0; 4:9=1; else =NA") 

Now, I need to form the transition variable, this is my event variable, and in this case it will be 1 if a child enters poverty between the first wave of the data and the third grade wave, and 0 otherwise.

NOTE I need to remove any children who are already in poverty age wave 1, because they are not at risk of experiencing this particular transition. Again, this is called forming the risk set

eclsk<-subset(eclsk, is.na(pov1)==F&is.na(pov2)==F&is.na(pov3)==F&is.na(age1)==F&is.na(age2)==F&is.na(age3)==F&pov1!=1)

Now we do the entire data set. To analyze data longitudinally, we need to reshape the data from the current “wide” format (repeated measures in columns) to a “long” format (repeated observations in rows). The reshape() function allows us to do this easily. It allows us to specify our repeated measures, time varying covariates as well as time-constant covariates.

e.long<-reshape(eclsk, idvar="childid", varying=list(c("age1","age2"),
                                                     c("age2", "age3")),
                v.names=c("age_enter", "age_exit"),
                times=1:2, direction="long" )
e.long<-e.long[order(e.long$childid, e.long$time),]

e.long$povtran<-NA

e.long$povtran[e.long$pov1==0&e.long$pov2==1&e.long$time==1]<-1
e.long$povtran[e.long$pov2==0&e.long$pov3==1&e.long$time==2]<-1

e.long$povtran[e.long$pov1==0&e.long$pov2==0&e.long$time==1]<-0
e.long$povtran[e.long$pov2==0&e.long$pov3==0&e.long$time==2]<-0

#find which kids failed in earlier time periods and remove them from the second & third period risk set
failed1<-which(is.na(e.long$povtran)==T)
e.long<-e.long[-failed1,]


e.long$age1r<-round(e.long$age_enter, 0)
e.long$age2r<-round(e.long$age_exit, 0)
head(e.long, n=10)
##             childid gender race r1_kage r4age r5age r6age r7age w1povrty
## 0001002C.1 0001002C      2    1   77.20 94.73     6     4     4        2
## 0001002C.2 0001002C      2    1   77.20 94.73     6     4     4        2
## 0001007C.1 0001007C      1    1   63.60 81.90     2     2     2        2
## 0001007C.2 0001007C      1    1   63.60 81.90     2     2     2        2
## 0001010C.1 0001010C      2    1   70.63 88.63     4     3     3        2
## 0001010C.2 0001010C      2    1   70.63 88.63     4     3     3        2
## 0002004C.1 0002004C      2    1   65.40 83.73     3     2     2        2
## 0002004C.2 0002004C      2    1   65.40 83.73     3     2     2        2
## 0002006C.1 0002006C      2    1   61.90 80.23     2     2     2        2
## 0002008C.1 0002008C      2    1   70.87 89.20     5     3     3        2
##            w1povrty.1 w3povrty w5povrty w8povrty wkmomed s2_id c1_5fp0
## 0001002C.1          2        2        2        2       3  0001  352.35
## 0001002C.2          2        2        2        2       3  0001  352.35
## 0001007C.1          2        2        2        2       6  0001  358.99
## 0001007C.2          2        2        2        2       6  0001  358.99
## 0001010C.1          2        2        2        2       7  0001  352.35
## 0001010C.2          2        2        2        2       7  0001  352.35
## 0002004C.1          2        2        2        2       6  0002  228.05
## 0002004C.2          2        2        2        2       6  0002  228.05
## 0002006C.1          2        1        1        2      -1  0002  287.25
## 0002008C.1          2        2        2        2       4  0002  259.19
##            c15fpstr c15fppsu pov1 pov2 pov3 hisp race_rec male mlths mgths
## 0001002C.1       63        1    0    0    0    0  nhwhite    0     0     0
## 0001002C.2       63        1    0    0    0    0  nhwhite    0     0     0
## 0001007C.1       63        1    0    0    0    0  nhwhite    1     0     1
## 0001007C.2       63        1    0    0    0    0  nhwhite    1     0     1
## 0001010C.1       63        1    0    0    0    0  nhwhite    0     0     1
## 0001010C.2       63        1    0    0    0    0  nhwhite    0     0     1
## 0002004C.1       63        1    0    0    0    0  nhwhite    0     0     1
## 0002004C.2       63        1    0    0    0    0  nhwhite    0     0     1
## 0002006C.1       63        1    0    1    1    0  nhwhite    0    NA    NA
## 0002008C.1       63        1    0    0    0    0  nhwhite    0     0     1
##            time age_enter age_exit povtran age1r age2r
## 0001002C.1    1  6.433333 7.894167       0     6     8
## 0001002C.2    2  7.894167 9.750000       0     8    10
## 0001007C.1    1  5.300000 6.825000       0     5     7
## 0001007C.2    2  6.825000 8.916667       0     7     9
## 0001010C.1    1  5.885833 7.385833       0     6     7
## 0001010C.2    2  7.385833 9.333333       0     7     9
## 0002004C.1    1  5.450000 6.977500       0     5     7
## 0002004C.2    2  6.977500 9.083333       0     7     9
## 0002006C.1    1  5.158333 6.685833       1     5     7
## 0002008C.1    1  5.905833 7.433333       0     6     7

So, this shows us the repeated measures nature of the longitudinal data set.

#poverty transition based on mother's education at time 1.
fit<-survfit(Surv(time = time, event = povtran)~mlths, e.long)
summary(fit)
## Call: survfit(formula = Surv(time = time, event = povtran) ~ mlths, 
##     data = e.long)
## 
## 256 observations deleted due to missingness 
##                 mlths=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1  12643     307    0.976 0.00137        0.973        0.978
##     2   6168     203    0.944 0.00258        0.939        0.949
## 
##                 mlths=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    769     107    0.861  0.0125        0.837        0.886
##     2    331      49    0.733  0.0199        0.695        0.773
ggsurvplot(fit,conf.int = T, risk.table = F, title = "Survivorship Function for Infant Mortality", xlab = "Time in Months")

plot(fit, col=c(2,3),
     ylim=c(.5,1), lwd=2 ,
     main="Survival function for poverty transition, K-5th Grade",
     xlab="Wave of survey", ylab="% Not Experiencing Transition")

legend("bottomleft",col = c(2,3),
       lty=1,lwd=2 ,
       legend=c("Mom HS or more", "Mom < HS"))

survdiff(Surv(time = time, event = povtran)~mlths, e.long)
## Call:
## survdiff(formula = Surv(time = time, event = povtran) ~ mlths, 
##     data = e.long)
## 
## n=13412, 256 observations deleted due to missingness.
## 
##             N Observed Expected (O-E)^2/E (O-E)^2/V
## mlths=0 12643      510    629.4      22.7       427
## mlths=1   769      156     36.6     390.0       427
## 
##  Chisq= 427  on 1 degrees of freedom, p= 0
#poverty transition based on mother's education at time 1 and child's race/ethnicity
fit2<-survfit(Surv(time = time, event = povtran)~mlths+race_rec, e.long)
summary(fit2)
## Call: survfit(formula = Surv(time = time, event = povtran) ~ mlths + 
##     race_rec, data = e.long)
## 
## 258 observations deleted due to missingness 
##                 mlths=0, race_rec=hispanic 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1   1418      70    0.951 0.00575        0.939        0.962
##     2    674      42    0.891 0.01037        0.871        0.912
## 
##                 mlths=0, race_rec=nhasian  
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    668      20     0.97 0.00659        0.957        0.983
##     2    324      10     0.94 0.01130        0.918        0.963
## 
##                 mlths=0, race_rec=nhblack  
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    801      57    0.929 0.00908        0.911        0.947
##     2    372      33    0.846 0.01600        0.816        0.878
## 
##                 mlths=0, race_rec=nhwhite  
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1   9109     133    0.985 0.00126        0.983        0.988
##     2   4488      96    0.964 0.00246        0.960        0.969
## 
##                 mlths=0, race_rec=other    
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    645      27    0.958 0.00789        0.943        0.974
##     2    309      22    0.890 0.01581        0.859        0.921
## 
##                 mlths=1, race_rec=hispanic 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    347      53    0.847  0.0193        0.810        0.886
##     2    147      27    0.692  0.0313        0.633        0.756
## 
##                 mlths=1, race_rec=nhasian  
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     60       6    0.900  0.0387        0.827        0.979
##     2     27       1    0.867  0.0496        0.775        0.970
## 
##                 mlths=1, race_rec=nhblack  
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     62      12    0.806  0.0502        0.714        0.911
##     2     25       3    0.710  0.0685        0.587        0.858
## 
##                 mlths=1, race_rec=nhwhite  
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    271      33    0.878  0.0199        0.840        0.918
##     2    119      13    0.782  0.0307        0.724        0.845
## 
##                 mlths=1, race_rec=other    
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     29       3    0.897  0.0566        0.792        1.000
##     2     13       5    0.552  0.1259        0.353        0.863
cols<-RColorBrewer::brewer.pal(n=5, "Greys")
#ggsurvplot(fit2,conf.int = T, risk.table = F, title = "Survivorship Function for Infant Mortality", xlab = "Time in Months")

plot(fit2, 
     col=rep(cols,2),
     lty=c(1,1,1,1,1,2,2,2,2,2),
     ylim=c(.5,1), lwd=2 )

title(main="Survival function for poverty transition,  K-5th Grade",
      sub="By Race and Mother's Education",
      xlab="Wave of survey", ylab="% Not Experiencing Transition")

legend("bottomleft",col=rep(cols,2),
       lty=c(1,1,1,1,1,2,2,2,2,2),
       lwd=2 ,
       legend=c("Mom > HS & Hispanic",
                "Mom > HS & Asian",
                "Mom > HS & NH black",
                "Mom > HS & NH white",
                "Mom > HS & NH other",
                "Mom < HS & Hispanic", 
                "Mom <HS & Asian",
                "Mom < HS & NH black",
                "Mom <HS & NH white",
                "Mom < HS & NH other"), cex=.8)

survdiff(Surv(time = age2r, event = povtran)~mlths+race_rec, e.long)
## Call:
## survdiff(formula = Surv(time = age2r, event = povtran) ~ mlths + 
##     race_rec, data = e.long)
## 
## n=13410, 258 observations deleted due to missingness.
## 
##                               N Observed Expected (O-E)^2/E (O-E)^2/V
## mlths=0, race_rec=hispanic 1418      112    68.89   26.9817    30.836
## mlths=0, race_rec=nhasian   668       30    31.22    0.0474     0.051
## mlths=0, race_rec=nhblack   801       90    37.69   72.6061    78.872
## mlths=0, race_rec=nhwhite  9109      229   460.79  116.5993   388.285
## mlths=0, race_rec=other     645       49    31.11   10.2891    11.059
## mlths=1, race_rec=hispanic  347       80    15.77  261.5318   274.443
## mlths=1, race_rec=nhasian    60        7     2.78    6.3939     6.577
## mlths=1, race_rec=nhblack    62       15     2.81   52.8267    54.354
## mlths=1, race_rec=nhwhite   271       46    13.51   78.1156    81.713
## mlths=1, race_rec=other      29        8     1.43   30.3300    31.140
## 
##  Chisq= 672  on 9 degrees of freedom, p= 0

Which, again shows us that at least two of these groups are different from one another.

With survey design

options(survey.lonely.psu = "adjust")
e.long<-e.long[complete.cases(e.long$c15fppsu, e.long$race_rec, e.long$mlths),]
des2<-svydesign(ids = ~c15fppsu, strata = ~c15fpstr, weights=~c1_5fp0, data=e.long, nest=T)

fit.s<-svykm(Surv(time, povtran)~race_rec, design=des2, se=T)
summary(fit.s)
##          Length Class Mode
## hispanic 3      svykm list
## nhasian  3      svykm list
## nhblack  3      svykm list
## nhwhite  3      svykm list
## other    3      svykm list
#use svyby to find the %of infants that die before age 1, by rural/urban status
svyby(~povtran, ~race_rec+time, des2, svymean)
##            race_rec time    povtran          se
## hispanic.1 hispanic    1 0.15132915 0.015651082
## nhasian.1   nhasian    1 0.08385973 0.022923951
## nhblack.1   nhblack    1 0.17636749 0.028999576
## nhwhite.1   nhwhite    1 0.04393319 0.005601312
## other.1       other    1 0.10676774 0.039838817
## hispanic.2 hispanic    2 0.09195957 0.012241587
## nhasian.2   nhasian    2 0.01680560 0.006453805
## nhblack.2   nhblack    2 0.10575151 0.019892545
## nhwhite.2   nhwhite    2 0.02638244 0.003696736
## other.2       other    2 0.12424443 0.035564032
#test statistic
svylogrank(Surv(time, povtran)~mlths, design=des2)
## [[1]]
##         score                               
## [1,] 42480.37 5138.529 8.267029 1.373387e-16
## 
## [[2]]
##        chisq            p 
## 6.834377e+01 1.373387e-16 
## 
## attr(,"class")
## [1] "svylogrank"