e02: Parametric Proportional Hazards Model

0.1 Objective

The objective of this exercise is to fit the most appropriate parametric proportional hazards model for under-five mortality in Nigeria. This exercise builds on the previous ones on procedures for estimating Time-to-Event Functions.

0.2 Methods

0.2.1 Data

The pooled birth-recode data sets of 2003, 2008, 2013 and 2018 Nigeria Demographic and Health Surveys (NDHS) were employed:

hwdata <- read.dta("C:/A/02/01F20/DEM223 Event History Analysis/hwdata.dta")

hwdata <- hwdata %>% select(strata, psu, iwt, d_age, status_60, status_60b, status_12, 
    status_12b, bcat01, sex, bsiz, unionstat, education, wealth, place, region2, 
    dhsyearf) %>% filter(complete.cases(hwdata))

hwdata$bcat01 <- factor(hwdata$bcat01, levels = c("Group 01", "Group 02", "Group 03", 
    "Group 04", "Group 05"), labels = c("Grp01", "Grp02", "Grp03", "Grp04", "Grp05"))

hwdata$caseid <- c(1:nrow(hwdata))

0.2.2 Outcome

The event variable is whether the last live birth had by a woman in the 59 months preceding the survey is alive (uncensored) or has died (censored). Hence, the observed time-to-event variable is the duration of survival between birth and death within the 5 years in retrospect; while the censoring indicator is the occurrence of under-five death.

0.2.3 Predictor

The main grouping variable is birth category. This is operationalized as child’s demographic risk attribute at birth (DRAB); a composite variable derived from maternal age at birth, birth order and birth interval characteristics. The DRAB was measured and categorized as follow:

Grp01: Middle-maternal age (20-34), combined with birth order 2-4, birth interval 24-59 months
Grp02: Middle-maternal age (20-34), combined with birth order 1
Grp03: Early-maternal age (<20), combined with any birth order, and any birth interval**
Grp04: Middle-maternal age (20-34), combined with birth order 5+, and birth interval 24-59 months
Grp05: Advanced-maternal age (35+), combined with any birth order, and any birth interval**

The resultant composite was coded and compared as Grp01 (0) vs Grp02 (1), Grp03 (2), Grp04 (3), and Grp05 (4).

** It is worthy to note that the DRAB categorization as presented here is for illustration purpose and subject to further refinement. Meanwhile, the categories was so constructed as the my interest is to investigate the effect of stage of childbearing age, especially the ages at the end of childbearing age spectrum, on early-life mortality experience. This is mainly conceived to capture and measure the impact of the mostly avoidable demographic risk associated with extreme maternal age-related fertility behaviors and compare the risk with those characterized by middle maternal age. Also, there were too few cases in birth order 1, and birth interval outside 24-59 range for early-age and advanced-age mothers.

0.2.4 Covariates

For the purpose of analytical intuition, I controlled for a list of selected correlates which I hypothesized to influence the outcome event, on one hand, and the relationship between main grouping variable and the outcome, on the other hand. The variables are listed below:

Child specific:
- Sex: Female (0) vs Male (1)
- Birth size: Large (0) vs Small (1)
Mother specific:
- Union status: In union (0) vs Not in union (1)
- Level of education: None (0) vs Primary (1), Incomplete secondary (2), and Complete secondary/Higher (3)
- Wealth status: Low (0) vs (High)
Context specific:
- Place of residence: Rural (0) vs Urban (1)
- Region of residence: North (0) vs South (1)
- NDHS year: 2018 (0) vs 2013 (1), 2008 (2), and 2003 (3)

0.2.5 Analytical approach

The proportional hazards (PH) model is employed since the outcome of interest is a measure of under-five mortality risk. Meanwhile, to determine the most appropriate parametric model for the analysis, it is important to establish the “best-fit” theoretical parametric distribution of the hazard by comparing the hazard distributions from empirical non-parametric model(s) to those from theoretical parametric models. To aid graphical illustrations and comparisons of the underlying pattern of under-five mortality risks, I estimated empirical models using kernel-density method and Cox’s PH approach to assess the non-parametric hazard function and cumulative hazard function, respectively, vis-a-vis the counterpart functions from theoretical models based on exponential, weibull, log-normal, log-logistic and piecewise constant exponential distributions.

The shapes of the distribution of each empirical and theoretical function was graphically compared to determine the best parametric model fit. In practice, the shape of the hazard function of the best-fit parametric model should closely approximate that of the empirical model if the parametric model fits the data to the best possible extent. In addition, the Akaike Information Criterion (AIC) parameters of the comparative models were compared to statistically substantiate the suggestive evidence of best-fit observed from the graphical representation of the hazard distributions. Lastly, the best-fit parametric model was used to examine the under-five mortality risk differential across DRAB categories accounting for the influence of key correlates specific to the child (child’s sex and birth size), mother (union status, level of education and wealth status), and context (place and region of residence and year of the survey). [Note: The time of event was shifted above 0 (d_age>0), since parametric functions do not work with time 0.]

0.2.6 Working hypotheses

Following from the foregoing, two main hypotheses were advanced. First, it is conjectured that the patterns of each parametric model will approximate the empirical model in a significantly similar version. Second, it is hypothesized that the hazard of dying before age 5 would be similar for all categories of children, and the influence of other covariates on risk pattern will be significantly negligible.

0.3 Results

0.3.1 Data structure

First, I examined the structure of the variables and distributions of observations thereof:

glimpse(hwdata)

Rows: 61,213
Columns: 18
$ strata     <chr> "2003-north west - urban", "2003-north west - urban", "2...
$ psu        <chr> "2003-183", "2003-183", "2003-183", "2003-183", "2003-18...
$ iwt        <int> 474592, 474592, 474592, 474592, 474592, 474592, 474592, ...
$ d_age      <dbl> 14, 17, 19, 49, 27, 26, 1, 27, 8, 14, 32, 12, 10, 3, 46,...
$ status_60  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
$ status_60b <fct> Alive, Alive, Alive, Alive, Alive, Alive, Alive, Alive, ...
$ status_12  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
$ status_12b <fct> Alive, Alive, Alive, Alive, Alive, Alive, Alive, Alive, ...
$ bcat01     <fct> Grp02, Grp02, Grp01, Grp05, Grp04, Grp04, Grp03, Grp05, ...
$ sex        <fct> Male, Male, Female, Male, Female, Female, Male, Male, Fe...
$ bsiz       <fct> Large, Large, Large, Large, Small, Large, Large, Small, ...
$ unionstat  <fct> InUnion, InUnion, InUnion, NotInUnion, NotInUnion, NotIn...
$ education  <fct> None, None, CompSec/High, None, None, Primary, None, Non...
$ wealth     <fct> High, High, High, High, High, High, Low, Low, High, High...
$ place      <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban, Urban, ...
$ region2    <fct> North, North, North, North, North, North, North, North, ...
$ dhsyearf   <fct> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20...
$ caseid     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...

# Distribution by censoring status
table(hwdata$status_60b)


Alive  Dead 
57516  3697

# Distribution by birth categories
table(hwdata$bcat01)


Grp01 Grp02 Grp03 Grp04 Grp05 
16317  6056  6914 19554 12372

# Distribution by child's sex
table(hwdata$sex)


Female   Male 
 30030  31183

# Distribution by child's birth size
table(hwdata$bsiz)


Large Small 
52108  9105

# Distribution by mother's union status
table(hwdata$unionstat)


   InUnion NotInUnion 
     57511       3702

# Distribution by mother's education
table(hwdata$education)


        None      Primary    IncompSec CompSec/High 
       28164        11989         7180        13880

# Distribution by mother's wealth status
table(hwdata$wealth)


  Low  High 
40762 20451

# Distribution by place of residence
table(hwdata$place)


Rural Urban 
41371 19842

# Distribution by region of residence
table(hwdata$region2)


North South 
41150 20063

# Distribution by survey year
table(hwdata$dhsyearf)


 2018  2013  2008  2003 
21350 19302 16946  3615

Also, I examined the characteristics of a few dead children in the data set:

# First ten censored observations
head(hwdata[hwdata$status_60 == 1, c("bcat01", "d_age", "sex", "bsiz", "unionstat", 
    "education", "wealth", "place", "region2")], n = 10)

# Last ten censored observations
tail(hwdata[hwdata$status_60 == 1, c("bcat01", "d_age", "sex", "bsiz", "unionstat", 
    "education", "wealth", "place", "region2")], n = 10)

0.3.2 Best-fit model determination

0.3.2.1 Empirical nmodels estimation

Product-limit unsmoothed hazard estimation:

fit.haz.km <- kphaz.fit(time = hwdata$d_age[hwdata$d_age > 0], status = hwdata$status_60[hwdata$d_age > 
    0], method = "product-limit")

Kernel-density smoothed hazard estimation:

fit.haz.sm <- muhaz(hwdata$d_age[hwdata$d_age > 0], hwdata$status_60[hwdata$d_age > 
    0], min.time = 1, max.time = 60)

Cox’s PH cumulative hazard estimation:

fit.cumhaz <- coxreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age > 
    0, ])

The underlying life table distribution of hazards:

km0 <- data.frame(fit.haz.km)
km1 <- data.frame(km0[km0$time %in% c(1.5:10.5), c("time", "haz")])
km2 <- data.frame(km0[km0$time %in% c(11.5:20.5), c("time", "haz")])
km3 <- data.frame(km0[km0$time > 20, c("time", "haz")])
km <- cbind(km1, km2, km3)
km

Plot of unsmoothed vs smoothed empirical hazard functions:

kphaz.plot(fit.haz.km, main = "Hazards of dying in infancy, NDHS 2003-2018")
lines(fit.haz.sm, col = 2, lwd = 1.5)
legend("topleft", legend = c("Unsmoothed", "Smoothed"), col = c(1, 2), lty = c(1, 
    1))

0.3.2.2 Theoretical models estimation

Exponential hazard estimation:

fit.1 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age > 
    0, ], dist = "weibull", shape = 1)

Plots of underlying exponential distribution functions:

plot(fit.1, col = 2)

Diagnostic plot of exponential hazard vs empirical “smoothed” HF:

plot(fit.1, fn = "haz", main = "Parametric exponential vs Non-parametric cox HF", 
    xlim = c(0, 60), ylim = c(0, 0.008))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))

Diagnostic plot of exponential hazard vs empirical CHF:

check.dist(sp = fit.cumhaz, pp = fit.1, main = "Parametric exponential vs Non-parametric cox CHF")

Weibull hazard estimation:

fit.2 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age > 
    0, ], dist = "weibull")

Plots of underlying weibull distribution functions.

plot(fit.2, col = 2)

Diagnostic plot of weibull hazard vs empirical “smoothed” HF:

plot(fit.2, fn = "haz", main = "Parametric weibull vs Non-parametric cox HF", xlim = c(0, 
    60), ylim = c(0, 0.008))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))

Diagnostic plot of weibull hazard vs empirical CHF:

check.dist(sp = fit.cumhaz, pp = fit.2, main = "Parametric weibull vs Non-parametric cox CHF")

Log-normal hazard estimation:

fit.3 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age > 
    0, ], dist = "lognormal", center = T)

Plots of underlying log-normal distribution functions.

plot(fit.3, col = 2)

Diagnostic plot of log-normal hazard vs empirical “smoothed” HF:

plot(fit.3, fn = "haz", main = "Parametric log-normal vs Non-parametric cox HF", 
    xlim = c(0, 60), ylim = c(0, 0.12))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.12))

Diagnostic plot of log-normal hazard vs empirical CHF.

check.dist(sp = fit.cumhaz, pp = fit.3, main = "Parametric log-normal vs Non-parametric cox CHF")

Log-logistic hazard estimation:

fit.4 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age > 
    0, ], dist = "loglogistic", center = T)

Plots of underlying Log-logistic distribution functions:

plot(fit.4, col = 2)

Diagnostic plot of Log-logistic hazard vs empirical “smoothed” HF:

plot(fit.4, fn = "haz", main = "Parametric log-logistic vs Non-parametric cox HF", 
    xlim = c(0, 60), ylim = c(0, 0.09))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))

Diagnostic plot of Log-logistic hazard vs empirical CHF:

check.dist(sp = fit.cumhaz, pp = fit.4, main = "Parametric log-logistic vs Non-parametric cox CHF")

Piecewise Constant Exponential (PCE) hazard estimation:

fit.5 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age > 
    0, ], dist = "pch", cuts = c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 
    42, 45, 48, 51, 54))

Plots of underlying PCE distribution functions:

plot(fit.5, col = 2)

Diagnostic plot of PCE vs empirical “smoothed” HF:

plot(fit.5, fn = "haz", main = "Parametric PCE vs Non-parametric cox HF", xlim = c(0, 
    60), ylim = c(0, 0.008))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))

Diagnostic plot of PCE hazard vs empirical CHF:

check.dist(sp = fit.cumhaz, pp = fit.5, main = "Parametric PCE vs Non-parametric cox CHF")

Comparison of parametric models’ AICs:

coxhaAIC <- AIC(fit.cumhaz)
exponAIC <- -2 * fit.1$loglik[2] + 2 * length(fit.1$coefficients)
weibuAIC <- -2 * fit.2$loglik[2] + 2 * length(fit.2$coefficients)
lognoAIC <- -2 * fit.3$loglik[2] + 2 * length(fit.3$coefficients)
logloAIC <- -2 * fit.4$loglik[2] + 2 * length(fit.4$coefficients)
pieceAIC <- -2 * fit.5$loglik[2] + 2 * length(fit.5$coefficients)

bindedAIC <- cbind(coxhaAIC, exponAIC, weibuAIC, lognoAIC, logloAIC, pieceAIC)
bindedAIC

     coxhaAIC exponAIC weibuAIC lognoAIC logloAIC pieceAIC
[1,]    45316    31464    31445    31210    31207    30286

0.3.3 Refitting the best-fit model

Unadjusted hazard ratio by birth category (DRAB):

fit.a <- phreg(Surv(d_age, status_60) ~ bcat01, data = hwdata[hwdata$d_age > 0, ], 
    dist = "pch", cuts = c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 
        48, 51, 54))
summary(fit.a)

Call:
phreg(formula = Surv(d_age, status_60) ~ bcat01, data = hwdata[hwdata$d_age > 
    0, ], dist = "pch", cuts = c(3, 6, 9, 12, 15, 18, 21, 24, 
    27, 30, 33, 36, 39, 42, 45, 48, 51, 54))

Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
bcat01 
           Grp01    0.257     0         1           (reference)
           Grp02    0.088    -0.028     0.972     0.096     0.769 
           Grp03    0.107     0.480     1.616     0.075     0.000 
           Grp04    0.321     0.326     1.385     0.061     0.000 
           Grp05    0.227     0.459     1.582     0.064     0.000 


Events                    2166 
Total time at risk        1230065 
Max. log. likelihood      -15266 
LR test statistic         81.51 
Degrees of freedom        4 
Overall p-value           1.11022e-16

Adjusted hazard by birth category (DRAB) controlling for characteristics of the child, mother and context:

fit.b <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + bsiz + unionstat + education + 
    wealth + place + region2 + dhsyearf, data = hwdata[hwdata$d_age > 0, ], dist = "pch", 
    cuts = c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54))
summary(fit.b)

Call:
phreg(formula = Surv(d_age, status_60) ~ bcat01 + sex + bsiz + 
    unionstat + education + wealth + place + region2 + dhsyearf, 
    data = hwdata[hwdata$d_age > 0, ], dist = "pch", cuts = c(3, 
        6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 
        48, 51, 54))

Covariate          W.mean      Coef Exp(Coef)  se(Coef)    Wald p
bcat01 
           Grp01    0.257     0         1           (reference)
           Grp02    0.088     0.160     1.173     0.098     0.104 
           Grp03    0.107     0.261     1.298     0.076     0.001 
           Grp04    0.321     0.232     1.261     0.061     0.000 
           Grp05    0.227     0.400     1.492     0.065     0.000 
sex 
          Female    0.493     0         1           (reference)
            Male    0.507     0.096     1.101     0.043     0.026 
bsiz 
           Large    0.864     0         1           (reference)
           Small    0.136     0.169     1.184     0.057     0.003 
unionstat 
         InUnion    0.926     0         1           (reference)
      NotInUnion    0.074     0.444     1.559     0.079     0.000 
education 
            None    0.447     0         1           (reference)
         Primary    0.201    -0.091     0.913     0.059     0.124 
       IncompSec    0.115    -0.220     0.802     0.084     0.008 
    CompSec/High    0.237    -0.573     0.564     0.087     0.000 
wealth 
             Low    0.649     0         1           (reference)
            High    0.351    -0.293     0.746     0.068     0.000 
place 
           Rural    0.661     0         1           (reference)
           Urban    0.339    -0.179     0.836     0.060     0.003 
region2 
           North    0.655     0         1           (reference)
           South    0.345    -0.256     0.774     0.061     0.000 
dhsyearf 
            2018    0.367     0         1           (reference)
            2013    0.313    -0.122     0.885     0.056     0.029 
            2008    0.265     0.152     1.165     0.054     0.005 
            2003    0.055     0.456     1.577     0.082     0.000 


Events                    2166 
Total time at risk        1230065 
Max. log. likelihood      -15068 
LR test statistic         477.58 
Degrees of freedom        16 
Overall p-value           0

Comparison of refitted model’s AICs:

fit.aAIC <- -2 * fit.a$loglik[2] + 2 * length(fit.a$coefficients)
fit.bAIC <- -2 * fit.b$loglik[2] + 2 * length(fit.b$coefficients)
pceAIC <- cbind(fit.aAIC, fit.bAIC)
pceAIC

     fit.aAIC fit.bAIC
[1,]    30541    30169

Comparison of refitted model’s BICs:

obsN <- as.data.frame(freq(hwdata$status_60))
fit.aBIC <- 3 * log(dim(hwdata)[1]) - 2 * (fit.a$loglik[2])
fit.bBIC <- 9 * log(dim(hwdata)[1]) - 2 * (fit.b$loglik[2])
pceBIC <- cbind(fit.aBIC, fit.bBIC)
pceBIC

     fit.aBIC fit.bBIC
[1,]    30566    30236

0.4 Interpretations

The graphical and statistical comparisons of distributions of non-parametric and parametric hazard functions revealed the importance of model diagnostic, and of specifying the best-fit distribution function in ensuring the propriety of parametric model strategy. The results established the PCE parameterization as the best-fit modeling strategy for analysing the data. The PCE function’s lowest AIC value further evidenced its appropriateness over other parametric functions: PCE (30308) vs. exponential (31487), weibull (31468), log-normal (31233), and log-logistic (31229). This outcome deviates from the hypothesized distributions that there will be no significant distinction in the extent to which each parametric model approximates the empirical model. Likewise, the assumption of significantly negligible between-Grpdifference in hazard of under-five mortality by birth category, especially, regardless of the influence of other correlates did not hold. The results from the unadjusted hazard ratio (UHR) and adjusted hazard ratio (AHR) of PCE models revealed significant disparities in under-five mortality risk by birth category (Grp01 vs. Grp02, Grp03, Grp04, and Grp05), at ≤5% significant level.

Compared to those in reference Grp1, the adjusted hazard of dying is highest for children in Grp05 (AHR: +49%, p<.001), followed by those in Grp03 (AHR: +35%, p<.001), and Grp04 (AHR: +26%, p<.001). Further more, there were reductions in risk slopes for children in Grp03, Grp04 and Grp05, but an insignificant increase for those in Grp02 in the adjusted model with significant variations in hazard ratios of mortality by levels of control variables (child’s sex and birth size; mother’s union status, level of education and wealth status; and place of residence, region of residence and year of the survey). These suggest that the inclusion of the covariates had substantial effects on the hazard of dying before age five and that the adjusted model fit the outcome better than the undajusted model. In other words, the selected correlates offered some additional explanations in understanding under-five mortality circumstance in Nigeria. These findings were established by estimated model comparison parameters (AIC: Unadjusted, 30541 vs Adjusted, 30175; BIC: Unadjusted, 30566 vs Adjusted, 30242).

0.5 Acknowledgements

This exercise was inspired by Dr. Corey Sparks’ lab session on “DEM 7223: Event History Analysis” Fall 2020 coursework, and his compendium of easy-to-follow instructional materials accessible at https://rpubs.com/corey_sparks.
The Demographic and Health Survey data can be downloaded with permission from the DHS Program at https://www.dhsprogram.com/data/.
R package https://www.R-project.org/ was used for data wrangling and analysis.

Created on September 28, 2020