The objective of this exercise is to fit the most appropriate parametric proportional hazards model for under-five mortality in Nigeria. This exercise builds on the previous ones on procedures for estimating Time-to-Event Functions.
The pooled birth-recode data sets of 2003, 2008, 2013 and 2018 Nigeria Demographic and Health Surveys (NDHS) were employed:
hwdata <- read.dta("C:/A/02/01F20/DEM223 Event History Analysis/hwdata.dta")
hwdata <- hwdata %>% select(strata, psu, iwt, d_age, status_60, status_60b, status_12,
status_12b, bcat01, sex, bsiz, unionstat, education, wealth, place, region2,
dhsyearf) %>% filter(complete.cases(hwdata))
hwdata$bcat01 <- factor(hwdata$bcat01, levels = c("Group 01", "Group 02", "Group 03",
"Group 04", "Group 05"), labels = c("Grp01", "Grp02", "Grp03", "Grp04", "Grp05"))
hwdata$caseid <- c(1:nrow(hwdata))The event variable is whether the last live birth had by a woman in the 59 months preceding the survey is alive (uncensored) or has died (censored). Hence, the observed time-to-event variable is the duration of survival between birth and death within the 5 years in retrospect; while the censoring indicator is the occurrence of under-five death.
The main grouping variable is birth category. This is operationalized as child’s demographic risk attribute at birth (DRAB); a composite variable derived from maternal age at birth, birth order and birth interval characteristics. The DRAB was measured and categorized as follow:
The resultant composite was coded and compared as Grp01 (0) vs Grp02 (1), Grp03 (2), Grp04 (3), and Grp05 (4).
** It is worthy to note that the DRAB categorization as presented here is for illustration purpose and subject to further refinement. Meanwhile, the categories was so constructed as the my interest is to investigate the effect of stage of childbearing age, especially the ages at the end of childbearing age spectrum, on early-life mortality experience. This is mainly conceived to capture and measure the impact of the mostly avoidable demographic risk associated with extreme maternal age-related fertility behaviors and compare the risk with those characterized by middle maternal age. Also, there were too few cases in birth order 1, and birth interval outside 24-59 range for early-age and advanced-age mothers.
For the purpose of analytical intuition, I controlled for a list of selected correlates which I hypothesized to influence the outcome event, on one hand, and the relationship between main grouping variable and the outcome, on the other hand. The variables are listed below:
The proportional hazards (PH) model is employed since the outcome of interest is a measure of under-five mortality risk. Meanwhile, to determine the most appropriate parametric model for the analysis, it is important to establish the “best-fit” theoretical parametric distribution of the hazard by comparing the hazard distributions from empirical non-parametric model(s) to those from theoretical parametric models. To aid graphical illustrations and comparisons of the underlying pattern of under-five mortality risks, I estimated empirical models using kernel-density method and Cox’s PH approach to assess the non-parametric hazard function and cumulative hazard function, respectively, vis-a-vis the counterpart functions from theoretical models based on exponential, weibull, log-normal, log-logistic and piecewise constant exponential distributions.
The shapes of the distribution of each empirical and theoretical function was graphically compared to determine the best parametric model fit. In practice, the shape of the hazard function of the best-fit parametric model should closely approximate that of the empirical model if the parametric model fits the data to the best possible extent. In addition, the Akaike Information Criterion (AIC) parameters of the comparative models were compared to statistically substantiate the suggestive evidence of best-fit observed from the graphical representation of the hazard distributions. Lastly, the best-fit parametric model was used to examine the under-five mortality risk differential across DRAB categories accounting for the influence of key correlates specific to the child (child’s sex and birth size), mother (union status, level of education and wealth status), and context (place and region of residence and year of the survey). [Note: The time of event was shifted above 0 (d_age>0), since parametric functions do not work with time 0.]
Following from the foregoing, two main hypotheses were advanced. First, it is conjectured that the patterns of each parametric model will approximate the empirical model in a significantly similar version. Second, it is hypothesized that the hazard of dying before age 5 would be similar for all categories of children, and the influence of other covariates on risk pattern will be significantly negligible.
First, I examined the structure of the variables and distributions of observations thereof:
Rows: 61,213
Columns: 18
$ strata <chr> "2003-north west - urban", "2003-north west - urban", "2...
$ psu <chr> "2003-183", "2003-183", "2003-183", "2003-183", "2003-18...
$ iwt <int> 474592, 474592, 474592, 474592, 474592, 474592, 474592, ...
$ d_age <dbl> 14, 17, 19, 49, 27, 26, 1, 27, 8, 14, 32, 12, 10, 3, 46,...
$ status_60 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
$ status_60b <fct> Alive, Alive, Alive, Alive, Alive, Alive, Alive, Alive, ...
$ status_12 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,...
$ status_12b <fct> Alive, Alive, Alive, Alive, Alive, Alive, Alive, Alive, ...
$ bcat01 <fct> Grp02, Grp02, Grp01, Grp05, Grp04, Grp04, Grp03, Grp05, ...
$ sex <fct> Male, Male, Female, Male, Female, Female, Male, Male, Fe...
$ bsiz <fct> Large, Large, Large, Large, Small, Large, Large, Small, ...
$ unionstat <fct> InUnion, InUnion, InUnion, NotInUnion, NotInUnion, NotIn...
$ education <fct> None, None, CompSec/High, None, None, Primary, None, Non...
$ wealth <fct> High, High, High, High, High, High, Low, Low, High, High...
$ place <fct> Urban, Urban, Urban, Urban, Urban, Urban, Urban, Urban, ...
$ region2 <fct> North, North, North, North, North, North, North, North, ...
$ dhsyearf <fct> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20...
$ caseid <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1...
Alive Dead
57516 3697
Grp01 Grp02 Grp03 Grp04 Grp05
16317 6056 6914 19554 12372
Female Male
30030 31183
Large Small
52108 9105
InUnion NotInUnion
57511 3702
None Primary IncompSec CompSec/High
28164 11989 7180 13880
Low High
40762 20451
Rural Urban
41371 19842
North South
41150 20063
2018 2013 2008 2003
21350 19302 16946 3615
Also, I examined the characteristics of a few dead children in the data set:
# First ten censored observations
head(hwdata[hwdata$status_60 == 1, c("bcat01", "d_age", "sex", "bsiz", "unionstat",
"education", "wealth", "place", "region2")], n = 10)# Last ten censored observations
tail(hwdata[hwdata$status_60 == 1, c("bcat01", "d_age", "sex", "bsiz", "unionstat",
"education", "wealth", "place", "region2")], n = 10)Product-limit unsmoothed hazard estimation:
fit.haz.km <- kphaz.fit(time = hwdata$d_age[hwdata$d_age > 0], status = hwdata$status_60[hwdata$d_age >
0], method = "product-limit")Kernel-density smoothed hazard estimation:
fit.haz.sm <- muhaz(hwdata$d_age[hwdata$d_age > 0], hwdata$status_60[hwdata$d_age >
0], min.time = 1, max.time = 60)Cox’s PH cumulative hazard estimation:
fit.cumhaz <- coxreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age >
0, ])The underlying life table distribution of hazards:
km0 <- data.frame(fit.haz.km)
km1 <- data.frame(km0[km0$time %in% c(1.5:10.5), c("time", "haz")])
km2 <- data.frame(km0[km0$time %in% c(11.5:20.5), c("time", "haz")])
km3 <- data.frame(km0[km0$time > 20, c("time", "haz")])
km <- cbind(km1, km2, km3)
kmPlot of unsmoothed vs smoothed empirical hazard functions:
kphaz.plot(fit.haz.km, main = "Hazards of dying in infancy, NDHS 2003-2018")
lines(fit.haz.sm, col = 2, lwd = 1.5)
legend("topleft", legend = c("Unsmoothed", "Smoothed"), col = c(1, 2), lty = c(1,
1))Exponential hazard estimation:
fit.1 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age >
0, ], dist = "weibull", shape = 1)Plots of underlying exponential distribution functions:
Diagnostic plot of exponential hazard vs empirical “smoothed” HF:
plot(fit.1, fn = "haz", main = "Parametric exponential vs Non-parametric cox HF",
xlim = c(0, 60), ylim = c(0, 0.008))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))Diagnostic plot of exponential hazard vs empirical CHF:
Weibull hazard estimation:
fit.2 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age >
0, ], dist = "weibull")Plots of underlying weibull distribution functions.
Diagnostic plot of weibull hazard vs empirical “smoothed” HF:
plot(fit.2, fn = "haz", main = "Parametric weibull vs Non-parametric cox HF", xlim = c(0,
60), ylim = c(0, 0.008))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))Diagnostic plot of weibull hazard vs empirical CHF:
Log-normal hazard estimation:
fit.3 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age >
0, ], dist = "lognormal", center = T)Plots of underlying log-normal distribution functions.
Diagnostic plot of log-normal hazard vs empirical “smoothed” HF:
plot(fit.3, fn = "haz", main = "Parametric log-normal vs Non-parametric cox HF",
xlim = c(0, 60), ylim = c(0, 0.12))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.12))Diagnostic plot of log-normal hazard vs empirical CHF.
Log-logistic hazard estimation:
fit.4 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age >
0, ], dist = "loglogistic", center = T)Plots of underlying Log-logistic distribution functions:
Diagnostic plot of Log-logistic hazard vs empirical “smoothed” HF:
plot(fit.4, fn = "haz", main = "Parametric log-logistic vs Non-parametric cox HF",
xlim = c(0, 60), ylim = c(0, 0.09))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))Diagnostic plot of Log-logistic hazard vs empirical CHF:
Piecewise Constant Exponential (PCE) hazard estimation:
fit.5 <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + education + place, data = hwdata[hwdata$d_age >
0, ], dist = "pch", cuts = c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39,
42, 45, 48, 51, 54))Plots of underlying PCE distribution functions:
Diagnostic plot of PCE vs empirical “smoothed” HF:
plot(fit.5, fn = "haz", main = "Parametric PCE vs Non-parametric cox HF", xlim = c(0,
60), ylim = c(0, 0.008))
lines(fit.haz.sm, col = 2, xlim = c(0, 60), ylim = c(0, 0.008))Diagnostic plot of PCE hazard vs empirical CHF:
Comparison of parametric models’ AICs:
coxhaAIC <- AIC(fit.cumhaz)
exponAIC <- -2 * fit.1$loglik[2] + 2 * length(fit.1$coefficients)
weibuAIC <- -2 * fit.2$loglik[2] + 2 * length(fit.2$coefficients)
lognoAIC <- -2 * fit.3$loglik[2] + 2 * length(fit.3$coefficients)
logloAIC <- -2 * fit.4$loglik[2] + 2 * length(fit.4$coefficients)
pieceAIC <- -2 * fit.5$loglik[2] + 2 * length(fit.5$coefficients)
bindedAIC <- cbind(coxhaAIC, exponAIC, weibuAIC, lognoAIC, logloAIC, pieceAIC)
bindedAIC coxhaAIC exponAIC weibuAIC lognoAIC logloAIC pieceAIC
[1,] 45316 31464 31445 31210 31207 30286
Unadjusted hazard ratio by birth category (DRAB):
fit.a <- phreg(Surv(d_age, status_60) ~ bcat01, data = hwdata[hwdata$d_age > 0, ],
dist = "pch", cuts = c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45,
48, 51, 54))
summary(fit.a)Call:
phreg(formula = Surv(d_age, status_60) ~ bcat01, data = hwdata[hwdata$d_age >
0, ], dist = "pch", cuts = c(3, 6, 9, 12, 15, 18, 21, 24,
27, 30, 33, 36, 39, 42, 45, 48, 51, 54))
Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
bcat01
Grp01 0.257 0 1 (reference)
Grp02 0.088 -0.028 0.972 0.096 0.769
Grp03 0.107 0.480 1.616 0.075 0.000
Grp04 0.321 0.326 1.385 0.061 0.000
Grp05 0.227 0.459 1.582 0.064 0.000
Events 2166
Total time at risk 1230065
Max. log. likelihood -15266
LR test statistic 81.51
Degrees of freedom 4
Overall p-value 1.11022e-16
Adjusted hazard by birth category (DRAB) controlling for characteristics of the child, mother and context:
fit.b <- phreg(Surv(d_age, status_60) ~ bcat01 + sex + bsiz + unionstat + education +
wealth + place + region2 + dhsyearf, data = hwdata[hwdata$d_age > 0, ], dist = "pch",
cuts = c(3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54))
summary(fit.b)Call:
phreg(formula = Surv(d_age, status_60) ~ bcat01 + sex + bsiz +
unionstat + education + wealth + place + region2 + dhsyearf,
data = hwdata[hwdata$d_age > 0, ], dist = "pch", cuts = c(3,
6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45,
48, 51, 54))
Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
bcat01
Grp01 0.257 0 1 (reference)
Grp02 0.088 0.160 1.173 0.098 0.104
Grp03 0.107 0.261 1.298 0.076 0.001
Grp04 0.321 0.232 1.261 0.061 0.000
Grp05 0.227 0.400 1.492 0.065 0.000
sex
Female 0.493 0 1 (reference)
Male 0.507 0.096 1.101 0.043 0.026
bsiz
Large 0.864 0 1 (reference)
Small 0.136 0.169 1.184 0.057 0.003
unionstat
InUnion 0.926 0 1 (reference)
NotInUnion 0.074 0.444 1.559 0.079 0.000
education
None 0.447 0 1 (reference)
Primary 0.201 -0.091 0.913 0.059 0.124
IncompSec 0.115 -0.220 0.802 0.084 0.008
CompSec/High 0.237 -0.573 0.564 0.087 0.000
wealth
Low 0.649 0 1 (reference)
High 0.351 -0.293 0.746 0.068 0.000
place
Rural 0.661 0 1 (reference)
Urban 0.339 -0.179 0.836 0.060 0.003
region2
North 0.655 0 1 (reference)
South 0.345 -0.256 0.774 0.061 0.000
dhsyearf
2018 0.367 0 1 (reference)
2013 0.313 -0.122 0.885 0.056 0.029
2008 0.265 0.152 1.165 0.054 0.005
2003 0.055 0.456 1.577 0.082 0.000
Events 2166
Total time at risk 1230065
Max. log. likelihood -15068
LR test statistic 477.58
Degrees of freedom 16
Overall p-value 0
Comparison of refitted model’s AICs:
fit.aAIC <- -2 * fit.a$loglik[2] + 2 * length(fit.a$coefficients)
fit.bAIC <- -2 * fit.b$loglik[2] + 2 * length(fit.b$coefficients)
pceAIC <- cbind(fit.aAIC, fit.bAIC)
pceAIC fit.aAIC fit.bAIC
[1,] 30541 30169
Comparison of refitted model’s BICs:
obsN <- as.data.frame(freq(hwdata$status_60))
fit.aBIC <- 3 * log(dim(hwdata)[1]) - 2 * (fit.a$loglik[2])
fit.bBIC <- 9 * log(dim(hwdata)[1]) - 2 * (fit.b$loglik[2])
pceBIC <- cbind(fit.aBIC, fit.bBIC)
pceBIC fit.aBIC fit.bBIC
[1,] 30566 30236
The graphical and statistical comparisons of distributions of non-parametric and parametric hazard functions revealed the importance of model diagnostic, and of specifying the best-fit distribution function in ensuring the propriety of parametric model strategy. The results established the PCE parameterization as the best-fit modeling strategy for analysing the data. The PCE function’s lowest AIC value further evidenced its appropriateness over other parametric functions: PCE (30308) vs. exponential (31487), weibull (31468), log-normal (31233), and log-logistic (31229). This outcome deviates from the hypothesized distributions that there will be no significant distinction in the extent to which each parametric model approximates the empirical model. Likewise, the assumption of significantly negligible between-Grpdifference in hazard of under-five mortality by birth category, especially, regardless of the influence of other correlates did not hold. The results from the unadjusted hazard ratio (UHR) and adjusted hazard ratio (AHR) of PCE models revealed significant disparities in under-five mortality risk by birth category (Grp01 vs. Grp02, Grp03, Grp04, and Grp05), at ≤5% significant level.
Compared to those in reference Grp1, the adjusted hazard of dying is highest for children in Grp05 (AHR: +49%, p<.001), followed by those in Grp03 (AHR: +35%, p<.001), and Grp04 (AHR: +26%, p<.001). Further more, there were reductions in risk slopes for children in Grp03, Grp04 and Grp05, but an insignificant increase for those in Grp02 in the adjusted model with significant variations in hazard ratios of mortality by levels of control variables (child’s sex and birth size; mother’s union status, level of education and wealth status; and place of residence, region of residence and year of the survey). These suggest that the inclusion of the covariates had substantial effects on the hazard of dying before age five and that the adjusted model fit the outcome better than the undajusted model. In other words, the selected correlates offered some additional explanations in understanding under-five mortality circumstance in Nigeria. These findings were established by estimated model comparison parameters (AIC: Unadjusted, 30541 vs Adjusted, 30175; BIC: Unadjusted, 30566 vs Adjusted, 30242).
This exercise was inspired by Dr. Corey Sparks’ lab session on “DEM 7223: Event History Analysis” Fall 2020 coursework, and his compendium of easy-to-follow instructional materials accessible at https://rpubs.com/corey_sparks.
The Demographic and Health Survey data can be downloaded with permission from the DHS Program at https://www.dhsprogram.com/data/.
R package https://www.R-project.org/ was used for data wrangling and analysis.
Created on September 28, 2020