HomeWork 3 Parametric Medols, DEM 7223 - Event History Analysis

Author

Assogba Albert

load required libraries

library(tidyverse)
library(haven)
library(survival)
library(car)
library(survey)
library(muhaz)
library(eha)
library(ggsurvfit)
library(dplyr)

load the data

tad<-read_dta("GHKR72DT/GHKR72FL.DTA")
tad<-zap_label(tad)

In the homework 3 assignment, I’ll construct various parametric models to estimate the effects of which covariates hypothesized will affect the outcome variable: the event of a child dying before age 5. This data is derived from the Ghana DHS 2014.

In this homework, I hypothesize that; - children from low SES households are more likely to die within the first 12 months of birth than children from high SES households. - the risk for a child dying before age 5 goes down for a highly educated mother compared to a child with a non-educated mother.

tub<-tad %>% 
    mutate(caseid=caseid,
           death.age=b7,
           intv_date=v008,
           birth_date=b3,
           child_alive=b5,
           highses=v190,
           educ=v106,
           partneredu=v701,
           rural=v025,
           agec=cut(v012, breaks = seq(15,50,5), include.lowest=T),
           weight=v005/1000000,
           psu=v021,
           strata=v022) %>% 
  select(caseid,death.age,intv_date,birth_date,child_alive,highses,educ,rural, agec,partneredu) %>% 
  mutate(highses =car::Recode(highses, recodes ="1:3 =0; 4:5 =1; else =NA"))
 tub$death.age<-ifelse(tub$child_alive==1,
                      ((((tub$intv_date)))- (((tub$birth_date))))
                      ,tub$death.age)

# censoring indicator for death by age 1, in months (12 months)

tub$d.event<-ifelse(is.na(tub$death.age) ==T| tub$death.age > 12,0,1)
tub$d.eventfac<-factor(tub$d.event)
levels(tub$d.eventfac)<-c("Alive at 1", "Death by 1")
fit<-survfit2(Surv(death.age, d.event) ~ 1, data=tub)

fit %>% 
  ggsurvfit() +
  labs(title = "Survival Function for Differences in SES, Ghana DHS 2014",
       y = "s(t)",
       x = "Months")

AFT or PH? Model

fit.haz.km<-kphaz.fit(tub$death.age, tub$d.event,
                      method="product-limit")

fit.haz.sm<-muhaz(tub$death.age, tub$d.event,min.time = 0,max.time = 21)

kphaz.plot(fit.haz.km,main="Hazard Plot of Dying before Age 5")

lines(fit.haz.sm,col=2,lwd=3)
legend("topright",legend=c("KM Hazard","Smoothed Hazard"),
       col=c(1,2),
       lty=c(1,1))

Did you choose an AFT or PH model and why?

I used the proportional hazard model as the PH model is more appropriate because it measures the hazard of an event outcome. Herein, the event of a child dying before their 5th birth date. AFT, on the other hand, is meant to measure the covariates’ effect on the time intervals between events.

Model fits

I will test some models (exponential, Weibull, lognormal, and piecewise constant) to see if they fit our empirical data.

Adding Covariates and Fitting the model

#Creating covariates 

tub$educ.high<-ifelse(tub$educ %in% c(2,3), 1, 0)
tub$partnerhiedu<-ifelse(tub$partneredu<3,0,
                           ifelse(tub$partneredu%in%c(8,9),NA,1 ))

Exponential Model

day <- 1/365


fit.exp<-phreg(Surv(death.age+day, d.event)~highses + rural + educ.high,
             data = tub,
             dist = "weibull",
             shape = 1)

summary(fit.exp)
Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
highses               0.271    -0.055     0.947     0.074   0.4632 
rural                 1.606    -0.032     0.969     0.064   0.6211 
educ.high             0.440     0.075     1.078     0.056   0.1808 

Events                    1550 
Total time at risk        161462 
Max. log. likelihood      -8750.3 
LR test statistic         2.11 
Degrees of freedom        3 
Overall p-value           0.549739
plot(fit.exp,fn="haz",ylim =c(0,0.2),main = "Exponential")
lines(fit.haz.sm,col=2)

Weibull model

Now, we will test a Weibull model and compare it to the empirical data. Here are the main effects for Weibull.

fit.1<-phreg(Surv(death.age+day, d.event)~ highses + rural + educ.high,
                   data=tub,
                   dist="weibull")

summary(fit.1)
Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
highses               0.271    -0.051     0.950     0.074   0.4907 
rural                 1.606    -0.024     0.977     0.064   0.7128 
educ.high             0.440     0.057     1.058     0.056   0.3123 

Events                    1550 
Total time at risk        161462 
Max. log. likelihood      -7531.7 
LR test statistic         1.23 
Degrees of freedom        3 
Overall p-value           0.745509
plot(fit.1,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)

Lognormal

Here I will attempt a lognormal model and compare it to the empirical data. Here are the main effects for a Lognormal model.

fit.2<-phreg(Surv(death.age + day, d.event)~highses + rural + educ.high,
                     data = tub,
                     dist = "lognormal")

summary(fit.2)
Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
highses               0.271     7.249  1407.142     3.847   0.4917 
rural                 1.606    -0.051     0.950     0.074   0.7160 
educ.high             0.440    -0.023     0.977     0.064   0.3176 

Events                    1550 
Total time at risk        161462 
Max. log. likelihood      -7529.1 
LR test statistic         1.24 
Degrees of freedom        3 
Overall p-value           0.744217
plot(fit.2,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)

Piecewise constant exponent

Here, I’ll create a piecewise constant exponent model and compare it to our empirical data. The main effects for this model are as follows.

fit.3<-phreg(Surv(death.age+day, d.event)~highses + rural + educ.high,
                     data = tub,
                     cuts=seq(1, 300, 12))

summary(fit.3)
Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
highses               0.271    -0.051     0.950     0.074   0.4907 
rural                 1.606    -0.024     0.977     0.064   0.7128 
educ.high             0.440     0.057     1.058     0.056   0.3123 

Events                    1550 
Total time at risk        161462 
Max. log. likelihood      -7531.7 
LR test statistic         1.23 
Degrees of freedom        3 
Overall p-value           0.745509
plot(fit.3,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)

Model Fit and Diagnostics - AIC

AIC(fit.exp)
[1] 17508.54
AIC(fit.1)
[1] 15073.43
AIC(fit.2)
[1] 15070.24
AIC(fit.3)
[1] 15073.43

Clearly, the lognormal model has a lower AIC score (15070.24) than the other models, suggesting a better overall fit with the empirical data.

#Testing for Interaction between at least two predictors We can test interaction with the variables that i have selected. In this test, I’ll be creating an interaction term between mother’s education and highses using the log normal model.

fit.2.interact<-phreg(Surv(death.age + day, d.event)~highses + rural + educ.high + (highses*educ.high),
                     data = tub,
                     dist = "lognormal")

summary(fit.2.interact)
Single term deletions

Model:
Surv(death.age + day, d.event) ~ highses + rural + educ.high + 
    (highses * educ.high)
                  Df   AIC   LRT Pr(>Chi)
<none>               15071               
rural              1 15069 0.105     0.75
highses:educ.high  1 15070 0.929     0.34

Covariate             Mean       Coef     Rel.Risk   S.E.    Wald p
highses               0.271     7.236  1388.203     3.848   0.0600 
rural                 1.606     0.038     1.039     0.117   0.7464 
educ.high             0.440    -0.021     0.979     0.064   0.7458 
highses:educ.high 
                              0.085     1.088     0.063    0.1791 

Events                    1550 
Total time at risk        161462 
Max. log. likelihood      -7528.7 
LR test statistic         2.17 
Degrees of freedom        4 
Overall p-value           0.705369
plot(fit.2.interact)

fit.2.interact<-phreg(Surv(death.age + day, d.event)~highses + rural + educ.high + (highses*partnerhiedu),
                     data = tub,
                     dist = "lognormal")

summary(fit.2.interact)
Single term deletions

Model:
Surv(death.age + day, d.event) ~ highses + rural + educ.high + 
    (highses * partnerhiedu)
                     Df   AIC  LRT Pr(>Chi)  
<none>                  13616                
rural                 1 13615 0.58    0.444  
educ.high             1 13614 0.00    0.985  
highses:partnerhiedu  1 13619 5.00    0.025 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Covariate             Mean       Coef     Rel.Risk   S.E.    Wald p
highses               0.269     7.685  2174.715     4.467   0.0854 
rural                 1.614     0.024     1.024     0.085   0.7815 
educ.high             0.421    -0.053     0.949     0.069   0.4433 
partnerhiedu          0.086    -0.001     0.999     0.061   0.9848 
highses:partnerhiedu 
                              0.437     1.548     0.152    0.0040 

Events                    1393 
Total time at risk        149912 
Max. log. likelihood       -6800 
LR test statistic         8.81 
Degrees of freedom        5 
Overall p-value           0.116788
plot(fit.2.interact)

From this model, we see that the interaction between highses and educ.high and highses and partnerhiedu are not significant. Rural residence remains the only significant independent variable explaining variance in under 5 mortality.

Conclusion

According to the AIC log normal is the best for my data. The graphical check doesn’t affirm this, though. The interaction between highses and educ.high and highses and partnerhiedu are not significant. Rural residence remains the only significant independent variable explaining the variance of the distribution. This means that location (rural vs. urban) impacts the hazard ratio of a child surviving to age 5.