library(tidyverse)
library(haven)
library(survival)
library(car)
library(survey)
library(muhaz)
library(eha)
library(ggsurvfit)
library(dplyr)HomeWork 3 Parametric Medols, DEM 7223 - Event History Analysis
load required libraries
load the data
tad<-read_dta("GHKR72DT/GHKR72FL.DTA")
tad<-zap_label(tad)In the homework 3 assignment, I’ll construct various parametric models to estimate the effects of which covariates hypothesized will affect the outcome variable: the event of a child dying before age 5. This data is derived from the Ghana DHS 2014.
In this homework, I hypothesize that; - children from low SES households are more likely to die within the first 12 months of birth than children from high SES households. - the risk for a child dying before age 5 goes down for a highly educated mother compared to a child with a non-educated mother.
tub<-tad %>%
mutate(caseid=caseid,
death.age=b7,
intv_date=v008,
birth_date=b3,
child_alive=b5,
highses=v190,
educ=v106,
partneredu=v701,
rural=v025,
agec=cut(v012, breaks = seq(15,50,5), include.lowest=T),
weight=v005/1000000,
psu=v021,
strata=v022) %>%
select(caseid,death.age,intv_date,birth_date,child_alive,highses,educ,rural, agec,partneredu) %>%
mutate(highses =car::Recode(highses, recodes ="1:3 =0; 4:5 =1; else =NA")) tub$death.age<-ifelse(tub$child_alive==1,
((((tub$intv_date)))- (((tub$birth_date))))
,tub$death.age)
# censoring indicator for death by age 1, in months (12 months)
tub$d.event<-ifelse(is.na(tub$death.age) ==T| tub$death.age > 12,0,1)
tub$d.eventfac<-factor(tub$d.event)
levels(tub$d.eventfac)<-c("Alive at 1", "Death by 1")fit<-survfit2(Surv(death.age, d.event) ~ 1, data=tub)
fit %>%
ggsurvfit() +
labs(title = "Survival Function for Differences in SES, Ghana DHS 2014",
y = "s(t)",
x = "Months")AFT or PH? Model
fit.haz.km<-kphaz.fit(tub$death.age, tub$d.event,
method="product-limit")
fit.haz.sm<-muhaz(tub$death.age, tub$d.event,min.time = 0,max.time = 21)
kphaz.plot(fit.haz.km,main="Hazard Plot of Dying before Age 5")
lines(fit.haz.sm,col=2,lwd=3)
legend("topright",legend=c("KM Hazard","Smoothed Hazard"),
col=c(1,2),
lty=c(1,1))Did you choose an AFT or PH model and why?
I used the proportional hazard model as the PH model is more appropriate because it measures the hazard of an event outcome. Herein, the event of a child dying before their 5th birth date. AFT, on the other hand, is meant to measure the covariates’ effect on the time intervals between events.
Model fits
I will test some models (exponential, Weibull, lognormal, and piecewise constant) to see if they fit our empirical data.
Adding Covariates and Fitting the model
#Creating covariates
tub$educ.high<-ifelse(tub$educ %in% c(2,3), 1, 0)
tub$partnerhiedu<-ifelse(tub$partneredu<3,0,
ifelse(tub$partneredu%in%c(8,9),NA,1 ))Exponential Model
day <- 1/365
fit.exp<-phreg(Surv(death.age+day, d.event)~highses + rural + educ.high,
data = tub,
dist = "weibull",
shape = 1)
summary(fit.exp)Covariate Mean Coef Rel.Risk S.E. LR p
highses 0.271 -0.055 0.947 0.074 0.4632
rural 1.606 -0.032 0.969 0.064 0.6211
educ.high 0.440 0.075 1.078 0.056 0.1808
Events 1550
Total time at risk 161462
Max. log. likelihood -8750.3
LR test statistic 2.11
Degrees of freedom 3
Overall p-value 0.549739
plot(fit.exp,fn="haz",ylim =c(0,0.2),main = "Exponential")
lines(fit.haz.sm,col=2)Weibull model
Now, we will test a Weibull model and compare it to the empirical data. Here are the main effects for Weibull.
fit.1<-phreg(Surv(death.age+day, d.event)~ highses + rural + educ.high,
data=tub,
dist="weibull")
summary(fit.1)Covariate Mean Coef Rel.Risk S.E. LR p
highses 0.271 -0.051 0.950 0.074 0.4907
rural 1.606 -0.024 0.977 0.064 0.7128
educ.high 0.440 0.057 1.058 0.056 0.3123
Events 1550
Total time at risk 161462
Max. log. likelihood -7531.7
LR test statistic 1.23
Degrees of freedom 3
Overall p-value 0.745509
plot(fit.1,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)Lognormal
Here I will attempt a lognormal model and compare it to the empirical data. Here are the main effects for a Lognormal model.
fit.2<-phreg(Surv(death.age + day, d.event)~highses + rural + educ.high,
data = tub,
dist = "lognormal")
summary(fit.2)Covariate Mean Coef Rel.Risk S.E. LR p
highses 0.271 7.249 1407.142 3.847 0.4917
rural 1.606 -0.051 0.950 0.074 0.7160
educ.high 0.440 -0.023 0.977 0.064 0.3176
Events 1550
Total time at risk 161462
Max. log. likelihood -7529.1
LR test statistic 1.24
Degrees of freedom 3
Overall p-value 0.744217
plot(fit.2,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)Piecewise constant exponent
Here, I’ll create a piecewise constant exponent model and compare it to our empirical data. The main effects for this model are as follows.
fit.3<-phreg(Surv(death.age+day, d.event)~highses + rural + educ.high,
data = tub,
cuts=seq(1, 300, 12))
summary(fit.3)Covariate Mean Coef Rel.Risk S.E. LR p
highses 0.271 -0.051 0.950 0.074 0.4907
rural 1.606 -0.024 0.977 0.064 0.7128
educ.high 0.440 0.057 1.058 0.056 0.3123
Events 1550
Total time at risk 161462
Max. log. likelihood -7531.7
LR test statistic 1.23
Degrees of freedom 3
Overall p-value 0.745509
plot(fit.3,fn="haz",ylim =c(0,0.2))
lines(fit.haz.sm,col=2)Model Fit and Diagnostics - AIC
AIC(fit.exp)[1] 17508.54
AIC(fit.1)[1] 15073.43
AIC(fit.2)[1] 15070.24
AIC(fit.3)[1] 15073.43
Clearly, the lognormal model has a lower AIC score (15070.24) than the other models, suggesting a better overall fit with the empirical data.
#Testing for Interaction between at least two predictors We can test interaction with the variables that i have selected. In this test, I’ll be creating an interaction term between mother’s education and highses using the log normal model.
fit.2.interact<-phreg(Surv(death.age + day, d.event)~highses + rural + educ.high + (highses*educ.high),
data = tub,
dist = "lognormal")
summary(fit.2.interact)Single term deletions
Model:
Surv(death.age + day, d.event) ~ highses + rural + educ.high +
(highses * educ.high)
Df AIC LRT Pr(>Chi)
<none> 15071
rural 1 15069 0.105 0.75
highses:educ.high 1 15070 0.929 0.34
Covariate Mean Coef Rel.Risk S.E. Wald p
highses 0.271 7.236 1388.203 3.848 0.0600
rural 1.606 0.038 1.039 0.117 0.7464
educ.high 0.440 -0.021 0.979 0.064 0.7458
highses:educ.high
0.085 1.088 0.063 0.1791
Events 1550
Total time at risk 161462
Max. log. likelihood -7528.7
LR test statistic 2.17
Degrees of freedom 4
Overall p-value 0.705369
plot(fit.2.interact)fit.2.interact<-phreg(Surv(death.age + day, d.event)~highses + rural + educ.high + (highses*partnerhiedu),
data = tub,
dist = "lognormal")
summary(fit.2.interact)Single term deletions
Model:
Surv(death.age + day, d.event) ~ highses + rural + educ.high +
(highses * partnerhiedu)
Df AIC LRT Pr(>Chi)
<none> 13616
rural 1 13615 0.58 0.444
educ.high 1 13614 0.00 0.985
highses:partnerhiedu 1 13619 5.00 0.025 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Covariate Mean Coef Rel.Risk S.E. Wald p
highses 0.269 7.685 2174.715 4.467 0.0854
rural 1.614 0.024 1.024 0.085 0.7815
educ.high 0.421 -0.053 0.949 0.069 0.4433
partnerhiedu 0.086 -0.001 0.999 0.061 0.9848
highses:partnerhiedu
0.437 1.548 0.152 0.0040
Events 1393
Total time at risk 149912
Max. log. likelihood -6800
LR test statistic 8.81
Degrees of freedom 5
Overall p-value 0.116788
plot(fit.2.interact)From this model, we see that the interaction between highses and educ.high and highses and partnerhiedu are not significant. Rural residence remains the only significant independent variable explaining variance in under 5 mortality.
Conclusion
According to the AIC log normal is the best for my data. The graphical check doesn’t affirm this, though. The interaction between highses and educ.high and highses and partnerhiedu are not significant. Rural residence remains the only significant independent variable explaining the variance of the distribution. This means that location (rural vs. urban) impacts the hazard ratio of a child surviving to age 5.