HW3_Fitting SA Models
Instructions for HW 3
Define your outcome as in HW 1. Also consider what covariates are hypothesized to affect the outcome variable. Define these and construct a parametric model for your outcome.
Fit the parametric model of your choosing to the data.
Did you choose an AFT or PH model and why?
Justify what parametric distribution you choose
Carry out model fit diagnostics for the model
Include all main effects in the model
Test for an interaction between at least two of the predictors
Interpret your results and write them up
Provide tabular and graphical output to support your conclusions
First, load the libraries and prep the Data
I will be using the following data set: Brazil: Standard DHS, 1996
Now I am bringing in the individual recode data file for Brazil and joining the two data sets using CS code.
Show the code
#library(haven)
<- read_dta("C:/Users/codar/OneDrive - University of Texas at San Antonio/R 2022/BRIR31FL.DTA")
idat
$hh <- substr(idat$caseid, 1,12)
idat<- left_join(idat, bdat2, by=c("hh" ="hhid")) #you used bdat, not bdat2
mdat
# names(mdat)
tail(names(mdat))
[1] "s452ah_6" "s452ax_6" "s452ay_6" "s453a_6" "hh" "mgenhh"
Filtering for a few of the variables for the dates, and some characteristics of the women, and details of the survey.
Show the code
<- mdat %>%
mdat2 filter(bidx_01==1&b0_01==0)%>%
transmute(CASEID=caseid,
int.cmc=v008,
fbir.cmc=b3_01,
sbir.cmc=b3_02,
marr.cmc=v509,
rural=v025,
educ=v106,
age = v012,
agec=cut(v012, breaks = seq(15,50,5), include.lowest=T),
partneredu=v701,
partnerage=v730,
weight=v005/1000000,
psu=v021,
strata=v022, #by adding them into the transmute, I was able to get them to show up in the mdat2 data frame...why?
mgenhh = mgenhh,
hh = hh) %>%
select(CASEID, int.cmc, fbir.cmc, sbir.cmc, marr.cmc, rural, educ, age, agec, partneredu, partnerage, weight, psu, strata, mgenhh, hh, rural)%>%
mutate(agefb = (age - (int.cmc - fbir.cmc)/12))
#Calculating the birth intervals, observed and censored and the event indicator (did women have a second birth?)
<-mdat2%>%
mdat3mutate(secbi = ifelse(is.na(sbir.cmc)==T,
- fbir.cmc,
int.cmc - sbir.cmc),
fbir.cmc b2event = ifelse(is.na(sbir.cmc)==T,0,1))
library(ggsurvfit)
<-survfit2(Surv(secbi, b2event)~1, mdat3)
fit
%>%
fit ggsurvfit()+
labs(title = "Survival Function for Second Birth Interval,\nBrazil DHS 1996",
y = "S(t)",
x= "Months")
Estimate the empirical hazard function
Show the code
#since these functions don't work with durations of 0, we add a very small amount to the intervals
<-kphaz.fit(mdat3$secbi,
fit.haz.km$b2event ,
mdat3method = "product-limit")
#this is a version of the hazard that is smoothed using a kernel-density method
<-muhaz(mdat3$secbi, mdat3$b2event )
fit.haz.sm
#Empirical hazard function (product-limit estimate) plot
kphaz.plot(fit.haz.km,
main="Plot of the hazard of having a second birth")
#overlay the smoothed version
lines(fit.haz.sm, col=2, lwd=3)
legend("topleft",
legend = c("KM hazard", "Smoothed Hazard"),
col=c(1,2),
lty=c(1,1))
Adding Covariates and Fitting the model
Defining Covariates
The covariates I will look at include multgen (weather it is a multigenerational household) and rural locations.
First, I have to clean up the covariates I will use. This means recoding.
Show the code
#Creating covariates
$educ.high<-ifelse(mdat3$educ %in% c(2,3), 1, 0)
mdat3# mdat3$partnerhiedu<-ifelse(mdat3$partneredu<3,0,
# ifelse(mdat3$partneredu%in%c(8,9),NA,1 ))
#Recoding mgenhh from char to numeric
<- c("notmgen" = 0, "mgenhh" = 1)
lookup $new_mghh <- lookup[mdat3$mgenhh] mdat3
For my own reference, here is the code for rural/urban
Show the code
#recoding rural to 0/1 binary
$rural2 <- ifelse(mdat3$rural==2, 1, 0)
mdat3
#survey design
options(survey.lonely.psu = "adjust")
<-svydesign(ids=~psu, strata=~strata,
desdata=mdat3, weight=~weight)
# rep.des<-as.svrepdesign(des, type="bootstrap" )
<- 1/365
day
# mdat3 <- mdat3 %>%
# filter( is.na(partnerhiedu) == F)
1. AFT Model With Exponential Distribution
Show the code
1.aft<-survreg(Surv(secbi+day, b2event)~educ.high+rural2+new_mghh,
fit.data=mdat3,
dist = "exponential" )
summary(fit.1.aft)
Call:
survreg(formula = Surv(secbi + day, b2event) ~ educ.high + rural2 +
new_mghh, data = mdat3, dist = "exponential")
Value Std. Error z p
(Intercept) 4.0187 0.0225 178.98 < 2e-16
educ.high 0.2212 0.0263 8.41 < 2e-16
rural2 -0.1728 0.0319 -5.41 6.2e-08
new_mghh 0.1131 0.0262 4.31 1.6e-05
Scale fixed at 1
Exponential distribution
Loglik(model)= -32629.8 Loglik(intercept only)= -32710.2
Chisq= 160.74 on 3 degrees of freedom, p= 1.3e-34
Number of Newton-Raphson Iterations: 4
n= 8299
2. PH With Weibull Distribution
Show the code
.2<-phreg(Surv(secbi+day, b2event)~educ.high+rural2+new_mghh,
fitdata=mdat3,
dist="weibull")
summary(fit.2)
Covariate Mean Coef Rel.Risk S.E. LR p
educ.high 0.536 -0.227 0.797 0.026 0.0000
rural2 0.175 0.226 1.254 0.032 0.0000
new_mghh 0.386 -0.128 0.880 0.026 0.0000
Events 6366
Total time at risk 399109
Max. log. likelihood -32293
LR test statistic 200.62
Degrees of freedom 3
Overall p-value 0
Plot
Show the code
# plot(fit.2, fn = "haz")
# lines(fit.haz.sm, col=2)
3. PH with Log Normal Distribution - A No Go!
Show the code
.3<-phreg(Surv(secbi+day, b2event)~rural2+new_mghh,
fitdata=mdat3,
dist="lognormal")
fail in [dsytrf]; info = 4
Show the code
summary(fit.3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
4. PH with Log-Logistic Distribution
Show the code
#log-normal distribution for hazard
.4<-phreg(Surv(secbi+day, b2event)~educ.high+rural2+new_mghh,
fitdata=mdat3,
dist="loglogistic")
summary(fit.4)
Covariate Mean Coef Rel.Risk S.E. LR p
educ.high 0.536 -0.960 0.383 0.053 0.0000
rural2 0.175 -0.237 0.789 0.026 0.0000
new_mghh 0.386 0.178 1.195 0.032 0.0012
Events 6366
Total time at risk 399109
Max. log. likelihood -31267
LR test statistic 167.56
Degrees of freedom 3
Overall p-value 0
5. Piecewise Model with Exponential Distribution
Show the code
.5<-pchreg(Surv(secbi, b2event)~educ.high+rural2+new_mghh,
fitdata=mdat3,
cuts=seq(1, 300, 12))
summary(fit.5)
Covariate Mean Coef Rel.Risk S.E. LR p
educ.high 0.536 -0.237 0.789 0.026 0.0000
rural2 0.175 0.183 1.201 0.032 0.0000
new_mghh 0.386 -0.092 0.912 0.026 0.0004
Events 6366
Total time at risk 399086
Max. log. likelihood -31564
LR test statistic 172.98
Degrees of freedom 3
Overall p-value 0
Restricted mean survival: 55.41409 in (1, 289]
Plot for Piecewise
Show the code
# plot(fit.5, fn="haz")
# lines(fit.haz.sm, col=2)
Comparing the Models
To compare the models, I used AIC. Lower scores = better.
AFT Model with Exponential Distribution
Show the code
AIC(fit.1.aft)
[1] 65267.62
Proportional Hazards with Weibull Distribution
Show the code
AIC(fit.2)
[1] 64596.89
Proportional Hazards with Log-Logistic
Show the code
AIC(fit.4)
[1] 62545.1
Piecewise with Exponential
Show the code
-2*fit.5$loglik[2]+length(fit.5$coefficients) #have to construct this ourselves)
[1] 63131.18
Observations
Based on the AIC results, the proportional hazards model with the log-logistic distribution is the best one.
Graphical Checks
Empirical vs Weibull
Show the code
<-coxreg(Surv(secbi, b2event)~educ.high+rural2+new_mghh,
empdata=mdat3)
check.dist(sp=emp,pp=fit.2, main = "Empirical vs. Weibull")
Empirical vs Log-Logistic
Show the code
check.dist(sp=emp,pp=fit.4, main = "Empirical vs. Log-Logistic")
Empirical vs PCH or Piecewise
Show the code
check.dist(sp=emp,pp=fit.5, main = "Empirical vs. PCH")
Observations
Based on the graphical checks, the piecewise model with an exponential distribution would be the best match. It must be an issue with my code, but the log-logistic ph model had the lowest AIC values, even though the graphical check isn’t successful.
Testing for Interaction
Log-Rank Rest to test for interactions
Show the code
survdiff(Surv(secbi, b2event)~rural2, data=mdat3)
Call:
survdiff(formula = Surv(secbi, b2event) ~ rural2, data = mdat3)
N Observed Expected (O-E)^2/E (O-E)^2/V
rural2=0 6634 4986 5262 14.5 85.9
rural2=1 1665 1380 1104 69.3 85.9
Chisq= 85.9 on 1 degrees of freedom, p= <2e-16
Show the code
chisq.test(table(mdat3$b2event, mdat3$new_mghh))
Pearson's Chi-squared test with Yates' continuity correction
data: table(mdat3$b2event, mdat3$new_mghh)
X-squared = 136.07, df = 1, p-value < 2.2e-16
Observation
P-value is not less than .05 and therefore there are no significant interactions.
Also ran peto. No homoscedasity ?
Show the code
<- survdiff(Surv(secbi, b2event)~rural2, data=mdat3, rho = 1)
peto_peto peto_peto
Call:
survdiff(formula = Surv(secbi, b2event) ~ rural2, data = mdat3,
rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
rural2=0 6634 2839 3004 8.99 70.7
rural2=1 1665 836 672 40.23 70.7
Chisq= 70.7 on 1 degrees of freedom, p= <2e-16
Show the code
<- aov(secbi ~ new_mghh * rural2, data = mdat3)
anova summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
new_mghh 1 2069 2069 1.138 0.286
rural2 1 78711 78711 43.294 5e-11 ***
new_mghh:rural2 1 4474 4474 2.461 0.117
Residuals 8295 15080697 1818
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Betas
Show the code
library(dplyr)
library(gtsummary)
library(survival)
#plotting betas
%>%
mdat3 coxph(Surv(secbi, b2event) ~ rural2 * new_mghh*educ.high, data = .) %>%
tbl_regression() %>%
plot()
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Show the code
#plotting hazard ratios
%>%
mdat3 coxph(Surv(secbi, b2event) ~ rural2 * new_mghh*educ.high, data = .) %>%
tbl_regression(exponentiate = TRUE) %>%
plot()
Conclusions
The best distribution for my data, according to AIC is the log-logistic. However, the graphical check doesn’t affirm this finding. There were no interactions between two of my predictor variables (rural2 & new_mghh). I fit my parametric model to proportional hazards with log-logistic distribution, and the results are not significant. This means that location (rural vs. urban) and weather the home is multigenerational or not does not have an impact on the hazard ratio of second birth.
Show the code
summary(fit.4)
Covariate Mean Coef Rel.Risk S.E. LR p
educ.high 0.536 -0.960 0.383 0.053 0.0000
rural2 0.175 -0.237 0.789 0.026 0.0000
new_mghh 0.386 0.178 1.195 0.032 0.0012
Events 6366
Total time at risk 399109
Max. log. likelihood -31267
LR test statistic 167.56
Degrees of freedom 3
Overall p-value 0