HW3_Fitting SA Models

Author

CRG

Instructions for HW 3

Define your outcome as in HW 1. Also consider what covariates are hypothesized to affect the outcome variable. Define these and construct a parametric model for your outcome.

Fit the parametric model of your choosing to the data.

Did you choose an AFT or PH model and why?
Justify what parametric distribution you choose
Carry out model fit diagnostics for the model
Include all main effects in the model
Test for an interaction between at least two of the predictors
Interpret your results and write them up
Provide tabular and graphical output to support your conclusions

First, load the libraries and prep the Data

I will be using the following data set: Brazil: Standard DHS, 1996

Now I am bringing in the individual recode data file for Brazil and joining the two data sets using CS code.

Show the code

#library(haven)
idat <- read_dta("C:/Users/codar/OneDrive - University of Texas at San Antonio/R 2022/BRIR31FL.DTA")


idat$hh <- substr(idat$caseid, 1,12)
mdat<- left_join(idat, bdat2, by=c("hh" ="hhid")) #you used bdat, not bdat2

# names(mdat)
tail(names(mdat))

[1] "s452ah_6" "s452ax_6" "s452ay_6" "s453a_6"  "hh"       "mgenhh"

Filtering for a few of the variables for the dates, and some characteristics of the women, and details of the survey.

Show the code

mdat2 <- mdat %>%
  filter(bidx_01==1&b0_01==0)%>%
  transmute(CASEID=caseid, 
                 int.cmc=v008,
                 fbir.cmc=b3_01,
                 sbir.cmc=b3_02,
                 marr.cmc=v509,
                 rural=v025,
                 educ=v106,
                 age = v012,
                 agec=cut(v012, breaks = seq(15,50,5), include.lowest=T),
                 partneredu=v701,
                 partnerage=v730,
                 weight=v005/1000000,
                 psu=v021,
                 strata=v022,   #by adding them into the transmute, I was able to get them to show up in the mdat2 data frame...why?
                mgenhh = mgenhh, 
            hh = hh) %>%
 select(CASEID, int.cmc, fbir.cmc, sbir.cmc, marr.cmc, rural, educ, age, agec, partneredu, partnerage, weight, psu, strata, mgenhh, hh, rural)%>%
  mutate(agefb = (age - (int.cmc - fbir.cmc)/12))

#Calculating the birth intervals, observed and censored and the event indicator (did women have a second birth?)

mdat3<-mdat2%>%
  mutate(secbi = ifelse(is.na(sbir.cmc)==T,
                   int.cmc - fbir.cmc, 
                   fbir.cmc - sbir.cmc),
         b2event = ifelse(is.na(sbir.cmc)==T,0,1))

library(ggsurvfit)
fit<-survfit2(Surv(secbi, b2event)~1, mdat3)

fit %>%
  ggsurvfit()+
  labs(title = "Survival Function for Second Birth Interval,\nBrazil DHS 1996",
       y = "S(t)", 
       x= "Months")

Estimate the empirical hazard function

Show the code

#since these functions don't work with durations of 0, we add a very small amount to the intervals
fit.haz.km<-kphaz.fit(mdat3$secbi,
                      mdat3$b2event , 
                      method = "product-limit")

#this is a version of the hazard that is smoothed using a kernel-density method
fit.haz.sm<-muhaz(mdat3$secbi, mdat3$b2event )

#Empirical hazard function (product-limit estimate) plot
kphaz.plot(fit.haz.km,
           main="Plot of the hazard of having a second birth")

#overlay the smoothed version
lines(fit.haz.sm, col=2, lwd=3)

legend("topleft",
       legend = c("KM hazard", "Smoothed Hazard"),
       col=c(1,2),
       lty=c(1,1))

Adding Covariates and Fitting the model

Defining Covariates

The covariates I will look at include multgen (weather it is a multigenerational household) and rural locations.

First, I have to clean up the covariates I will use. This means recoding.

Show the code

#Creating covariates 

mdat3$educ.high<-ifelse(mdat3$educ %in% c(2,3), 1, 0)
# mdat3$partnerhiedu<-ifelse(mdat3$partneredu<3,0,
#                           ifelse(mdat3$partneredu%in%c(8,9),NA,1 ))

#Recoding mgenhh from char to numeric 
lookup <- c("notmgen" = 0, "mgenhh" = 1)
mdat3$new_mghh <- lookup[mdat3$mgenhh]

For my own reference, here is the code for rural/urban

Show the code

#recoding rural to 0/1 binary

mdat3$rural2 <- ifelse(mdat3$rural==2, 1, 0)

                        
#survey design 
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~psu, strata=~strata,
               data=mdat3, weight=~weight)

# rep.des<-as.svrepdesign(des, type="bootstrap" )

day<- 1/365

# mdat3 <- mdat3 %>%
#   filter( is.na(partnerhiedu) == F)

1. AFT Model With Exponential Distribution

Show the code

fit.1.aft<-survreg(Surv(secbi+day, b2event)~educ.high+rural2+new_mghh,
                   data=mdat3,
                   dist = "exponential" )

summary(fit.1.aft)


Call:
survreg(formula = Surv(secbi + day, b2event) ~ educ.high + rural2 + 
    new_mghh, data = mdat3, dist = "exponential")
              Value Std. Error      z       p
(Intercept)  4.0187     0.0225 178.98 < 2e-16
educ.high    0.2212     0.0263   8.41 < 2e-16
rural2      -0.1728     0.0319  -5.41 6.2e-08
new_mghh     0.1131     0.0262   4.31 1.6e-05

Scale fixed at 1 

Exponential distribution
Loglik(model)= -32629.8   Loglik(intercept only)= -32710.2
    Chisq= 160.74 on 3 degrees of freedom, p= 1.3e-34 
Number of Newton-Raphson Iterations: 4 
n= 8299

2. PH With Weibull Distribution

Show the code

fit.2<-phreg(Surv(secbi+day, b2event)~educ.high+rural2+new_mghh,
             data=mdat3,
             dist="weibull")
summary(fit.2)

Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
educ.high             0.536    -0.227     0.797     0.026   0.0000 
rural2                0.175     0.226     1.254     0.032   0.0000 
new_mghh              0.386    -0.128     0.880     0.026   0.0000 

Events                    6366 
Total time at risk        399109 
Max. log. likelihood      -32293 
LR test statistic         200.62 
Degrees of freedom        3 
Overall p-value           0

Plot

Show the code

# plot(fit.2, fn = "haz") 
# lines(fit.haz.sm, col=2)

3. PH with Log Normal Distribution - A No Go!

Show the code

fit.3<-phreg(Surv(secbi+day, b2event)~rural2+new_mghh,
             data=mdat3,
             dist="lognormal")

fail in [dsytrf]; info = 4

Show the code

summary(fit.3)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1

4. PH with Log-Logistic Distribution

Show the code

#log-normal distribution for hazard
fit.4<-phreg(Surv(secbi+day, b2event)~educ.high+rural2+new_mghh,
             data=mdat3,
             dist="loglogistic")
summary(fit.4)

Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
educ.high             0.536    -0.960     0.383     0.053   0.0000 
rural2                0.175    -0.237     0.789     0.026   0.0000 
new_mghh              0.386     0.178     1.195     0.032   0.0012 

Events                    6366 
Total time at risk        399109 
Max. log. likelihood      -31267 
LR test statistic         167.56 
Degrees of freedom        3 
Overall p-value           0

5. Piecewise Model with Exponential Distribution

Show the code

fit.5<-pchreg(Surv(secbi, b2event)~educ.high+rural2+new_mghh,
             data=mdat3,
             cuts=seq(1, 300, 12))
summary(fit.5)

Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
educ.high             0.536    -0.237     0.789     0.026   0.0000 
rural2                0.175     0.183     1.201     0.032   0.0000 
new_mghh              0.386    -0.092     0.912     0.026   0.0004 

Events                    6366 
Total time at risk        399086 
Max. log. likelihood      -31564 
LR test statistic         172.98 
Degrees of freedom        3 
Overall p-value           0

Restricted mean survival:  55.41409 in (1, 289]

Plot for Piecewise

Show the code

# plot(fit.5, fn="haz")
# lines(fit.haz.sm, col=2)

Comparing the Models

To compare the models, I used AIC. Lower scores = better.

AFT Model with Exponential Distribution

Show the code

AIC(fit.1.aft)

[1] 65267.62

Proportional Hazards with Weibull Distribution

Show the code

AIC(fit.2)

[1] 64596.89

Proportional Hazards with Log-Logistic

Show the code

AIC(fit.4)

[1] 62545.1

Piecewise with Exponential

Show the code

-2*fit.5$loglik[2]+length(fit.5$coefficients) #have to construct this ourselves)

[1] 63131.18

Observations

Based on the AIC results, the proportional hazards model with the log-logistic distribution is the best one.

Graphical Checks

Empirical vs Weibull

Show the code

emp<-coxreg(Surv(secbi, b2event)~educ.high+rural2+new_mghh,
            data=mdat3)

check.dist(sp=emp,pp=fit.2, main = "Empirical vs. Weibull")

Empirical vs Log-Logistic

Show the code

check.dist(sp=emp,pp=fit.4, main = "Empirical vs. Log-Logistic")

Empirical vs PCH or Piecewise

Show the code

check.dist(sp=emp,pp=fit.5, main = "Empirical vs. PCH")

Observations

Based on the graphical checks, the piecewise model with an exponential distribution would be the best match. It must be an issue with my code, but the log-logistic ph model had the lowest AIC values, even though the graphical check isn’t successful.

Testing for Interaction

Log-Rank Rest to test for interactions

Show the code

survdiff(Surv(secbi, b2event)~rural2, data=mdat3)

Call:
survdiff(formula = Surv(secbi, b2event) ~ rural2, data = mdat3)

            N Observed Expected (O-E)^2/E (O-E)^2/V
rural2=0 6634     4986     5262      14.5      85.9
rural2=1 1665     1380     1104      69.3      85.9

 Chisq= 85.9  on 1 degrees of freedom, p= <2e-16

Show the code

chisq.test(table(mdat3$b2event, mdat3$new_mghh))


    Pearson's Chi-squared test with Yates' continuity correction

data:  table(mdat3$b2event, mdat3$new_mghh)
X-squared = 136.07, df = 1, p-value < 2.2e-16

Observation

P-value is not less than .05 and therefore there are no significant interactions.

Also ran peto. No homoscedasity ?

Show the code

peto_peto <- survdiff(Surv(secbi, b2event)~rural2, data=mdat3, rho = 1)
peto_peto

Call:
survdiff(formula = Surv(secbi, b2event) ~ rural2, data = mdat3, 
    rho = 1)

            N Observed Expected (O-E)^2/E (O-E)^2/V
rural2=0 6634     2839     3004      8.99      70.7
rural2=1 1665      836      672     40.23      70.7

 Chisq= 70.7  on 1 degrees of freedom, p= <2e-16

Show the code

anova <- aov(secbi ~ new_mghh * rural2, data = mdat3)
summary(anova)

                  Df   Sum Sq Mean Sq F value Pr(>F)    
new_mghh           1     2069    2069   1.138  0.286    
rural2             1    78711   78711  43.294  5e-11 ***
new_mghh:rural2    1     4474    4474   2.461  0.117    
Residuals       8295 15080697    1818                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Betas

Show the code

library(dplyr)
library(gtsummary)
library(survival)

#plotting betas

mdat3 %>%
  coxph(Surv(secbi, b2event) ~ rural2 * new_mghh*educ.high, data = .) %>% 
  tbl_regression() %>%
  plot()

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

Show the code

#plotting hazard ratios
    
mdat3 %>%
  coxph(Surv(secbi, b2event) ~ rural2 * new_mghh*educ.high, data = .) %>% 
  tbl_regression(exponentiate = TRUE) %>%
  plot()

Conclusions

The best distribution for my data, according to AIC is the log-logistic. However, the graphical check doesn’t affirm this finding. There were no interactions between two of my predictor variables (rural2 & new_mghh). I fit my parametric model to proportional hazards with log-logistic distribution, and the results are not significant. This means that location (rural vs. urban) and weather the home is multigenerational or not does not have an impact on the hazard ratio of second birth.

Show the code

summary(fit.4)

Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
educ.high             0.536    -0.960     0.383     0.053   0.0000 
rural2                0.175    -0.237     0.789     0.026   0.0000 
new_mghh              0.386     0.178     1.195     0.032   0.0012 

Events                    6366 
Total time at risk        399109 
Max. log. likelihood      -31267 
LR test statistic         167.56 
Degrees of freedom        3 
Overall p-value           0