library(survival)
data1 <- read.csv("C:\\Users\\anami\\OneDrive\\Documents\\EHA\\mhas0121dummies.csv")
data2<-na.omit(data1)
data2$female <- as.factor(data2$female)
str(data2)
## 'data.frame':    7552 obs. of  15 variables:
##  $ locsize01: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ female   : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 1 2 2 2 ...
##  $ schooling: int  0 3 0 0 0 6 3 1 0 6 ...
##  $ died0121 : int  0 0 1 0 0 1 0 1 1 1 ...
##  $ educlevel: int  0 15 0 0 0 68 15 15 0 68 ...
##  $ ageint   : num  70.1 51.1 64.8 57.3 53.2 ...
##  $ agecensor: num  90.5 71.5 74.7 77.8 73.6 ...
##  $ locsiz1  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ locsiz2  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ locsiz3  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ locsiz4  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ educ1    : int  1 0 1 1 1 0 0 0 1 0 ...
##  $ educ2    : int  0 1 0 0 0 0 1 1 0 0 ...
##  $ educ3    : int  0 0 0 0 0 1 0 0 0 1 ...
##  $ educ4    : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:2084] 1 10 19 28 37 42 47 48 52 53 ...
##   ..- attr(*, "names")= chr [1:2084] "1" "10" "19" "28" ...
surv_obj <- Surv(time = data2$agecensor, event = data2$died0121)

I.Multivariate parametric proportional hazards models

  1. Estimate parametric proportional hazards model with an exponential (i.e., constant) with no covariates. Instead of hazard ratios, obtain coefficients. Exponentiate the intercept of the model.
exp_model_no_covariates <- survreg(surv_obj ~ 1, dist = "exponential")
summary(exp_model_no_covariates)
## 
## Call:
## survreg(formula = surv_obj ~ 1, dist = "exponential")
##              Value Std. Error   z      p
## (Intercept) 5.0002     0.0158 316 <2e-16
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -23898.8   Loglik(intercept only)= -23898.8
## Number of Newton-Raphson Iterations: 4 
## n= 7552
exp_intercept_no_cov <- exp(coef(exp_model_no_covariates))
exp_intercept_no_cov
## (Intercept) 
##     148.442

The estimated intercept is 5.0002. The extremely small p-value (approximately zero) indicates the intercept is highly statistically significant. This suggests that the model provides a good fit to the data.The model is based on a sample of 7552 individuals. The exponential value of the intercept is 148.442.

I.b.

Estimate another exponential, proportional hazards model, using FEMALE as the only covariate. Obtain coefficients, not hazard ratios.

exp_model_female <- survreg(surv_obj ~ data2$female, dist = "exponential")
summary(exp_model_female)
## 
## Call:
## survreg(formula = surv_obj ~ data2$female, dist = "exponential")
##                Value Std. Error      z      p
## (Intercept)   4.9469     0.0230 215.17 <2e-16
## data2$female1 0.0993     0.0317   3.13 0.0018
## 
## Scale fixed at 1 
## 
## Exponential distribution
## Loglik(model)= -23893.9   Loglik(intercept only)= -23898.8
##  Chisq= 9.77 on 1 degrees of freedom, p= 0.0018 
## Number of Newton-Raphson Iterations: 4 
## n= 7552
exp_intercept_female <- exp(coef(exp_model_female)[1])  # Intercept
exp_female_coef <- exp(coef(exp_model_female)[2])       # Coefficient for FEMALE

exp_intercept_female
## (Intercept) 
##    140.7322
exp_female_coef
## data2$female1 
##      1.104352
# Sum the intercept and the coefficient for FEMALE, then exponentiate
exp_sum <- exp(coef(exp_model_female)[1] + coef(exp_model_female)[2])
exp_sum
## (Intercept) 
##     155.418

The estimated coefficient for females is 0.0993; exp⁡(0.0993)=1.104 when exponentiated. The exponentiated coefficient (1.104352) represents the hazard ratio for females compared to males. A hazard ratio of 1.10 means that, on average, females have a 10.4% higher hazard (death) than males. 155.418 is the hazard rate for females. This value represents the constant hazard for females, considering both the baseline hazard (intercept) and the additional risk associated with being female.

I.c.

  1. Estimate another PH parametric model controlling for FEMALE, SCHOOLING (or categories of EDUCLEVEL, whatever you prefer, justifying your choice), and LOCSIZE01. Assuming the baseline hazard is distributed…
library(eha)
library(flexsurv)
## 
## Attaching package: 'flexsurv'
## The following objects are masked from 'package:eha':
## 
##     dgompertz, dllogis, hgompertz, Hgompertz, hllogis, Hllogis, hlnorm,
##     Hlnorm, hweibull, Hweibull, pgompertz, pllogis, qgompertz, qllogis,
##     rgompertz, rllogis
# i. Exponential Distribution (Weibull with shape parameter = 1)
fitweibull_exp <- phreg(Surv(ageint, agecensor, died0121) ~ female + educ2 + educ3 + educ4 + locsiz2 + locsiz3 + locsiz4,
                        na.action = na.omit, dist = "weibull", shape = 1, data = data2)
summary(fitweibull_exp)
## Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
## female                                                      0.0000 
##                0      0.447     0         1 (reference)
##                1      0.553    -0.171     0.843     0.032
## educ2                 0.349    -0.255     0.775     0.039   0.0000 
## educ3                 0.229    -0.522     0.593     0.047   0.0000 
## educ4                 0.179    -0.745     0.475     0.055   0.0000 
## locsiz2               0.155    -0.061     0.941     0.046   0.1815 
## locsiz3               0.096    -0.075     0.927     0.055   0.1641 
## locsiz4               0.174    -0.216     0.805     0.046   0.0000 
## 
## Events                    3983 
## Total time at risk        111700 
## Max. log. likelihood      -17138 
## LR test statistic         246.27 
## Degrees of freedom        7 
## Overall p-value           0
# ii. Gompertz Distribution
fitgomp <- phreg(Surv(ageint, agecensor, died0121) ~ female + educ2 + educ3 + educ4 + locsiz2 + locsiz3 + locsiz4,
                 na.action = na.omit, dist = "gompertz", data = data2)
summary(fitgomp)
## Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
## female                                                      0.0000 
##                0      0.447     0         1 (reference)
##                1      0.553    -0.254     0.776     0.032
## educ2                 0.349    -0.016     0.984     0.039   0.6728 
## educ3                 0.229    -0.077     0.926     0.048   0.1100 
## educ4                 0.179    -0.187     0.830     0.057   0.0009 
## locsiz2               0.155    -0.010     0.990     0.046   0.8330 
## locsiz3               0.096    -0.100     0.905     0.055   0.0660 
## locsiz4               0.174    -0.187     0.829     0.046   0.0000 
## 
## Events                    3983 
## Total time at risk        111700 
## Max. log. likelihood      -15894 
## LR test statistic         78.74 
## Degrees of freedom        7 
## Overall p-value           2.4869e-14
# iii. Weibull Distribution
fitweibull <- phreg(Surv(ageint, agecensor, died0121) ~ female + educ2 + educ3 + educ4 + locsiz2 + locsiz3 + locsiz4,
                    na.action = na.omit, dist = "weibull", data = data2)
summary(fitweibull)
## Covariate             Mean       Coef     Rel.Risk   S.E.    LR p
## female                                                      0.0000 
##                0      0.447     0         1 (reference)
##                1      0.553    -0.256     0.774     0.032
## educ2                 0.349    -0.031     0.969     0.039   0.4221 
## educ3                 0.229    -0.092     0.912     0.048   0.0563 
## educ4                 0.179    -0.197     0.821     0.057   0.0005 
## locsiz2               0.155    -0.013     0.987     0.046   0.7738 
## locsiz3               0.096    -0.101     0.904     0.055   0.0629 
## locsiz4               0.174    -0.188     0.828     0.046   0.0000 
## 
## Events                    3983 
## Total time at risk        111700 
## Max. log. likelihood      -15899 
## LR test statistic         80.15 
## Degrees of freedom        7 
## Overall p-value           1.27676e-14
# iv. Log-logistic model using flexsurvreg
fitloglog <- flexsurvreg(Surv(ageint, agecensor, died0121) ~ female + schooling, 
                         dist = "llogis", data = data2)
summary(fitloglog)
# v. Log-normal model using flexsurvreg
fitlognorm <- flexsurvreg(Surv(ageint, agecensor, died0121) ~ female + schooling, 
                          dist = "lnorm", data = data2)
summary(fitlognorm)
# Extract coefficients (log hazard ratios) for log-logistic and log-normal models
coef_loglog <- coef(fitloglog)  # Log-logistic model coefficients
coef_lognorm <- coef(fitlognorm)  # Log-normal model coefficients

# Compute inverse hazard ratios (e^(-β)) for log-logistic and log-normal
inv_hr_loglog <- exp(-coef_loglog)  # Inverse hazard ratios for log-logistic
inv_hr_lognorm <- exp(-coef_lognorm)  # Inverse hazard ratios for log-normal
# Extract coefficients (log hazard ratios) and exponentiate them for all models except log-logistic and log-normal
exp_coef_exp <- exp(coef(fitweibull_exp))    # Exponential (Weibull with shape=1)
exp_coef_gomp <- exp(coef(fitgomp))  # Gompertz
exp_coef_weib <- exp(coef(fitweibull))  # Weibull
exp_coef_loglog <- exp(-coef(fitloglog)) # Inverse for log-logistic
exp_coef_lognorm <- exp(-coef(fitlognorm)) # Inverse for log-normal
# Print hazard ratios for all models
cat("Hazard Ratios (HRs):\n")
## Hazard Ratios (HRs):
cat("Exponential (Weibull with shape=1):\n", exp_coef_exp, "\n")
## Exponential (Weibull with shape=1):
##  0.8427085 0.775067 0.5934362 0.4746067 0.9412674 0.9273516 0.8054441 17.71075
cat("Gompertz:\n", exp_coef_gomp, "\n")
## Gompertz:
##  0.7759618 0.9836505 0.9257087 0.8295781 0.9904335 0.9050318 0.8294534 12.1842 0.001283197
cat("Weibull:\n", exp_coef_weib, "\n")
## Weibull:
##  0.7738267 0.9692153 0.912301 0.8214931 0.9869808 0.9039902 0.8283914 81.5389 7.296986
cat("Log-logistic (Inverse):\n", inv_hr_loglog, "\n")
## Log-logistic (Inverse):
##  0.09188302 0.01282522 0.9615755 0.9982825
cat("Log-normal (Inverse):\n", inv_hr_lognorm, "\n")
## Log-normal (Inverse):
##  0.0128991 6.42689 0.9627517 0.9978325
# Extract AIC values from models
aic_weibull_exp <- AIC(fitweibull_exp)
aic_gomp <- AIC(fitgomp)
aic_weibull <- AIC(fitweibull)
aic_loglog <- fitloglog$AIC  # flexsurv automatically includes AIC
aic_lognorm <- fitlognorm$AIC  # flexsurv automatically includes AIC

# Print AIC values
cat("AIC values:\n")
## AIC values:
cat("Exponential (Weibull with shape=1):", aic_weibull_exp, "\n")
## Exponential (Weibull with shape=1): 34292.61
cat("Gompertz:", aic_gomp, "\n")
## Gompertz: 31805.71
cat("Weibull:", aic_weibull, "\n")
## Weibull: 31816.33
cat("Log-logistic:", aic_loglog, "\n")
## Log-logistic: 32146.57
cat("Log-normal:", aic_lognorm, "\n")
## Log-normal: 32129.43

Based on the AIC values, the Gompertz model provides the best fit, closely followed by the Weibull model. Both models allow for time-varying hazards, offering greater flexibility and making them well-suited to this dataset.

In contrast, the Exponential model performs poorly, likely due to its restrictive assumption of a constant hazard over time. The Log-logistic and Log-normal models offer a moderate fit and may be more appropriate for situations where the hazard decreases after peaking. However, in this case, they do not outperform the Gompertz and Weibull models.

Cox regression

# Fit Cox proportional hazards model with left truncation
cox_model <- coxph(Surv(ageint, agecensor, died0121) ~ female + educ2 + educ3 + educ4 + locsiz2 + locsiz3 + locsiz4, data = data2)

# Display model summary
summary(cox_model)
## Call:
## coxph(formula = Surv(ageint, agecensor, died0121) ~ female + 
##     educ2 + educ3 + educ4 + locsiz2 + locsiz3 + locsiz4, data = data2)
## 
##   n= 7552, number of events= 3983 
## 
##             coef exp(coef) se(coef)      z Pr(>|z|)    
## female1 -0.25797   0.77262  0.03218 -8.017 1.08e-15 ***
## educ2   -0.03079   0.96968  0.03905 -0.788 0.430532    
## educ3   -0.08899   0.91486  0.04838 -1.839 0.065883 .  
## educ4   -0.20013   0.81862  0.05685 -3.520 0.000431 ***
## locsiz2 -0.01662   0.98352  0.04566 -0.364 0.715893    
## locsiz3 -0.09538   0.90902  0.05488 -1.738 0.082176 .  
## locsiz4 -0.19423   0.82347  0.04577 -4.243 2.20e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## female1    0.7726      1.294    0.7254    0.8229
## educ2      0.9697      1.031    0.8982    1.0468
## educ3      0.9149      1.093    0.8321    1.0059
## educ4      0.8186      1.222    0.7323    0.9151
## locsiz2    0.9835      1.017    0.8993    1.0756
## locsiz3    0.9090      1.100    0.8163    1.0122
## locsiz4    0.8235      1.214    0.7528    0.9008
## 
## Concordance= 0.55  (se = 0.005 )
## Likelihood ratio test= 81.55  on 7 df,   p=7e-15
## Wald test            = 81.72  on 7 df,   p=6e-15
## Score (logrank) test = 81.99  on 7 df,   p=5e-15

III. Prediction, diagnostics, time-varying covariates

  1. Choosing whatever model you prefer from I or II, assess whether the proportionality of hazards assumption is violated for FEMALE and SCHOOLING (separately). Does it hold for either/both?
cox_model <- coxph(Surv(ageint, agecensor, died0121) ~ female + schooling, data = data2)
ph_test <- cox.zph(cox_model)
print(ph_test)
##           chisq df      p
## female     6.35  1 0.0117
## schooling  6.79  1 0.0091
## GLOBAL    13.61  2 0.0011
plot(ph_test, var = "female")

plot(ph_test, var = "schooling")

The chi-squared statistic for female is 6.35 with a p-value of 0.0117, while the chi-squared statistic for schooling is 6.79 with a p-value of 0.0091 . Based on the chi-squared test results, the proportional hazards assumption is likely violated for both female and schooling. However, the visual residual plot for female shows no firm evidence of violation, so further assessment or transformations may be necessary. For schooling, both the chi-squared test and the residual plot indicate a clear violation, meaning the effect of schooling on hazard is not constant over time.

III b.

  1. Estimate a “Cox-stratified” model using a new dummy variable that divides people into whether they have less than 9 vs. 9 or more years of formal schooling as the stratification variable. Would your conclusions about the effect of education (linearly anyway) change doing this?
data2$schooling <- as.numeric(data2$schooling)
data2$school_strata <- ifelse(data2$schooling < 9, 1, 0)

# Fit the Cox stratified model using the new school_strata dummy variable
fit_stratified <- coxph(Surv(ageint, agecensor, died0121) ~ female + strata(school_strata), data = data2)
summary(fit_stratified)
## Call:
## coxph(formula = Surv(ageint, agecensor, died0121) ~ female + 
##     strata(school_strata), data = data2)
## 
##   n= 7552, number of events= 3983 
## 
##             coef exp(coef) se(coef)      z Pr(>|z|)    
## female1 -0.23973   0.78684  0.03186 -7.524 5.33e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## female1    0.7868      1.271    0.7392    0.8375
## 
## Concordance= 0.535  (se = 0.005 )
## Likelihood ratio test= 56.25  on 1 df,   p=6e-14
## Wald test            = 56.6  on 1 df,   p=5e-14
## Score (logrank) test = 56.87  on 1 df,   p=5e-14

he coefficient for female is -0.23973. This is the log hazard ratio for females compared to males. A negative value indicates that females have a lower hazard than males. The exponentiated coefficient (hazard ratio) for FEMALE is 0.78684. This indicates that females have approximately 21.3% lower hazard of the event (death) compared to males. Likelihood ratio test: A test for the overall significance of the model. The p-value (p = 6e-14) indicates that the model is highly significant.

In this stratified model, education is not directly included as a covariate. Instead, the model adjusts for differences in baseline hazards between the two groups (those with less than 9 years of schooling and those with 9 or more years). Since education is being used for stratification, we cannot directly interpret a linear effect of schooling from this model. The model adjusts for differences in baseline hazards but does not estimate how additional years of schooling affect the hazard ratio.

III.

c.“Recover” and graph the baseline hazard under the Cox model.

# Recover baseline hazard
baseline_hazard <- basehaz(cox_model, centered = FALSE)

# Plot baseline hazard
plot(baseline_hazard$time, baseline_hazard$hazard, type = "l", xlab = "Time", ylab = "Cumulative Baseline Hazard")

III d.

  1. Estimate predicted survival curves or smoothed hazards
library(muhaz)
hazard_men_0 <- muhaz(data2$agecensor[data2$female == 0 & data2$educlevel == 0],
                      data2$died0121[data2$female == 0 & data2$educlevel == 0])

hazard_women_0 <- muhaz(data2$agecensor[data2$female == 1 & data2$educlevel == 0],
                        data2$died0121[data2$female == 1 & data2$educlevel == 0])

hazard_men_68 <- muhaz(data2$agecensor[data2$female == 0 & data2$educlevel == 68],
                       data2$died0121[data2$female == 0 & data2$educlevel == 68])

hazard_women_68 <- muhaz(data2$agecensor[data2$female == 1 & data2$educlevel == 68],
                         data2$died0121[data2$female == 1 & data2$educlevel == 68])


# Plot the smoothed hazard functions
plot(hazard_men_0, col = "blue", lty = 1, lwd = 2, main = "Smoothed Hazard Functions", 
     xlab = "Time", ylab = "Hazard", ylim = c(0, 0.20))  # Set the line width (lwd)

lines(hazard_women_0, col = "red", lty = 2, lwd = 2)     # Women with 0 years of education
lines(hazard_men_68, col = "green", lty = 3, lwd = 2)    # Men with 6-8 years of education
lines(hazard_women_68, col = "purple", lty = 4, lwd = 2) # Women with 6-8 years of education

# Shift the legend to the top center of the plot
legend(x = "top", inset = c(0, -0.1), legend = c("Men, 0 years", "Women, 0 years", 
       "Men, 6-8 years", "Women, 6-8 years"), col = c("blue", "red", "green", "purple"), lty = 1:4, 
       lwd = 2, xpd = TRUE)

IV.

Write a short write up with a reflection on the problem set, either describing and interpreting the results or doing something more “meta” about your learning of the techniques, whatever is most useful to you.

Answer: Assignment 3 involved estimating and interpreting parametric and Cox proportional hazard models using different distributions and covariates. Initially, a parametric model with an exponential distribution and no covariates provided insight into the baseline hazard. Introducing ‘female’ as a covariate highlighted gender’s influence on survival, with the exponentiated coefficient representing the relative hazard for females compared to males. Adding schooling and location size further refined the model, offering a more detailed understanding of survival outcomes. We examined how distribution choice impacts hazard estimates by comparing models with various distributions (Exponential, Weibull, Gompertz, Log-logistic, and Log-normal). The Cox model, a semi-parametric approach, allowed for flexible estimation without assuming a specific hazard distribution. Stratifying the Cox model by education levels helped explore survival differences across subgroups. This exercise provided valuable experience in applying both parametric and semi-parametric methods for survival analysis. It emphasized the importance of model selection based on data characteristics, accurately interpreting results, and understanding each method’s assumptions to make informed decisions in real-world survival analysis.