According to the question, we generate the survival times \(\tilde{t}_{ij}\) as follows:
set.seed(1220)
n <- 2000
ni <- sample(2:4, n, replace = TRUE)
N <- sum(ni)
X1 <- runif(N)
X2 <- rgamma(n, 0.5, 0.5) %>% log() %>% rep(time = ni)
u <- runif(N)
t <- sqrt(-log(1-u) / exp(-2 * X1 + X2))
c <- runif(N, 3, 6)
delta <- ifelse(t <= c, 1, 0)
t[t > c] <- c[t > c]
The summary statistics of \(t_{ij}\) and the numbers of 0 and 1 for \(\delta_{ij}\) are as follows:
summary(t)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01915 0.93861 1.90999 2.33499 3.58516 5.99542
table(delta)
## delta
## 0 1
## 1540 4503
We fit the Cox PH model with both \(X1\) and \(X2\) as covariates as follows:
df <- data.frame(ID = rep(1:2000, ni), time = t, status = delta, X1 = X1, X2 = X2)
model1 <- coxph(Surv(time, status) ~ X1 + X2, data = df, ties = "breslow")
summary(model1)
## Call:
## coxph(formula = Surv(time, status) ~ X1 + X2, data = df, ties = "breslow")
##
## n= 6043, number of events= 4503
##
## coef exp(coef) se(coef) z Pr(>|z|)
## X1 -2.06392 0.12696 0.05653 -36.51 <2e-16 ***
## X2 1.00164 2.72274 0.01531 65.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## X1 0.127 7.8768 0.1136 0.1418
## X2 2.723 0.3673 2.6423 2.8057
##
## Concordance= 0.829 (se = 0.003 )
## Likelihood ratio test= 7021 on 2 df, p=<2e-16
## Wald test = 4566 on 2 df, p=<2e-16
## Score (logrank) test = 4041 on 2 df, p=<2e-16
We find that the estimates of \(\beta_1\) and \(\beta_2\) are given by \(\hat{\beta}_1=-2.06392\) and \(\hat{\beta}_2=1.00164\), which close to the true value \(\beta_1=-2\) and \(\beta_2=1\).
And we fit the Cox PH model with only \(X1\) as covariate as follows:
model2 <- coxph(Surv(time, status) ~ X1, data = df, ties = "breslow")
summary(model2)
## Call:
## coxph(formula = Surv(time, status) ~ X1, data = df, ties = "breslow")
##
## n= 6043, number of events= 4503
##
## coef exp(coef) se(coef) z Pr(>|z|)
## X1 -0.90709 0.40370 0.05223 -17.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## X1 0.4037 2.477 0.3644 0.4472
##
## Concordance= 0.593 (se = 0.004 )
## Likelihood ratio test= 303.6 on 1 df, p=<2e-16
## Wald test = 301.6 on 1 df, p=<2e-16
## Score (logrank) test = 305.5 on 1 df, p=<2e-16
We find that the estimate of \(\beta_1\) is given by \(\hat{\beta}_1=-0.90709\), which is different from true value. Therefore, \(X2\) is an important variable in this Cox PH model. If important variables are not observed and included in the model, the estimated coefficients of the observed variables are biased towards 0.
We fit the Cox model with shared frailty as follows:
model3 <- emfrail(Surv(time, status) ~ X1 + cluster(ID), data = df)
summary(model3)
## Call:
## emfrail(formula = Surv(time, status) ~ X1 + cluster(ID), data = df)
##
## Regression coefficients:
## coef exp(coef) se(coef) adj. se z p
## X1 -2.0817 0.1247 0.0715 0.0735 -28.3102 0
## Estimated distribution: gamma / left truncation: FALSE
##
## Fit summary:
## Commenges-Andersen test for heterogeneity: p-val 0
## no-frailty Log-likelihood: -36389.92
## Log-likelihood: -34950.11
## LRT: 1/2 * pchisq(2880), p-val 0
##
## Frailty summary:
## estimate lower 95% upper 95%
## Var[Z] 2.010 1.859 2.181
## Kendall's tau 0.501 0.482 0.522
## Median concordance 0.513 0.492 0.536
## E[logZ] -1.278 -1.404 -1.167
## Var[logZ] 4.977 4.359 5.727
## theta 0.498 0.459 0.538
## Confidence intervals based on the likelihood function
According to the result above, we find that the estimate of \(\beta_1\) is given by \(\hat{\beta}_1=-2.0817\), which is close to the true value. This means that the bias due to omitting important variables in the model is corrected by introducing a shared frailty into the model. And the \(p\)-value of likelihood ratio test show that the frailty term is highly significant. At last, we have \(\hat{\theta}=0.498\), which is close to the true value \(\theta=0.5\).