STAT 578 Final - Ryan Masson

Ryan Masson | Portland State University | 2/23/24

Problem 1

Prove the theorem of martingale under independence between event time \(T\) and censoring time \(C\). Particularly, you only need to prove the third verification, that is proving \[E\{M(t+s)\mid\mathcal{F}_t\}=M(t)\] for any \(t,s\ge 0\), if and only if \[\lambda(t)=\frac{-\frac{\partial}{\partial u}Pr(T\ge u,C\ge t)\mid_{u=t}}{Pr(T\ge t,C\ge t)}\] whenever \(P(X>t)>0\). The counting processes are defined as \[N(t)=I(X\le t,\delta=1),\] \[N_c(t)=I(X\le t, \delta=0),\] where \(X=min(T,C)\), \(\delta=I(T\le C)\). Then process \(M(t)\) is given by \[M(t) = N(t) -\int_0^t I(X\ge u)\lambda(u)du.\]

page1 page2 page3 page4 page5 page6

Problem 2

Let’s generate a survival data and estimate the coefficients and conduct inferences on them.

Assume a CoxPH model given by \[\lambda(t,\beta^T X) = te^{\beta_1X_1+\beta_2X_2}.\] Let parameter \(\beta_1=2,\beta_2=-1\), \(X_1\) is from random normal distribution with mean 0 and variance 1, \(X_2\) is from uniform distribution on \([0,2]\), and sample size is 200.

Generate the survival data with censoring

In order to generate the random event \(T\) from \(\lambda(t,\beta^T X)\), you need to derive the cumulative hazard function and survival function first. Then generate \(T\) from the solving \(u_i = e^{-\Lambda(T_i,\beta_1X_{i1}+\beta_2 X_{i2})},i = 1,2,...,200\), where \(u_i\)’s are i.i.d. uniform distributed on \([0,1]\). This is how we generate random variable from uniform distribution.

The event time you generated above is not censored. For generating a censored data set, one of the popular methods is generating censoring variable \(C_i\) from, for example, uniform distribution on \([0,k]\), \(k\) is some constant. Then you generate \(Y_i=min(T_i,C_i),i = 1,2,...,200\) as your observed event and censoring indicator \(\delta_i=I(T_i\le C_i),i=1,2,...,200\). You can adjust \(k\) to acquire different censoring rates.

Show the code
# The given hazard function from a Cox PH model
hazard <- function(t, x1, x2) {
  return(t * exp(2*x1 - x2))
}

# Derive the cumulative hazard function by integrating hazard from 0 to t
cumulative_hazard <- function(t, x1, x2) {
  return(integrate(Vectorize(function(u) hazard(u, x1, x2)), 
                   lower = 0, 
                   upper = t)$value)
}

# Derive the survival function
survival_function <- function(t, x1, x2) {
  exp(-cumulative_hazard(t, x1, x2))
}

# Generate event times through inverse transform sampling
generate_event_times <- function(n) {
  u <- runif(n)
  u <- as.data.frame(u)
  for (i in 1:n) {
    x1 <- rnorm(1, mean = 0, sd = 1)  # random instances of x1 and x2
    x2 <- runif(1, min = 0, max = 2)
    u$x1[i] <- x1
    u$x2[i] <- x2
    u$u[i] <- (-log(u$u[i]) / survival_function(1, x1, x2)) # inverse of survival
  }
  return(u)
}

# Generate 200 event times
event_times <- generate_event_times(200)  

# add censoring and delta column
event_times <- as.data.frame(event_times)
names(event_times)[names(event_times) == "u"] <- "T"
C <- runif(200, 0, 1)
event_times$C <- C
event_times$Y <- pmin(event_times$T, event_times$C)
event_times$delta <- ifelse(event_times$T < event_times$C, TRUE, FALSE)
event_times$delta <- as.numeric(event_times$delta)

Estimate the survival function and the hazard function using nonparametric method, denoted your estimations by \(\hat S\) and \(\hat\lambda\)

Estimate the survival function and the baseline hazard function \(\lambda_0(t)\) using nonparametric method, denoted your estimations by \(\hat S\) and \(\hat\lambda_0\)

I use the K-M estimator to estimate the survival function, pictured below. I first tried to do this using my own implementation, but it did not seem to work, as seen in the plot with the red line. I then use the R package survival to do the same estimation, as seen in the plot with the blue line.

Kaplan-Meier estimator for survival 1
Show the code
library(survival)

##### SURVIVAL FUNCTION ESTIMATION USING KAPLAN-MEIER ESTIMATOR

# order the events
event_times_sorted <- event_times[order(event_times$Y), ]

kaplan_meier <- function(t) {
  
  km_estimate <- 1

  # for each ordered event before time t
  for (i in 1:nrow(subset(event_times_sorted, Y <= t))) {
      # compute n_i, the number alive and not censored just before y_(i)
      n_i <- (sum(subset(event_times_sorted, 
                         Y > event_times_sorted[i,]$Y)$delta == 0))
      
      # skip calculation if n_i is 0
      if (n_i == 0) {
        next
      }

      # compute d_i, the number of died at y_(i) (it's a 0 or 1)
      d_i <- as.numeric(!event_times_sorted[i,]$delta)
  
      # multiply the probabilities
      km_estimate = (km_estimate * ((n_i - d_i) / n_i))
  }
  
  return(km_estimate)
}

# the following line computes the Kaplan-Meier estimates for 1500 time points; saving to file for faster access

#km_estimates_simulated <- lapply(c(1:1500), kaplan_meier)
#writeLines(as.character(km_estimates_simulated), "km_estimates_simulated.txt")

# plot these estimates against time 
time_points <- c(1:1500)
km_estimates <- as.numeric(readLines("km_estimates_simulated.txt"))
plot(time_points, km_estimates, xlab = "time to event", ylab = "proportion remaining without event", type = 's', col = "red", pch = 1, cex = 0.5)

Show the code
####### USING R PACKAGE SURVIVAL
# Fit a survival object
surv_obj <- Surv(time = event_times_sorted$Y, event = event_times_sorted$delta)

# Estimate the survival function using the Kaplan-Meier estimator
km_fit <- survfit(surv_obj ~ 1)

# Plot the Kaplan-Meier estimate of the survival function
plot(km_fit, xlab = "Time", ylab = "Survival Probability", col = "blue", lwd = 2)

Through the R package survival, I use this nonparametric way to estimate the hazard function:

Nonparametric estimation of hazard function 2
Show the code
##### NONPARAMETRIC BASELINE HAZARD FUNCTION ESTIMATION
library(survival)

# Fit a survival object
surv_obj <- Surv(time = event_times_sorted$Y, event = event_times_sorted$delta)

# Estimate the baseline survival function using the Kaplan-Meier estimator
baseline_survival <- survfit(surv_obj ~ 1)

# Plot the Kaplan-Meier estimate of the baseline hazard function
plot(baseline_survival, fun = "cumhaz", xlab = "Time", ylab = "Cumulative Hazard", col = "blue", lwd = 2)

Estimate the parameter \(\beta\) from a CoxPH model

My below code estimates the two beta coefficients using the R function coxph(). The p-values for the coefficients both indicate significance, and the likelihood ratio test for the model is also significant. However, the estimates for the coefficients of x1 and x2 are -0.6174 and 0.6806, respectively. This is a model fit to entirely simulated data based on a given hazard function, where the coefficients were 2 and -1. Shouldn’t the estimates here be close to those number? I am confused.

Show the code
# Fit a Cox proportional hazards model
cox_model <- coxph(Surv(Y, delta) ~ x1 + x2, data = event_times_sorted)

# Print the summary of the model
summary(cox_model)
Call:
coxph(formula = Surv(Y, delta) ~ x1 + x2, data = event_times_sorted)

  n= 200, number of events= 60 

      coef exp(coef) se(coef)      z Pr(>|z|)   
x1 -0.1444    0.8655   0.1204 -1.199  0.23036   
x2  0.6455    1.9069   0.2267  2.848  0.00441 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

   exp(coef) exp(-coef) lower .95 upper .95
x1    0.8655     1.1553    0.6836     1.096
x2    1.9069     0.5244    1.2229     2.974

Concordance= 0.608  (se = 0.04 )
Likelihood ratio test= 10.31  on 2 df,   p=0.006
Wald test            = 10  on 2 df,   p=0.007
Score (logrank) test = 10.29  on 2 df,   p=0.006

Save your results of \(\hat\beta\), \(\hat S\), and \(\hat\lambda\).

Show the code
km_fit
Call: survfit(formula = surv_obj ~ 1)

       n events median 0.95LCL 0.95UCL
[1,] 200     60  0.892   0.857      NA
Show the code
baseline_survival
Call: survfit(formula = surv_obj ~ 1)

       n events median 0.95LCL 0.95UCL
[1,] 200     60  0.892   0.857      NA
Show the code
cox_model
Call:
coxph(formula = Surv(Y, delta) ~ x1 + x2, data = event_times_sorted)

      coef exp(coef) se(coef)      z       p
x1 -0.1444    0.8655   0.1204 -1.199 0.23036
x2  0.6455    1.9069   0.2267  2.848 0.00441

Likelihood ratio test=10.31  on 2 df, p=0.005758
n= 200, number of events= 60 

Problem 3

Repeat what you did in Problem 2 for 1000 times. Because each time we generate data randomly, your 1000 results should be different. Store your results in big matrices with 1000 row and several columns each.

Your code:

Show the code
thousand_simulations <- data.frame(matrix(nrow = 1000, ncol = 0))

for (i in 1:1000) {
  # Generate 200 event times
  event_times <- generate_event_times(200)  

  # add censoring and delta column
  event_times <- as.data.frame(event_times)
  names(event_times)[names(event_times) == "u"] <- "T"
  C <- runif(200, 0, 1)
  event_times$C <- C
  event_times$Y <- pmin(event_times$T, event_times$C)
  event_times$delta <- ifelse(event_times$T < event_times$C, TRUE, FALSE)
  event_times$delta <- as.numeric(event_times$delta)
  
  event_times_sorted <- event_times[order(event_times$Y), ]

  # Fit a Cox proportional hazards model
  cox_model <- coxph(Surv(Y, delta) ~ x1 + x2, data = event_times_sorted)
  
  thousand_simulations$beta1[i] <- cox_model$coefficients[1]
  thousand_simulations$beta2[i] <- cox_model$coefficients[2]
}

Discussion about \(\beta\)

Compute the mean, median, and variance of each of your 1000 results of \(\hat\beta_1,\hat\beta_2\). Compare the mean with the true \(\beta_1,\beta_2\) you generated from Problem 2.

The mean of my thousand beta_ones is -0.46. The mean of my thousand beta_twos is 0.22. These means are far off from the initial means of 2 and -1 which were given in the original hazard function.

Your code:

Show the code
summary(thousand_simulations$beta1)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.88477 -0.54494 -0.46010 -0.46180 -0.37578  0.02956 
Show the code
summary(thousand_simulations$beta2)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.54735  0.07823  0.24051  0.24516  0.41193  0.91388 
Show the code
var(thousand_simulations$beta1)
[1] 0.01689194
Show the code
var(thousand_simulations$beta2)
[1] 0.06150233

Plot the histogram of your 1000 \(\hat\beta_1,\hat\beta_2\), what do you find?

I find that these histograms appear very normal, which is consistent with a central limit theorem effect produced by repeated simulated random samples of parameter estimates.

Show the code
hist(thousand_simulations$beta1)

Show the code
hist(thousand_simulations$beta2)

Discussion about the hazard function \(\lambda_0\)

Illustrate your 1000 results of \(\hat\lambda_0\) and the true \(\lambda_0\), which is known to us because you generated data from it, in one plot. What can you say about it?

The known baseline hazard in this case is just equal to t, the time variable. My plot of the cumulative hazard below seems consistent with this.

Your code:

Show the code
# Plot the Kaplan-Meier estimate of the baseline hazard function
plot(baseline_survival, fun = "cumhaz", xlab = "Time", ylab = "Cumulative Hazard", col = "blue", lwd = 2)

If you change your censoring rate by altering \(k\), you will have different results.

Footnotes

  1. Tableman, Kim, Survival Analysis Using S, 2005.↩︎

  2. Tableman, Kim, Survival Analysis Using S, 2005.↩︎