STAT 578 Survival Analysis HW 2 - HIV data

Ryan Masson | Portland State University | 2/23/24

This assignment analyzes HIV data from AIDS Clinical Trials Group Protocol 175 (ACTG175). The data include 1,076 HIV-positive subjects who were randomized to one of two treatments. There are also data on 12 other covariates. The survival time of interest is chosen as the time to having a larger than 50% decline in the CD4 count, or progressing to AIDS or death, whichever comes first.

Nonparametric estimation of the survival function, assuming no censored data.

The nonparametric estimation of a survival function, without assuming any censoring, is obtained through the empirical survival function. This computes the proportion of observations still surviving after a given time point, and uses this as the point estimate for the probability that the time of interest T will occur at that time point (the definition of the sruvivor function). This process forms a right-continous stepdown curve. ¹ The below code computes the empirical survivor function for the range of the time points in the data and plots the estimated survival curve.

Show the code

setwd('~/Documents/datawork/psu/STAT 578/')
hiv_data <- read.table("HIVData.txt", header = TRUE)

# for each time point, compute the proportion of subjects remaining to overall sample size, and make a vector with these proportions
esf <- function(t) {
  proportion_surviving <- (sum(hiv_data$time > t) / 1076)
  return(proportion_surviving)
}
emp_surv_proportions <- lapply(c(1:1500), esf)

# plot this proportion against time 
time_points <- c(1:1500)
plot(time_points, emp_surv_proportions, xlab = "time to 50% CD4 count / AIDS / death", ylab = "proportion remaining without event", type = 's', pch = 1, cex = 0.5)

Nonparametric estimation of the survival function, considering censored data.

In order to nonparametrically estimate the survival function while considering censored data, we use the Kaplan-Meier estimator. This estimator computes a joint probability by sequentially multiplying estimates of the proportion surviving (here of course “surviving” means not having AIDS or not having a 50% drop in CD4 count) through the current interval given that a subject was alive at the beginning of that interval. This proportion is estimated as one minus the proportion of “dead” subjects at the beginning of the current interval to the number alive and not censored just before the interval. This forms a right-continuous step function which steps down only at an uncensored observation.³

Show the code

# order the events
hiv_data_sorted <- hiv_data[order(hiv_data$time), ]

kaplan_meier <- function(t) {
  
  km_estimate <- 1

  # for each ordered event before time t
  for (i in 1:nrow(subset(hiv_data_sorted, time <= t))) {
      # compute n_i, the number alive and not censored just before y_(i)
      n_i <- (sum(subset(hiv_data_sorted, 
                         time > hiv_data_sorted[i,]$time)$censoring == 0))

      # compute d_i, the number of died at y_(i) (it's a 0 or 1)
      d_i <- as.numeric(!hiv_data_sorted[i,]$censoring)
  
      # multiply the probabilities
      km_estimate = (km_estimate * ((n_i - d_i) / n_i))
  }
  
  return(km_estimate)
}

# the following line computes the Kaplan-Meier estimates for 1500 time points; saving to file for faster access
#km_estimates_of_survival <- lapply(c(1:1500), kaplan_meier)
#writeLines(as.character(km_estimates_of_survival), "km_estimates.txt")

# plot these estimates against time 
time_points <- c(1:1500)
km_estimates <- as.numeric(readLines("km_estimates.txt"))
plot(time_points, km_estimates, xlab = "time to 50% CD4 count / AIDS / death", ylab = "proportion remaining without event", type = 's', col = "red", pch = 1, cex = 0.5)
# plot esf to compare 
lines(time_points, emp_surv_proportions, xlab = "time to 50% CD4 count / AIDS / death", ylab = "proportion remaining without event", type = 's', col = "blue", pch = 1, cex = 0.5)
legend("topright", legend=c("K-M estimates w/ censoring", "ESF estimates w/o censoring"), col=c("red", "blue"), lty=1)

MLE estimate of parameter vector for exponentially distributed survival time

As derived and discussed in class, the following partial likelihood expression can be maximized to estimate the parameter vector for an exponentially distributed survival time:

\[ \prod_{i=1}^{n} \left( \frac{e^{\beta^\top x_i}}{\sum_{j=1}^{i} e^{\beta^\top x_j}} \right)^{\delta_i} \] I have implemented this expression as an R function, as seen in my code below. However, I am not sure how to handle the calculations of exponentials with very large exponents resulting in values that become NaNs in the computer. I am interested in seeing how to compute this in code.

Show the code

#library(gmp)
#library(Rmpfr)

# writing the above expression as an R function
partial_survival_mle <- function(beta) {
  
  n <- 1076
  result <- numeric(n)
  
  for (i in 1:n) {
    numerator <- exp(as.bigz(beta %*% t(hiv_data[i,3:15])))
    denominator <- sum(exp(as.bigz(beta %*% t(hiv_data[1:i,3:15]))))
    result[i] <- (numerator / denominator) ^ hiv_data[i,]$censoring
  }
  
  return(-prod(result)) # negate for optimizer function
}

# run of the optimizer commented out currently
#initial_values <- rep(1, 13)
#result <- optim(initial_values, partial_survival_mle)
#print(result)

Cox PH model

A Cox proportional hazards model is a semiparametric model. This model’s hazard function is expressed as: ⁵

\[ h(t|x) = h_0(t) \cdot \exp(x^{\prime}\beta) \]

The baseline hazard function h-naught(t) is not specified in the Cox model, so the likelihood function cannot be fully specified. A special partial likelihood, defined on conditional probabilities free of the baseline hazard, is used to estimate effects of covariates in the Cox PH model.

Once again the treatment of very large numbers has foiled my attempt to attain the estimator of the parameter vector with this likelihood. I need to find out how to compute these.

Show the code

#library(gmp)
#library(Rmpfr)

# writing the above expression as an R function
cox_ph_likelihood <- function(beta) {
  
  n <- 1076
  result <- numeric(n)
  
  for (i in 1:n) {
    numerator <- exp(as.bigz(beta %*% t(hiv_data[i,3:15])))
    denominator <- sum(exp(as.bigz(beta %*% t(hiv_data_sorted[i:1076,3:15]))))
    result[i] <- (numerator / denominator) ^ hiv_data[i,]$censoring
  }
  
  return(-prod(result)) # negate for optimizer function
}

# run of the optimizer commented out currently
#initial_values <- rep(1, 13)
#result <- optim(initial_values, partial_survival_mle)
#print(result)

Comparing results with results using R packages

Below I compared my above results using the existing survival analysis R package “survival.” My Kaplan-Meier nonparametric survival curve looks similar to the one output by the R function. The p-values on the coefficients in the estimated parameter vector from the function coxph appear very large, possibly indicating poor model specification or incorrect implementation by me. Further model diagnostics and analysis is needed to verify this model.

Show the code

library(survival)

Warning: package 'survival' was built under R version 4.3.1

Show the code

# fit a Kaplan-Meier survival estimate for censored data
fit <- survfit(Surv(time = time, !censoring) ~ 1, data = hiv_data)
plot(fit, main = "K-M estimate of survival with R package 'survival'", xlab = "time to 50% CD4 count / AIDS / death", ylab = "proportion remaining without event", pch = 1, cex = 0.5)

Show the code

# Cox proportional hazards model
fit <- coxph(Surv(time = time, !censoring) ~ 
               age+weight+hemophilia+homosecual+drugs+karnof+race+
               gender+anti+symptom+cd40+cd80+treatment, data = hiv_data)
print(fit)

Call:
coxph(formula = Surv(time = time, !censoring) ~ age + weight + 
    hemophilia + homosecual + drugs + karnof + race + gender + 
    anti + symptom + cd40 + cd80 + treatment, data = hiv_data)

                 coef  exp(coef)   se(coef)      z      p
age         1.645e-03  1.002e+00  4.221e-03  0.390 0.6967
weight      3.602e-03  1.004e+00  2.735e-03  1.317 0.1878
hemophilia  2.581e-01  1.295e+00  1.753e-01  1.473 0.1408
homosecual  1.069e-01  1.113e+00  1.266e-01  0.844 0.3984
drugs      -1.956e-01  8.223e-01  1.100e-01 -1.778 0.0753
karnof      1.255e-02  1.013e+00  6.190e-03  2.028 0.0426
race       -1.767e-02  9.825e-01  8.652e-02 -0.204 0.8381
gender     -2.484e-01  7.800e-01  1.443e-01 -1.722 0.0851
anti        5.650e-05  1.000e+00  7.851e-05  0.720 0.4717
symptom     2.553e-02  1.026e+00  9.953e-02  0.257 0.7976
cd40        3.381e-05  1.000e+00  3.152e-04  0.107 0.9146
cd80       -1.415e-04  9.999e-01  7.843e-05 -1.804 0.0713
treatment  -7.452e-02  9.282e-01  7.168e-02 -1.040 0.2986

Likelihood ratio test=18.48  on 13 df, p=0.1402
n= 1076, number of events= 807

Footnotes

Tableman, Kim, Survival Analysis Using S, 2005.↩︎
Tableman, Kim, Survival Analysis Using S, 2005.↩︎
Tableman, Kim, Survival Analysis Using S, 2005.↩︎
Tableman, Kim, Survival Analysis Using S, 2005.↩︎
Tableman, Kim, Survival Analysis Using S, 2005.↩︎
Tableman, Kim, Survival Analysis Using S, 2005.↩︎