Survival Analysis I

This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

introducation

The hazard function is also called conditional failure rate. The two defining features of hazard function is that it is always non negative and it has no upper bound.

The goals: 1. To estimate and interpret survivor or hazard functions from survival data 2. To compare survivor and or hazard functions 3. To assess the relationship of explanatory variables to survival time.

Definitions: 1. Median survival time: The time when survival probability reaches 50%. If the data cannot reach survival probability 50% (in Y axis), then, there will be no median survival time.

Features: 1. The data structure used to do KM estimation is ordered failure times. (data structure: ordered failure times, # of failures, # censored in the time interval, Risk set) 2. Number of people in risk vs. number of people has event. 3.The most important thing to note is that the risk set contains people who have not been censored during the time interval considered. 4.Thus KM estimation does not throw away the information.

leukemia<-read.table("https://data.princeton.edu/wws509/datasets/gehan.dat")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

summarize(leukemia, events=sum(relapse), exposure=sum(weeks))

library(DT)

## Warning: package 'DT' was built under R version 4.0.3

DT::datatable(leukemia, options=list(autoWidth=TRUE))

30 of 42 patients had relapse, with total 541 weeks. the weekly relapse rate is (42/541) 7.8%.

group is control

leukemia[,3]<-as.numeric(leukemia[,3])
grp1<-subset(leukemia, group=="control")
library(survival)

## 
## Attaching package: 'survival'

## The following object is masked _by_ '.GlobalEnv':
## 
##     leukemia

with(grp1, survfit(Surv(weeks, relapse)~1))

## Call: survfit(formula = Surv(weeks, relapse) ~ 1)
## 
##       n  events  median 0.95LCL 0.95UCL 
##      21      21       8       4      12

fit_1<-with(grp1, survfit(Surv(weeks, relapse)~1, conf.type="none"))
print(summary(fit_1))

## Call: survfit(formula = Surv(weeks, relapse) ~ 1, conf.type = "none")
## 
##  time n.risk n.event survival std.err
##     1     21       2   0.9048  0.0641
##     2     19       2   0.8095  0.0857
##     3     17       1   0.7619  0.0929
##     4     16       2   0.6667  0.1029
##     5     14       2   0.5714  0.1080
##     8     12       4   0.3810  0.1060
##    11      8       2   0.2857  0.0986
##    12      6       2   0.1905  0.0857
##    15      4       1   0.1429  0.0764
##    17      3       1   0.0952  0.0641
##    22      2       1   0.0476  0.0465
##    23      1       1   0.0000     NaN

group is treated

grp2<-subset(leukemia, group=="treated")
with(grp2, survfit(Surv(weeks, relapse)~ 1))

## Call: survfit(formula = Surv(weeks, relapse) ~ 1)
## 
##       n  events  median 0.95LCL 0.95UCL 
##      21       9      23      16      NA

fit_2<-with(grp2, survfit(Surv(weeks, relapse)~ 1, conf.type="none"))
print(summary(fit_2))

## Call: survfit(formula = Surv(weeks, relapse) ~ 1, conf.type = "none")
## 
##  time n.risk n.event survival std.err
##     6     21       3    0.857  0.0764
##     7     17       1    0.807  0.0869
##    10     15       1    0.753  0.0963
##    13     12       1    0.690  0.1068
##    16     11       1    0.627  0.1141
##    22      7       1    0.538  0.1282
##    23      6       1    0.448  0.1346

fit<-with(leukemia, survfit(Surv(weeks, relapse)~group))
plot(fit, lty=c(2,1)) #lty: line type
legend("topright", legend = c("treated", "control"),
lty =c(1, 2),bty = "n")
abline(h = 0.5, col = "sienna", lty = 3)

log rank test is to compare two survival curves. This is a large sample chi square statistics in which categories are the set of ordered failure times. The idea behind this test is that at each failure point, given the risk set of each group and total failures of each of group, an expected failure count is computed for each group. This expected count is compared to the observed count.

with(leukemia, survdiff(Surv(weeks, relapse)~group, rho=0))

## Call:
## survdiff(formula = Surv(weeks, relapse) ~ group, rho = 0)
## 
##                N Observed Expected (O-E)^2/E (O-E)^2/V
## group=control 21       21     10.7      9.77      16.8
## group=treated 21        9     19.3      5.46      16.8
## 
##  Chisq= 16.8  on 1 degrees of freedom, p= 4e-05

#rho 1 is peto-peto, 0 for log rank

# data_bladder <- read.table("D:\\Users\\EChen13\\OneDrive - JNJ\\Documents\\stat\\Survival data\\bladder.dat", header=TRUE, stringsAsFactors = FALSE)

Reference: https://data.princeton.edu/pop509/