In Chapter 5 of Intutive Biostatistics, (2nd ed), Motusky introduces survival analysis by demonstrating the construction of a Kaplan-Meir survival curve. Survival analysis is also known as “time to event” or “failure time” analysis and can be used to determine the surival or longevity anything, including humans, states (fighting, sick), and widgets. Kaplan-Meir methods are only one of many types of survival analysis. Motulsky further elaborates on these methods in Chapter 20 (2nd Ed).
The central point of survival analysis methods are that they allow you to use “censored data” where the final fate or state of the study individual or object isn’t known. For example, agronomists might be interested if a certain chemical treatment increase the rate of seed germination, but some seeds might not germinate by the end of the study. To determine mean time to germination, the researchers could take the mean germination date of those seeds that did germinate. This would biased by ingoring seeds that didn’t germinate. Survival analysis allows the “censored” seeds that did not germinate to be included in the analysis, yielding an unbiased estimate.
Survival analysis is frequently used in biomedical studies such as those testing the effects of medication; see Fisher & Linn (1999) for an overview. Survival analysis is less frequently used in other areas of biological research. Its utility for environmental biologists and ecologists has frequently been highlighted for tasks such as seed germination (McNair et al 2012), pollinator visitation rates (CITE), animal contests (Moya-Laraño Estación and Wise 2000), expansion of invasive species (Reino et al 2009), plant disease biology (Scherm and Ojiambo 2004) and leaf survival (Egli and Schmid 2001). In contrast, researchers using radio telemetry to study wildlife survival and movement frequently use methods based on survival analysis (CITE). Avian ecologist also use survival methods to study nest survival (Heisy te al 2007)
The issue of censored data arrises not just in studies recording the outcomes of binary events. For example, when chemical sensors or assays have a lower detection limit, data is censored by this lower limit. For an ecological focused introduction to general issues of censored data see Fox and Negrete-Yankelevich (2015).
R has a large array of basic and advanced methods for survival analysis. We will use the basic package ‘survival’ and its functions Surv() and survfit() R is a major platform for the development of new survival methods, such as mixed-effects survival models. See the CRAN Task View for more info.
R Functions Used
Key Vocab
Motulsky provides a short data table of survival times (Table 5.1). This table could be loaded from a .csv file, but for convience we will use the following code builds this table.
First, each column of data need to be made. (Note that I split up the date into month, data and year to make data entry easier).
#Parts of data table
## Starting date
start.month <- c(2,5,11,3,6,12, 12)
start.day <- c(7,19,14,4,15,1,15)
start.year <- c(1998,1998,1998,
1999,1999,1999,
1999)
##Ending data
###when individual died or was "censored"
end.month <- c(3,11,3, 5,5,9,8)
end.day <- c(2,30,3,4,4,4,15)
end.year <- c(2002,2004,2000,
2005,2005,2004,
2003)
## Fate
fate1 <- c("Died","moved","Died","Study ended","Died","Died","Died-car crash")
Next, I’ll assemble the dates using paste(). First the start date.
start.date <- paste(start.month,
start.day,
start.year,
sep = "/")
Then the end date.
end.date <- paste(end.month,
end.day,
end.year,
sep = "/")
Assemble the final table
dat.tab <- data.frame(start.date = start.date,
end.date,
fate = fate1)
| start.date | end.date | fate |
|---|---|---|
| 2/7/1998 | 3/2/2002 | Died |
| 5/19/1998 | 11/30/2004 | moved |
| 11/14/1998 | 3/3/2000 | Died |
| 3/4/1999 | 5/4/2005 | Study ended |
| 6/15/1999 | 5/4/2005 | Died |
| 12/1/1999 | 9/4/2004 | Died |
| 12/15/1999 | 8/15/2003 | Died-car crash |
Motulsky Table 5.1 “Sample survival-data details for the data plotted in Figure 5.1” (page 38, 2nd edition)
The data in table 5.1 need to be processed inorder to enter into R’s survival functions. Two things need to be done.
When we made the date columns and added to the datatable R automatically converted them “factor” format. The original vector was “character”, which is the format we need it to be in.
We can see what type of data soemthing is using is()
#start.date was charcter data...
is(start.date)[1]
## [1] "character"
#but data.frame() converted it to factor
is(dat.tab$start.date)[1]
## [1] "factor"
We can convert it back using as.charcter()
#convert start date
dat.tab$start.date <- as.character(dat.tab$start.date)
#convert end date
dat.tab$end.date <- as.character(dat.tab$end.date)
#convert start.date
dat.tab$start.date <- as.Date(dat.tab$start.date ,format = "%m/%d/%Y")
#convert end.date
dat.tab$end.date <- as.Date(dat.tab$end.date ,format = "%m/%d/%Y")
We can check what the conversion did using is()
is(dat.tab$end.date)
## [1] "Date" "oldClass"
Once in R’s “date”" format we can do math with the columns to calculate the time difference in days between the end and the start date.
dat.tab$days <- dat.tab$end.date -dat.tab$start.date
The time data has a special format in R , “difftime”
is(dat.tab$days)
## [1] "difftime"
For this to work properly we need to convert the data expliclity to numeric data. All this data conversion is kind of a pain but is essential to get R to work on the data.
dat.tab$days <- as.numeric(dat.tab$days )
Motulsky converts the data to years
dat.tab$years <- dat.tab$days/365
Motulsky list fates as detailed comments in table 5.1. To analyze the data I need to convert to
#recode fates
fate2 <- ifelse(fate1 == "Died",1,0)
Assemble the table
dat.tab2 <- data.frame(year = dat.tab$years,
fate.01 = fate2)
The final table
pander(dat.tab2)
| year | fate.01 |
|---|---|
| 4.066 | 1 |
| 6.54 | 0 |
| 1.301 | 1 |
| 6.173 | 0 |
| 5.89 | 1 |
| 4.764 | 1 |
| 3.668 | 0 |
Load the ‘survival’ package. The package can be downloaded from CRAN in RStudio by clicking on the “packages” tabe next to the “plot” tab. Next click on “Install” and type in “survival”.
library(survival)
The Surv() function formats the data for the final analysis. Note the capital “S”.
library(survival)
surv.object <- Surv(time = dat.tab2$year,
event = dat.tab2$fate.01)
The processed survival data has a unique format (this is a theme for survival analysis in R…)
surv.object
## [1] 4.065753 6.539726+ 1.301370 6.172603+ 5.890411 4.764384 3.668493+
surv.fitted.Motulsky <- survfit(surv.object ~ 1,
conf.type = "log-log")
Calling plot() on the fitted survival model, surv.fitted, produces a plot similar to the lower panels in Motulsky’s figure 5.1.
plot(surv.fitted.Motulsky)
R’s default setting produce a slightly similar output where the upper 95% confidence interval equals 1 for the entire curve
surv.fitted.deafult <- survfit(surv.object ~ 1)
plot(surv.fitted.deafult)
surfit() takes two arguements that can impact the shape of the curve and 95%CIs, “conf.type” as noted above and “type”. What these different setting do is beyond the scope of this tutorial. See ?survfit.formula for full details. (Please note, however, that adjusting the setting simply to adjust the confidence interval to suit your hypothesis would be a form of p-hacking)
This plot matches the lower left-hand plot in Motulsky’s figure 5.1.
#mark censored individuals
plot(surv.fitted.deafult,
mark.time = T,
mark = 16)
In Chapter 29 (2nd edition) Motulsky elaborates on survival analyses by discuss to test if two survival curves are significantly different using a log-rank test. In Chapter 29 he uses a different dataset thatn Chapter 5. Here, we’ll briefly introduce this method by modifying our current data in our dat.tab2 object.
First, we’ll make a duplicate of the data
dat.tab2.duplicate <- dat.tab2
Then we’ll change the data slightly by adding some random noise with the rnorm() comamnd. we’ll take the origina “year” column and add a random number drawn from a normal distribution with mean 0 and an SD of 3
dat.tab2.duplicate$year <- dat.tab2.duplicate$year + rnorm(n = 7,mean = 0, sd = 2)
rbind(dat.tab2.duplicate,dat.tab2)
## year fate.01
## 1 4.9705072 1
## 2 7.2672236 0
## 3 -0.1995911 1
## 4 7.6226245 0
## 5 8.0834898 1
## 6 6.1458614 1
## 7 2.2314634 0
## 8 4.0657534 1
## 9 6.5397260 0
## 10 1.3013699 1
## 11 6.1726027 0
## 12 5.8904110 1
## 13 4.7643836 1
## 14 3.6684932 0