Survival Analysis Assignment 1

library(survival)

## Loading required package: splines

cell=read.csv("Cell.csv")

Question 1

To make a non-parametric estimate:

ecdf(cell$Time)(1175)

## [1] 0.7059

Using non-parametric methods, we would estimate the chance of a cell lasting at most 1175 days to be about .706.

To fit the CDF of the exponential model and calculate the probability:

survreg(Surv(Time)~1, dist="exponential", data=cell)

## Call:
## survreg(formula = Surv(Time) ~ 1, data = cell, dist = "exponential")
## 
## Coefficients:
## (Intercept) 
##       6.903 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -134.3   Loglik(intercept only)= -134.3
## n= 17

1/exp(6.902861) #calculate lambda

## [1] 0.001005

pexp(1175, 0.001004906)

## [1] 0.693

Using the exponential distribution, we estimate the probability of a cell lasting at most 1175 days to be approximately .693.

To fit the CDF of the Weibull model and calculate the probability:

survreg(Surv(Time)~1, dist="weibull", data=cell)

## Call:
## survreg(formula = Surv(Time) ~ 1, data = cell, dist = "weibull")
## 
## Coefficients:
## (Intercept) 
##       7.012 
## 
## Scale= 0.3012 
## 
## Loglik(model)= -122.8   Loglik(intercept only)= -122.8
## n= 17

1/.3012274 #a/scale

## [1] 3.32

exp(7.011883) #b/shape

## [1] 1110

pweibull(1175, shape=3.319751, scale=1109.742)

## [1] 0.7015

Using the Weibull distribution, we estimate the probability of a cell lasting at most 1175 days to be about .701.

To compare the ecdf to the CDFs of the other models:

plot(ecdf(cell$Time), do.points=F, verticals=T)
curve(pexp(x,0.001004906) , 0 , max(cell$Time), add=T, col='blue')
curve(pweibull(x, 3.319751, 1109.742), add=T, col='red')

plot of chunk unnamed-chunk-5

Based on these plots, we would not feel comfortable using the Exponential model.

I probably would have decided that the value obtained from (1b) was close enough to the one obtained from (1a) to use it. I now realize that even slight differences in estimates found by using different methods can be quite significant, and reaffirm that it is important to make sure you are representing the data using the most appropriate model.
To estimate the mean:

mean(cell$Time)

## [1] 995.1

This value corresponds to the 50th percentile of the ecdf function, as we see in the plot from (1c).

Question 2

cord=read.csv("Cord.csv")

We know that the actual strength of some of the cords are at least the number recorded, but we don’t know the actual values. This is right-censored data.
To fit and plot the Weibull model:

Status=1-(cord$Damaged==1) #All (Damaged==1) values are the right-censored data, so need to be equal to 0
survreg(Surv(Strength, Status)~1, dist="weibull", data=cord)

## Call:
## survreg(formula = Surv(Strength, Status) ~ 1, data = cord, dist = "weibull")
## 
## Coefficients:
## (Intercept) 
##       4.026 
## 
## Scale= 0.06132 
## 
## Loglik(model)= -115.8   Loglik(intercept only)= -115.8
## n= 48

1/0.06131651 #calculate shape

## [1] 16.31

exp(4.025853) #calculate scale

## [1] 56.03

curve(dweibull(x, 16.30882, 56.02808), 20, 80)

plot of chunk unnamed-chunk-8

The estimated density appears to be mildly skewed.

Estimate the mean by finding the area under the curve.

S=function(x) {1-pweibull(x, 16.30882, 56.02808)}
integrate(S, 0, Inf)

## 54.24 with absolute error < 0.0011

Estimate the median by using the qweibull function.

qweibull(.5, 16.30882, 56.02808)

## [1] 54.78

The mean and median appear to be relatively close in value, so a Normal model would be appropriate for this data.
To fit the Normal model, and compare estimated density of the Normal model to that of the Weibull model:

survreg(Surv(Strength)~1, dist="gaussian", data=cord)

## Call:
## survreg(formula = Surv(Strength) ~ 1, data = cord, dist = "gaussian")
## 
## Coefficients:
## (Intercept) 
##       51.44 
## 
## Scale= 8.178 
## 
## Loglik(model)= -169   Loglik(intercept only)= -169
## n= 48

mean(cord$Strength)

## [1] 51.44

sd(cord$Strength)

## [1] 8.265

curve(dnorm(x, mean=51.44375, sd=8.264714),20,80)

plot of chunk unnamed-chunk-11

curve(dweibull(x, 16.30882, 56.02808),20,80, ylim=c(0,1), color='fuschia')

## Warning: "color" is not a graphical parameter
## Warning: "color" is not a graphical parameter
## Warning: "color" is not a graphical parameter
## Warning: "color" is not a graphical parameter
## Warning: "color" is not a graphical parameter
## Warning: "color" is not a graphical parameter

plot of chunk unnamed-chunk-12

The two curves differ in that the Weibull density is more compact and concentrated than that of the Normal.

To find the estimated mean of the Normal model, find the area underneath the curve.

S=function(x) {1-pnorm(x, mean=51.44375, sd=8.264714)}
integrate(S, 0, Inf)

## 51.44 with absolute error < 0.0012

To estimate the median, use the qnorm function.

qnorm(.5, mean=51.44375, sd=8.264714)

## [1] 51.44

It appears that the Normal model is more restrictive in that the mean and median must be relatively close in value.

Question 3

AML=read.csv("AML.csv")

To plot the two ecdfs:

plot(ecdf(AML$DurationPlacebo), do.points=F, verticals=T)
plot(ecdf(AML$Duration6MP),verticals=T,do.points=F,add=T,col='red')

plot of chunk unnamed-chunk-16

The placebo group appears to have longer remission times.

kmplacebo=survfit(Surv(DurationPlacebo, StatusPlacebo)~1, data=AML)
km6mp=survfit(Surv(Duration6MP, Status6MP)~1, data=AML)
plot(kmplacebo,mark.time=F,conf.int=F)
lines(km6mp, mark.time=F, conf.int=F, add=T, col='green')

plot of chunk unnamed-chunk-17

The curve for the placebo group is equal to what we found in (3a), but the curve for the 6MP group takes more time to take a “step”, presumably due to the presence of censored data. The probability of survival stays the same when right-censored points are included.

The gap between the two Kaplan-Meier curves is larger than it was using the ecdfs. This is because the Kaplan-Meier curves have taken into account the censored data points (patients in remission).
The Kaplan-Meier curve for the 6-MP group does not drop to 0 because the largest value is a right-censored data point (i.e. some of the patients were still in remission at the end of the study).
Any means calculated using these methods will be an underestimate because the observed values of right-censored data may severely underestimate the actual values, thus causing an overall underestimate of the truth.

Question 4

recidivism=read.csv("Recidivism.csv")

status=1-(recidivism$week==52)
kmrec=survfit(Surv(week, status)~1, data=recidivism)
plot(kmrec,mark.time=F,conf.int=F)

plot of chunk unnamed-chunk-19

From just looking at the plot, we estimate the chance of a released prisoner being re-arrested within 26 weeks to be 10%.
We cannot make any reasonable estimates about the median time until re-arrest because, according to the plot, around 80% of the prisoners spend at least one year out of prison (i.e. are censored at 52). Thus, any estimate that we made about the median would be significantly lower than the actual value.

d(i). To analyze the relationship between time to re-arrest and level of education:

primary=subset(recidivism, educ==2)
kmprimary=survfit(Surv(primary$week)~1, data=primary)

secondary=subset(recidivism, educ==3)
kmsecondary=survfit(Surv(secondary$week)~1, data=secondary)

secondary2=subset(recidivism, educ==4)
kmsecondary2=survfit(Surv(secondary2$week)~1, data=secondary2)

hsgrad=subset(recidivism, educ=5)
kmhsgrad=survfit(Surv(hsgrad$week)~1, data=hsgrad)

coll=subset(recidivism, educ==6)
kmcoll=survfit(Surv(coll$week)~1, data=coll)

plot(kmprimary, mark.time=F, conf.int=F)
lines(kmsecondary, mark.time=F, conf.int=F, col='red')
lines(kmsecondary2, mark.time=F, conf.int=F, col='orange')
lines(kmhsgrad, mark.time=F, conf.int=F, col='green')
lines(kmcoll, mark.time=F, conf.int=F, col='blue')

plot of chunk unnamed-chunk-20

According to the plot, prisoners with at least some post-secondary education appear to have more time until re-arrest, whereas prisoners with an early middle school (7th, 8th or 9th grade) education seem to have the least time until re-arrest.

d(ii). To analyze the relationship between time to re-arrest and marital status:

married=subset(recidivism, mar==1)
unmarried=subset(recidivism, mar==0)
kmmar=survfit(Surv(married$week)~1, data=married)
kmunmar=survfit(Surv(unmarried$week)~1, data=unmarried)
plot(kmmar, mark.time=F, conf.int=F)
lines(kmunmar, mark.time=F, conf.int=F, col='cyan')

plot of chunk unnamed-chunk-21

According to the plot, married prisoners appear to have more time until re-arrest.

Question 5

Ignoring right-censoring leads to (potentially egregious) underestimates of the truth.
The advantage of performing a non-parametric analysys is that we can analyze data without making any assumptions about it.
Derivative
In an incident study, we wait for the event to occur, and then enroll the afflicted people in the study and follow them, but in a prevalent study, we identify people with the disease before starting the study. In short, in a prevalent study, we may not know the exact onset of the disease, whereas we always know in an incident study.

Question 6

False
True
False
False
True
True
False
False
True

j: False

k: False