Math 353-01: Assignment # 1 Solutions

1.

a.

Non-parametric estimate of P(X <= 1175) = 12/17 = 0.7058824, obbtained simply be counting how many observed data points are at most 1175 (note that there is no censoring in this data).

b.

Exponential Model:

survreg(Surv(Time,rep(1,17))~1,dist='exponential',data=Cell)

## Call:
## survreg(formula = Surv(Time, rep(1, 17)) ~ 1, data = Cell, dist = "exponential")
## 
## Coefficients:
## (Intercept) 
##       6.903 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -134.3   Loglik(intercept only)= -134.3
## n= 17

Estimate of P(X <= 1175) = pexp(1175,1/exp(6.902861)) = 0.6929562

Weibull Model:

survreg(Surv(Time,rep(1,17))~1,dist='weibull',data=Cell)

## Call:
## survreg(formula = Surv(Time, rep(1, 17)) ~ 1, data = Cell, dist = "weibull")
## 
## Coefficients:
## (Intercept) 
##       7.012 
## 
## Scale= 0.3012 
## 
## Loglik(model)= -122.8   Loglik(intercept only)= -122.8
## n= 17

Estimate of P(X <= 1175) = pweibull(1175,shape=1/0.3012274,scale=exp(7.011883)) = 0.7014677

c.

plot(ecdf(Cell$Time),do.points=F,verticals=T,main="",ylab="Cumulative",xlab="Days in service")
curve(pexp(x,1/exp(6.902861)),add=T,col='red')
curve(pweibull(x,shape=1/0.3012274,scale=exp(7.011883)),add=T,col='blue')

plot of chunk unnamed-chunk-4 The Weibull and non-parametric estimates agree, while the Exponential model is very different. This would probably make me uneasy about using the Exponential for these failure times.

d.

No, notice that for the answers to (a) and (b), all three models agreed very nicely. I wanted to illustrate that simply because models agree for one quantity (the mean, say) or at one specific point (here, 1175), that does not mean they are identical, or even near identical models overall. The graph from © shows that close to 1175, the 3 curves nearly intersect, but that the Exponential model is very different from the other two models at almost all other places in the distribution.

e.

mean(Cell$Time)

## [1] 995.1

The value 995.1176 days corresponds to the area under the ecdf in the graph from ©. You could also have obtained this area using the AUCKM function that I showed in class.

2.

a.

The Strength variable is right-censored in this example. Right-censored data refers to a situation where the observations are incomplete in a way that we know a lower bound for their value. That is, we know that an observation is at least the listed value. This is the situation here: the value recorded for the damaged cords is a lower bound for the actual strength of those cords.

b.

Weibull Model:

survreg(Surv(Strength,(1-Damaged))~1,dist='weibull',data=Cord)

## Call:
## survreg(formula = Surv(Strength, (1 - Damaged)) ~ 1, data = Cord, 
##     dist = "weibull")
## 
## Coefficients:
## (Intercept) 
##       4.026 
## 
## Scale= 0.06132 
## 
## Loglik(model)= -115.8   Loglik(intercept only)= -115.8
## n= 48

curve(dweibull(x,shape=1/0.06131651,scale=exp(4.025853)),30,70,ylab='Density',xlab='Strength')

plot of chunk unnamed-chunk-6 Yes, the density shows a mild amount of left-skewness.

c.

To find the estimated mean strength, find the area under the estimated survival curve.

f=function(x) 1-pweibull(x,shape=1/0.06131651,scale=exp(4.025853))
integrate(f,0,Inf)

## 54.24 with absolute error < 0.0011

To find the estimated median strength, find the 50th percentile of the estimated Weibull:

qweibull(0.5,shape=1/0.06131651,scale=exp(4.025853))

## [1] 54.78

Indeed, the estimated median is slightly greater than the estimated median, as would be expected for mildly left-skewed data.

d.

From the graph in ©, it looks like the Normal might be a reasonable model since the estimated Weibull looks relatively bell-shaped and symmetric.

e.

Normal Model:

survreg(Surv(Strength,(1-Damaged))~1,dist='gaussian',data=Cord)

## Call:
## survreg(formula = Surv(Strength, (1 - Damaged)) ~ 1, data = Cord, 
##     dist = "gaussian")
## 
## Coefficients:
## (Intercept) 
##       54.15 
## 
## Scale= 4.751 
## 
## Loglik(model)= -122.1   Loglik(intercept only)= -122.1
## n= 48

Adding the estimated Normal density to the graph from (b):

curve(dnorm(x,mean=54.15332,sd=4.751413),add=T,col='red')

## Error: plot.new has not been called yet

Even though it seemed like the Normal model would be reasonable, since the estimated Weibull model looked roughly mound-shaped and symmetric, the two models differ somewhat substantially. In terms of central tendency, the estimated Normal is roughly the same as the estimated Weibull (maybe ever so slightly pushed to the left) with an estimated mean strength of 54.15332, compared to 54.24244 for the Weibull. But the Normal does seem to be a little more spread (or variable) than the estimated Weibull.

f.

You can find this estimated mean using the integrate command, but it is also just the coefficient returned by the Normal model in (e): Estimated Mean = 54.15332

You can find the estimated median using the qnorm command, but we know that it must be the same as the mean, since the Normal distribution is symmetric: Estimated Median = 54.15332

This illustrates another way in which the Normal model is said to be more restrictive than the Weibull: the mean and median are forced to be identical in the Normal model.

3.

a.

plot(ecdf(AML$DurationPlacebo),do.points=F,verticals=T,main="",ylab="Cumulative",xlab="Survival (in weeks)",xlim=c(0,35))
lines(ecdf(AML$Duration6MP),do.points=F,verticals=T,col='red')

plot of chunk unnamed-chunk-11 The red group, i.e. the 6MP group has longer remission times. Remember that for survival curves, one curve being above another indicates longer survival, but for cumulative distribution functions, a curve being below indicates longer survival (since it can be thought of as “taking longer”“ to get to 1, in other words, extending out to larger survival times).

b.

KMP=survfit(Surv(AML$DurationPlacebo,AML$StatusPlacebo)~1)
KM6=survfit(Surv(AML$Duration6MP,AML$Status6MP)~1)
plot(KMP,mark.time=F,conf.int=F,xlim=c(0,35))
lines(KM6,mark.time=F,conf.int=F,col='red')

plot of chunk unnamed-chunk-12 For the placebo group, the Kaplan-Meier curve is indeed simply "1-” the ecdf from (a), but this is not the case for the 6MP group.

The reason for this is that the placebo group actually had no censoring, so drawing the Kaplan-Meier is equivalent to drawing “1-” the ecdf. But, censoring is present for the 6MP group, thus, the Kaplan-Meier is clearly distinct from simple drawing “1-” the ecdf. That is, in the absence of censoring, the Kaplan-Meier reduce to the empirical survival curve.

c.

The gap between the Kaplan-Meier curves is larger than it was for the ecdfs. The reason for this follows from our observation in (b): there is no censoring for the placebo group, but there is for the 6MP group.Thus, ignoring censoring (and drawing ecdfs) as we did in (a) does not affect the estimates for the placebo group. However, doing the same for the 6MP will lead to an underestimation of survival. Thus, when the Kaplan-Meier accounts for censoring, it will not change the placebo survival curve, but it will increase the 6MP survival curve (i.e. push it even higher on the graph).

d.

The reason that the Kaplan-Meier curve drops all the way to 0 for the placebo group is because the largest observation is uncensored (in fact, all the placebo observations are uncensored). For the 6MP group, on the other hand, the largest observation is censored (at 35 weeks). Whenever the largest observation is censored, the Kaplan-Meier curve for that variable will not drop all the way down to 0.

Largest 6MP censored observation:

max(AML$Duration6MP[AML$Status6MP==0])

## [1] 35

Largest 6MP uncensored observation:

max(AML$Duration6MP[AML$Status6MP==1])

## [1] 23

e.

For estimating the medians, simply print out the Kaplan-Meier objects:

KMP

## Call: survfit(formula = Surv(AML$DurationPlacebo, AML$StatusPlacebo) ~ 
##     1)
## 
## records   n.max n.start  events  median 0.95LCL 0.95UCL 
##      21      21      21      21       8       4      12

KM6

## Call: survfit(formula = Surv(AML$Duration6MP, AML$Status6MP) ~ 1)
## 
## records   n.max n.start  events  median 0.95LCL 0.95UCL 
##      21      21      21       9      23      16      NA

Estimated difference in medians is: 23 – 8 = 15 weeks.

For estimating the means, use the AUCKM function from class:

Estimated mean survival in weeks for 6MP group:

AUCKM(KM6,AML$Duration6MP)

## [1] 23.29

Estimated mean survival in weeks for placebo group:

AUCKM(KMP,AML$DurationPlacebo)

## [1] 8.667

Estimated difference in means is: 23.28739 – 8.666667 = 14.62072 weeks

The difference in medians is not an underestimate, but the difference in means is an underestimate of the true difference in means. The reason is that, since the 6MP survival curve does not drop to 0, we need to make an assumption in order to find the area under that curve. So, we drop a perpendicular down to the x-axis. But this is a worst case scenario, and our estimated mean (area under curve) should truly be larger. This does not happen for the placebo group. Thus, 23.28739 weeks is an underestimate of the average remission time back to this form of leukemia when remission was induced with 6-MP.

4.

a.

KM = survfit(Surv(week,arrest)~1,data=Recidivism)
plot(KM,mark.time=F,conf.int=F,ylim=c(0.7,1),ylab="Survival",xlab="Weeks")

plot of chunk unnamed-chunk-18

b.

By eye, from the graph in (a), I would say that P(X <= 26) = 1 - 0.875 = 0.125 (approximately).

c.

No, since the Kaplan-Meier curve does not drop to 0.5, we cannot estimate the median time to re-arrest. We can say with absolute certainty that any estimate of the median time to re-arrest should be greater than 52 weeks since that is how long these subjects were followed for and the Kaplan-Meier curve had still not dropped below 0.5.

d.

There are 5 distinct education levels, described on the assignment sheet.

table(Recidivism$educ)

## 
##   2   3   4   5   6 
##  24 239 119  39  11

We fit separate Kaplan-Meier curves for each of these education levels:

KMe=survfit(Surv(week,arrest)~educ,data=Recidivism)

Then plot these curves on the same graph:

plot(KMe,mark.time=F,conf.int=F,main="",ylab="Survival",xlab="Weeks",ylim=c(0.65,1),col=1:5)

plot of chunk unnamed-chunk-21

There does not seem to be any discernible relationship between time to re-arrest and education level. Amount of education is increasing (according to my coloring) as follows: black, red, green, blue, and cyan. Although the highest education level (cyan line) does very well, it does not do much better than the lowest education level (black line). Moreover, the lowest education group performs better or similarly to the 2nd highest education level (blue line) at all time points.

There may be some evidence of a relationship if we combine the bottom three education levels (i.e. those individuals who are not high school graduates) and compare them to the top two levels. The reason for this is that the red and green lines represent, respectively, the 2nd and 3rd lowest education groups, and these groups, arguably, perform worst in terms of time to re-arrest. Also, keep in mind that they contain many more data points than the lowest education level, and so by combining categories in this way, they might dominate that lowest category. We can investigate this together, if you like. Soon we will be able to build models and perform hypothesis tests to answer these questions more formally.

Time to re-arrest and marital status:

KMm=survfit(Surv(week,arrest)~mar,data=Recidivism)
plot(KMm,mark.time=F,conf.int=F,main="",ylab="Survival",xlab="Weeks",ylim=c(0.65,1),col=1:2)

plot of chunk unnamed-chunk-22

Marital status, perhaps expectedly, seems to have a fairly strong relationship with time to re-arrest with married prisoners (at time of release) having longer times to re-arrest.

5.

a.

If we pretend that right-censored observations are actually exact, then we will underestimate survival. This makes sense since a right-censored value only provides a lower bound of the actual survival time.

b.

The main advantage of performing a non-parametric analysis is that it will unbiased estimate the true survival distribution without requiring any assumption about the form that distribution takes.

c.

Thus, the density function is the derivative of the CDF.

d.

In a prevalent follow-up study, subject are recruited who have already experienced the initiating event of interest (e.g. if we are interested in studying survival time with Alzheimer’s disease, we would recruit into the study individuals who have already had onset of Alzheimer's disease). Then, we could attempt to ascertain when the initiating even occurred and follow these subjects until the terminating event (or censoring, which ever happens first) occurs in order to obtain our data.

In an incident follow-up study, a cohort of subjects who have not experienced the initiating event are followed until some of them do experience this initiating event (e.g. Alzheimer’s free subjects are followed until some people develop Alzheimer’s disease). Those subjects are then followed until the terminating event (or censoring, which ever happens first) occurs in order to obtain our data.

6.

a.

False

b.

True

c.

False

d.

False

e.

True

f.

True

g.

False

h.

False

i.

True

j.

False

k.

False