Stats Assignment 2

library(survival)

## Loading required package: splines

cell=read.csv("Cell.csv")

Question 1

To plot a non-parametric estimate of the hazard:

LifeTable=function(time,status,breaks,finite) {
failed=c(0,hist(time[status==1],breaks=breaks,plot=F)$counts)
censored=c(0,hist(time[status==0],breaks=breaks,plot=F)$counts)
alivestart=length(time)-cumsum(failed[-length(failed)])-cumsum(censored[-length(censored)])
atrisk=alivestart-c(censored[-1]/2)
failrate=failed[-1]/atrisk
survrate=1-failrate
survest=c(1,cumprod(1-failrate)[1:(length(failrate))])
if (finite==0) return(list(Failed=failed[-1],Censored=censored[-1],AliveStart=alivestart,
AtRisk=atrisk[-length(atrisk)],FailRate=failrate[-length(failrate)],
SurvRate=survrate[-length(survrate)],SurvEst=survest[-length(survest)]))
if (finite==1) return(list(Failed=failed[-1],Censored=censored[-1],AliveStart=alivestart,
AtRisk=atrisk,FailRate=failrate,SurvRate=survrate,SurvEst=survest)) }

l=LifeTable(c(1:15), rep(1,16), c(1:15), 0)
SurvDrop = l$SurvEst[1:14] - l$SurvEst[2:15]
HazardEst = ( SurvDrop ) / l$SurvEst[2:15]
hazstep = stepfun( c(1:15) , c(0,HazardEst,0) )
plot(hazstep,do.points=F,ylab='h(k)',xlab='k',main="")

plot of chunk unnamed-chunk-2

We cannot fit an Exponential model to this data because we do not have the raw data itself. In order to fit any parametric model, we need the exact data. Estimates will not suffice.
This risk in this plot is not stable, indicating that an Exponential model is not a good fit for this data.

Question 2

We cannot fit a Kaplan-Meier curve to this data because we do now know when (in what time interval) subjects were lost to follow up.

Age Interval	# of CHD Events [Failure]	# Lost to Follow-Up	Alive at Start [Total-(Failure+# Lost to Follow-Up)]	At Risk [Alive at Start - .5(# Lost to Followup)]	Interval Failure Rate [Failures/At Risk]	Interval Survival Rate [1-Int. Failure Rate]	Survival at left endpt.
45-50	17	29	1571	1562.5	.011	.989	1
50-55	36	60	1525	1507	(36/1507=).024	.976	(.989x1=).989
55-60	62	83	1429	1398	(62/1398=).044	.956	(.976x.989=).965
60-65	76	441	1284	1246	(76/1284=).061	.939	(.956x.965=).922
65-70	50	439	767	742	(50/742=).067	.933	(.922x.939=).866
70-75	9	262	278	267.5	(9/267.5=).034	.966	(.933x.866=).808
75-80	0	7	7	–	–	–	(.966x.808=).781

To plot the actuarial estimate:

l=LifeTable(c(45:80), rep(1,8), c(0,45,50,55,60,65,70,75,80), 0)
step = stepfun( seq(50,80,5) , l$SurvEst )
plot(step , do.points=F , ylab="Survival" , xlab="Days" , main="" , ylim=c(0.4,1)) #step function method
lines( seq(45,80,5) , l$SurvEst , col='red' ) #straight-line method

plot of chunk unnamed-chunk-3

Question 3

\(h(t)\) measures the number of people who failed against the number of people who survived up to a certain point and \(f(t)\) measures the number of people who failed against the total number of people in the story. The only way these will be equal is if the total number of people in the study is equal to the number of people who survive up to any given point, i.e. there are no failures.
The difference between h(t) and f(t) will be negligible for small values of t, but increases as t increases.
For an Exponential function, \(h(t)=\lambda\). Previous events are not considered when calculating risk at a certain point for an Exponential model, so it makes sense that risk () would be static no matter where (in what interval) risk is being calculated.

Question 4

kmcell=survfit(Surv(cell$Time)~1, conf.type='plain', data=cell)
plot(kmcell, mark.time=F, conf.int=F)

plot of chunk unnamed-chunk-4

summary(kmcell)

## Call: survfit(formula = Surv(cell$Time) ~ 1, data = cell, conf.type = "plain")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   296     17       1   0.9412  0.0571       0.8293        1.000
##   660     16       1   0.8824  0.0781       0.7292        1.000
##   670     15       1   0.8235  0.0925       0.6423        1.000
##   728     14       1   0.7647  0.1029       0.5631        0.966
##   775     13       1   0.7059  0.1105       0.4893        0.922
##   797     12       1   0.6471  0.1159       0.4199        0.874
##   841     11       1   0.5882  0.1194       0.3543        0.822
##   869     10       1   0.5294  0.1211       0.2921        0.767
##   999      9       1   0.4706  0.1211       0.2333        0.708
##  1006      8       1   0.4118  0.1194       0.1778        0.646
##  1035      7       1   0.3529  0.1159       0.1258        0.580
##  1169      6       1   0.2941  0.1105       0.0775        0.511
##  1193      5       1   0.2353  0.1029       0.0337        0.437
##  1415      4       1   0.1765  0.0925       0.0000        0.358
##  1424      3       1   0.1176  0.0781       0.0000        0.271
##  1500      2       1   0.0588  0.0571       0.0000        0.171
##  1540      1       1   0.0000     NaN          NaN          NaN

Our interval estimate of the three-year survival probability is [.1258, .0775], and our estimate of the median failure time is [728, 1193].

The derivative of the numerator is \(1-2S(k)\). Setting this equal to 0 (to find critical points) reveals that this value is maximized at S(k)=.5. The denominator is in proportion to the variance, so it does not affect our calculations.
To fit and plot the Weibull model:

survreg( Surv( Time ) ~ 1 , dist='weibull' , data=cell)

## Call:
## survreg(formula = Surv(Time) ~ 1, data = cell, dist = "weibull")
## 
## Coefficients:
## (Intercept) 
##       7.012 
## 
## Scale= 0.3012 
## 
## Loglik(model)= -122.8   Loglik(intercept only)= -122.8
## n= 17

plot(hazstep,do.points=F,ylab='h(k)',xlab='k',main="")
curve(dweibull(x,shape=1/0.3012274,scale=exp(7.011883))/(1-pweibull(x,shape=1/0.3012274, scale=exp(7.011883))),add=T,col='red')

plot of chunk unnamed-chunk-6

The Weibull model is flat, indicating no wear and tear over time.

Question 5

We assume that half of the censored observations survive any particular stage.
We could make our approximation better if we knew exactly how many points were censored, and at what times.
We can estimate the hazard function from a life-table by either a) finding how far S drops in a particular interval, calculating the density in that interval and then dividing by interval length or b) dividing the interval failure rate by the interval length.
When two groups S1 and S2 are stochastically ordered, S1(k) >= S2(k), for all time points k, i.e. survival is more likely at all points.

Question 6

False
False
True
False
True
True
True
True
False