Surivival Analysis: Lab 1

library(survival)

## Loading required package: splines

unemployment=read.csv("Unemployment.csv")

To fit a Weibull model:

Status=1-(unemployment$Censored==1)
survreg(Surv(UnemSpell,Status)~1,dist="weibull",data=unemployment)

## Call:
## survreg(formula = Surv(UnemSpell, Status) ~ 1, data = unemployment, 
##     dist = "weibull")
## 
## Coefficients:
## (Intercept) 
##       2.988 
## 
## Scale= 1.353 
## 
## Loglik(model)= -3379   Loglik(intercept only)= -3379
## n= 1055

1/1.353461 #calculate shape

## [1] 0.7388

exp(2.987603) #calculate scale

## [1] 19.84

It is safer to choose a Weibull model because, given the appropriate parameters, a Weibull model can behave like an exponential model (in fact, exponential models are really a subset of Weibull models). However, Weibull models require two parameters whereas exponential models only require one, and so may not be as cost- and/or time-efficient.
To plot the Weibull CDF model from (1):

curve(pweibull(x,shape=0.7388466,scale=19.83807),from=0, to=150, ylab='Cumulative' , xlab='Months')

plot of chunk unnamed-chunk-3

To add the ecdf model:

curve(pweibull(x,shape=0.7388466,scale=19.83807),from=0, to=150, ylab='Cumulative' , xlab='Months')
lines(ecdf(unemployment$UnemSpell), col='red', do.points=F, verticals=T)

plot of chunk unnamed-chunk-4

These curves do NOT match because the ecdf ignores the censored data, whereas the Weibull takes the censored data into account.
The Weibull is suggesting longer unemployment spells than the ecdf because the Weibull curve takes longer to increase, indicating that the rate of employment is rising more slowly.
To fit and plot the Kaplan-Meier curve:

kmemploy = survfit(Surv(unemployment$UnemSpell,Status)~1,data=unemployment)
plot( kmemploy, conf.int=F, mark.time=F, xlab='Months', ylab='Survival' )

plot of chunk unnamed-chunk-5

To add the Weibull survival curve:

kmemploy = survfit(Surv(unemployment$UnemSpell,Status)~1,data=unemployment)
plot( kmemploy, conf.int=F, mark.time=F, xlab='Months', ylab='Survival' )
curve(1-pweibull(x,shape=0.7388466,scale=19.83807), add=T, col='red')

plot of chunk unnamed-chunk-6

9i. The curves are similar until around 45 months, and then after that the Kaplan-Meier remains slightly above the Weibull.

9ii. The Kaplan-Meier takes censored data into account, and so is, in a way, slightly more true to the data than the Weibull.

To find the median of the Weibull:

qweibull(0.5,  shape=0.7388466,scale=19.83807)

## [1] 12.08

To find the mean of the Weibull:

S=function(x) {1-pweibull(x,shape=0.7388466,scale=19.83807)}
integrate(S, 0, Inf)

## 23.92 with absolute error < 0.002

To find the median of the Kaplan-Meier:

kmemploy

## Call: survfit(formula = Surv(unemployment$UnemSpell, Status) ~ 1, data = unemployment)
## 
## records   n.max n.start  events  median 0.95LCL 0.95UCL 
##    1055    1055    1055     848      10       8      12

The median is 10 units.

To find the mean of the Kaplan-Meier:

AUCKM = function(survobj,duration) {
base=c(0,summary(survobj)$time,max(duration))
heights=c(1,summary(survobj)$surv)
new=c()
for(i in 1:length(heights)) { new=c(new,(base[i+1]-base[i])*heights[i]) }
c(sum(new)) }
AUCKM(kmemploy, unemployment$UnemSpell)

## [1] 32.05

To compare unemployment durations of males and females:

male=subset(unemployment, Sex==0)
female=subset(unemployment, Sex==1)
kmmale=survfit(Surv(male$UnemSpell)~1,data=male)
kmfemale=survfit(Surv(female$UnemSpell)~1, data=female)
plot(kmmale, conf.int=F, mark.time=F, xlab='Months', ylab='Survival' )
lines(kmfemale, conf.int=F, mark.time=F, add=T, col='cyan')

plot of chunk unnamed-chunk-11

The line for males generally stays below the line for females, indicating that males are getting hired faster than females. In terms of duration, after about 99 months, both lines are essentially at 0. This suggests that employment rates are about the same over time.

Surivival Analysis: Lab 1

Chidinma Emenike

9/17/2014