library(survival)
## Loading required package: splines
unemployment=read.csv("Unemployment.csv")
Status=1-(unemployment$Censored==1)
survreg(Surv(UnemSpell,Status)~1,dist="weibull",data=unemployment)
## Call:
## survreg(formula = Surv(UnemSpell, Status) ~ 1, data = unemployment,
## dist = "weibull")
##
## Coefficients:
## (Intercept)
## 2.988
##
## Scale= 1.353
##
## Loglik(model)= -3379 Loglik(intercept only)= -3379
## n= 1055
1/1.353461 #calculate shape
## [1] 0.7388
exp(2.987603) #calculate scale
## [1] 19.84
It is safer to choose a Weibull model because, given the appropriate parameters, a Weibull model can behave like an exponential model (in fact, exponential models are really a subset of Weibull models). However, Weibull models require two parameters whereas exponential models only require one, and so may not be as cost- and/or time-efficient.
To plot the Weibull CDF model from (1):
curve(pweibull(x,shape=0.7388466,scale=19.83807),from=0, to=150, ylab='Cumulative' , xlab='Months')
curve(pweibull(x,shape=0.7388466,scale=19.83807),from=0, to=150, ylab='Cumulative' , xlab='Months')
lines(ecdf(unemployment$UnemSpell), col='red', do.points=F, verticals=T)
These curves do NOT match because the ecdf ignores the censored data, whereas the Weibull takes the censored data into account.
The Weibull is suggesting longer unemployment spells than the ecdf because the Weibull curve takes longer to increase, indicating that the rate of employment is rising more slowly.
To fit and plot the Kaplan-Meier curve:
kmemploy = survfit(Surv(unemployment$UnemSpell,Status)~1,data=unemployment)
plot( kmemploy, conf.int=F, mark.time=F, xlab='Months', ylab='Survival' )
kmemploy = survfit(Surv(unemployment$UnemSpell,Status)~1,data=unemployment)
plot( kmemploy, conf.int=F, mark.time=F, xlab='Months', ylab='Survival' )
curve(1-pweibull(x,shape=0.7388466,scale=19.83807), add=T, col='red')
9i. The curves are similar until around 45 months, and then after that the Kaplan-Meier remains slightly above the Weibull.
9ii. The Kaplan-Meier takes censored data into account, and so is, in a way, slightly more true to the data than the Weibull.
qweibull(0.5, shape=0.7388466,scale=19.83807)
## [1] 12.08
To find the mean of the Weibull:
S=function(x) {1-pweibull(x,shape=0.7388466,scale=19.83807)}
integrate(S, 0, Inf)
## 23.92 with absolute error < 0.002
To find the median of the Kaplan-Meier:
kmemploy
## Call: survfit(formula = Surv(unemployment$UnemSpell, Status) ~ 1, data = unemployment)
##
## records n.max n.start events median 0.95LCL 0.95UCL
## 1055 1055 1055 848 10 8 12
The median is 10 units.
To find the mean of the Kaplan-Meier:
AUCKM = function(survobj,duration) {
base=c(0,summary(survobj)$time,max(duration))
heights=c(1,summary(survobj)$surv)
new=c()
for(i in 1:length(heights)) { new=c(new,(base[i+1]-base[i])*heights[i]) }
c(sum(new)) }
AUCKM(kmemploy, unemployment$UnemSpell)
## [1] 32.05
male=subset(unemployment, Sex==0)
female=subset(unemployment, Sex==1)
kmmale=survfit(Surv(male$UnemSpell)~1,data=male)
kmfemale=survfit(Surv(female$UnemSpell)~1, data=female)
plot(kmmale, conf.int=F, mark.time=F, xlab='Months', ylab='Survival' )
lines(kmfemale, conf.int=F, mark.time=F, add=T, col='cyan')
The line for males generally stays below the line for females, indicating that males are getting hired faster than females. In terms of duration, after about 99 months, both lines are essentially at 0. This suggests that employment rates are about the same over time.