library(survival)
## Loading required package: splines
Nursing=read.csv("Nursing.csv")
Oscar=read.csv("Oscar.csv")
CoxSnell = function(cs,status,xlim=NULL,ylim=NULL) {
kmcs = survfit(Surv(jitter(cs,amount=(max(cs)-min(cs))/1000),status)~1)$surv
plot(log(-log(kmcs))~sort(log(cs)),xlab="log(Cox-Snell)",ylab="log(-log(S(Cox-Snell)))",xlim=xlim,ylim=ylim)
abline(0,1,col='red')
}
survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='exponential', data=Nursing)
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment +
## Gender + Marriage + Health, data = Nursing, dist = "exponential")
##
## Coefficients:
## (Intercept) Age Treatment Gender Marriage Health
## 6.029828 0.004851 0.282041 -0.432234 -0.191577 -0.221548
##
## Scale fixed at 1
##
## Loglik(model)= -8511 Loglik(intercept only)= -8582
## Chisq= 143.5 on 5 degrees of freedom, p= 0
## n= 1601
The p-value for the Treatment term (6.43e-07 << .05) indicates that there is a strong association between whether or not the hospitals receive bonuses for reducing a patient’s length of stay, but it is surprising that the Treatment variable has a positive coefficient (indicating that patients in hospitals that receive bonuses actually stay longer than patients in hospitals that don’t).
The two models are
survreg(Surv(LOS+.001, Status)~1, dist='exponential', data=Nursing) #HA
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ 1, data = Nursing,
## dist = "exponential")
##
## Coefficients:
## (Intercept)
## 5.71
##
## Scale fixed at 1
##
## Loglik(model)= -8582 Loglik(intercept only)= -8582
## n= 1601
and
survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='exponential', data=Nursing) #H0
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment +
## Gender + Marriage + Health, data = Nursing, dist = "exponential")
##
## Coefficients:
## (Intercept) Age Treatment Gender Marriage Health
## 6.029828 0.004851 0.282041 -0.432234 -0.191577 -0.221548
##
## Scale fixed at 1
##
## Loglik(model)= -8511 Loglik(intercept only)= -8582
## Chisq= 143.5 on 5 degrees of freedom, p= 0
## n= 1601
To perform the likelihood ratio test:
2*(-8510.7--8582.5) #Calculate how much better the model is with additional covariates
## [1] 143.6
1-pchisq(143.6,7) #calculate p-value
## [1] 0
The p-value is 0, so we reject the null hypothesis. There is strong evidence that having no covariates improves the quality of the model.
CS = -log( 1 - pexp(Nursing$LOS , 1/exp(6.02983 + .00485*Nursing$Age + .28204*Nursing$Treatment -.43223*Nursing$Gender -.19158*Nursing$Marriage -.22155*Nursing$Health ) ) )
CoxSnell(CS, 1-Nursing$Status, xlim=c(-5,1))
An Exponential model does not appear to be adequate for this data.
Question (b) asks us to consider which type of Exponential model would be a better way to represent the data, whereas (c) asks whether an Exponential model would be appropriate for the data at all.
To fit the Weibull model:
survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='weibull', data=Nursing)
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment +
## Gender + Marriage + Health, data = Nursing, dist = "weibull")
##
## Coefficients:
## (Intercept) Age Treatment Gender Marriage Health
## 5.996181 0.007726 0.191157 -0.580347 -0.246019 -0.294076
##
## Scale= 1.665
##
## Loglik(model)= -8208 Loglik(intercept only)= -8250
## Chisq= 82.53 on 5 degrees of freedom, p= 2.2e-16
## n= 1601
To create the Cox-Snell residuals:
CS = -log( 1 - pweibull(Nursing$LOS , shape=1/1.664759 , scale=exp(5.996180751 + 0.007726179*Nursing$Age + 0.191156845*Nursing$Treatment -0.580346981*Nursing$Gender -0.246018655*Nursing$Marriage -0.294076278*Nursing$Health) ) )
CoxSnell(CS, 1-Nursing$Status, xlim=c(-5,1))
We select the Weibull model since it appears to have slightly less disparity than the Exponential model.
To compare AIC values:
2*(7-8510.7) #Exponential AIC value/test statistic (# parameters - loglik(model))
## [1] -17007
2*(7-8208.3) #Weibull AIC value/test statistic
## [1] -16403
From these results, we would choose the Weibull model (slightly smaller absolute AIC value).
To perform the likelihood ratio test:
2*(-8510.7--8208.3) #Calculate test statistic (how much better the Exponential model is than the Weibull model)
## [1] -604.8
1-pchisq(604.8,7) #calculate p-value
## [1] 0
The p-value is 0, leading us to reject the null hypothesis (that the Exponential model is better) in favor of the alternative (that the Weibull model is better).
The results from both the AIC values and the likelihood ratio test match with our answer to (e).
survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='lognormal', data=Nursing)
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment +
## Gender + Marriage + Health, data = Nursing, dist = "lognormal")
##
## Coefficients:
## (Intercept) Age Treatment Gender Marriage Health
## 5.322090 0.008116 0.189477 -0.664649 -0.245244 -0.336093
##
## Scale= 2.179
##
## Loglik(model)= -8228 Loglik(intercept only)= -8265
## Chisq= 73.28 on 5 degrees of freedom, p= 2.1e-14
## n= 1601
To create the Cox-Snell residuals:
CS = -log( 1 - plnorm(Nursing$LOS , 5.322090023 + 0.008115688*Nursing$Age + 0.189477044*Nursing$Treatment-0.664649203*Nursing$Gender-0.245243533*Nursing$Marriage -0.336092546*Nursing$Health, 2.178653 ) )
CoxSnell(CS,1-Nursing$Status, xlim=c(-5,1))
Judging from the Cox-Snell residuals, we would prefer the lognormal model appear to the Weibull model.
Age=Oscar$Final - Oscar$Birth
EverWon=(Oscar$Wins > 0)+0
ml=survreg(Surv(Age, 1-Alive)~EverWon, dist='lognormal', data=Oscar)
ml
## Call:
## survreg(formula = Surv(Age, 1 - Alive) ~ EverWon, data = Oscar,
## dist = "lognormal")
##
## Coefficients:
## (Intercept) EverWon
## 4.32867 0.06777
##
## Scale= 0.2263
##
## Loglik(model)= -3604 Loglik(intercept only)= -3610
## Chisq= 11.69 on 1 degrees of freedom, p= 0.00063
## n= 1670
mw=survreg(Surv(Age, 1-Alive)~EverWon, dist='weibull', data=Oscar)
mw
## Call:
## survreg(formula = Surv(Age, 1 - Alive) ~ EverWon, data = Oscar,
## dist = "weibull")
##
## Coefficients:
## (Intercept) EverWon
## 4.39984 0.04097
##
## Scale= 0.1466
##
## Loglik(model)= -3504 Loglik(intercept only)= -3508
## Chisq= 7.55 on 1 degrees of freedom, p= 0.006
## n= 1670
According to the models, Oscar winners do tend to live longer than other actors. We know this because the EverWon term has a positive coefficient in both cases(.068 for the Log-Normal model and .041 for the Weibull model). The time ratio is exp(0.2263067) or about 1.254 for the Log-Normal model, and exp(0.1465575), or 1.158 for the Weibull model.
qlnorm(.5,4.32866905,0.2263067)
## [1] 75.84
qlnorm(.5,4.32866905+0.06776662,0.2263067)
## [1] 81.16
To calculate the median of mw for Oscar winners and non-winners:
qweibull(0.5, shape=4.39984197, 1/0.2263067) #non-winners
## [1] 4.066
qweibull(0.5, shape=4.39984197 + 0.04097455, 1/0.2263067) #winners
## [1] 4.069
We cannot perform a formal hypothesis test (i.e. likelihood ratio test) to choose between these models because they are not nested.
To create the Cox-Snell residuals:
CS1 = -log( 1 - plnorm(Age, 4.32866905 + 0.06776662*EverWon, 0.2263067 ) ) #using info from ml
CoxSnell(CS1, 1-Oscar$Alive, xlim=c(-5,1))
CS = -log( 1 - pweibull(Age, shape=1/0.1465575, scale=exp(4.39984197 + 0.04097455*EverWon) ) ) #using info from mw
CoxSnell(CS, 1-Oscar$Alive, xlim=c(-5,1))
2*(3-3604) #Log-Normal AIC value/test statistic (# parameters - loglik(model))
## [1] -7202
2*(3-3503.8) #Weibull AIC value/test statistic
## [1] -7002
According to the AIC values, the Weibull model performs better.
curve(dlnorm(x,4.32866905,0.2263067)/(1-plnorm(x,4.32866905)), 0, 75, ylab="h(k)")
curve(dlnorm(x,4.32866905+0.06776662,0.2263067)/(1-plnorm(x,4.32866905+0.06776662,0.2263067)),add=T, lty=2)
To plot the estimated hazard functions for the Weibull model:
curve(dweibull(x,1/0.1465575,exp(4.32866905))/(1-pweibull(x,1/0.1465575,exp(4.32866905))),0,150,ylab="h(k)")
curve(dweibull(x,1/0.1465575,exp(4.32866905 + 0.04097455))/(1-pweibull(x,1/0.1465575,exp(4.32866905 + 0.04097455))),add=T,lty=2)
The fact that Log-Normal models can have nonmonotonic hazards does not appear to give it an advantage over the Weibull model. The expected hazard increases linearly, which means that the Weibull model (for which hazards are always increasing) is a better fit.
Winning an Oscar could inspire an actor to create more pieces, or gloat about it to their friends, both of which require living. On the other hand, not winning an Oscar could depress an actor and make them want to give up on life (or at least their acting career).
An actor who has lived longer has (most likely) acted in more things, and thus has a higher chance of winning. For example, someone who has acted in 100 pieces over their 40-year-long career has more of a chance than winning at least one Oscar in that time than someone who has acted in only three pieces over their five-year-long career, by sheer numbers. To improve the comparison between Oscar winners and non-winners, we could perhaps change EverWon into a categorical variable (Yes/No), in order to control for actors who have won multiple Oscars.
The fancy algebra in question:
2(p+1)-2\(L_{M2}\) < 2p -2\(L_{M1}\) #Formula for AIC: 2(p-l), where p is the number of parameters and l is the log-likelihood value
2-\(L_{M2}\)< 2\(L_{M1}\) #distribute, and cancel out the 2p terms
2(\(L_{M2}\)-\(L_{M1}\))>2 # rearrange
\(L_{M2}\)-\(L_{M1}\)>1 #simplify
qchisq(0.95, 1)/2
## [1] 1.921
So, LRT will select M2 if the log-likelihood value increases by at least 1.920 units.
AIC is more willing to pick larger models because the acceptable difference in log-likelihood values is lower than when using LRT.
Because the threshold for selecting M2 using AIC is 2, we need to find the corresponding p-value at 2 that will cause LRT to select M2. We can do this by typing
1-pchisq(2,1)
## [1] 0.1573
So, using LRT to decide whether to include an additional covariate will be the same as using AIC if \(\alpha\) = .157 .