library(survival)
## Loading required package: splines
Nursing=read.csv("Nursing.csv")
Oscar=read.csv("Oscar.csv")
CoxSnell = function(cs,status,xlim=NULL,ylim=NULL)  {
kmcs = survfit(Surv(jitter(cs,amount=(max(cs)-min(cs))/1000),status)~1)$surv
plot(log(-log(kmcs))~sort(log(cs)),xlab="log(Cox-Snell)",ylab="log(-log(S(Cox-Snell)))",xlim=xlim,ylim=ylim)
abline(0,1,col='red')
}

Question 1

  1. To fit an Exponential AFT model using all five covariates:
survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='exponential', data=Nursing)
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment + 
##     Gender + Marriage + Health, data = Nursing, dist = "exponential")
## 
## Coefficients:
## (Intercept)         Age   Treatment      Gender    Marriage      Health 
##    6.029828    0.004851    0.282041   -0.432234   -0.191577   -0.221548 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -8511   Loglik(intercept only)= -8582
##  Chisq= 143.5 on 5 degrees of freedom, p= 0 
## n= 1601

The p-value for the Treatment term (6.43e-07 << .05) indicates that there is a strong association between whether or not the hospitals receive bonuses for reducing a patient’s length of stay, but it is surprising that the Treatment variable has a positive coefficient (indicating that patients in hospitals that receive bonuses actually stay longer than patients in hospitals that don’t).

  1. Our null hypothesis is that the original model is better than the model with no covariates, and our alternative hypotheis is that the model with no covariates is better.

The two models are

survreg(Surv(LOS+.001, Status)~1, dist='exponential', data=Nursing) #HA
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ 1, data = Nursing, 
##     dist = "exponential")
## 
## Coefficients:
## (Intercept) 
##        5.71 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -8582   Loglik(intercept only)= -8582
## n= 1601

and

survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='exponential', data=Nursing) #H0
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment + 
##     Gender + Marriage + Health, data = Nursing, dist = "exponential")
## 
## Coefficients:
## (Intercept)         Age   Treatment      Gender    Marriage      Health 
##    6.029828    0.004851    0.282041   -0.432234   -0.191577   -0.221548 
## 
## Scale fixed at 1 
## 
## Loglik(model)= -8511   Loglik(intercept only)= -8582
##  Chisq= 143.5 on 5 degrees of freedom, p= 0 
## n= 1601

To perform the likelihood ratio test:

2*(-8510.7--8582.5) #Calculate how much better the model is with additional covariates
## [1] 143.6
1-pchisq(143.6,7) #calculate p-value
## [1] 0

The p-value is 0, so we reject the null hypothesis. There is strong evidence that having no covariates improves the quality of the model.

  1. To create the Cox-Snell residuals:
CS = -log( 1 - pexp(Nursing$LOS , 1/exp(6.02983 + .00485*Nursing$Age + .28204*Nursing$Treatment -.43223*Nursing$Gender -.19158*Nursing$Marriage -.22155*Nursing$Health ) ) )
CoxSnell(CS, 1-Nursing$Status, xlim=c(-5,1))

plot of chunk unnamed-chunk-6

An Exponential model does not appear to be adequate for this data.

  1. Question (b) asks us to consider which type of Exponential model would be a better way to represent the data, whereas (c) asks whether an Exponential model would be appropriate for the data at all.

  2. To fit the Weibull model:

survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='weibull', data=Nursing)
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment + 
##     Gender + Marriage + Health, data = Nursing, dist = "weibull")
## 
## Coefficients:
## (Intercept)         Age   Treatment      Gender    Marriage      Health 
##    5.996181    0.007726    0.191157   -0.580347   -0.246019   -0.294076 
## 
## Scale= 1.665 
## 
## Loglik(model)= -8208   Loglik(intercept only)= -8250
##  Chisq= 82.53 on 5 degrees of freedom, p= 2.2e-16 
## n= 1601

To create the Cox-Snell residuals:

CS = -log( 1 - pweibull(Nursing$LOS , shape=1/1.664759 , scale=exp(5.996180751 + 0.007726179*Nursing$Age + 0.191156845*Nursing$Treatment -0.580346981*Nursing$Gender -0.246018655*Nursing$Marriage -0.294076278*Nursing$Health) ) )
CoxSnell(CS, 1-Nursing$Status, xlim=c(-5,1))

plot of chunk unnamed-chunk-8

We select the Weibull model since it appears to have slightly less disparity than the Exponential model.

  1. We could compare the AIC values of both models (appropriate for nested and non-nested models) or perform a likelihood ratio test (only appropriate for nested models, but okay here because Exponential models are nested within Weibull models).

To compare AIC values:

2*(7-8510.7) #Exponential AIC value/test statistic (# parameters - loglik(model))
## [1] -17007
2*(7-8208.3) #Weibull AIC value/test statistic
## [1] -16403

From these results, we would choose the Weibull model (slightly smaller absolute AIC value).

To perform the likelihood ratio test:

2*(-8510.7--8208.3) #Calculate test statistic (how much better the Exponential model is than the Weibull model)
## [1] -604.8
1-pchisq(604.8,7) #calculate p-value
## [1] 0

The p-value is 0, leading us to reject the null hypothesis (that the Exponential model is better) in favor of the alternative (that the Weibull model is better).

The results from both the AIC values and the likelihood ratio test match with our answer to (e).

  1. To fit the log-normal model:
survreg(Surv(LOS+.001, Status)~Age+Treatment+Gender+Marriage+Health, dist='lognormal', data=Nursing)
## Call:
## survreg(formula = Surv(LOS + 0.001, Status) ~ Age + Treatment + 
##     Gender + Marriage + Health, data = Nursing, dist = "lognormal")
## 
## Coefficients:
## (Intercept)         Age   Treatment      Gender    Marriage      Health 
##    5.322090    0.008116    0.189477   -0.664649   -0.245244   -0.336093 
## 
## Scale= 2.179 
## 
## Loglik(model)= -8228   Loglik(intercept only)= -8265
##  Chisq= 73.28 on 5 degrees of freedom, p= 2.1e-14 
## n= 1601

To create the Cox-Snell residuals:

CS = -log( 1 - plnorm(Nursing$LOS , 5.322090023 + 0.008115688*Nursing$Age + 0.189477044*Nursing$Treatment-0.664649203*Nursing$Gender-0.245243533*Nursing$Marriage -0.336092546*Nursing$Health, 2.178653 ) )
CoxSnell(CS,1-Nursing$Status, xlim=c(-5,1))

plot of chunk unnamed-chunk-12

Judging from the Cox-Snell residuals, we would prefer the lognormal model appear to the Weibull model.

  1. Given that the results from the LRT we performed in (b) indicated that a model with no covariates performed better, as well as our conflicting results from including Treatment in the model from (a), and the inability of any of the models chosen to model the data with a decent degree of accuracy, we conclude that Treatment does NOT affect LOS.

Question 2

Age=Oscar$Final - Oscar$Birth
EverWon=(Oscar$Wins > 0)+0
  1. To make the models:
ml=survreg(Surv(Age, 1-Alive)~EverWon, dist='lognormal', data=Oscar)
ml
## Call:
## survreg(formula = Surv(Age, 1 - Alive) ~ EverWon, data = Oscar, 
##     dist = "lognormal")
## 
## Coefficients:
## (Intercept)     EverWon 
##     4.32867     0.06777 
## 
## Scale= 0.2263 
## 
## Loglik(model)= -3604   Loglik(intercept only)= -3610
##  Chisq= 11.69 on 1 degrees of freedom, p= 0.00063 
## n= 1670
mw=survreg(Surv(Age, 1-Alive)~EverWon, dist='weibull', data=Oscar)
mw
## Call:
## survreg(formula = Surv(Age, 1 - Alive) ~ EverWon, data = Oscar, 
##     dist = "weibull")
## 
## Coefficients:
## (Intercept)     EverWon 
##     4.39984     0.04097 
## 
## Scale= 0.1466 
## 
## Loglik(model)= -3504   Loglik(intercept only)= -3508
##  Chisq= 7.55 on 1 degrees of freedom, p= 0.006 
## n= 1670

According to the models, Oscar winners do tend to live longer than other actors. We know this because the EverWon term has a positive coefficient in both cases(.068 for the Log-Normal model and .041 for the Weibull model). The time ratio is exp(0.2263067) or about 1.254 for the Log-Normal model, and exp(0.1465575), or 1.158 for the Weibull model.

  1. To calculate the median of ml for Oscar winners and non-winners:
qlnorm(.5,4.32866905,0.2263067)
## [1] 75.84
qlnorm(.5,4.32866905+0.06776662,0.2263067)
## [1] 81.16

To calculate the median of mw for Oscar winners and non-winners:

qweibull(0.5, shape=4.39984197, 1/0.2263067) #non-winners
## [1] 4.066
qweibull(0.5, shape=4.39984197 + 0.04097455, 1/0.2263067) #winners
## [1] 4.069
  1. We cannot perform a formal hypothesis test (i.e. likelihood ratio test) to choose between these models because they are not nested.

  2. To create the Cox-Snell residuals:

CS1 = -log( 1 - plnorm(Age, 4.32866905  + 0.06776662*EverWon, 0.2263067 ) ) #using info from ml
CoxSnell(CS1, 1-Oscar$Alive, xlim=c(-5,1))

plot of chunk unnamed-chunk-17

CS = -log( 1 - pweibull(Age, shape=1/0.1465575, scale=exp(4.39984197 + 0.04097455*EverWon) ) ) #using info from mw
CoxSnell(CS, 1-Oscar$Alive, xlim=c(-5,1))

plot of chunk unnamed-chunk-17

  1. To find the AIC values:
2*(3-3604) #Log-Normal AIC value/test statistic (# parameters - loglik(model))
## [1] -7202
2*(3-3503.8) #Weibull AIC value/test statistic
## [1] -7002

According to the AIC values, the Weibull model performs better.

  1. To plot the estimated hazard functions for the Log-Normal model:
curve(dlnorm(x,4.32866905,0.2263067)/(1-plnorm(x,4.32866905)), 0, 75, ylab="h(k)")
curve(dlnorm(x,4.32866905+0.06776662,0.2263067)/(1-plnorm(x,4.32866905+0.06776662,0.2263067)),add=T, lty=2)

plot of chunk unnamed-chunk-19

To plot the estimated hazard functions for the Weibull model:

curve(dweibull(x,1/0.1465575,exp(4.32866905))/(1-pweibull(x,1/0.1465575,exp(4.32866905))),0,150,ylab="h(k)")
curve(dweibull(x,1/0.1465575,exp(4.32866905 + 0.04097455))/(1-pweibull(x,1/0.1465575,exp(4.32866905 + 0.04097455))),add=T,lty=2)

plot of chunk unnamed-chunk-20

  1. The fact that Log-Normal models can have nonmonotonic hazards does not appear to give it an advantage over the Weibull model. The expected hazard increases linearly, which means that the Weibull model (for which hazards are always increasing) is a better fit.

  2. Winning an Oscar could inspire an actor to create more pieces, or gloat about it to their friends, both of which require living. On the other hand, not winning an Oscar could depress an actor and make them want to give up on life (or at least their acting career).

  3. An actor who has lived longer has (most likely) acted in more things, and thus has a higher chance of winning. For example, someone who has acted in 100 pieces over their 40-year-long career has more of a chance than winning at least one Oscar in that time than someone who has acted in only three pieces over their five-year-long career, by sheer numbers. To improve the comparison between Oscar winners and non-winners, we could perhaps change EverWon into a categorical variable (Yes/No), in order to control for actors who have won multiple Oscars.

Question 3

  1. If M1 is a model with p parameters and M2 is a model with p+1 parameters (so M1 is nested in M2), AIC will select M2 if and when its AIC value is less than that of M1. Do some fancy algebra (below), and we see that this happens when the log-likelihood of M2 minus that of M1 is greater than 1, i.e. it increases by more than one in M2.

The fancy algebra in question:

2(p+1)-2\(L_{M2}\) < 2p -2\(L_{M1}\) #Formula for AIC: 2(p-l), where p is the number of parameters and l is the log-likelihood value

2-\(L_{M2}\)< 2\(L_{M1}\) #distribute, and cancel out the 2p terms

2(\(L_{M2}\)-\(L_{M1}\))>2 # rearrange

\(L_{M2}\)-\(L_{M1}\)>1 #simplify

  1. LRT will select M2 when the p-value is less than .05. This happens when the LRT test statistic (2*(\(L_{M2}\)-\(L_{M1}\))) is greater than the chi-squared distribution value at .05. We can find this value by typing
qchisq(0.95, 1)/2
## [1] 1.921

So, LRT will select M2 if the log-likelihood value increases by at least 1.920 units.

  1. AIC is more willing to pick larger models because the acceptable difference in log-likelihood values is lower than when using LRT.

  2. Because the threshold for selecting M2 using AIC is 2, we need to find the corresponding p-value at 2 that will cause LRT to select M2. We can do this by typing

1-pchisq(2,1)
## [1] 0.1573

So, using LRT to decide whether to include an additional covariate will be the same as using AIC if \(\alpha\) = .157 .