Exercise 1

The liver data set is a subset of the ILPD (Indian Liver Patient Dataset) data set. It contains the first 10 variables described on the UCI Machine Learning Repository and a LiverPatient variable (indicating whether or not the individual is a liver patient. People with active liver disease are coded as LiverPatient=1 and people without disease are coded LiverPatient=0) for adults in the data set. Adults here are defined to be individuals who are at least 18 years of age. It is possible that there will be different significant predictors of being a liver patient for adult females and adult males.

a)

For only females in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient. NOTE: Specifying the full model using “LiverPatient~., data=…” will give an error message (due to only one level of factor – Female – in the data, I guess so). Suggest typing all variables manually for the full model

str(liver)
## spec_tbl_df [558 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Age         : num [1:558] 65 62 62 58 72 46 26 29 55 57 ...
##  $ Gender      : chr [1:558] "Female" "Male" "Male" "Male" ...
##  $ TB          : num [1:558] 0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.7 0.6 ...
##  $ DB          : num [1:558] 0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.2 0.1 ...
##  $ Alkphos     : num [1:558] 187 699 490 182 195 208 154 202 290 210 ...
##  $ Alamine     : num [1:558] 16 64 60 14 27 19 16 14 53 51 ...
##  $ Aspartate   : num [1:558] 18 100 68 20 59 14 12 11 58 59 ...
##  $ TP          : num [1:558] 6.8 7.5 7 6.8 7.3 7.6 7 6.7 6.8 5.9 ...
##  $ ALB         : num [1:558] 3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 3.4 2.7 ...
##  $ LiverPatient: num [1:558] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Age = col_double(),
##   ..   Gender = col_character(),
##   ..   TB = col_double(),
##   ..   DB = col_double(),
##   ..   Alkphos = col_double(),
##   ..   Alamine = col_double(),
##   ..   Aspartate = col_double(),
##   ..   TP = col_double(),
##   ..   ALB = col_double(),
##   ..   LiverPatient = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
liverFemale = liver[which(liver$Gender == "Female"),]
glm.null.female = glm(LiverPatient ~ 1, data = liverFemale, family = "binomial") #null model: no predictor
glm.full.female = glm(LiverPatient ~ Age+ TB+ DB+ Alkphos+ Alamine+ Aspartate+ TP+ ALB, data = liverFemale, family = "binomial") # full model: all predictors
#Stepwise based on AIC for Female
stepwise.Female <- step(glm.null.female, scope = list(upper=glm.full.female), direction="both",test="Chisq", trace = F) 
summary(stepwise.Female)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = liverFemale)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

b)

Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Comments: The variables DB and Aspartate both have values below a 0.1 significance level of 0.0905 and 0.0726, respectfully, making them the significant predictors for Female liver patients. Both of these predictors have a high chance of a relationship between Females patients and active liver disease.

hoslem.test(stepwise.Female$y, fitted(stepwise.Female), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  stepwise.Female$y, fitted(stepwise.Female)
## X-squared = 7.7535, df = 8, p-value = 0.4579

Observation: The Hosmer and Lemeshow goodness of fit test provides a p-value of 0.4579 which is above the 0.1 significance level making the model adequate.

resid.d<-residuals(stepwise.Female, type = "deviance")
resid.p<-residuals(stepwise.Female, type = "pearson")
std.res.d<-residuals(stepwise.Female, type = "deviance")/sqrt(1 - hatvalues(stepwise.Female)) # standardized deviance residuals
std.res.p <-residuals(stepwise.Female, type = "pearson")/sqrt(1 - hatvalues(stepwise.Female)) # standardized pearson residuals

par(mfrow=c(1,2))
plot(std.res.d[stepwise.Female$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[stepwise.Female$model$LiverPatient==1], col = "blue")

plot(std.res.p[stepwise.Female$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[stepwise.Female$model$LiverPatient==1], col = "blue")

Observation: The diagnostic plots follow a similar pattern in both the standard deviance residual and Pearson residuals. The red line does not stray away from linearity and there is no large outstanding values it does not violate any assumptions.

plot(stepwise.Female, which = 4, id.n = 5)

(inf.id.1=which(cooks.distance(stepwise.Female)>0.25))
## named integer(0)

Observation: There is no observation with a Cook’s distance larger than 0.25.

c)

Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio). NOTE: stepwise selection with AIC criteria can be performed by default step() function in R.

summary(stepwise.Female)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = liverFemale)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

Observation: The final model for Females is: \(\log(p/1-p)\)=-0.32480+0.94479DB+0.01106Aspartate

OR=exp(stepwise.Female$coefficients)
round(OR, 3)
## (Intercept)          DB   Aspartate 
##       0.723       2.572       1.011

Observation: 1. As DB increases by 0.94479 the odds of DB increases by exp=2.572. 2. As Aspartate increases by 0.01106 the odds of Aspartate increases by exp=1.011 3. Female patients with DB or Aspartate are more likely to have liver disease that is active.

Exercise 2:

Repeat exercise 1 for males. In addition to the previous questions, also d) comment on how the models for adult females and adult males differ. Use significance level alpha=0.1

NOTE: You will get an error message “glm.fit: fitted probabilities numerically 0 or 1 occurred” for this run. Ignore this and use the result for the interpretation. I will explain what this error means in Week 14 videos.

a)

liverMale = liver[which(liver$Gender == "Male"),]
glm.null.male = glm(LiverPatient ~ 1, data = liverMale, family = "binomial")
glm.full.male = glm(LiverPatient ~ Age + TB + DB + Alkphos + Alamine + Aspartate + TP + ALB, data=liverMale, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stepwise.Male <- step(glm.null.male, scope = list(upper=glm.full.male),
                      direction="both", test="Chisq", trace = F)
summary(stepwise.Male)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = liverMale)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3405  -0.5170   0.3978   0.8614   1.3756  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.476570   0.481336  -3.068  0.00216 **
## DB           0.512503   0.176066   2.911  0.00360 **
## Alamine      0.016218   0.005239   3.095  0.00197 **
## Age          0.020616   0.008095   2.547  0.01087 * 
## Alkphos      0.001740   0.001058   1.645  0.09992 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 476.28  on 422  degrees of freedom
## Residual deviance: 395.05  on 418  degrees of freedom
## AIC: 405.05
## 
## Number of Fisher Scoring iterations: 7

b)

Comments: All the values from the given stepwise selection have p-values lower than the significance level of 0.1. This allows us to state that there is a significant relationship between DB, Alamine, Age, and Alkphos in determining if a Male liver pationa has a liver disease.

hoslem.test(stepwise.Male$y, fitted(stepwise.Male), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  stepwise.Male$y, fitted(stepwise.Male)
## X-squared = 7.043, df = 8, p-value = 0.532
resid.d.Male<-residuals(stepwise.Male, type = "deviance")
resid.p.Male<-residuals(stepwise.Male, type = "pearson")
std.res.d.Male<-residuals(stepwise.Male, type = "deviance")/sqrt(1 - hatvalues(stepwise.Male)) 
std.res.p.Male <-residuals(stepwise.Male, type = "pearson")/sqrt(1 - hatvalues(stepwise.Male))

par(mfrow=c(1,2))
plot(std.res.d.Male[stepwise.Male$model$LiverPatient==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d.Male[stepwise.Male$model$LiverPatient==1], col = "blue")

plot(std.res.p.Male[stepwise.Male$model$LiverPatient==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p.Male[stepwise.Male$model$LiverPatient==1], col = "blue")

plot(stepwise.Male, which = 4, id.n=5 )

(inf.id.2 = which(cooks.distance(stepwise.Male)>0.25))
## 86 
## 86
glm.male.final = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, data = liverMale[-inf.id.2, ], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plot(glm.male.final, which = 4, id.n=5 )

Observation With our Hosmer-Lemeshow test we receieved a p-value greater than our significance level of 0.1, which means we dod not reject H\(_{0}\) and we assume our model is adequate. Now looking at our residual plots there is a parallel pattern, which implies there is a similar estimated probabilities for all observations. Finally in the cooks plot we do see one observation with a Cook’s distance larger than 0.25, observation 86.

c)

summary(glm.male.final)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = liverMale[-inf.id.2, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5166   0.0000   0.3301   0.8648   1.4696  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.902754   0.527386  -3.608 0.000309 ***
## DB           0.573104   0.198893   2.881 0.003958 ** 
## Alamine      0.015850   0.005466   2.900 0.003737 ** 
## Age          0.020418   0.008210   2.487 0.012883 *  
## Alkphos      0.003744   0.001477   2.534 0.011262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 473.51  on 421  degrees of freedom
## Residual deviance: 381.31  on 417  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 8
OR.M = exp(glm.male.final$coefficients)
round(OR.M,3)
## (Intercept)          DB     Alamine         Age     Alkphos 
##       0.149       1.774       1.016       1.021       1.004

Observation As our final model is \(\log(p/1-p)\) = -1.902754 + 0.573104*DB + 0.015850*Alamine + 0.020418*Age + 0.003744*Alkphos we can compare it to the output of the Odds Ratios.

The odds of a male liver patient having active liver disease increases by a fact of 1.774 with every unit increase in DB.

Similary if we assume all other variables remain constant the odds of liver disease increase by a factor of the following,

  • odds of liver disease increase by a factor of 1.016 with a one unit increase in Alamine
  • odds of liver disease increase by a factor of 1.021 with a one unit increase in Age
  • odds of liver disease increase by a factor of 1.004 with a one unit increase in Alkphos

d)

Comment on how the models for adult females and adult males differ. Use significance level alpha=0.1 Comment: When comparing the models for females and males we note that the females only have two significant predictors (DB and Aspartate) while the males have four significant predictors (DB, Alamine, Age, and Alkphos). With each of these predictors increasing the odds of a patient having liver disease it may be easier to predict when a male is at risk easier than a female, simply because of the amount of predictors used.

Exercise 3:

Use the sleep data set which originates from http://lib.stat.cmu.edu/datasets/sleep. maxlife10 is 0 if the species maximum life span is less than 10 years and 1 if its maximum life span is greater than or equal to 10 years. Consider finding the best logistic model for predicting the probability that a species’ maximum lifespan will be at least 10 years. Consider all 6 variables as candidates (do not include species) and two index variables of them are categorical in nature. Treat two index variables as categorical variables (e.g. ignore the fact that they are ordinal). Use significance level alpha=0.1

a)

First find and specify the best set of predictors via stepwise selection with AIC criteria.

str(sleep)
## 'data.frame':    51 obs. of  8 variables:
##  $ species           : chr  "African" "African" "Arctic F" "Asian el" ...
##  $ bodyweight        : num  6654 1 3.38 2547 10.55 ...
##  $ brainweight       : num  5712 6.6 44.5 4603 179.5 ...
##  $ totalsleep        : num  3.3 8.3 12.5 3.9 9.8 19.7 6.2 14.5 9.7 12.5 ...
##  $ gestationtime     : num  645 42 60 624 180 35 392 63 230 112 ...
##  $ predationindex    : int  3 3 1 3 4 1 4 1 1 5 ...
##  $ sleepexposureindex: int  5 1 1 5 4 1 5 2 1 4 ...
##  $ maxlife10         : int  1 0 1 1 1 1 1 1 1 0 ...
glm.null.sleep1 <- glm(maxlife10 ~ 1, data = sleep, family = binomial)

glm.full.sleep1 <- glm(maxlife10 ~ bodyweight + brainweight + totalsleep + gestationtime + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep, family = binomial)
step.sleep1 <- step(glm.null.sleep1, scope = list(upper=glm.full.sleep1), direction = "both", test = "Chisq", trace = F)

summary(step.sleep1)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) + 
##     as.factor(predationindex), family = binomial, data = sleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.42528  -0.00004   0.00000   0.00013   2.37523  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20

Observation: Significant predictor is sleepexposure because it has p-values below the 0.1 significance level.

b)

Comment on the significance of parameter estimates, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.

Comment: Specifically the sleepexposure levels 3-5 have a chance of an event not happening whereas level 2 has a chance of an event happening. Still, sleepexposure is a significant predictor.

hoslem.test(step.sleep1$y, fitted(step.sleep1), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.sleep1$y, fitted(step.sleep1)
## X-squared = 7.0397, df = 8, p-value = 0.5324

Observation: Based on the Hosmer goodness of fit test, the p-value is 0.5324 which falls above the 0.1 significance level making the model adequate.

resid.d<-residuals(step.sleep1, type = "deviance")
resid.p<-residuals(step.sleep1, type = "pearson")
std.res.d<-residuals(step.sleep1, type = "deviance")/sqrt(1 - hatvalues(step.sleep1)) # standardized deviance residuals
std.res.p <-residuals(step.sleep1, type = "pearson")/sqrt(1 - hatvalues(step.sleep1)) # standardized pearson residuals

par(mfrow=c(1,2))
plot(std.res.d[step.sleep1$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.sleep1$model$maxlife10==1], col = "blue")

plot(std.res.p[step.sleep1$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.sleep1$model$maxlife10==1], col = "blue")

plot(step.sleep1, which = 4, id.n = 5)

(inf.id.3=which(cooks.distance(step.sleep1)>0.25))
## 35 40 
## 35 40

Observation: 1. Diagnostic plots: The Deviance Residual and Pearson Residual red plots do not follow a linearity line. Both appear to have values scattered and follow no pattern. As a result, this does not violate any linearity assumption. 2. Cook’s distance: There are two values, 35 and 40 above the Cook’s distance point of 0.25.

c)

Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years. NOTE: stepwise selection with BIC criteria can be performed by step() function by adding an option k=log(n), where n is a sample size. No need to refit the model with only significant variables. Just use the final model from step() function, and for part (b), make comments based on this. But, for part (c), interpret the Odds Ratio only for significant level and no need to interpret other insignificant ones.

glm.sleep.final = glm(maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep[-inf.id.3], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.sleep.final)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) + 
##     as.factor(sleepexposureindex), family = "binomial", data = sleep[-inf.id.3])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.42528  -0.00004   0.00000   0.00013   2.37523  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20
round(exp(glm.sleep.final$coefficients),3)
##                    (Intercept)                    brainweight 
##                   1.000000e-03                   1.052000e+00 
##                     totalsleep     as.factor(predationindex)2 
##                   1.527000e+00                   7.900000e-02 
##     as.factor(predationindex)3     as.factor(predationindex)4 
##                   0.000000e+00                   0.000000e+00 
##     as.factor(predationindex)5 as.factor(sleepexposureindex)2 
##                   0.000000e+00                   1.480500e+02 
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4 
##                   6.173141e+15                   4.332708e+14 
## as.factor(sleepexposureindex)5 
##                   7.603846e+31

Observation: Based on the Odds Ratio, sleepexposure2 has a value of 1.480500e+02. Therefore, animals with a maxlife of Level 2 has a more likely chance to live past the maximum lifespan of 10yrs.

Exercise 4:

The index variables in the data set are ordinal, meaning they are categorical and they have a natural ordering. If we treat an index variable as a continuous variable, this will imply a linear change as the index changes. Repeat Exercise 3 by treating two index variables as continuous variables. Use significance level alpha=0.1

###a)

glm.null.sleep2 <- glm(maxlife10 ~ 1, data = sleep, family = "binomial")

glm.full.sleep2 <- glm(maxlife10 ~ bodyweight+brainweight+totalsleep+gestationtime
                       + predationindex + sleepexposureindex, data = sleep, family = "binomial")

step.sleep2 <- step(glm.null.sleep2, scope = list(upper=glm.full.sleep2),
                    direction="both",test="Chisq", trace = F)
summary(step.sleep2)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = sleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82148  -0.04746   0.00000   0.05811   2.41681  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11

Observation: With the stepwise selection treating the two index variables as continous variables we find the best predictors to be brainweight, totalsleep, sleepexposureindex, and predationindex.

b)

As mentioned in part A brainweight, totalsleep, sleepexposureindex, and predationindex have p-values lower than the significance level of 0.1. Implying they all have a significant relationship when determining maxlife10.

hoslem.test(step.sleep2$y, fitted(step.sleep2), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.sleep2$y, fitted(step.sleep2)
## X-squared = 1.4406, df = 8, p-value = 0.9937
resid.d<-residuals(step.sleep2, type = "deviance")
resid.p<-residuals(step.sleep2, type = "pearson")
std.res.d<-residuals(step.sleep2, type = "deviance")/sqrt(1 - hatvalues(step.sleep2)) # standardized deviance residuals
std.res.p <-residuals(step.sleep2, type = "pearson")/sqrt(1 - hatvalues(step.sleep2)) # standardized pearson residuals

par(mfrow=c(1,2))
plot(std.res.d[step.sleep2$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.sleep2$model$maxlife10==1], col = "blue")

plot(std.res.p[step.sleep2$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.sleep2$model$maxlife10==1], col = "blue")

plot(step.sleep2, which=4, id.n=5)

Observation: Based on the Hosmer-Lemeshow test our model does appear to be adequate since the p-value is >0.1 and we do not have enough significant. The residual plots show overlap in the red (maxlife10 = 0) and blue (maxlife10 = 1) points. Finally there appears to be four influential points in the Cooks distance plot; 10, 35, 40, and 50.

(inf.id.4 = which(cooks.distance(step.sleep2)>0.25))
## 10 35 40 50 
## 10 35 40 50
glm.sleep2.final = glm(maxlife10 ~ brainweight + totalsleep + sleepexposureindex + predationindex, data = sleep[-inf.id.4], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

c)

summary(glm.sleep2.final)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = sleep[-inf.id.4])
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82148  -0.04746   0.00000   0.05811   2.41681  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11
round(exp(glm.sleep2.final$coefficients),3)
##        (Intercept)        brainweight         totalsleep sleepexposureindex 
##              0.002              1.062              1.433             83.188 
##     predationindex 
##              0.034

Observation: As our final model is \(\log(p/1-p)\) = -6.16387 + 0.06018*brainweight + 0.35985*totalsleep + 4.42111*sleepexposureindex + -3.36917*predationindex we can compare it to the output of the Odds Ratios.

The odds of a species maximum lifespan being at least 10 years increases by a fact of 1.062 with every unit increase in brainweight.

Similary if we assume all other variables remain constant the odds of a species maximum lifespan being at least 10 years increase by a factor of the following,

  • odds of a species maximum lifespan being at least 10 years increase by a factor of 1.433 with a one unit increase in totalsleep
  • odds of a species maximum lifespan being at least 10 years increases by a factor of 83.188 with a one unit increase in sleepexposureindex
  • odds of a species maximum lifespan being at least 10 years decreases by a factor of 0.034 with a one unit increase in predationindex