Homework 4

#install.packages("ResourceSelection")
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.3.2
## ResourceSelection 0.3-6   2023-06-27
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
.bordered{
  border-style: solid;
  border-color: teal;
  padding: 5px;
  background-color: #DCDCDC;
}

Due: Saturday, Dec 2, at 11:59 pm

To complete this assignment, you will need to download liver.csv and sleep.csv.

We use significance level alpha=0.1 in HW4.

liver = read.csv("liver-1.csv")
sleep = read.csv("sleep-1.csv")

#Exercise 1

The liver data set is a subset of the ILPD (Indian Liver Patient Dataset) data set. It contains the first 10 variables described on the UCI Machine Learning Repository and a LiverPatient variable (indicating whether or not the individual is a liver patient. People with active liver disease are coded as LiverPatient=1 and people without disease are coded LiverPatient=0) for adults in the data set. Adults here are defined to be individuals who are at least 18 years of age. It is possible that there will be different significant predictors of being a liver patient for adult females and adult males.

(A) For only females in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient.

  • After performing the stepwise selection below, we can conclude that the best set of predictors for a logistic regression model predicting whether a female is a liver patient includes two variables: DB and Aspartate.


liver.f = liver[which(liver$Gender == "Female"),]
glm.null.f = glm(LiverPatient ~ 1, data = liver.f, family = "binomial")
glm.full.f = glm(LiverPatient ~ Age + TB + DB + Alkphos + Alamine + Aspartate + TP + ALB, data = liver.f, family = "binomial")
stepwise.f=step(glm.null.f, scope = list(upper=glm.full.f), direction="both",test="Chisq", trace = F) 
summary(stepwise.f)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = liver.f)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

(B) Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

  • In the model, DB and Aspartate are both significant, with p-values less than the significance level 0.1. The result of the HL (Hosmer-Lemeshow) test below has a p-value of 0.45, which is greater than 0.1. Thus, we accept the null hypothesis and conclude that the model fits the data well.

  • As presented in the following Influence Diagnostics plot, the highest cook’s d is around 0.07, which is less than the threshold 0.25. Thus, there is no unduly influential point. So, there is no need for refitting the model. The residual plots do not show any systematic pattern, and there are no observations with very large residuals. Thus, our model assumption seems valid.


hoslem.test(stepwise.f$y, fitted(stepwise.f), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  stepwise.f$y, fitted(stepwise.f)
## X-squared = 7.7535, df = 8, p-value = 0.4579
resid1 = residuals(stepwise.f, type = "deviance")
resid2 = residuals(stepwise.f, type = "pearson")
std.res.1 = residuals(stepwise.f, type = "deviance")/sqrt(1 - hatvalues(stepwise.f))
std.res.2 = residuals(stepwise.f, type = "pearson")/sqrt(1 - hatvalues(stepwise.f))

par(mfrow=c(1,2))
plot(std.res.1[stepwise.f$model$LiverPatient==0],  
     ylim = c(-3.5,3.5), ylab = "Deviance residuals", xlab = "ID")
points(std.res.1[stepwise.f$model$LiverPatient==1], col = "blue")

plot(std.res.2[stepwise.f$model$LiverPatient==0],  
     ylim = c(-3.5,3.5), ylab = "Pearson residuals", xlab = "ID")
points(std.res.2[stepwise.f$model$LiverPatient==1], col = "blue")

plot(stepwise.f, which = 4, id.n = 5)

(inf.id.1=which(cooks.distance(stepwise.f)>0.25))
## named integer(0)

(C) Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio). NOTE: stepwise selection with AIC criteria can be performed by default step() function in R.

  • The estimated of Odds Ratio (OR) for DB and Aspartate are 2.57 and 1.017, respectively. This means that, for each unit increasing of DB, there will be 2.57 (=exp(0.94)) times increasing of odds, and for each unit increasing of Aspartate, there will be 1.011 (=exp(0.011)) times increasing of odds of an adult female being a liver patient.


OR=exp(stepwise.f$coefficients)
round(OR, 3)
## (Intercept)          DB   Aspartate 
##       0.723       2.572       1.011

#Exercise 2

Repeat exercise 1 for males. In addition to the previous questions, also d) comment on how the models for adult females and adult males differ. Use significance level alpha=0.1

NOTE: You will get an error message “glm.fit: fitted probabilities numerically 0 or 1 occurred” for this run. Ignore this and use the result for the interpretation. I will explain what this error means in the class.

(A) For only males in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient.

  • According to the summary output of Stepwise selection below, we can conclude that the best set of predictors for a logistic regression model predicting whether a male is a liver patient are: DB, Alamine, Age and Alkphos.


liverm = liver[which(liver$Gender == "Male"),]
glm.null.m = glm(LiverPatient ~ 1, data = liverm, family = "binomial")
glm.full.m = glm(LiverPatient ~ Age + TB + DB + Alkphos + Alamine + Aspartate + TP + ALB, data=liverm, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stepwise.m = step(glm.null.m, scope = list(upper=glm.full.m), direction="both", test="Chisq", trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(stepwise.m)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = liverm)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.476570   0.481336  -3.068  0.00216 **
## DB           0.512503   0.176066   2.911  0.00360 **
## Alamine      0.016218   0.005239   3.095  0.00197 **
## Age          0.020616   0.008095   2.547  0.01087 * 
## Alkphos      0.001740   0.001058   1.645  0.09992 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 476.28  on 422  degrees of freedom
## Residual deviance: 395.05  on 418  degrees of freedom
## AIC: 405.05
## 
## Number of Fisher Scoring iterations: 7

(B) Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

  • As presented in the following Influence Diagnostics plot, the highest cook’s d is greater than the threshold 0.25. Thus we refit the model without the high influential point.

  • All predictors (DB, Alamine, Age, and Alkphos) in the final refitted model without influential points are significant with all their p-value less than 0.1. The result of HL test below has a p-value of 0.467 which is greater than 0.1. Thus we accept the null hypothesis and conclude that the model fit the data well.

  • The cook’s d plot after removing the high influential point, and this time, there is no high influential point showed in the plot. Residual plots does not show any problematic patterns or large standardized residuals, thus model assumption seems valid.


hoslem.test(stepwise.m$y, fitted(stepwise.m), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  stepwise.m$y, fitted(stepwise.m)
## X-squared = 7.043, df = 8, p-value = 0.532
resid.3 = residuals(stepwise.m, type = "deviance")
resid.4 = residuals(stepwise.m, type = "pearson")
std.res.3 = residuals(stepwise.m, type = "deviance")/sqrt(1 - hatvalues(stepwise.m))
std.res.4 = residuals(stepwise.m, type = "pearson")/sqrt(1 - hatvalues(stepwise.m))

par(mfrow=c(1,2))
plot(std.res.3[stepwise.m$model$LiverPatient==0],  
     ylim = c(-3.5,3.5), ylab = "Deviance residuals", xlab = "ID")
points(std.res.3[stepwise.m$model$LiverPatient==1], col = "blue")

plot(std.res.4[stepwise.m$model$LiverPatient==0], 
     ylim = c(-3.5,3.5), ylab = "Pearson residuals", xlab = "ID")
points(std.res.4[stepwise.m$model$LiverPatient==1], col = "blue")

plot(stepwise.m, which = 4, id.n = 5)

inf.id.2 = which(cooks.distance(stepwise.m)>0.25)
inf.id.2
## 111 
##  86

(C) Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio). NOTE: stepwise selection with AIC criteria can be performed by default step() function in R.

  • The estimation of OR for DB is 1.774. This means that, for each unit increasing of DB, there will be 1.774(=exp(0.573)) times increasing of odds of an adult male being a liver patient. The estimation of OR for Alamine is 1.016. This means that, for each unit increasing of Alamine, there will be 1.016 (=exp(0.016)) times increasing of odds of an adult male being a liver patient. The estimation of OR for Age is 1.021. This means that, for each unit increasing of Age, there will be 1.021(=exp(0.02)) times increasing of odds of an adult male being a liver patient. The estimation of OR for Alkphos is 1,004. This means that, for each unit increasing of Alkphos, there will be


glm.liver.m.2 = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, data = liverm[-inf.id.2, ], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.liver.m.2)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = liverm[-inf.id.2, ])
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.902754   0.527386  -3.608 0.000309 ***
## DB           0.573104   0.198893   2.881 0.003958 ** 
## Alamine      0.015850   0.005466   2.900 0.003737 ** 
## Age          0.020418   0.008210   2.487 0.012883 *  
## Alkphos      0.003744   0.001477   2.534 0.011262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 473.51  on 421  degrees of freedom
## Residual deviance: 381.31  on 417  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 8
round(exp(glm.liver.m.2$coefficients),3)
## (Intercept)          DB     Alamine         Age     Alkphos 
##       0.149       1.774       1.016       1.021       1.004

(D) Comment on how the models for adult females and adult males differ. Use significance level alpha=0.1

  • For DB, the odds of being a liver patient are expected to increase by approximately DB = 1.774. The higher values of DB are associated with higher odds of being a liver patient.

  • For Alamine, the odds of being a liver patient are expected to increase by approximately Alamine = 1.016. The higher values of Aspartate are associated with slightly higher odds of being a liver patient.

  • For Age, the odds of being a liver patient are expected to increase by approximately Age = 1.021. The higher values of Age are associated with slightly higher odds of being a liver patient.

  • For Alkphos, the odds of being a liver patient are expected to increase by approximately Age = 1.004. The higher values of Alkphos are associated with slightly higher odds of being a liver patient.

  • In other words, Male patients with DB or Alamine or Age or Alkphosare more likely to have liver disease that is active.


#Exercise 3

Use the sleep data set which originates from http://lib.stat.cmu.edu/datasets/sleep. maxlife10 is 0 if the species maximum life span is less than 10 years and 1 if its maximum life span is greater than or equal to 10 years. Consider finding the best logistic model for predicting the probability that a species’ maximum lifespan will be at least 10 years. Consider all 6 variables as candidates (do not include species) and two index variables of them are categorical in nature. Treat two index variables as categorical variables (e.g. ignore the fact that they are ordinal). Use significance level alpha=0.1

(A)First find and specify the best set of predictors via stepwise selection with BIC criteria.

  • The best model from stepwise selection via AIC criteria contains brainweight, totalsleep, sleepexposureindex, and predationindex.


glm.null.sleep1 <- glm(maxlife10 ~ 1, data = sleep, family = binomial)

glm.full.sleep1 <- glm(maxlife10 ~ bodyweight + brainweight + totalsleep + gestationtime + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stepwise.sleep1 <- step(glm.null.sleep1, scope = list(upper=glm.full.sleep1), direction = "both", test = "Chisq", trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(stepwise.sleep1)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) + 
##     as.factor(predationindex), family = binomial, data = sleep)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20

(B) What does Hosmer-Lemeshow’s test tells us about goodness of fit? And point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.

  • Among 4 chosen predictors, only sleepexposureindex is significant with p-value for sleepexposureindex2 less than 0.1. The goodness of fit test for the model has p-value of 0.53, which indicates the model fit is adequate.

  • In the diagnostic plots, we find two observations with relatively large cook’s d compared to others.Residual plots looks okay without problematic issues.


hoslem.test(stepwise.sleep1$y, fitted(stepwise.sleep1), g=10)
## Warning in hoslem.test(stepwise.sleep1$y, fitted(stepwise.sleep1), g = 10): The
## data did not allow for the requested number of bins.
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  stepwise.sleep1$y, fitted(stepwise.sleep1)
## X-squared = 7.0397, df = 7, p-value = 0.4248
resid.d = residuals(stepwise.sleep1, type = "deviance")
resid.p = residuals(stepwise.sleep1, type = "pearson")
std.res.d = residuals(stepwise.sleep1, type = "deviance")/sqrt(1 - hatvalues(stepwise.sleep1))
std.res.p = residuals(stepwise.sleep1, type = "pearson")/sqrt(1 - hatvalues(stepwise.sleep1))

par(mfrow=c(1,2))
plot(std.res.d[stepwise.sleep1$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "deviance residuals", xlab = "ID")
points(std.res.d[stepwise.sleep1$model$maxlife10==1], col = "black")

plot(std.res.p[stepwise.sleep1$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "Pearson residuals", xlab = "ID")
points(std.res.p[stepwise.sleep1$model$maxlife10==1], col = "black")

plot(stepwise.sleep1, which = 4, id.n = 5)

infsleep3 = which(cooks.distance(stepwise.sleep1)>0.25)
infsleep3
## 35 40 
## 35 40

(C) Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years.

  • We only interpret the significant one, estimated OR for sleepexposureindex2. The odds ratio is estimated as 148.05, so we can say that the odds of having a maximum lifespan of at least 10 years for a species with sleepexposureindex2 is 148.05 (=exp(4.99)) times the odds for a species with sleepexposureindex1. A species under other sleepexposureindex levels (3,4, and 5) does not have significantly different odds comapred to a species with sleepexposureindex1.


glm.sleep.2 = glm(maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep[-infsleep3], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.sleep.2)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) + 
##     as.factor(sleepexposureindex), family = "binomial", data = sleep[-infsleep3])
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20
round(exp(glm.sleep.2$coefficients),3)
##                    (Intercept)                    brainweight 
##                   1.000000e-03                   1.052000e+00 
##                     totalsleep     as.factor(predationindex)2 
##                   1.527000e+00                   7.900000e-02 
##     as.factor(predationindex)3     as.factor(predationindex)4 
##                   0.000000e+00                   0.000000e+00 
##     as.factor(predationindex)5 as.factor(sleepexposureindex)2 
##                   0.000000e+00                   1.480500e+02 
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4 
##                   6.173141e+15                   4.332708e+14 
## as.factor(sleepexposureindex)5 
##                   7.603846e+31

Exercise 4

The index variables in the data set are ordinal, meaning they are categorical and they have a natural ordering. If we treat an index variable as a continuous variable, this will imply a linear change as the index changes. Repeat Exercise 3 a)-c) by treating two index variables as continuous variables.

(A) First find and specify the best set of predictors via stepwise selection with BIC criteria.

  • Treating the index variables as continuous, stepwise select brainweight, totalsleep, sleepexposureindex and predationindex.


glm.null.sleep2 <- glm(maxlife10 ~ 1, data = sleep, family = "binomial")

glm.full.sleep2 <- glm(maxlife10 ~ bodyweight+brainweight+totalsleep+gestationtime
                       + predationindex + sleepexposureindex, data = sleep, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
step.sleep2 <- step(glm.null.sleep2, scope = list(upper=glm.full.sleep2),
                    direction="both",test="Chisq", trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(step.sleep2)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = sleep)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11

(B) What does Hosmer-Lemeshow’s test tells us about goodness of fit? And point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.

  • All chosen predictors in the final model are statistically significant with p-values less than 0.1, meaning its estimated coefficient is far from 0. Thus, it aids in predicting whether the maximum lifespan of a species will be at least 10 years. The goodness of fit test for the model has a p-value of 0.99, which indicates the model fit is reasonable.

  • We observe a few observations have cook’s d relatively larger than others and residual plots does not show problematic issues.


hoslem.test(step.sleep2$y, fitted(step.sleep2), g=10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.sleep2$y, fitted(step.sleep2)
## X-squared = 1.4406, df = 8, p-value = 0.9937
resid.d<-residuals(step.sleep2, type = "deviance")
resid.p<-residuals(step.sleep2, type = "pearson")
std.res.d<-residuals(step.sleep2, type = "deviance")/sqrt(1 - hatvalues(step.sleep2)) # standardized deviance residuals
std.res.p <-residuals(step.sleep2, type = "pearson")/sqrt(1 - hatvalues(step.sleep2)) # standardized pearson residuals

par(mfrow=c(1,2))
plot(std.res.d[step.sleep2$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.sleep2$model$maxlife10==1], col = "blue")

plot(std.res.p[step.sleep2$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.sleep2$model$maxlife10==1], col = "blue")

plot(step.sleep2, which=4, id.n=5)

(inf.id.4 = which(cooks.distance(step.sleep2)>0.25))
## 10 35 40 50 
## 10 35 40 50
glm.sleep2.final = glm(maxlife10 ~ brainweight + totalsleep + sleepexposureindex + predationindex, data = sleep[-inf.id.4], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

(C) Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years.

  • The estimated OR for brainweight is 1.062 and it implies that the odds of a species having maximum lifespan at least 10 years is expected to increase by 1.062 (=exp(0.06)) times with one unit increase in brainweight. The estimated OR for totalsleep is 1.433 and it implies that the odds of a species having maximum lifespan at least 10 years is expected to increase by 1.433 (=exp(0.36)) with one unit increase in totalsleep. The odds ratio is estimated as 83.18 for sleepexposure so we can say for a one-unit increase in sleep exposure index, we expect to see an increase in the odds of a species having maximum lifespan at least 10 years by 83.18 (=exp(4.42)) times. The odds ratio for predation index is estimated as 0.034. Thus we expect the odds of a species having max lifespan at least 10 years change by 0.034 (=exp(-3.37)) multiplicative factor with one unit increase in predation index.

  • Estimated odds ratio is very large for sleep exposure index. We need to be careful to see if this result is presumable. But it is different issue and above is what we get from data. Also we see different result for the significance of variables from Exercise3 and 4. The reason is due to small sample size with relatively large number of parameters to estimate in Exercise3 by treating index variables as catetorial variables.


summary(glm.sleep2.final)
## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = sleep[-inf.id.4])
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11
round(exp(glm.sleep2.final$coefficients),3)
##        (Intercept)        brainweight         totalsleep sleepexposureindex 
##              0.002              1.062              1.433             83.188 
##     predationindex 
##              0.034