Homework 4
#install.packages("ResourceSelection")
library(ResourceSelection)
## Warning: package 'ResourceSelection' was built under R version 4.3.2
## ResourceSelection 0.3-6 2023-06-27
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
.bordered{
border-style: solid;
border-color: teal;
padding: 5px;
background-color: #DCDCDC;
}
Due: Saturday, Dec 2, at 11:59 pm
To complete this assignment, you will need to download liver.csv and sleep.csv.
We use significance level alpha=0.1 in HW4.
liver = read.csv("liver-1.csv")
sleep = read.csv("sleep-1.csv")
#Exercise 1
The liver data set is a subset of the ILPD (Indian Liver Patient Dataset) data set. It contains the first 10 variables described on the UCI Machine Learning Repository and a LiverPatient variable (indicating whether or not the individual is a liver patient. People with active liver disease are coded as LiverPatient=1 and people without disease are coded LiverPatient=0) for adults in the data set. Adults here are defined to be individuals who are at least 18 years of age. It is possible that there will be different significant predictors of being a liver patient for adult females and adult males.
(A) For only females in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient.
liver.f = liver[which(liver$Gender == "Female"),]
glm.null.f = glm(LiverPatient ~ 1, data = liver.f, family = "binomial")
glm.full.f = glm(LiverPatient ~ Age + TB + DB + Alkphos + Alamine + Aspartate + TP + ALB, data = liver.f, family = "binomial")
stepwise.f=step(glm.null.f, scope = list(upper=glm.full.f), direction="both",test="Chisq", trace = F)
summary(stepwise.f)
##
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial",
## data = liver.f)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32480 0.31013 -1.047 0.2950
## DB 0.94479 0.55808 1.693 0.0905 .
## Aspartate 0.01106 0.00616 1.796 0.0726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175.72 on 134 degrees of freedom
## Residual deviance: 154.27 on 132 degrees of freedom
## AIC: 160.27
##
## Number of Fisher Scoring iterations: 7
(B) Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).
In the model, DB and Aspartate are both significant, with p-values less than the significance level 0.1. The result of the HL (Hosmer-Lemeshow) test below has a p-value of 0.45, which is greater than 0.1. Thus, we accept the null hypothesis and conclude that the model fits the data well.
hoslem.test(stepwise.f$y, fitted(stepwise.f), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: stepwise.f$y, fitted(stepwise.f)
## X-squared = 7.7535, df = 8, p-value = 0.4579
resid1 = residuals(stepwise.f, type = "deviance")
resid2 = residuals(stepwise.f, type = "pearson")
std.res.1 = residuals(stepwise.f, type = "deviance")/sqrt(1 - hatvalues(stepwise.f))
std.res.2 = residuals(stepwise.f, type = "pearson")/sqrt(1 - hatvalues(stepwise.f))
par(mfrow=c(1,2))
plot(std.res.1[stepwise.f$model$LiverPatient==0],
ylim = c(-3.5,3.5), ylab = "Deviance residuals", xlab = "ID")
points(std.res.1[stepwise.f$model$LiverPatient==1], col = "blue")
plot(std.res.2[stepwise.f$model$LiverPatient==0],
ylim = c(-3.5,3.5), ylab = "Pearson residuals", xlab = "ID")
points(std.res.2[stepwise.f$model$LiverPatient==1], col = "blue")
plot(stepwise.f, which = 4, id.n = 5)
(inf.id.1=which(cooks.distance(stepwise.f)>0.25))
## named integer(0)
(C) Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio). NOTE: stepwise selection with AIC criteria can be performed by default step() function in R.
OR=exp(stepwise.f$coefficients)
round(OR, 3)
## (Intercept) DB Aspartate
## 0.723 2.572 1.011
#Exercise 2
NOTE: You will get an error message “glm.fit: fitted probabilities numerically 0 or 1 occurred” for this run. Ignore this and use the result for the interpretation. I will explain what this error means in the class.
(A) For only males in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient.
liverm = liver[which(liver$Gender == "Male"),]
glm.null.m = glm(LiverPatient ~ 1, data = liverm, family = "binomial")
glm.full.m = glm(LiverPatient ~ Age + TB + DB + Alkphos + Alamine + Aspartate + TP + ALB, data=liverm, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stepwise.m = step(glm.null.m, scope = list(upper=glm.full.m), direction="both", test="Chisq", trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(stepwise.m)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = liverm)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.476570 0.481336 -3.068 0.00216 **
## DB 0.512503 0.176066 2.911 0.00360 **
## Alamine 0.016218 0.005239 3.095 0.00197 **
## Age 0.020616 0.008095 2.547 0.01087 *
## Alkphos 0.001740 0.001058 1.645 0.09992 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 476.28 on 422 degrees of freedom
## Residual deviance: 395.05 on 418 degrees of freedom
## AIC: 405.05
##
## Number of Fisher Scoring iterations: 7
(B) Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).
As presented in the following Influence Diagnostics plot, the highest cook’s d is greater than the threshold 0.25. Thus we refit the model without the high influential point.
All predictors (DB, Alamine, Age, and Alkphos) in the final refitted model without influential points are significant with all their p-value less than 0.1. The result of HL test below has a p-value of 0.467 which is greater than 0.1. Thus we accept the null hypothesis and conclude that the model fit the data well.
hoslem.test(stepwise.m$y, fitted(stepwise.m), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: stepwise.m$y, fitted(stepwise.m)
## X-squared = 7.043, df = 8, p-value = 0.532
resid.3 = residuals(stepwise.m, type = "deviance")
resid.4 = residuals(stepwise.m, type = "pearson")
std.res.3 = residuals(stepwise.m, type = "deviance")/sqrt(1 - hatvalues(stepwise.m))
std.res.4 = residuals(stepwise.m, type = "pearson")/sqrt(1 - hatvalues(stepwise.m))
par(mfrow=c(1,2))
plot(std.res.3[stepwise.m$model$LiverPatient==0],
ylim = c(-3.5,3.5), ylab = "Deviance residuals", xlab = "ID")
points(std.res.3[stepwise.m$model$LiverPatient==1], col = "blue")
plot(std.res.4[stepwise.m$model$LiverPatient==0],
ylim = c(-3.5,3.5), ylab = "Pearson residuals", xlab = "ID")
points(std.res.4[stepwise.m$model$LiverPatient==1], col = "blue")
plot(stepwise.m, which = 4, id.n = 5)
inf.id.2 = which(cooks.distance(stepwise.m)>0.25)
inf.id.2
## 111
## 86
(C) Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio). NOTE: stepwise selection with AIC criteria can be performed by default step() function in R.
glm.liver.m.2 = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, data = liverm[-inf.id.2, ], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.liver.m.2)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = liverm[-inf.id.2, ])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.902754 0.527386 -3.608 0.000309 ***
## DB 0.573104 0.198893 2.881 0.003958 **
## Alamine 0.015850 0.005466 2.900 0.003737 **
## Age 0.020418 0.008210 2.487 0.012883 *
## Alkphos 0.003744 0.001477 2.534 0.011262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 473.51 on 421 degrees of freedom
## Residual deviance: 381.31 on 417 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 8
round(exp(glm.liver.m.2$coefficients),3)
## (Intercept) DB Alamine Age Alkphos
## 0.149 1.774 1.016 1.021 1.004
(D) Comment on how the models for adult females and adult males differ. Use significance level alpha=0.1
For DB, the odds of being a liver patient are expected to increase by approximately DB = 1.774. The higher values of DB are associated with higher odds of being a liver patient.
For Alamine, the odds of being a liver patient are expected to increase by approximately Alamine = 1.016. The higher values of Aspartate are associated with slightly higher odds of being a liver patient.
For Age, the odds of being a liver patient are expected to increase by approximately Age = 1.021. The higher values of Age are associated with slightly higher odds of being a liver patient.
For Alkphos, the odds of being a liver patient are expected to increase by approximately Age = 1.004. The higher values of Alkphos are associated with slightly higher odds of being a liver patient.
#Exercise 3
Use the sleep data set which originates from http://lib.stat.cmu.edu/datasets/sleep. maxlife10 is 0 if the species maximum life span is less than 10 years and 1 if its maximum life span is greater than or equal to 10 years. Consider finding the best logistic model for predicting the probability that a species’ maximum lifespan will be at least 10 years. Consider all 6 variables as candidates (do not include species) and two index variables of them are categorical in nature. Treat two index variables as categorical variables (e.g. ignore the fact that they are ordinal). Use significance level alpha=0.1
(A)First find and specify the best set of predictors via stepwise selection with BIC criteria.
glm.null.sleep1 <- glm(maxlife10 ~ 1, data = sleep, family = binomial)
glm.full.sleep1 <- glm(maxlife10 ~ bodyweight + brainweight + totalsleep + gestationtime + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stepwise.sleep1 <- step(glm.null.sleep1, scope = list(upper=glm.full.sleep1), direction = "both", test = "Chisq", trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(stepwise.sleep1)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) +
## as.factor(predationindex), family = binomial, data = sleep)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.602e+00 4.864e+00 -1.357 0.1747
## brainweight 5.101e-02 5.084e-02 1.003 0.3157
## totalsleep 4.230e-01 2.647e-01 1.598 0.1100
## as.factor(sleepexposureindex)2 4.998e+00 2.559e+00 1.953 0.0508 .
## as.factor(sleepexposureindex)3 3.636e+01 9.624e+03 0.004 0.9970
## as.factor(sleepexposureindex)4 3.370e+01 1.037e+04 0.003 0.9974
## as.factor(sleepexposureindex)5 7.341e+01 1.262e+04 0.006 0.9954
## as.factor(predationindex)2 -2.535e+00 1.960e+00 -1.293 0.1960
## as.factor(predationindex)3 -2.512e+01 1.253e+04 -0.002 0.9984
## as.factor(predationindex)4 -1.826e+01 6.795e+03 -0.003 0.9979
## as.factor(predationindex)5 -5.264e+01 1.143e+04 -0.005 0.9963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.31 on 50 degrees of freedom
## Residual deviance: 15.88 on 40 degrees of freedom
## AIC: 37.88
##
## Number of Fisher Scoring iterations: 20
(B) What does Hosmer-Lemeshow’s test tells us about goodness of fit? And point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.
Among 4 chosen predictors, only sleepexposureindex is significant with p-value for sleepexposureindex2 less than 0.1. The goodness of fit test for the model has p-value of 0.53, which indicates the model fit is adequate.
hoslem.test(stepwise.sleep1$y, fitted(stepwise.sleep1), g=10)
## Warning in hoslem.test(stepwise.sleep1$y, fitted(stepwise.sleep1), g = 10): The
## data did not allow for the requested number of bins.
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: stepwise.sleep1$y, fitted(stepwise.sleep1)
## X-squared = 7.0397, df = 7, p-value = 0.4248
resid.d = residuals(stepwise.sleep1, type = "deviance")
resid.p = residuals(stepwise.sleep1, type = "pearson")
std.res.d = residuals(stepwise.sleep1, type = "deviance")/sqrt(1 - hatvalues(stepwise.sleep1))
std.res.p = residuals(stepwise.sleep1, type = "pearson")/sqrt(1 - hatvalues(stepwise.sleep1))
par(mfrow=c(1,2))
plot(std.res.d[stepwise.sleep1$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "deviance residuals", xlab = "ID")
points(std.res.d[stepwise.sleep1$model$maxlife10==1], col = "black")
plot(std.res.p[stepwise.sleep1$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "Pearson residuals", xlab = "ID")
points(std.res.p[stepwise.sleep1$model$maxlife10==1], col = "black")
plot(stepwise.sleep1, which = 4, id.n = 5)
infsleep3 = which(cooks.distance(stepwise.sleep1)>0.25)
infsleep3
## 35 40
## 35 40
(C) Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years.
glm.sleep.2 = glm(maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep[-infsleep3], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.sleep.2)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) +
## as.factor(sleepexposureindex), family = "binomial", data = sleep[-infsleep3])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.602e+00 4.864e+00 -1.357 0.1747
## brainweight 5.101e-02 5.084e-02 1.003 0.3157
## totalsleep 4.230e-01 2.647e-01 1.598 0.1100
## as.factor(predationindex)2 -2.535e+00 1.960e+00 -1.293 0.1960
## as.factor(predationindex)3 -2.512e+01 1.253e+04 -0.002 0.9984
## as.factor(predationindex)4 -1.826e+01 6.795e+03 -0.003 0.9979
## as.factor(predationindex)5 -5.264e+01 1.143e+04 -0.005 0.9963
## as.factor(sleepexposureindex)2 4.998e+00 2.559e+00 1.953 0.0508 .
## as.factor(sleepexposureindex)3 3.636e+01 9.624e+03 0.004 0.9970
## as.factor(sleepexposureindex)4 3.370e+01 1.037e+04 0.003 0.9974
## as.factor(sleepexposureindex)5 7.341e+01 1.262e+04 0.006 0.9954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.31 on 50 degrees of freedom
## Residual deviance: 15.88 on 40 degrees of freedom
## AIC: 37.88
##
## Number of Fisher Scoring iterations: 20
round(exp(glm.sleep.2$coefficients),3)
## (Intercept) brainweight
## 1.000000e-03 1.052000e+00
## totalsleep as.factor(predationindex)2
## 1.527000e+00 7.900000e-02
## as.factor(predationindex)3 as.factor(predationindex)4
## 0.000000e+00 0.000000e+00
## as.factor(predationindex)5 as.factor(sleepexposureindex)2
## 0.000000e+00 1.480500e+02
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4
## 6.173141e+15 4.332708e+14
## as.factor(sleepexposureindex)5
## 7.603846e+31
The index variables in the data set are ordinal, meaning they are categorical and they have a natural ordering. If we treat an index variable as a continuous variable, this will imply a linear change as the index changes. Repeat Exercise 3 a)-c) by treating two index variables as continuous variables.
(A) First find and specify the best set of predictors via stepwise selection with BIC criteria.
glm.null.sleep2 <- glm(maxlife10 ~ 1, data = sleep, family = "binomial")
glm.full.sleep2 <- glm(maxlife10 ~ bodyweight+brainweight+totalsleep+gestationtime
+ predationindex + sleepexposureindex, data = sleep, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
step.sleep2 <- step(glm.null.sleep2, scope = list(upper=glm.full.sleep2),
direction="both",test="Chisq", trace = F)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(step.sleep2)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = sleep)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16387 3.59301 -1.716 0.0863 .
## brainweight 0.06018 0.03544 1.698 0.0895 .
## totalsleep 0.35985 0.20995 1.714 0.0865 .
## sleepexposureindex 4.42111 1.97540 2.238 0.0252 *
## predationindex -3.36917 1.51823 -2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.310 on 50 degrees of freedom
## Residual deviance: 19.212 on 46 degrees of freedom
## AIC: 29.212
##
## Number of Fisher Scoring iterations: 11
(B) What does Hosmer-Lemeshow’s test tells us about goodness of fit? And point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.
All chosen predictors in the final model are statistically significant with p-values less than 0.1, meaning its estimated coefficient is far from 0. Thus, it aids in predicting whether the maximum lifespan of a species will be at least 10 years. The goodness of fit test for the model has a p-value of 0.99, which indicates the model fit is reasonable.
hoslem.test(step.sleep2$y, fitted(step.sleep2), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.sleep2$y, fitted(step.sleep2)
## X-squared = 1.4406, df = 8, p-value = 0.9937
resid.d<-residuals(step.sleep2, type = "deviance")
resid.p<-residuals(step.sleep2, type = "pearson")
std.res.d<-residuals(step.sleep2, type = "deviance")/sqrt(1 - hatvalues(step.sleep2)) # standardized deviance residuals
std.res.p <-residuals(step.sleep2, type = "pearson")/sqrt(1 - hatvalues(step.sleep2)) # standardized pearson residuals
par(mfrow=c(1,2))
plot(std.res.d[step.sleep2$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.sleep2$model$maxlife10==1], col = "blue")
plot(std.res.p[step.sleep2$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.sleep2$model$maxlife10==1], col = "blue")
plot(step.sleep2, which=4, id.n=5)
(inf.id.4 = which(cooks.distance(step.sleep2)>0.25))
## 10 35 40 50
## 10 35 40 50
glm.sleep2.final = glm(maxlife10 ~ brainweight + totalsleep + sleepexposureindex + predationindex, data = sleep[-inf.id.4], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
(C) Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years.
The estimated OR for brainweight is 1.062 and it implies that the odds of a species having maximum lifespan at least 10 years is expected to increase by 1.062 (=exp(0.06)) times with one unit increase in brainweight. The estimated OR for totalsleep is 1.433 and it implies that the odds of a species having maximum lifespan at least 10 years is expected to increase by 1.433 (=exp(0.36)) with one unit increase in totalsleep. The odds ratio is estimated as 83.18 for sleepexposure so we can say for a one-unit increase in sleep exposure index, we expect to see an increase in the odds of a species having maximum lifespan at least 10 years by 83.18 (=exp(4.42)) times. The odds ratio for predation index is estimated as 0.034. Thus we expect the odds of a species having max lifespan at least 10 years change by 0.034 (=exp(-3.37)) multiplicative factor with one unit increase in predation index.
summary(glm.sleep2.final)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = sleep[-inf.id.4])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16387 3.59301 -1.716 0.0863 .
## brainweight 0.06018 0.03544 1.698 0.0895 .
## totalsleep 0.35985 0.20995 1.714 0.0865 .
## sleepexposureindex 4.42111 1.97540 2.238 0.0252 *
## predationindex -3.36917 1.51823 -2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.310 on 50 degrees of freedom
## Residual deviance: 19.212 on 46 degrees of freedom
## AIC: 29.212
##
## Number of Fisher Scoring iterations: 11
round(exp(glm.sleep2.final$coefficients),3)
## (Intercept) brainweight totalsleep sleepexposureindex
## 0.002 1.062 1.433 83.188
## predationindex
## 0.034