The liver data set is a subset of the ILPD (Indian Liver Patient Dataset) data set. It contains the first 10 variables described on the UCI Machine Learning Repository and a LiverPatient variable (indicating whether or not the individual is a liver patient. People with active liver disease are coded as LiverPatient=1 and people without disease are coded LiverPatient=0) for adults in the data set. Adults here are defined to be individuals who are at least 18 years of age. It is possible that there will be different significant predictors of being a liver patient for adult females and adult males.
For only females in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient. NOTE: Specifying the full model using “LiverPatient~., data=…” will give an error message (due to only one level of factor – Female – in the data, I guess so). Suggest typing all variables manually for the full model
str(liver)
## spec_tbl_df [558 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:558] 65 62 62 58 72 46 26 29 55 57 ...
## $ Gender : chr [1:558] "Female" "Male" "Male" "Male" ...
## $ TB : num [1:558] 0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.7 0.6 ...
## $ DB : num [1:558] 0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.2 0.1 ...
## $ Alkphos : num [1:558] 187 699 490 182 195 208 154 202 290 210 ...
## $ Alamine : num [1:558] 16 64 60 14 27 19 16 14 53 51 ...
## $ Aspartate : num [1:558] 18 100 68 20 59 14 12 11 58 59 ...
## $ TP : num [1:558] 6.8 7.5 7 6.8 7.3 7.6 7 6.7 6.8 5.9 ...
## $ ALB : num [1:558] 3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 3.4 2.7 ...
## $ LiverPatient: num [1:558] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Gender = col_character(),
## .. TB = col_double(),
## .. DB = col_double(),
## .. Alkphos = col_double(),
## .. Alamine = col_double(),
## .. Aspartate = col_double(),
## .. TP = col_double(),
## .. ALB = col_double(),
## .. LiverPatient = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
liverFemale = liver[which(liver$Gender == "Female"),]
glm.null.female = glm(LiverPatient ~ 1, data = liverFemale, family = "binomial") #null model: no predictor
glm.full.female = glm(LiverPatient ~ Age+ TB+ DB+ Alkphos+ Alamine+ Aspartate+ TP+ ALB, data = liverFemale, family = "binomial") # full model: all predictors
#Stepwise based on AIC for Female
stepwise.Female <- step(glm.null.female, scope = list(upper=glm.full.female), direction="both",test="Chisq", trace = F)
summary(stepwise.Female)
##
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial",
## data = liverFemale)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8178 -1.2223 0.4402 1.1091 1.2049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32480 0.31013 -1.047 0.2950
## DB 0.94479 0.55808 1.693 0.0905 .
## Aspartate 0.01106 0.00616 1.796 0.0726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175.72 on 134 degrees of freedom
## Residual deviance: 154.27 on 132 degrees of freedom
## AIC: 160.27
##
## Number of Fisher Scoring iterations: 7
Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).
Comments: The variables DB and Aspartate both have values below a 0.1 significance level of 0.0905 and 0.0726, respectfully, making them the significant predictors for Female liver patients. Both of these predictors have a high chance of a relationship between Females patients and active liver disease.
hoslem.test(stepwise.Female$y, fitted(stepwise.Female), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: stepwise.Female$y, fitted(stepwise.Female)
## X-squared = 7.7535, df = 8, p-value = 0.4579
Observation: The Hosmer and Lemeshow goodness of fit test provides a p-value of 0.4579 which is above the 0.1 significance level making the model adequate.
resid.d<-residuals(stepwise.Female, type = "deviance")
resid.p<-residuals(stepwise.Female, type = "pearson")
std.res.d<-residuals(stepwise.Female, type = "deviance")/sqrt(1 - hatvalues(stepwise.Female)) # standardized deviance residuals
std.res.p <-residuals(stepwise.Female, type = "pearson")/sqrt(1 - hatvalues(stepwise.Female)) # standardized pearson residuals
par(mfrow=c(1,2))
plot(std.res.d[stepwise.Female$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[stepwise.Female$model$LiverPatient==1], col = "blue")
plot(std.res.p[stepwise.Female$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[stepwise.Female$model$LiverPatient==1], col = "blue")
Observation: The diagnostic plots follow a similar pattern in both the standard deviance residual and Pearson residuals. The red line does not stray away from linearity and there is no large outstanding values it does not violate any assumptions.
plot(stepwise.Female, which = 4, id.n = 5)
(inf.id.1=which(cooks.distance(stepwise.Female)>0.25))
## named integer(0)
Observation: There is no observation with a Cook’s distance larger than 0.25.
Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio). NOTE: stepwise selection with AIC criteria can be performed by default step() function in R.
summary(stepwise.Female)
##
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial",
## data = liverFemale)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8178 -1.2223 0.4402 1.1091 1.2049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32480 0.31013 -1.047 0.2950
## DB 0.94479 0.55808 1.693 0.0905 .
## Aspartate 0.01106 0.00616 1.796 0.0726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175.72 on 134 degrees of freedom
## Residual deviance: 154.27 on 132 degrees of freedom
## AIC: 160.27
##
## Number of Fisher Scoring iterations: 7
Observation: The final model for Females is: \(\log(p/1-p)\)=-0.32480+0.94479DB+0.01106Aspartate
OR=exp(stepwise.Female$coefficients)
round(OR, 3)
## (Intercept) DB Aspartate
## 0.723 2.572 1.011
Observation: 1. As DB increases by 0.94479 the odds of DB increases by exp=2.572. 2. As Aspartate increases by 0.01106 the odds of Aspartate increases by exp=1.011 3. Female patients with DB or Aspartate are more likely to have liver disease that is active.
Repeat exercise 1 for males. In addition to the previous questions, also d) comment on how the models for adult females and adult males differ. Use significance level alpha=0.1
NOTE: You will get an error message “glm.fit: fitted probabilities numerically 0 or 1 occurred” for this run. Ignore this and use the result for the interpretation. I will explain what this error means in Week 14 videos.
liverMale = liver[which(liver$Gender == "Male"),]
glm.null.male = glm(LiverPatient ~ 1, data = liverMale, family = "binomial")
glm.full.male = glm(LiverPatient ~ Age + TB + DB + Alkphos + Alamine + Aspartate + TP + ALB, data=liverMale, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
stepwise.Male <- step(glm.null.male, scope = list(upper=glm.full.male),
direction="both", test="Chisq", trace = F)
summary(stepwise.Male)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = liverMale)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3405 -0.5170 0.3978 0.8614 1.3756
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.476570 0.481336 -3.068 0.00216 **
## DB 0.512503 0.176066 2.911 0.00360 **
## Alamine 0.016218 0.005239 3.095 0.00197 **
## Age 0.020616 0.008095 2.547 0.01087 *
## Alkphos 0.001740 0.001058 1.645 0.09992 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 476.28 on 422 degrees of freedom
## Residual deviance: 395.05 on 418 degrees of freedom
## AIC: 405.05
##
## Number of Fisher Scoring iterations: 7
Comments: All the values from the given stepwise selection have p-values lower than the significance level of 0.1. This allows us to state that there is a significant relationship between DB, Alamine, Age, and Alkphos in determining if a Male liver pationa has a liver disease.
hoslem.test(stepwise.Male$y, fitted(stepwise.Male), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: stepwise.Male$y, fitted(stepwise.Male)
## X-squared = 7.043, df = 8, p-value = 0.532
resid.d.Male<-residuals(stepwise.Male, type = "deviance")
resid.p.Male<-residuals(stepwise.Male, type = "pearson")
std.res.d.Male<-residuals(stepwise.Male, type = "deviance")/sqrt(1 - hatvalues(stepwise.Male))
std.res.p.Male <-residuals(stepwise.Male, type = "pearson")/sqrt(1 - hatvalues(stepwise.Male))
par(mfrow=c(1,2))
plot(std.res.d.Male[stepwise.Male$model$LiverPatient==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d.Male[stepwise.Male$model$LiverPatient==1], col = "blue")
plot(std.res.p.Male[stepwise.Male$model$LiverPatient==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p.Male[stepwise.Male$model$LiverPatient==1], col = "blue")
plot(stepwise.Male, which = 4, id.n=5 )
(inf.id.2 = which(cooks.distance(stepwise.Male)>0.25))
## 86
## 86
glm.male.final = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, data = liverMale[-inf.id.2, ], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
plot(glm.male.final, which = 4, id.n=5 )
Observation With our Hosmer-Lemeshow test we receieved a p-value greater than our significance level of
0.1, which means we dod not reject H\(_{0}\) and we assume our model is adequate. Now looking at our residual plots there is a parallel pattern, which implies there is a similar estimated probabilities for all observations. Finally in the cooks plot we do see one observation with a Cook’s distance larger than 0.25, observation 86.
summary(glm.male.final)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = liverMale[-inf.id.2, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5166 0.0000 0.3301 0.8648 1.4696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.902754 0.527386 -3.608 0.000309 ***
## DB 0.573104 0.198893 2.881 0.003958 **
## Alamine 0.015850 0.005466 2.900 0.003737 **
## Age 0.020418 0.008210 2.487 0.012883 *
## Alkphos 0.003744 0.001477 2.534 0.011262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 473.51 on 421 degrees of freedom
## Residual deviance: 381.31 on 417 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 8
OR.M = exp(glm.male.final$coefficients)
round(OR.M,3)
## (Intercept) DB Alamine Age Alkphos
## 0.149 1.774 1.016 1.021 1.004
Observation As our final model is \(\log(p/1-p)\) = -1.902754 + 0.573104*DB + 0.015850*Alamine + 0.020418*Age + 0.003744*Alkphos we can compare it to the output of the Odds Ratios.
The odds of a male liver patient having active liver disease increases by a fact of 1.774 with every unit increase in DB.
Similary if we assume all other variables remain constant the odds of liver disease increase by a factor of the following,
1.016 with a one unit increase in Alamine1.021 with a one unit increase in Age1.004 with a one unit increase in AlkphosComment on how the models for adult females and adult males differ. Use significance level alpha=0.1 Comment: When comparing the models for females and males we note that the females only have two significant predictors (DB and Aspartate) while the males have four significant predictors (DB, Alamine, Age, and Alkphos). With each of these predictors increasing the odds of a patient having liver disease it may be easier to predict when a male is at risk easier than a female, simply because of the amount of predictors used.
Use the sleep data set which originates from http://lib.stat.cmu.edu/datasets/sleep. maxlife10 is 0 if the species maximum life span is less than 10 years and 1 if its maximum life span is greater than or equal to 10 years. Consider finding the best logistic model for predicting the probability that a species’ maximum lifespan will be at least 10 years. Consider all 6 variables as candidates (do not include species) and two index variables of them are categorical in nature. Treat two index variables as categorical variables (e.g. ignore the fact that they are ordinal). Use significance level alpha=0.1
First find and specify the best set of predictors via stepwise selection with AIC criteria.
str(sleep)
## 'data.frame': 51 obs. of 8 variables:
## $ species : chr "African" "African" "Arctic F" "Asian el" ...
## $ bodyweight : num 6654 1 3.38 2547 10.55 ...
## $ brainweight : num 5712 6.6 44.5 4603 179.5 ...
## $ totalsleep : num 3.3 8.3 12.5 3.9 9.8 19.7 6.2 14.5 9.7 12.5 ...
## $ gestationtime : num 645 42 60 624 180 35 392 63 230 112 ...
## $ predationindex : int 3 3 1 3 4 1 4 1 1 5 ...
## $ sleepexposureindex: int 5 1 1 5 4 1 5 2 1 4 ...
## $ maxlife10 : int 1 0 1 1 1 1 1 1 1 0 ...
glm.null.sleep1 <- glm(maxlife10 ~ 1, data = sleep, family = binomial)
glm.full.sleep1 <- glm(maxlife10 ~ bodyweight + brainweight + totalsleep + gestationtime + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep, family = binomial)
step.sleep1 <- step(glm.null.sleep1, scope = list(upper=glm.full.sleep1), direction = "both", test = "Chisq", trace = F)
summary(step.sleep1)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) +
## as.factor(predationindex), family = binomial, data = sleep)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.42528 -0.00004 0.00000 0.00013 2.37523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.602e+00 4.864e+00 -1.357 0.1747
## brainweight 5.101e-02 5.084e-02 1.003 0.3157
## totalsleep 4.230e-01 2.647e-01 1.598 0.1100
## as.factor(sleepexposureindex)2 4.998e+00 2.559e+00 1.953 0.0508 .
## as.factor(sleepexposureindex)3 3.636e+01 9.624e+03 0.004 0.9970
## as.factor(sleepexposureindex)4 3.370e+01 1.037e+04 0.003 0.9974
## as.factor(sleepexposureindex)5 7.341e+01 1.262e+04 0.006 0.9954
## as.factor(predationindex)2 -2.535e+00 1.960e+00 -1.293 0.1960
## as.factor(predationindex)3 -2.512e+01 1.253e+04 -0.002 0.9984
## as.factor(predationindex)4 -1.826e+01 6.795e+03 -0.003 0.9979
## as.factor(predationindex)5 -5.264e+01 1.143e+04 -0.005 0.9963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.31 on 50 degrees of freedom
## Residual deviance: 15.88 on 40 degrees of freedom
## AIC: 37.88
##
## Number of Fisher Scoring iterations: 20
Observation: Significant predictor is sleepexposure because it has p-values below the 0.1 significance level.
Comment on the significance of parameter estimates, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.
Comment: Specifically the sleepexposure levels 3-5 have a chance of an event not happening whereas level 2 has a chance of an event happening. Still, sleepexposure is a significant predictor.
hoslem.test(step.sleep1$y, fitted(step.sleep1), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.sleep1$y, fitted(step.sleep1)
## X-squared = 7.0397, df = 8, p-value = 0.5324
Observation: Based on the Hosmer goodness of fit test, the p-value is 0.5324 which falls above the 0.1 significance level making the model adequate.
resid.d<-residuals(step.sleep1, type = "deviance")
resid.p<-residuals(step.sleep1, type = "pearson")
std.res.d<-residuals(step.sleep1, type = "deviance")/sqrt(1 - hatvalues(step.sleep1)) # standardized deviance residuals
std.res.p <-residuals(step.sleep1, type = "pearson")/sqrt(1 - hatvalues(step.sleep1)) # standardized pearson residuals
par(mfrow=c(1,2))
plot(std.res.d[step.sleep1$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.sleep1$model$maxlife10==1], col = "blue")
plot(std.res.p[step.sleep1$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.sleep1$model$maxlife10==1], col = "blue")
plot(step.sleep1, which = 4, id.n = 5)
(inf.id.3=which(cooks.distance(step.sleep1)>0.25))
## 35 40
## 35 40
Observation: 1. Diagnostic plots: The Deviance Residual and Pearson Residual red plots do not follow a linearity line. Both appear to have values scattered and follow no pattern. As a result, this does not violate any linearity assumption. 2. Cook’s distance: There are two values, 35 and 40 above the Cook’s distance point of 0.25.
Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years. NOTE: stepwise selection with BIC criteria can be performed by step() function by adding an option k=log(n), where n is a sample size. No need to refit the model with only significant variables. Just use the final model from step() function, and for part (b), make comments based on this. But, for part (c), interpret the Odds Ratio only for significant level and no need to interpret other insignificant ones.
glm.sleep.final = glm(maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) + as.factor(sleepexposureindex), data = sleep[-inf.id.3], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.sleep.final)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(predationindex) +
## as.factor(sleepexposureindex), family = "binomial", data = sleep[-inf.id.3])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.42528 -0.00004 0.00000 0.00013 2.37523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.602e+00 4.864e+00 -1.357 0.1747
## brainweight 5.101e-02 5.084e-02 1.003 0.3157
## totalsleep 4.230e-01 2.647e-01 1.598 0.1100
## as.factor(predationindex)2 -2.535e+00 1.960e+00 -1.293 0.1960
## as.factor(predationindex)3 -2.512e+01 1.253e+04 -0.002 0.9984
## as.factor(predationindex)4 -1.826e+01 6.795e+03 -0.003 0.9979
## as.factor(predationindex)5 -5.264e+01 1.143e+04 -0.005 0.9963
## as.factor(sleepexposureindex)2 4.998e+00 2.559e+00 1.953 0.0508 .
## as.factor(sleepexposureindex)3 3.636e+01 9.624e+03 0.004 0.9970
## as.factor(sleepexposureindex)4 3.370e+01 1.037e+04 0.003 0.9974
## as.factor(sleepexposureindex)5 7.341e+01 1.262e+04 0.006 0.9954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.31 on 50 degrees of freedom
## Residual deviance: 15.88 on 40 degrees of freedom
## AIC: 37.88
##
## Number of Fisher Scoring iterations: 20
round(exp(glm.sleep.final$coefficients),3)
## (Intercept) brainweight
## 1.000000e-03 1.052000e+00
## totalsleep as.factor(predationindex)2
## 1.527000e+00 7.900000e-02
## as.factor(predationindex)3 as.factor(predationindex)4
## 0.000000e+00 0.000000e+00
## as.factor(predationindex)5 as.factor(sleepexposureindex)2
## 0.000000e+00 1.480500e+02
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4
## 6.173141e+15 4.332708e+14
## as.factor(sleepexposureindex)5
## 7.603846e+31
Observation: Based on the Odds Ratio, sleepexposure2 has a value of 1.480500e+02. Therefore, animals with a maxlife of Level 2 has a more likely chance to live past the maximum lifespan of 10yrs.
The index variables in the data set are ordinal, meaning they are categorical and they have a natural ordering. If we treat an index variable as a continuous variable, this will imply a linear change as the index changes. Repeat Exercise 3 by treating two index variables as continuous variables. Use significance level alpha=0.1
###a)
glm.null.sleep2 <- glm(maxlife10 ~ 1, data = sleep, family = "binomial")
glm.full.sleep2 <- glm(maxlife10 ~ bodyweight+brainweight+totalsleep+gestationtime
+ predationindex + sleepexposureindex, data = sleep, family = "binomial")
step.sleep2 <- step(glm.null.sleep2, scope = list(upper=glm.full.sleep2),
direction="both",test="Chisq", trace = F)
summary(step.sleep2)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = sleep)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82148 -0.04746 0.00000 0.05811 2.41681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16387 3.59301 -1.716 0.0863 .
## brainweight 0.06018 0.03544 1.698 0.0895 .
## totalsleep 0.35985 0.20995 1.714 0.0865 .
## sleepexposureindex 4.42111 1.97540 2.238 0.0252 *
## predationindex -3.36917 1.51823 -2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.310 on 50 degrees of freedom
## Residual deviance: 19.212 on 46 degrees of freedom
## AIC: 29.212
##
## Number of Fisher Scoring iterations: 11
Observation: With the stepwise selection treating the two index variables as continous variables we find the best predictors to be brainweight, totalsleep, sleepexposureindex, and predationindex.
As mentioned in part A brainweight, totalsleep, sleepexposureindex, and predationindex have p-values lower than the significance level of 0.1. Implying they all have a significant relationship when determining maxlife10.
hoslem.test(step.sleep2$y, fitted(step.sleep2), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.sleep2$y, fitted(step.sleep2)
## X-squared = 1.4406, df = 8, p-value = 0.9937
resid.d<-residuals(step.sleep2, type = "deviance")
resid.p<-residuals(step.sleep2, type = "pearson")
std.res.d<-residuals(step.sleep2, type = "deviance")/sqrt(1 - hatvalues(step.sleep2)) # standardized deviance residuals
std.res.p <-residuals(step.sleep2, type = "pearson")/sqrt(1 - hatvalues(step.sleep2)) # standardized pearson residuals
par(mfrow=c(1,2))
plot(std.res.d[step.sleep2$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.sleep2$model$maxlife10==1], col = "blue")
plot(std.res.p[step.sleep2$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.sleep2$model$maxlife10==1], col = "blue")
plot(step.sleep2, which=4, id.n=5)
Observation: Based on the Hosmer-Lemeshow test our model does appear to be adequate since the p-value is >0.1 and we do not have enough significant. The residual plots show overlap in the red (maxlife10 = 0) and blue (maxlife10 = 1) points. Finally there appears to be four influential points in the Cooks distance plot; 10, 35, 40, and 50.
(inf.id.4 = which(cooks.distance(step.sleep2)>0.25))
## 10 35 40 50
## 10 35 40 50
glm.sleep2.final = glm(maxlife10 ~ brainweight + totalsleep + sleepexposureindex + predationindex, data = sleep[-inf.id.4], family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(glm.sleep2.final)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = sleep[-inf.id.4])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82148 -0.04746 0.00000 0.05811 2.41681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16387 3.59301 -1.716 0.0863 .
## brainweight 0.06018 0.03544 1.698 0.0895 .
## totalsleep 0.35985 0.20995 1.714 0.0865 .
## sleepexposureindex 4.42111 1.97540 2.238 0.0252 *
## predationindex -3.36917 1.51823 -2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.310 on 50 degrees of freedom
## Residual deviance: 19.212 on 46 degrees of freedom
## AIC: 29.212
##
## Number of Fisher Scoring iterations: 11
round(exp(glm.sleep2.final$coefficients),3)
## (Intercept) brainweight totalsleep sleepexposureindex
## 0.002 1.062 1.433 83.188
## predationindex
## 0.034
Observation: As our final model is \(\log(p/1-p)\) = -6.16387 + 0.06018*brainweight + 0.35985*totalsleep + 4.42111*sleepexposureindex + -3.36917*predationindex we can compare it to the output of the Odds Ratios.
The odds of a species maximum lifespan being at least 10 years increases by a fact of 1.062 with every unit increase in brainweight.
Similary if we assume all other variables remain constant the odds of a species maximum lifespan being at least 10 years increase by a factor of the following,
1.433 with a one unit increase in totalsleep83.188 with a one unit increase in sleepexposureindex0.034 with a one unit increase in predationindex