Use the sleep data set which originates from http://lib.stat.cmu.edu/datasets/sleep. maxlife10 is 0 if the species maximum life span is less than 10 years and 1 if its maximum life span is greater than or equal to 10 years. Consider finding the best logistic model for predicting the probability that a species’ maximum lifespan will be at least 10 years. Consider all 6 variables as candidates (do not include species) and two index variables of them are categorical in nature. Treat two index variables as categorical variables (e.g. ignore the fact that they are ordinal). Use significance level alpha=0.1
d2 <- read.csv("sleep.csv")
#str(d2)
First find and specify the best set of predictors via stepwise selection with AIC criteria.
Step 1)
Change sleepexposureindex & predationindex to categorical variables
d2$sleepexposureindex <- as.factor(d2$sleepexposureindex)
d2$predationindex <- as.factor(d2$predationindex)
#str(d2)
Now, running the model
e3.glm.sleep.full <- glm(maxlife10 ~ . , data=d2, family = "binomial")
e3.glm.sleep.null <- glm(maxlife10 ~ 1 , data=d2, family = "binomial")
#summary(glm.sleep.full)
Step 2)
e3.step.model.AIC <- step(e3.glm.sleep.null,scope = list(upper=e3.glm.sleep.full), direction="both",test="Chisq", trace = F)
summary(e3.step.model.AIC)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.42528 -0.00004 0.00000 0.00013 2.37523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.602e+00 4.864e+00 -1.357 0.1747
## brainweight 5.101e-02 5.084e-02 1.003 0.3157
## totalsleep 4.230e-01 2.647e-01 1.598 0.1100
## sleepexposureindex2 4.998e+00 2.559e+00 1.953 0.0508 .
## sleepexposureindex3 3.636e+01 9.624e+03 0.004 0.9970
## sleepexposureindex4 3.370e+01 1.037e+04 0.003 0.9974
## sleepexposureindex5 7.341e+01 1.262e+04 0.006 0.9954
## predationindex2 -2.535e+00 1.960e+00 -1.293 0.1960
## predationindex3 -2.512e+01 1.253e+04 -0.002 0.9984
## predationindex4 -1.826e+01 6.795e+03 -0.003 0.9979
## predationindex5 -5.264e+01 1.143e+04 -0.005 0.9963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.31 on 50 degrees of freedom
## Residual deviance: 15.88 on 40 degrees of freedom
## AIC: 37.88
##
## Number of Fisher Scoring iterations: 20
Comment on the significance of parameter estimates, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.
Since alpha = 0.1, thus:
Step 1)
## Hosmer Lemeshow test
hoslem.test(e3.step.model.AIC$y, fitted(e3.step.model.AIC), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: e3.step.model.AIC$y, fitted(e3.step.model.AIC)
## X-squared = 7.0397, df = 8, p-value = 0.5324
Based on P-Value, since alpha level=0.1, and P-Value > 0.1, thus this is our conclusion:
Step 2)
We will use both Deviance and Person methods:
e3.res.deviance <-residuals(e3.step.model.AIC, type = "deviance")
e3.res.pearson <-residuals(e3.step.model.AIC, type = "pearson")
e3.std.res.deviance <-residuals(e3.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e3.step.model.AIC)) # standardized deviance residuals
e3.std.res.pearson <-residuals(e3.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e3.step.model.AIC)) # standardized pearson residuals
Show the plots:
dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e3.std.res.deviance[e3.step.model.AIC$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e3.std.res.deviance[e3.step.model.AIC$model$maxlife10==1], col = "blue")
plot(e3.std.res.pearson[e3.step.model.AIC$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e3.std.res.pearson[e3.step.model.AIC$model$maxlife10==1], col = "blue")
Step 3)
# visual inspection
dev.new(width = 1000, height = 1000)
plot(e3.step.model.AIC, which = 4, id.n = 5)
# which observation has cook's d larger than 0.25 ?
e3.inf.id=which(cooks.distance(e3.step.model.AIC) > 0.25)
e3.inf.id
## 35 40
## 35 40
Conclusion:
interpret what the model tells us about relationships between the predictors and the odds of a species maximum lifespan being at least 10 years.
Step 1)
## odds ratios
e3.OR <- exp(e3.step.model.AIC$coefficients)
round(e3.OR, 3)
## (Intercept) brainweight totalsleep sleepexposureindex2
## 1.000000e-03 1.052000e+00 1.527000e+00 1.480500e+02
## sleepexposureindex3 sleepexposureindex4 sleepexposureindex5 predationindex2
## 6.173141e+15 4.332708e+14 7.603846e+31 7.900000e-02
## predationindex3 predationindex4 predationindex5
## 0.000000e+00 0.000000e+00 0.000000e+00
Based on OR results:
OR (predationindex2 Vs. predationindex1) = exp(-2.535e+00) = 0.07926 < 1
==> odds(Y=1|predationindex = 1) > odds(Y=1|predationindex = 2)
OR (predationindex3 Vs. predationindex1) = exp(-2.512e+01) = 1.2e-11 < 1
==> odds(Y=1|predationindex = 1) > odds(Y=1|predationindex = 3)
OR (predationindex4 Vs. predationindex1) = exp(-1.826e+01) = 1.1e-08 < 1
==> odds(Y=1|predationindex = 1) > odds(Y=1|predationindex = 4)
OR (predationindex5 Vs. predationindex1) = exp(-5.264e+01) = 1.3e-23 < 1
==> odds(Y=1|predationindex = 1) > odds(Y=1|predationindex = 5)
OR (sleepexposureindex2 Vs.sleepexposureindex1) = exp(4.998e+00) = 148.1166 > 1
==> odds(Y=1|sleepexposureindex = 2) > odds(Y=1|sleepexposureindex = 1)
OR (sleepexposureindex3 Vs.sleepexposureindex1) = exp(3.636e+01) = 6.174e15 > 1
==> odds(Y=1|sleepexposureindex = 3) > odds(Y=1|sleepexposureindex = 1)
OR (sleepexposureindex4 Vs.sleepexposureindex1) = exp(3.370e+01) = 4.322e14 > 1
==> odds(Y=1|sleepexposureindex = 4) > odds(Y=1|sleepexposureindex = 1)
OR (sleepexposureindex5 Vs.sleepexposureindex1) = exp(7.341e+01) = 7.613e31 > 1
==> odds(Y=1|sleepexposureindex = 5) > odds(Y=1|sleepexposureindex = 1)
OR (brainweight) = exp(5.101e-02) = 1.052330 = 1
==> No brainweight effect
OR (totalsleep) = exp(4.230e-01) = 1.526534 > 1
==> chance of having an event will increase 1.5 times with one unit change in totalsleep
So,
odds(Y=1|predationindex = 1) > odds(Y=1|predationindex = 2) > odds(Y=1|predationindex = 4) > odds(Y=1|predationindex = 3) > odds(Y=1|predationindex = 5)
&
odds(Y=1|sleepexposureindex = 5) > odds(Y=1|sleepexposureindex = 3) > odds(Y=1|sleepexposureindex = 4) > odds(Y=1|sleepexposureindex = 2) > odds(Y=1|sleepexposureindex = 1)
Conclusion:
The index variables in the data set are ordinal, meaning they are categorical and they have a natural ordering. If we treat an index variable as a continuous variable, this will imply a linear change as the index changes. Repeat Part I by treating two index variables as continuous variables. Use significance level alpha=0.1
d3 <- read.csv("sleep.csv")
#str(d3)
Step 1)
e4.glm.sleep.full <- glm(maxlife10 ~ bodyweight+brainweight+totalsleep+gestationtime + predationindex + sleepexposureindex , data=d3, family = "binomial")
e4.glm.sleep.null <- glm(maxlife10 ~ 1 , data=d3, family = "binomial")
#summary(e4.glm.sleep.full)
Step 2)
e4.step.model.AIC <- step(e4.glm.sleep.null,scope = list(upper=e4.glm.sleep.full), direction="both",test="Chisq", trace = F)
summary(e4.step.model.AIC)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = d3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82148 -0.04746 0.00000 0.05811 2.41681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16387 3.59301 -1.716 0.0863 .
## brainweight 0.06018 0.03544 1.698 0.0895 .
## totalsleep 0.35985 0.20995 1.714 0.0865 .
## sleepexposureindex 4.42111 1.97540 2.238 0.0252 *
## predationindex -3.36917 1.51823 -2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.310 on 50 degrees of freedom
## Residual deviance: 19.212 on 46 degrees of freedom
## AIC: 29.212
##
## Number of Fisher Scoring iterations: 11
Comment on the significance of parameter estimates, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.
Log(maxlife10) = -6.16387 + 0.06018 * brainweight + 0.35985 * totalsleep + 4.42111 * sleepexposureindex - 3.36917 * predationindex
Since alpha = 0.1, thus:
Step 1)
## Hosmer Lemeshow test
hoslem.test(e4.step.model.AIC$y, fitted(e4.step.model.AIC), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: e4.step.model.AIC$y, fitted(e4.step.model.AIC)
## X-squared = 1.4406, df = 8, p-value = 0.9937
Step 2)
We will use both Deviance and Person methods:
e4.res.deviance <-residuals(e4.step.model.AIC, type = "deviance")
e4.res.pearson <-residuals(e4.step.model.AIC, type = "pearson")
e4.std.res.deviance <-residuals(e4.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e4.step.model.AIC)) # standardized deviance residuals
e4.std.res.pearson <-residuals(e4.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e4.step.model.AIC)) # standardized pearson residuals
Show the plots:
dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e4.std.res.deviance[e4.step.model.AIC$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e4.std.res.deviance[e4.step.model.AIC$model$maxlife10==1], col = "blue")
plot(e4.std.res.pearson[e4.step.model.AIC$model$maxlife10==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e4.std.res.pearson[e4.step.model.AIC$model$maxlife10==1], col = "blue")
Step 3)
# visual inspection
dev.new(width = 1000, height = 1000)
plot(e4.step.model.AIC, which = 4, id.n = 5)
# which observation has cook's d larger than 0.25 ?
e4.inf.id=which(cooks.distance(e4.step.model.AIC) > 0.25)
e4.inf.id
## 10 35 40 50
## 10 35 40 50
Conclusion
interpret what the model tells us about relationships between the predictors and the odds of a species maximum lifespan being at least 10 years.
Step 1)
## odds ratios
e4.OR <- exp(e4.step.model.AIC$coefficients)
round(e4.OR, 3)
## (Intercept) brainweight totalsleep sleepexposureindex
## 0.002 1.062 1.433 83.188
## predationindex
## 0.034
Conclusion
Based on previous part, our final model is:
Log(maxlife10) = -6.16387 + 0.06018 * brainweight + 0.35985 * totalsleep + 4.42111 * sleepexposureindex - 3.36917 * predationindex
Since all predictors are continuous variables, thus we have Multiplicative change.
OR(brainweight) = 1 (Almost is equal 1)
Thus, there is no brainweight effect.
OR(totalsleep) =1.433 > 1
Thus, there is positive relationship (increase),
It means one unit increasing in “totalsleep” will cause multiple of 1.433 change in maxlife.
OR(sleepexposureindex) = 83.188 > 1
Thus, there is positive relationship (increase),
It means one unit increasing in “sleepexposureindex” will cause multiple of 83.188 change in maxlife.
OR(predationindex) = 0.034 < 1
Thus, there is Negative relationship (Decrease),
It means one unit increasing in “predationindex” will cause multiple of 0.034 change in maxlife.
In the other word, increasing in the “totalsleep” and “sleepexposureindex” and decreasing in the “predationindex” will cause more chance to maxlife.