Begin by loading liver.csv data into environment as a data frame (d1).
Perform basic inspection of data frame.
Find and specify the best set of predictors for the females in d1.
Create subset of d1 wherein only Gender = “Females” are used.
Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.
Conclusion: We can state based on the results obtained above that the best predictors are “DB” and “Aspartate”.
Note: This fulfills the requirement of exercise 1A.
# 1
d1 = read.csv("liver.csv", header = T)
# 2
str(d1)
## 'data.frame': 558 obs. of 10 variables:
## $ Age : int 65 62 62 58 72 46 26 29 55 57 ...
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ TB : num 0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.7 0.6 ...
## $ DB : num 0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.2 0.1 ...
## $ Alkphos : int 187 699 490 182 195 208 154 202 290 210 ...
## $ Alamine : int 16 64 60 14 27 19 16 14 53 51 ...
## $ Aspartate : int 18 100 68 20 59 14 12 11 58 59 ...
## $ TP : num 6.8 7.5 7 6.8 7.3 7.6 7 6.7 6.8 5.9 ...
## $ ALB : num 3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 3.4 2.7 ...
## $ LiverPatient: int 1 1 1 1 1 1 1 1 1 1 ...
# 3
# 3a)
d1_female = subset(d1, Gender == "Female")
# 3b)
glm.null.F = glm(LiverPatient ~ 1, data = d1_female, family = "binomial")
glm.full.F = glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB,
data = d1_female, family = "binomial")
# 3c)
(step.model.1 = step(glm.null.F, scope = list(upper=glm.full.F),
direction="both",test="Chisq", trace = F))
##
## Call: glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial",
## data = d1_female)
##
## Coefficients:
## (Intercept) DB Aspartate
## -0.32480 0.94479 0.01106
##
## Degrees of Freedom: 134 Total (i.e. Null); 132 Residual
## Null Deviance: 175.7
## Residual Deviance: 154.3 AIC: 160.3
summary(step.model.1)
##
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial",
## data = d1_female)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8178 -1.2223 0.4402 1.1091 1.2049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32480 0.31013 -1.047 0.2950
## DB 0.94479 0.55808 1.693 0.0905 .
## Aspartate 0.01106 0.00616 1.796 0.0726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175.72 on 134 degrees of freedom
## Residual deviance: 154.27 on 132 degrees of freedom
## AIC: 160.27
##
## Number of Fisher Scoring iterations: 7
We see that “DB”(P-Value = 0.0905) and “Aspartate” (P-Value = 0.0726) are considered significant with P-Values < 0.1. This can be further explained as there exist a significant relationship between the probability of having Liver Disease and DB. Also, there exist a significant relationship between the probability of having Liver Disease and Aspartate.
Perform Hosmer-Lemeshow’s Goodness of fit Test
Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.4579 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, and the fit is good.
Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. Also, all observations fall between -2 and 2, therefore the assumption remains valid.
Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there does not exist any observations larger than the criteria.
Note: This fulfills the requirement of exercise 1B.
# 1
# Comment shown above.
# 2
hoslem.test(step.model.1$y, fitted(step.model.1), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.model.1$y, fitted(step.model.1)
## X-squared = 7.7535, df = 8, p-value = 0.4579
# 3
# a)
resid.d<-residuals(step.model.1, type = "deviance")
resid.p<-residuals(step.model.1, type = "pearson")
std.res.d<-residuals(step.model.1, type = "deviance")/
sqrt(1 - hatvalues(step.model.1))
std.res.p <-residuals(step.model.1, type = "pearson")/
sqrt(1 - hatvalues(step.model.1))
par(mfrow=c(1,2))
plot(std.res.d[step.model.1$model$LiverPatient==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.model.1$model$LiverPatient==1], col = "blue")
plot(std.res.p[step.model.1$model$LiverPatient==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.model.1$model$LiverPatient==1], col = "blue")
# b)
plot(step.model.1, which = 4, id.n = 5)
(inf.id=which(cooks.distance(step.model.1)>0.25))
## named integer(0)
P(hat) = e^(-0.32480 + 0.94479 * DB + 0.01106 * Aspartate)/ 1 + e^(-0.32480 + 0.94479 * DB + 0.01106 * Aspartate) b) Conclusion DB: Based on the odds ratio we can interpret that the odds of an adult female being a liver patient increases by a factor of e^(0.94479)= 2.572 with one unit of increase in DB with Aspartate being constant.
Conclusion Aspartate: Based on the odds ratio we can interpret that the odds of an adult female being a liver patient increases by a factor of e^(0.01106)= 1.011 with one unit of increase in Aspartate with DB being constant.
Conclusion Final: We can use the above positive relationships to state that Adult Females with higher levels of DB or Aspartate have higher odds(chance) of getting liver disease.
# 1
# a)
summary(step.model.1)
##
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial",
## data = d1_female)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8178 -1.2223 0.4402 1.1091 1.2049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32480 0.31013 -1.047 0.2950
## DB 0.94479 0.55808 1.693 0.0905 .
## Aspartate 0.01106 0.00616 1.796 0.0726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175.72 on 134 degrees of freedom
## Residual deviance: 154.27 on 132 degrees of freedom
## AIC: 160.27
##
## Number of Fisher Scoring iterations: 7
# b)
round(exp(step.model.1$coefficients),3)
## (Intercept) DB Aspartate
## 0.723 2.572 1.011
Create subset of d1 wherein only Gender = “Male” are used.
Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.
Conclusion: We can state based on the results obtained above that the best predictors are “DB”, “Alamine”, “Age”, and Alkphos.
Note: This fulfills the requirement of exercise 2A.
# 1
# 1a)
d1_male = subset(d1, Gender == "Male")
# 3b)
glm.null.M = glm(LiverPatient ~ 1, data = d1_male, family = "binomial")
glm.full.M = glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB,
data = d1_male, family = "binomial")
# 3c)
(step.model.2 = step(glm.null.M, scope = list(upper=glm.full.M),
direction="both",test="Chisq", trace = F))
##
## Call: glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = d1_male)
##
## Coefficients:
## (Intercept) DB Alamine Age Alkphos
## -1.47657 0.51250 0.01622 0.02062 0.00174
##
## Degrees of Freedom: 422 Total (i.e. Null); 418 Residual
## Null Deviance: 476.3
## Residual Deviance: 395.1 AIC: 405.1
summary(step.model.2)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = d1_male)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3405 -0.5170 0.3978 0.8614 1.3756
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.476570 0.481336 -3.068 0.00216 **
## DB 0.512503 0.176066 2.911 0.00360 **
## Alamine 0.016218 0.005239 3.095 0.00197 **
## Age 0.020616 0.008095 2.547 0.01087 *
## Alkphos 0.001740 0.001058 1.645 0.09992 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 476.28 on 422 degrees of freedom
## Residual deviance: 395.05 on 418 degrees of freedom
## AIC: 405.05
##
## Number of Fisher Scoring iterations: 7
We see that “DB”(P-Value = 0.00360), “Alamine” (P-Value = 0.00197), “Age”(P-Value = 0.01087), and Alkphos(P-Value = 0.09992) are considered significant with P-Values < 0.1 This can be further explained as there exist a significant relationship between the probability having Liver Disease and DB, Alamine, Age, and Alkphos.
Perform Hosmer-Lemeshow’s Goodness of fit Test
Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.532 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, i.e. the fit is good; but this is based on having influential points.
Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. Also, most observations fall between -2 and 2, therefore the assumption remains valid.
Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there exist any observations larger than the criteria. Observations: 111 and 86.
Re-Check for issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).
Perform Hosmer-Lemeshow’s Goodness of fit Test
Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. Also, most observations fall between -2 and 2, therefore the assumption remains valid.
Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we do not see any observations larger than the criteria.
Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.4669 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is still adequate, i.e. the fit is good.
Note: This fulfills the requirement of exercise 2B.
# 1
# Comment shown above.
# 2
hoslem.test(step.model.2$y, fitted(step.model.2), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.model.2$y, fitted(step.model.2)
## X-squared = 7.043, df = 8, p-value = 0.532
# 3
# a)
resid.d_2<-residuals(step.model.2, type = "deviance")
resid.p_2<-residuals(step.model.2, type = "pearson")
std.res.d_2<-residuals(step.model.2, type = "deviance")/
sqrt(1 - hatvalues(step.model.2))
std.res.p_2 <-residuals(step.model.2, type = "pearson")/
sqrt(1 - hatvalues(step.model.2))
par(mfrow=c(1,2))
plot(std.res.d_2[step.model.2$model$LiverPatient==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d_2[step.model.2$model$LiverPatient==1], col = "blue")
plot(std.res.p_2[step.model.2$model$LiverPatient==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p_2[step.model.2$model$LiverPatient==1], col = "blue")
# b)
plot(step.model.2, which = 4, id.n = 5)
(inf.id_2=which(cooks.distance(step.model.2)>0.25))
## 111
## 86
# c)
glm.liver.M_2 = glm(LiverPatient ~ DB + Alamine + Age + Alkphos,
data = d1_male[-inf.id_2, ], family = "binomial")
summary(glm.liver.M_2)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = d1_male[-inf.id_2, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5166 0.0000 0.3301 0.8648 1.4696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.902754 0.527386 -3.608 0.000309 ***
## DB 0.573104 0.198893 2.881 0.003958 **
## Alamine 0.015850 0.005466 2.900 0.003737 **
## Age 0.020418 0.008210 2.487 0.012883 *
## Alkphos 0.003744 0.001477 2.534 0.011262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 473.51 on 421 degrees of freedom
## Residual deviance: 381.31 on 417 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 8
# 4 Recheck diagnostics after removal of influential points
plot(glm.liver.M_2, which = 4, id.n = 5)
(inf.id_z=which(cooks.distance(glm.liver.M_2)>0.25))
## named integer(0)
# 5 Refitted model goodness of fit check
hoslem.test(glm.liver.M_2$y, fitted(glm.liver.M_2), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: glm.liver.M_2$y, fitted(glm.liver.M_2)
## X-squared = 7.6642, df = 8, p-value = 0.4669
Final Model: Log(p/1-p) = -1.902754 + 0.573104 * DB + 0.015850 * Alamine + 0.020418 * Age + 0.003744 * Alkphos
Conclusion DB: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.573104)= 1.774 with one unit of increase in DB with Alamine, Age, and Alkphos being constant.
Conclusion Alamine: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.015850)= 1.016 with one unit of increase in Alamine with DB, Age, and Alkphos being constant.
Conclusion Age: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.020418)= 1.021 with one unit of increase in Age with DB, Alamine, and Alkphos being constant.
Conclusion Alkphos: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.003744)= 1.004 with one unit of increase in Alkphos with DB, Alamine, and Age being constant.
Conclusion Final: We can use the above positive relationships to state that Adult Males with high levels of DB, Alamine, and Alkphose, who are higher or older in Age are more likely to have higher odds(chance) of h liver disease.
Note: This fulfills the requirement of exercise 2C.
# 1
# a)
summary(glm.liver.M_2)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = d1_male[-inf.id_2, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5166 0.0000 0.3301 0.8648 1.4696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.902754 0.527386 -3.608 0.000309 ***
## DB 0.573104 0.198893 2.881 0.003958 **
## Alamine 0.015850 0.005466 2.900 0.003737 **
## Age 0.020418 0.008210 2.487 0.012883 *
## Alkphos 0.003744 0.001477 2.534 0.011262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 473.51 on 421 degrees of freedom
## Residual deviance: 381.31 on 417 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 8
# b)
round(exp(glm.liver.M_2$coefficients),3)
## (Intercept) DB Alamine Age Alkphos
## 0.149 1.774 1.016 1.021 1.004
Conclusion: We can see based on the results in exercises 1 and 2, that adult females possess less predictors that are significantly related to having liver disease, while adult males have more predictors related to having liver disease. i.e. Adult Female Predictors: DB and Aspartate and Adult Male Predictors: DB, Alamine, Age, and Alkphos.
Note: This fulfills the requirement of exercise 2D.
#### Adult Female Predictors: DB and Aspartate.
### Adult Male Predictors: DB, Alamine, Age, and Alkphos.
##http://lib.stat.cmu.edu/datasets/sleep. Consider finding the best logistic ## model for predicting the probability that a species’ maximum lifespan will ## be at least 10 years. Consider all 6 variables as candidates ## (do not include species) and two index variables of them are categorical in ## nature. Treat two index variables as categorical variables ## (e.g. ignore the fact that they are ordinal).
Begin by loading sleep.csv data into environment as a data frame (d2).
Perform basic inspection of data frame.
Find and specify the best set of predictors in d2.
Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.
Conclusion: We can state based on the results obtained above that the best predictors are brainweight, totalsleep, sleepexposureindex, and predationindex.
Note: This fulfills the requirement of exercise 3A.
# 1
d2 = read.csv("sleep.csv", header = T)
# 2
str(d2)
## 'data.frame': 51 obs. of 8 variables:
## $ species : chr "African" "African" "Arctic F" "Asian el" ...
## $ bodyweight : num 6654 1 3.38 2547 10.55 ...
## $ brainweight : num 5712 6.6 44.5 4603 179.5 ...
## $ totalsleep : num 3.3 8.3 12.5 3.9 9.8 19.7 6.2 14.5 9.7 12.5 ...
## $ gestationtime : num 645 42 60 624 180 35 392 63 230 112 ...
## $ predationindex : int 3 3 1 3 4 1 4 1 1 5 ...
## $ sleepexposureindex: int 5 1 1 5 4 1 5 2 1 4 ...
## $ maxlife10 : int 1 0 1 1 1 1 1 1 1 0 ...
# 3
# 3a)
glm.null.sleep1 = glm(maxlife10 ~ 1, data = d2, family = "binomial")
glm.full.sleep1 = glm(maxlife10 ~ bodyweight + brainweight +
totalsleep + gestationtime +
as.factor(predationindex) +
as.factor(sleepexposureindex),
data = d2, family = "binomial")
# 3b)
(step.model.sleep1 = step(glm.null.sleep1, scope = list(upper=glm.full.sleep1),
direction="both",test="Chisq", trace = F))
##
## Call: glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) +
## as.factor(predationindex), family = "binomial", data = d2)
##
## Coefficients:
## (Intercept) brainweight
## -6.60173 0.05101
## totalsleep as.factor(sleepexposureindex)2
## 0.42302 4.99755
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4
## 36.35898 33.70238
## as.factor(sleepexposureindex)5 as.factor(predationindex)2
## 73.40879 -2.53463
## as.factor(predationindex)3 as.factor(predationindex)4
## -25.11650 -18.26080
## as.factor(predationindex)5
## -52.63970
##
## Degrees of Freedom: 50 Total (i.e. Null); 40 Residual
## Null Deviance: 68.31
## Residual Deviance: 15.88 AIC: 37.88
summary(step.model.sleep1)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) +
## as.factor(predationindex), family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.42528 -0.00004 0.00000 0.00013 2.37523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.602e+00 4.864e+00 -1.357 0.1747
## brainweight 5.101e-02 5.084e-02 1.003 0.3157
## totalsleep 4.230e-01 2.647e-01 1.598 0.1100
## as.factor(sleepexposureindex)2 4.998e+00 2.559e+00 1.953 0.0508 .
## as.factor(sleepexposureindex)3 3.636e+01 9.624e+03 0.004 0.9970
## as.factor(sleepexposureindex)4 3.370e+01 1.037e+04 0.003 0.9974
## as.factor(sleepexposureindex)5 7.341e+01 1.262e+04 0.006 0.9954
## as.factor(predationindex)2 -2.535e+00 1.960e+00 -1.293 0.1960
## as.factor(predationindex)3 -2.512e+01 1.253e+04 -0.002 0.9984
## as.factor(predationindex)4 -1.826e+01 6.795e+03 -0.003 0.9979
## as.factor(predationindex)5 -5.264e+01 1.143e+04 -0.005 0.9963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.31 on 50 degrees of freedom
## Residual deviance: 15.88 on 40 degrees of freedom
## AIC: 37.88
##
## Number of Fisher Scoring iterations: 20
We saw based on the results obtained above that the the best predictor is “sleepexposureindex” it had atleast one level that is statistically significant with P-Values < 0.1. We can state specifically that sleepexposureindex level 2 is significant, while sleepexposureindex levels 1, 3, 4, and 5 are not significant and have no effect on the probability of having the event. This can be further explained as there exist a significant relationship between having a maximum lifespan of atleast 10 yrs and sleepexposureindex level 2.
Perform Hosmer-Lemeshow’s Goodness of fit Test
Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.5324 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, i.e. the fit is good.
Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. However, we can see from the plots that are are some overlaying of data between the the two ranges. Also, all observations fall between -2 and 2, therefore the assumption remains valid.
Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there exist any observations larger than the criteria. Observations: 35 and 40.
Note: This fulfills the requirement of exercise 3B.
# 1
# Comment shown above.
# 2
hoslem.test(step.model.sleep1$y, fitted(step.model.sleep1), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.model.sleep1$y, fitted(step.model.sleep1)
## X-squared = 7.0397, df = 8, p-value = 0.5324
# 3
# a)
resid.d_3<-residuals(step.model.sleep1, type = "deviance")
resid.p_3<-residuals(step.model.sleep1, type = "pearson")
std.res.d_3<-residuals(step.model.sleep1, type = "deviance")/
sqrt(1 - hatvalues(step.model.sleep1))
std.res.p_3 <-residuals(step.model.sleep1, type = "pearson")/
sqrt(1 - hatvalues(step.model.sleep1))
par(mfrow=c(1,2))
plot(std.res.d_3[step.model.sleep1$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d_3[step.model.sleep1$model$maxlife10==1], col = "blue")
plot(std.res.p_3[step.model.sleep1$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.model.sleep1$model$maxlife10==1], col = "blue")
# b)
plot(step.model.sleep1, which = 4, id.n = 5)
(inf.id_3=which(cooks.distance(step.model.sleep1)>0.25))
## 35 40
## 35 40
P(hat) = e^(-6.602e+00 + 4.998e+00 * (sleepexposureindex2) + 3.636e+01 * (sleepexposureindex3) + 3.370e+01 * (sleepexposureindex4) + 7.341e+01 * (sleepexposureindex5))/ 1 + e^(-6.602e+00 + 4.998e+00 * (sleepexposureindex2) + 3.636e+01 * (sleepexposureindex3) + 3.370e+01 * (sleepexposureindex4) + 7.341e+01 * (sleepexposureindex5))
We can say specifically,sleepexposureindex(2) we see that the odds of reaching 10 year lifespan is 1.480500e+02 times compared to that of sleepexposureindex(1).
Note: This fulfills the requirement of exercise 3C.
# 1
# a)
summary(step.model.sleep1)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) +
## as.factor(predationindex), family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.42528 -0.00004 0.00000 0.00013 2.37523
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.602e+00 4.864e+00 -1.357 0.1747
## brainweight 5.101e-02 5.084e-02 1.003 0.3157
## totalsleep 4.230e-01 2.647e-01 1.598 0.1100
## as.factor(sleepexposureindex)2 4.998e+00 2.559e+00 1.953 0.0508 .
## as.factor(sleepexposureindex)3 3.636e+01 9.624e+03 0.004 0.9970
## as.factor(sleepexposureindex)4 3.370e+01 1.037e+04 0.003 0.9974
## as.factor(sleepexposureindex)5 7.341e+01 1.262e+04 0.006 0.9954
## as.factor(predationindex)2 -2.535e+00 1.960e+00 -1.293 0.1960
## as.factor(predationindex)3 -2.512e+01 1.253e+04 -0.002 0.9984
## as.factor(predationindex)4 -1.826e+01 6.795e+03 -0.003 0.9979
## as.factor(predationindex)5 -5.264e+01 1.143e+04 -0.005 0.9963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.31 on 50 degrees of freedom
## Residual deviance: 15.88 on 40 degrees of freedom
## AIC: 37.88
##
## Number of Fisher Scoring iterations: 20
# b)
round(exp(step.model.sleep1$coefficients),3)
## (Intercept) brainweight
## 1.000000e-03 1.052000e+00
## totalsleep as.factor(sleepexposureindex)2
## 1.527000e+00 1.480500e+02
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4
## 6.173141e+15 4.332708e+14
## as.factor(sleepexposureindex)5 as.factor(predationindex)2
## 7.603846e+31 7.900000e-02
## as.factor(predationindex)3 as.factor(predationindex)4
## 0.000000e+00 0.000000e+00
## as.factor(predationindex)5
## 0.000000e+00
##http://lib.stat.cmu.edu/datasets/sleep. Consider finding the best logistic ## model for predicting the probability that a species’ maximum lifespan will ## be at least 10 years. Consider all 6 variables as candidates ## (do not include species) and two index variables of them are categorical in ## nature. Treat two index variables as continuous variables
Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.
Conclusion: We can state based on the results obtained above that the best predictors are brainweight, totalsleep, sleepexposureindex, and predationindex.
Note: This fulfills the requirement of exercise 4A.
# 1
# 1a)
glm.null.sleep2 = glm(maxlife10 ~ 1, data = d2, family = "binomial")
glm.full.sleep2 = glm(maxlife10 ~ bodyweight + brainweight +
totalsleep + gestationtime +
predationindex +
sleepexposureindex,
data = d2, family = "binomial")
# 1b)
step.model.sleep2 = step(glm.null.sleep2, scope = list(upper=glm.full.sleep2),
direction="both",test="Chisq", trace = F)
summary(step.model.sleep2)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82148 -0.04746 0.00000 0.05811 2.41681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16387 3.59301 -1.716 0.0863 .
## brainweight 0.06018 0.03544 1.698 0.0895 .
## totalsleep 0.35985 0.20995 1.714 0.0865 .
## sleepexposureindex 4.42111 1.97540 2.238 0.0252 *
## predationindex -3.36917 1.51823 -2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.310 on 50 degrees of freedom
## Residual deviance: 19.212 on 46 degrees of freedom
## AIC: 29.212
##
## Number of Fisher Scoring iterations: 11
We see that brainweight(P-Value = 0.0895), totalsleep(P-Value = 0.0865), sleepexposureindex(P-Value = 0.0252), and predationindex(P-Value = 0.0265) as they aresignificant with P-Values < 0.1 This can be further explained as there exist a significant relationship between having a maximum lifespan of atleast 10 yrs and brainweight, totalsleep, sleepexposureindex, and predationindex.
Perform Hosmer-Lemeshow’s Goodness of fit Test
Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.9937 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, i.e. the fit is good.
Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. However, we can see from the plots that are are some overlaying of data between the the two ranges. Also, all observations fall between -2 and 2, therefore the assumption remains valid.
Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there exist any observations larger than the criteria. Observations: 10, 35, 40, and 50 .
Note: This fulfills the requirement of exercise 4B.
# 1
# Comment shown above.
# 2
hoslem.test(step.model.sleep2$y, fitted(step.model.sleep2), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: step.model.sleep2$y, fitted(step.model.sleep2)
## X-squared = 1.4406, df = 8, p-value = 0.9937
# 3
# a)
resid.d_4<-residuals(step.model.sleep2, type = "deviance")
resid.p_4<-residuals(step.model.sleep2, type = "pearson")
std.res.d_4<-residuals(step.model.sleep2, type = "deviance")/
sqrt(1 - hatvalues(step.model.sleep1))
std.res.p_4 <-residuals(step.model.sleep2, type = "pearson")/
sqrt(1 - hatvalues(step.model.sleep2))
par(mfrow=c(1,2))
plot(std.res.d_4[step.model.sleep2$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d_4[step.model.sleep2$model$maxlife10==1], col = "blue")
plot(std.res.p_4[step.model.sleep2$model$maxlife10==0], col = "red",
ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.model.sleep2$model$maxlife10==1], col = "blue")
# b)
plot(step.model.sleep2, which = 4, id.n = 5)
(inf.id_4=which(cooks.distance(step.model.sleep2)>0.25))
## 10 35 40 50
## 10 35 40 50
Final Model: Log(p/1-p) = -6.16387 + 0.06018 * brainweight + 0.35985 * totalsleep + 4.42111 * sleepexposureindex - 3.36917 * predationindex
Conclusion brainweight: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years increases by a factor of e^(0.06018)= 1.062 with one unit of increase in brainweight with totalsleep, predationindex,and sleepexposureindex being constant.
Conclusion totalsleep: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years increases by a factor of e^(0.35985)= 1.433 with one unit of increase in totalsleep with brainweight, predationindex, and sleepexposureindex being constant.
Conclusion sleepexposureindex: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years increases by a factor of e^(4.42111)= 83.188 with one unit of increase in sleepexposureindexwith brainweight, totalsleep, and predationindex being constant.
Conclusion predationindex: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years decreases by a factor of e^(-3.36917)= 0.034 with one unit of increase in predationindex with brainweight, totalsleep, and sleepexposureindex being constant.
Conclusion Final: We can use the above to state that species with: larger brainweight, who obtain more totalsleep, who also have a higher exposure to sleep, and have a lower predationindex are more likely to achieve a maximum lifespan of at least 10 years.
Note: This fulfills the requirement of exercise 4C.
# 1
# a)
summary(step.model.sleep2)
##
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex +
## predationindex, family = "binomial", data = d2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82148 -0.04746 0.00000 0.05811 2.41681
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16387 3.59301 -1.716 0.0863 .
## brainweight 0.06018 0.03544 1.698 0.0895 .
## totalsleep 0.35985 0.20995 1.714 0.0865 .
## sleepexposureindex 4.42111 1.97540 2.238 0.0252 *
## predationindex -3.36917 1.51823 -2.219 0.0265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 68.310 on 50 degrees of freedom
## Residual deviance: 19.212 on 46 degrees of freedom
## AIC: 29.212
##
## Number of Fisher Scoring iterations: 11
# b)
round(exp(step.model.sleep2$coefficients),3)
## (Intercept) brainweight totalsleep sleepexposureindex
## 0.002 1.062 1.433 83.188
## predationindex
## 0.034