Algo_STAT6443

Excercise 1: The liver data set is a subset of the ILPD

(Indian Liver Patient Dataset) data set.

Exercise 1A: For only females in the data set, find and specify the best set

of predictors via stepwise selection with AIC criteria for a logistic

regression model predicting whether a female is a liver patient.

Begin by loading liver.csv data into environment as a data frame (d1).
Perform basic inspection of data frame.
Find and specify the best set of predictors for the females in d1.

Create subset of d1 wherein only Gender = “Females” are used.
Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.

Conclusion: We can state based on the results obtained above that the best predictors are “DB” and “Aspartate”.

Note: This fulfills the requirement of exercise 1A.

# 1
d1 = read.csv("liver.csv", header = T)

# 2 
str(d1)

## 'data.frame':    558 obs. of  10 variables:
##  $ Age         : int  65 62 62 58 72 46 26 29 55 57 ...
##  $ Gender      : chr  "Female" "Male" "Male" "Male" ...
##  $ TB          : num  0.7 10.9 7.3 1 3.9 1.8 0.9 0.9 0.7 0.6 ...
##  $ DB          : num  0.1 5.5 4.1 0.4 2 0.7 0.2 0.3 0.2 0.1 ...
##  $ Alkphos     : int  187 699 490 182 195 208 154 202 290 210 ...
##  $ Alamine     : int  16 64 60 14 27 19 16 14 53 51 ...
##  $ Aspartate   : int  18 100 68 20 59 14 12 11 58 59 ...
##  $ TP          : num  6.8 7.5 7 6.8 7.3 7.6 7 6.7 6.8 5.9 ...
##  $ ALB         : num  3.3 3.2 3.3 3.4 2.4 4.4 3.5 3.6 3.4 2.7 ...
##  $ LiverPatient: int  1 1 1 1 1 1 1 1 1 1 ...

# 3
# 3a)
d1_female = subset(d1, Gender == "Female")

# 3b)
glm.null.F = glm(LiverPatient ~ 1, data = d1_female, family = "binomial")
glm.full.F = glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB, 
                  data = d1_female, family = "binomial")

# 3c)
(step.model.1 = step(glm.null.F, scope = list(upper=glm.full.F), 
                     direction="both",test="Chisq", trace = F))

## 
## Call:  glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = d1_female)
## 
## Coefficients:
## (Intercept)           DB    Aspartate  
##    -0.32480      0.94479      0.01106  
## 
## Degrees of Freedom: 134 Total (i.e. Null);  132 Residual
## Null Deviance:       175.7 
## Residual Deviance: 154.3     AIC: 160.3

summary(step.model.1)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = d1_female)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

Exercise 2B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot (with cut-off 0.25).

We see that “DB”(P-Value = 0.0905) and “Aspartate” (P-Value = 0.0726) are considered significant with P-Values < 0.1. This can be further explained as there exist a significant relationship between the probability of having Liver Disease and DB. Also, there exist a significant relationship between the probability of having Liver Disease and Aspartate.
Perform Hosmer-Lemeshow’s Goodness of fit Test

Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.4579 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, and the fit is good.

Check for issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Residual Diagnostics

Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. Also, all observations fall between -2 and 2, therefore the assumption remains valid.

Influence Diagnostics

Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there does not exist any observations larger than the criteria.

Note: This fulfills the requirement of exercise 1B.

# 1
# Comment shown above.

# 2
hoslem.test(step.model.1$y, fitted(step.model.1), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.model.1$y, fitted(step.model.1)
## X-squared = 7.7535, df = 8, p-value = 0.4579

# 3
# a)
resid.d<-residuals(step.model.1, type = "deviance")
resid.p<-residuals(step.model.1, type = "pearson")
std.res.d<-residuals(step.model.1, type = "deviance")/ 
  sqrt(1 - hatvalues(step.model.1)) 
std.res.p <-residuals(step.model.1, type = "pearson")/
  sqrt(1 - hatvalues(step.model.1)) 

par(mfrow=c(1,2))
plot(std.res.d[step.model.1$model$LiverPatient==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d[step.model.1$model$LiverPatient==1], col = "blue")

plot(std.res.p[step.model.1$model$LiverPatient==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.model.1$model$LiverPatient==1], col = "blue")

# b)
plot(step.model.1, which = 4, id.n = 5)

(inf.id=which(cooks.distance(step.model.1)>0.25))

## named integer(0)

Exercise 1C: Interpret relationships between predictors in the final model

and the odds of an adult female being a liver patient.

(based on estimated Odds Ratio).

Obtain Final Model and provide interpretation based on estimated Odds Ratio.

Final Model: Log(p(hat)/1-p(hat)) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate

P(hat) = e^(-0.32480 + 0.94479 * DB + 0.01106 * Aspartate)/ 1 + e^(-0.32480 + 0.94479 * DB + 0.01106 * Aspartate) b) Conclusion DB: Based on the odds ratio we can interpret that the odds of an adult female being a liver patient increases by a factor of e^(0.94479)= 2.572 with one unit of increase in DB with Aspartate being constant.

Conclusion Aspartate: Based on the odds ratio we can interpret that the odds of an adult female being a liver patient increases by a factor of e^(0.01106)= 1.011 with one unit of increase in Aspartate with DB being constant.

Conclusion Final: We can use the above positive relationships to state that Adult Females with higher levels of DB or Aspartate have higher odds(chance) of getting liver disease.

# 1
# a)
summary(step.model.1)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = d1_female)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

# b)
round(exp(step.model.1$coefficients),3)

## (Intercept)          DB   Aspartate 
##       0.723       2.572       1.011

Excercise 2: The liver data set is a subset of the ILPD

(Indian Liver Patient Dataset) data set.

Exercise 2A: For only Males in the data set, find and specify the best set

of predictors via stepwise selection with AIC criteria for a logistic

regression model predicting whether a female is a liver patient.

Find and specify the best set of predictors for the males in d1.

Create subset of d1 wherein only Gender = “Male” are used.
Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.

Conclusion: We can state based on the results obtained above that the best predictors are “DB”, “Alamine”, “Age”, and Alkphos.

Note: This fulfills the requirement of exercise 2A.

# 1
# 1a)
d1_male = subset(d1, Gender == "Male")

# 3b)
glm.null.M = glm(LiverPatient ~ 1, data = d1_male, family = "binomial")
glm.full.M = glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB, 
                  data = d1_male, family = "binomial")

# 3c)
(step.model.2 = step(glm.null.M, scope = list(upper=glm.full.M), 
                     direction="both",test="Chisq", trace = F))

## 
## Call:  glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1_male)
## 
## Coefficients:
## (Intercept)           DB      Alamine          Age      Alkphos  
##    -1.47657      0.51250      0.01622      0.02062      0.00174  
## 
## Degrees of Freedom: 422 Total (i.e. Null);  418 Residual
## Null Deviance:       476.3 
## Residual Deviance: 395.1     AIC: 405.1

summary(step.model.2)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1_male)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3405  -0.5170   0.3978   0.8614   1.3756  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.476570   0.481336  -3.068  0.00216 **
## DB           0.512503   0.176066   2.911  0.00360 **
## Alamine      0.016218   0.005239   3.095  0.00197 **
## Age          0.020616   0.008095   2.547  0.01087 * 
## Alkphos      0.001740   0.001058   1.645  0.09992 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 476.28  on 422  degrees of freedom
## Residual deviance: 395.05  on 418  degrees of freedom
## AIC: 405.05
## 
## Number of Fisher Scoring iterations: 7

Exercise 2B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot (with cut-off 0.25).

We see that “DB”(P-Value = 0.00360), “Alamine” (P-Value = 0.00197), “Age”(P-Value = 0.01087), and Alkphos(P-Value = 0.09992) are considered significant with P-Values < 0.1 This can be further explained as there exist a significant relationship between the probability having Liver Disease and DB, Alamine, Age, and Alkphos.
Perform Hosmer-Lemeshow’s Goodness of fit Test

Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.532 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, i.e. the fit is good; but this is based on having influential points.

Check for issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Residual Diagnostics

Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. Also, most observations fall between -2 and 2, therefore the assumption remains valid.

Influence Diagnostics

Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there exist any observations larger than the criteria. Observations: 111 and 86.

Refit the model without influential observations

Re-Check for issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).
Perform Hosmer-Lemeshow’s Goodness of fit Test

Residual Diagnostics

Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. Also, most observations fall between -2 and 2, therefore the assumption remains valid.

Influence Diagnostics

Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we do not see any observations larger than the criteria.

We do not need to refit the model any further as we have removed all influential observations.

Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.4669 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is still adequate, i.e. the fit is good.

Note: This fulfills the requirement of exercise 2B.

# 1
# Comment shown above.

# 2
hoslem.test(step.model.2$y, fitted(step.model.2), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.model.2$y, fitted(step.model.2)
## X-squared = 7.043, df = 8, p-value = 0.532

# 3
# a)
resid.d_2<-residuals(step.model.2, type = "deviance")
resid.p_2<-residuals(step.model.2, type = "pearson")
std.res.d_2<-residuals(step.model.2, type = "deviance")/ 
  sqrt(1 - hatvalues(step.model.2)) 
std.res.p_2 <-residuals(step.model.2, type = "pearson")/
  sqrt(1 - hatvalues(step.model.2)) 

par(mfrow=c(1,2))
plot(std.res.d_2[step.model.2$model$LiverPatient==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d_2[step.model.2$model$LiverPatient==1], col = "blue")

plot(std.res.p_2[step.model.2$model$LiverPatient==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p_2[step.model.2$model$LiverPatient==1], col = "blue")

# b)
plot(step.model.2, which = 4, id.n = 5)

(inf.id_2=which(cooks.distance(step.model.2)>0.25))

## 111 
##  86

# c)
glm.liver.M_2 = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, 
                    data = d1_male[-inf.id_2, ], family = "binomial")
summary(glm.liver.M_2)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1_male[-inf.id_2, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5166   0.0000   0.3301   0.8648   1.4696  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.902754   0.527386  -3.608 0.000309 ***
## DB           0.573104   0.198893   2.881 0.003958 ** 
## Alamine      0.015850   0.005466   2.900 0.003737 ** 
## Age          0.020418   0.008210   2.487 0.012883 *  
## Alkphos      0.003744   0.001477   2.534 0.011262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 473.51  on 421  degrees of freedom
## Residual deviance: 381.31  on 417  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 8

# 4 Recheck diagnostics after removal of influential points
plot(glm.liver.M_2, which = 4, id.n = 5)

(inf.id_z=which(cooks.distance(glm.liver.M_2)>0.25))

## named integer(0)

# 5 Refitted model goodness of fit check
hoslem.test(glm.liver.M_2$y, fitted(glm.liver.M_2), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.liver.M_2$y, fitted(glm.liver.M_2)
## X-squared = 7.6642, df = 8, p-value = 0.4669

Exercise 2C: Interpret relationships between predictors in the final model

and the odds of an adult female being a liver patient.

(based on estimated Odds Ratio).

Obtain Final Model and provide interpretation based on estimated Odds Ratio.

Final Model: Log(p/1-p) = -1.902754 + 0.573104 * DB + 0.015850 * Alamine + 0.020418 * Age + 0.003744 * Alkphos
Conclusion DB: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.573104)= 1.774 with one unit of increase in DB with Alamine, Age, and Alkphos being constant.

Conclusion Alamine: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.015850)= 1.016 with one unit of increase in Alamine with DB, Age, and Alkphos being constant.

Conclusion Age: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.020418)= 1.021 with one unit of increase in Age with DB, Alamine, and Alkphos being constant.

Conclusion Alkphos: Based on the odds ratio we can interpret that the odds of an adult male being a liver patient increases by a factor of e^(0.003744)= 1.004 with one unit of increase in Alkphos with DB, Alamine, and Age being constant.

Conclusion Final: We can use the above positive relationships to state that Adult Males with high levels of DB, Alamine, and Alkphose, who are higher or older in Age are more likely to have higher odds(chance) of h liver disease.

Note: This fulfills the requirement of exercise 2C.

# 1
# a)
summary(glm.liver.M_2)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1_male[-inf.id_2, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5166   0.0000   0.3301   0.8648   1.4696  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.902754   0.527386  -3.608 0.000309 ***
## DB           0.573104   0.198893   2.881 0.003958 ** 
## Alamine      0.015850   0.005466   2.900 0.003737 ** 
## Age          0.020418   0.008210   2.487 0.012883 *  
## Alkphos      0.003744   0.001477   2.534 0.011262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 473.51  on 421  degrees of freedom
## Residual deviance: 381.31  on 417  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 8

# b)
round(exp(glm.liver.M_2$coefficients),3)

## (Intercept)          DB     Alamine         Age     Alkphos 
##       0.149       1.774       1.016       1.021       1.004

Exercise 2D: Comment on how the models for adult females and

adult males differ.

Conclusion: We can see based on the results in exercises 1 and 2, that adult females possess less predictors that are significantly related to having liver disease, while adult males have more predictors related to having liver disease. i.e. Adult Female Predictors: DB and Aspartate and Adult Male Predictors: DB, Alamine, Age, and Alkphos.

Note: This fulfills the requirement of exercise 2D.

#### Adult Female Predictors: DB and Aspartate.

### Adult Male Predictors: DB, Alamine, Age, and Alkphos.

Excercise 3: Use the sleep data set which originates from

##http://lib.stat.cmu.edu/datasets/sleep. Consider finding the best logistic ## model for predicting the probability that a species’ maximum lifespan will ## be at least 10 years. Consider all 6 variables as candidates ## (do not include species) and two index variables of them are categorical in ## nature. Treat two index variables as categorical variables ## (e.g. ignore the fact that they are ordinal).

Exercise 3A: First find and specify the best set of predictors via

stepwise selection with AIC criteria.

Begin by loading sleep.csv data into environment as a data frame (d2).
Perform basic inspection of data frame.
Find and specify the best set of predictors in d2.

Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.

Conclusion: We can state based on the results obtained above that the best predictors are brainweight, totalsleep, sleepexposureindex, and predationindex.

Note: This fulfills the requirement of exercise 3A.

# 1
d2 = read.csv("sleep.csv", header = T)

# 2 
str(d2)

## 'data.frame':    51 obs. of  8 variables:
##  $ species           : chr  "African" "African" "Arctic F" "Asian el" ...
##  $ bodyweight        : num  6654 1 3.38 2547 10.55 ...
##  $ brainweight       : num  5712 6.6 44.5 4603 179.5 ...
##  $ totalsleep        : num  3.3 8.3 12.5 3.9 9.8 19.7 6.2 14.5 9.7 12.5 ...
##  $ gestationtime     : num  645 42 60 624 180 35 392 63 230 112 ...
##  $ predationindex    : int  3 3 1 3 4 1 4 1 1 5 ...
##  $ sleepexposureindex: int  5 1 1 5 4 1 5 2 1 4 ...
##  $ maxlife10         : int  1 0 1 1 1 1 1 1 1 0 ...

# 3
# 3a)
glm.null.sleep1 = glm(maxlife10 ~ 1, data = d2, family = "binomial")
glm.full.sleep1 = glm(maxlife10 ~ bodyweight + brainweight + 
                           totalsleep + gestationtime + 
                           as.factor(predationindex) + 
                           as.factor(sleepexposureindex), 
                         data = d2, family = "binomial")

# 3b)
(step.model.sleep1 = step(glm.null.sleep1, scope = list(upper=glm.full.sleep1),
     direction="both",test="Chisq", trace = F))

## 
## Call:  glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) + 
##     as.factor(predationindex), family = "binomial", data = d2)
## 
## Coefficients:
##                    (Intercept)                     brainweight  
##                       -6.60173                         0.05101  
##                     totalsleep  as.factor(sleepexposureindex)2  
##                        0.42302                         4.99755  
## as.factor(sleepexposureindex)3  as.factor(sleepexposureindex)4  
##                       36.35898                        33.70238  
## as.factor(sleepexposureindex)5      as.factor(predationindex)2  
##                       73.40879                        -2.53463  
##     as.factor(predationindex)3      as.factor(predationindex)4  
##                      -25.11650                       -18.26080  
##     as.factor(predationindex)5  
##                      -52.63970  
## 
## Degrees of Freedom: 50 Total (i.e. Null);  40 Residual
## Null Deviance:       68.31 
## Residual Deviance: 15.88     AIC: 37.88

summary(step.model.sleep1)

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) + 
##     as.factor(predationindex), family = "binomial", data = d2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.42528  -0.00004   0.00000   0.00013   2.37523  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20

Exercise 3B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot.

We saw based on the results obtained above that the the best predictor is “sleepexposureindex” it had atleast one level that is statistically significant with P-Values < 0.1. We can state specifically that sleepexposureindex level 2 is significant, while sleepexposureindex levels 1, 3, 4, and 5 are not significant and have no effect on the probability of having the event. This can be further explained as there exist a significant relationship between having a maximum lifespan of atleast 10 yrs and sleepexposureindex level 2.
Perform Hosmer-Lemeshow’s Goodness of fit Test

Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.5324 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, i.e. the fit is good.

Check for issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Residual Diagnostics

Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. However, we can see from the plots that are are some overlaying of data between the the two ranges. Also, all observations fall between -2 and 2, therefore the assumption remains valid.

Influence Diagnostics

Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there exist any observations larger than the criteria. Observations: 35 and 40.

Note: This fulfills the requirement of exercise 3B.

# 1
# Comment shown above.

# 2
hoslem.test(step.model.sleep1$y, fitted(step.model.sleep1), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.model.sleep1$y, fitted(step.model.sleep1)
## X-squared = 7.0397, df = 8, p-value = 0.5324

# 3
# a)
resid.d_3<-residuals(step.model.sleep1, type = "deviance")
resid.p_3<-residuals(step.model.sleep1, type = "pearson")
std.res.d_3<-residuals(step.model.sleep1, type = "deviance")/ 
  sqrt(1 - hatvalues(step.model.sleep1)) 
std.res.p_3 <-residuals(step.model.sleep1, type = "pearson")/
  sqrt(1 - hatvalues(step.model.sleep1)) 

par(mfrow=c(1,2))
plot(std.res.d_3[step.model.sleep1$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d_3[step.model.sleep1$model$maxlife10==1], col = "blue")

plot(std.res.p_3[step.model.sleep1$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.model.sleep1$model$maxlife10==1], col = "blue")

# b)
plot(step.model.sleep1, which = 4, id.n = 5)

(inf.id_3=which(cooks.distance(step.model.sleep1)>0.25))

## 35 40 
## 35 40

Exercise 3C: Interpret what the model tells us about relationships

between the predictors and the odds of a species’ maximum lifespan

being at least 10 years.

Obtain Final Model and provide interpretation based on estimated Odds Ratio.

Final Model: Log(p/1-p) = -6.602e+00 + 4.998e+00 * (sleepexposureindex2) + 3.636e+01 * (sleepexposureindex3) + 3.370e+01 * (sleepexposureindex4) + 7.341e+01 * (sleepexposureindex5)

P(hat) = e^(-6.602e+00 + 4.998e+00 * (sleepexposureindex2) + 3.636e+01 * (sleepexposureindex3) + 3.370e+01 * (sleepexposureindex4) + 7.341e+01 * (sleepexposureindex5))/ 1 + e^(-6.602e+00 + 4.998e+00 * (sleepexposureindex2) + 3.636e+01 * (sleepexposureindex3) + 3.370e+01 * (sleepexposureindex4) + 7.341e+01 * (sleepexposureindex5))

Conclusion Final(significant predictors): We can use the above to state that that the comparison groups of sleepexposureindex: 2, 3, 4, and 5 have increased odds of reaching 10 year lifespan compared to sleepexposureindex with value 1 which was the reference group.

We can say specifically,sleepexposureindex(2) we see that the odds of reaching 10 year lifespan is 1.480500e+02 times compared to that of sleepexposureindex(1).

Note: This fulfills the requirement of exercise 3C.

# 1
# a)
summary(step.model.sleep1)

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) + 
##     as.factor(predationindex), family = "binomial", data = d2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.42528  -0.00004   0.00000   0.00013   2.37523  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20

# b)
round(exp(step.model.sleep1$coefficients),3)

##                    (Intercept)                    brainweight 
##                   1.000000e-03                   1.052000e+00 
##                     totalsleep as.factor(sleepexposureindex)2 
##                   1.527000e+00                   1.480500e+02 
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4 
##                   6.173141e+15                   4.332708e+14 
## as.factor(sleepexposureindex)5     as.factor(predationindex)2 
##                   7.603846e+31                   7.900000e-02 
##     as.factor(predationindex)3     as.factor(predationindex)4 
##                   0.000000e+00                   0.000000e+00 
##     as.factor(predationindex)5 
##                   0.000000e+00

Excercise 4: Use the sleep data set which originates from

Exercise 4: First find and specify the best set of predictors via

stepwise selection with AIC criteria.

Find and specify the best set of predictors in d2.

Fit and obtain the Null + Full logistic regression Model.
Perform stepwise selection with AIC criteria.

Conclusion: We can state based on the results obtained above that the best predictors are brainweight, totalsleep, sleepexposureindex, and predationindex.

Note: This fulfills the requirement of exercise 4A.

# 1
# 1a)
glm.null.sleep2 = glm(maxlife10 ~ 1, data = d2, family = "binomial")
glm.full.sleep2 = glm(maxlife10 ~ bodyweight + brainweight + 
                           totalsleep + gestationtime + 
                           predationindex + 
                           sleepexposureindex, 
                         data = d2, family = "binomial")

# 1b)
step.model.sleep2 = step(glm.null.sleep2, scope = list(upper=glm.full.sleep2),
     direction="both",test="Chisq", trace = F)

summary(step.model.sleep2)

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = d2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82148  -0.04746   0.00000   0.05811   2.41681  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11

Exercise 4B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot.

We see that brainweight(P-Value = 0.0895), totalsleep(P-Value = 0.0865), sleepexposureindex(P-Value = 0.0252), and predationindex(P-Value = 0.0265) as they aresignificant with P-Values < 0.1 This can be further explained as there exist a significant relationship between having a maximum lifespan of atleast 10 yrs and brainweight, totalsleep, sleepexposureindex, and predationindex.
Perform Hosmer-Lemeshow’s Goodness of fit Test

Conclusion: Based on the results from the Hosmer-Lemeshow’s Goodness of fit Test, we have obtained a P-Value = 0.9937 which is greater than 0.1. Therefore, we cannot reject the null hypothesis and can conclude that the model is adequate, i.e. the fit is good.

Check for issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Residual Diagnostics

Conclusion: Based on the results of the deviance and pearson residual plots we can see that when Y = 1 there does not exist any large patterns, and when Y = 0 there also does not exist any large patterns. However, we can see from the plots that are are some overlaying of data between the the two ranges. Also, all observations fall between -2 and 2, therefore the assumption remains valid.

Influence Diagnostics

Conclusion: Based on the cutoff criteria used when performing the Cooks Distance check we can see that there exist any observations larger than the criteria. Observations: 10, 35, 40, and 50 .

Note: This fulfills the requirement of exercise 4B.

# 1
# Comment shown above.

# 2
hoslem.test(step.model.sleep2$y, fitted(step.model.sleep2), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.model.sleep2$y, fitted(step.model.sleep2)
## X-squared = 1.4406, df = 8, p-value = 0.9937

# 3
# a)
resid.d_4<-residuals(step.model.sleep2, type = "deviance")
resid.p_4<-residuals(step.model.sleep2, type = "pearson")
std.res.d_4<-residuals(step.model.sleep2, type = "deviance")/ 
  sqrt(1 - hatvalues(step.model.sleep1)) 
std.res.p_4 <-residuals(step.model.sleep2, type = "pearson")/
  sqrt(1 - hatvalues(step.model.sleep2)) 

par(mfrow=c(1,2))
plot(std.res.d_4[step.model.sleep2$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(std.res.d_4[step.model.sleep2$model$maxlife10==1], col = "blue")

plot(std.res.p_4[step.model.sleep2$model$maxlife10==0], col = "red", 
     ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(std.res.p[step.model.sleep2$model$maxlife10==1], col = "blue")

# b)
plot(step.model.sleep2, which = 4, id.n = 5)

(inf.id_4=which(cooks.distance(step.model.sleep2)>0.25))

## 10 35 40 50 
## 10 35 40 50

Exercise 4C: Interpret what the model tells us about relationships

between the predictors and the odds of a species’ maximum lifespan

being at least 10 years.

Obtain Final Model and provide interpretation based on estimated Odds Ratio.

Final Model: Log(p/1-p) = -6.16387 + 0.06018 * brainweight + 0.35985 * totalsleep + 4.42111 * sleepexposureindex - 3.36917 * predationindex
Conclusion brainweight: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years increases by a factor of e^(0.06018)= 1.062 with one unit of increase in brainweight with totalsleep, predationindex,and sleepexposureindex being constant.

Conclusion totalsleep: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years increases by a factor of e^(0.35985)= 1.433 with one unit of increase in totalsleep with brainweight, predationindex, and sleepexposureindex being constant.

Conclusion sleepexposureindex: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years increases by a factor of e^(4.42111)= 83.188 with one unit of increase in sleepexposureindexwith brainweight, totalsleep, and predationindex being constant.

Conclusion predationindex: Based on the odds ratio we can interpret that the odds of a species’ achieving a maximum lifespan of least 10 years decreases by a factor of e^(-3.36917)= 0.034 with one unit of increase in predationindex with brainweight, totalsleep, and sleepexposureindex being constant.

Conclusion Final: We can use the above to state that species with: larger brainweight, who obtain more totalsleep, who also have a higher exposure to sleep, and have a lower predationindex are more likely to achieve a maximum lifespan of at least 10 years.

Note: This fulfills the requirement of exercise 4C.

# 1
# a)
summary(step.model.sleep2)

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = d2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82148  -0.04746   0.00000   0.05811   2.41681  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11

# b)
round(exp(step.model.sleep2$coefficients),3)

##        (Intercept)        brainweight         totalsleep sleepexposureindex 
##              0.002              1.062              1.433             83.188 
##     predationindex 
##              0.034

Algo_STAT6443_HW4

Jerris George

12/4/2021

Excercise 1: The liver data set is a subset of the ILPD

(Indian Liver Patient Dataset) data set.

Exercise 1A: For only females in the data set, find and specify the best set

of predictors via stepwise selection with AIC criteria for a logistic

regression model predicting whether a female is a liver patient.

Exercise 2B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot (with cut-off 0.25).

Exercise 1C: Interpret relationships between predictors in the final model

and the odds of an adult female being a liver patient.

(based on estimated Odds Ratio).

Excercise 2: The liver data set is a subset of the ILPD

(Indian Liver Patient Dataset) data set.

Exercise 2A: For only Males in the data set, find and specify the best set

of predictors via stepwise selection with AIC criteria for a logistic

regression model predicting whether a female is a liver patient.

Exercise 2B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot (with cut-off 0.25).

Exercise 2C: Interpret relationships between predictors in the final model

and the odds of an adult female being a liver patient.

(based on estimated Odds Ratio).

Exercise 2D: Comment on how the models for adult females and

adult males differ.

Excercise 3: Use the sleep data set which originates from

Exercise 3A: First find and specify the best set of predictors via

stepwise selection with AIC criteria.

Exercise 3B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot.

Exercise 3C: Interpret what the model tells us about relationships

between the predictors and the odds of a species’ maximum lifespan

being at least 10 years.

Excercise 4: Use the sleep data set which originates from

Exercise 4: First find and specify the best set of predictors via

stepwise selection with AIC criteria.

Exercise 4B: Comment on the significance of parameter estimates under

significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about

goodness of fit, and point out any issues with diagnostics by checking

residual plots and cook’s distance plot.

Exercise 4C: Interpret what the model tells us about relationships

between the predictors and the odds of a species’ maximum lifespan

being at least 10 years.