Exercise 1:

The liver data set is a subset of the ILPD (Indian Liver Patient Dataset) data set. It contains the first 10 variables described on the UCI Machine Learning Repository and a LiverPatient variable (indicating whether or not the individual is a liver patient. People with active liver disease are coded as LiverPatient=1 and people without disease are coded LiverPatient=0) for adults in the data set. Adults here are defined to be individuals who are at least 18 years of age. It is possible that there will be different significant predictors of being a liver patient for adult females and adult males.

Exercise 1A:

For only females in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient. NOTE: Specifying the full model using “LiverPatient~., data=…” will give an error message (due to only one level of factor – Female – in the data, I guess so). Suggest typing all variables manually for the full model

Fit Logistic Regression Model (Female)

Stepwise Selection (AIC)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = liverF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

Conclusion: From the analysis above, DB and Aspartate are significant predictors since their p-values are below the significance level of 0.1.

Exercise 1B:

Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

  • Significance of Parameter Estimates: Since DB has a p-value of 0.0905 which is below the significance level of 0.1, therefore there is a significant relationship between DB and whether a female liver patient has active liver disease. Aspartate has a p-value of 0.0726 which is below the significance level of 0.1, therefore there as well is a significant relationship between Aspartate and whether a female liver patient has active liver disease.

Hosmer-Lemeshow Test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.model.1$y, fitted(step.model.1)
## X-squared = 7.7535, df = 8, p-value = 0.4579

Goodness of Fit: Since the Hosmer-Lemeshow test yielded a p-value of 0.4579 which is above 0.1, we do not reject our null hypothesis. This means that the model is adequate

Residual Plots

Conclusion: When comparing these residual plots, we can see that there’s a parallel pattern. The reason for this is similar estimated probabilities for all observations. Both Pearson and Deviance residual plots are based on (Y - P-hat), and if all observations have similar P-hats, there will be a parallel pattern. Hence, the parallel pattern above is because of data feature, not violation of assumptions. Also, the plotted points fall within the range of 0 to 1 (Blue) and -1 to -2 (Red).

Because there are no points with very large values, the Bernoulli assumption is valid. And, because there is not a systematic pattern in the plot; therefore, the homoscedasticity assumption is valid.

Cook’s Distance

## named integer(0)

Conclusion: There are no observations with a Cook’s Distance larger than 0.25

Exercise 1C:

Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio). NOTE: stepwise selection with AIC criteria can be performed by default step() function in R.

## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = liverF)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

Final Model: Log(p/1-p) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate

Odds Ratio

## (Intercept)          DB   Aspartate 
##       0.723       2.572       1.011

Conclusion: - The odds of female being a liver patient with active liver disease increases by a factor of exp(0.94479) = 2.572 with a one unit increase in DB when Aspartate is held constant.

  • The odds of female being a liver patient with active liver disease increases by a factor of exp(0.01106) = 1.011 with a one unit increase in Aspartate when DB is held constant.

Therefore, a female with high levels of Direct Bilirubin (DB) and Aspartate Aminotransferase (Aspartate) is more likely to be a liver patient with active liver disease

Exercise 2:

Repeat exercise 1 for males. In addition to the previous questions, also d) comment on how the models for adult females and adult males differ. Use significance level alpha=0.1 NOTE: You will get an error message “glm.fit: fitted probabilities numerically 0 or 1 occurred” for this run. Ignorethis and use the result for the interpretation. I will explain what this error means in Week 14 videos.

Exercise 2A:

For only males in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a male is a liver patient.

Fit Logistic Regression Model (Male)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Stepwise Selection (AIC)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = liverM)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3405  -0.5170   0.3978   0.8614   1.3756  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.476570   0.481336  -3.068  0.00216 **
## DB           0.512503   0.176066   2.911  0.00360 **
## Alamine      0.016218   0.005239   3.095  0.00197 **
## Age          0.020616   0.008095   2.547  0.01087 * 
## Alkphos      0.001740   0.001058   1.645  0.09992 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 476.28  on 422  degrees of freedom
## Residual deviance: 395.05  on 418  degrees of freedom
## AIC: 405.05
## 
## Number of Fisher Scoring iterations: 7

Conclusion: From the analysis above, DB, Alamine, Age, and Alkphos are significant predictors because their p-values fall below the significance level of 0.1.

Exercise 2B:

Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

  • Significance of Parameter Estimates: Since DB (0.00360), Alamine (0.00197), Age (0.01087), and Alkphos (0.09992) have p-values that fall below the significance level of 0.1. With that being said, there is a significant relationship between DB, Alamine, Age, and Alkphos and whether a male liver patient has active liver disease.

Hosmer-Lemeshow Test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.model.2$y, fitted(step.model.2)
## X-squared = 7.043, df = 8, p-value = 0.532

Goodness of Fit: Since the Hosmer-Lemeshow test yielded a p-value of 0.532 (> 0.1), we do not reject our null hypothesis. This means that the model is adequate.

Residual Plots

Conclusion: When comparing these residual plots, we can see that there’s a parallel pattern. The reason for this is similar estimated probabilities for all observations. Both of these plots are based on (Y - P-hat), and if all observations have similar P-hats, there will be a parallel pattern. Hence, the parallel pattern above is because of data feature, not violation of assumptions. Also, majority of the plotted points fall between the range 0 to 1 (Blue) and -1 to -2 (Red), and a couple of outliers in the -3 range (red points)

Because there are no points with very large values, the Bernoulli assumption is valid. And, because there is not a systematic pattern in the plot; therefore, it does not violate the homoscedasticity assumption.

Cook’s Distance

## 111 
##  86

Conclusion: Observations 111 and 86 have Cook’s Distance larger than 0.25.

Refitted Model without Observations 111 and 86

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Exercise 2C:

Interpret relationships between predictors in the final model and the odds of an adult male being a liver patient. (based on estimated Odds Ratio).

Final Model

## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = liverM[-inf.id.2, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5166   0.0000   0.3301   0.8648   1.4696  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.902754   0.527386  -3.608 0.000309 ***
## DB           0.573104   0.198893   2.881 0.003958 ** 
## Alamine      0.015850   0.005466   2.900 0.003737 ** 
## Age          0.020418   0.008210   2.487 0.012883 *  
## Alkphos      0.003744   0.001477   2.534 0.011262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 473.51  on 421  degrees of freedom
## Residual deviance: 381.31  on 417  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 8

Final Model: Log(p/1-p) = -1.902754 + 0.573104 * DB + 0.015850 * Alamine + 0.020418 * Age + 0.003744 * Alkphos

Odds Ratio

## (Intercept)          DB     Alamine         Age     Alkphos 
##       0.149       1.774       1.016       1.021       1.004

Conclusion:

  • The odds of an adult male being a liver patient with active liver disease increases by a factor of EXP(0.573104) = 1.774 with a one unit increase in DB; when Alamine, Age, and Alkphos are held constant

  • The odds of an adult male being a liver patient with active liver disease increases by a factor of EXP(0.015850) = 1.016 with a one unit increase in Alamine; when DB, Age, and Alkphos are held constant

  • The odds of an adult male being a liver patient with active liver disease increases by a factor of EXP(0.020418) = 1.021 with a one unit increase in Age; when DB, Alamine, and Alkphos are held constant

  • The odds of an adult male being a liver patient with active liver disease increases by a factor of EXP(0.003744) = 1.004 with a one unit increase in Alkphos; when DB, Age, and Alamine are held constant

Therefore, an adult male with high levels of Direct Bilirubin (DB), Alamine Aminotransferase (Alamine), Alkaline Phosphotase (Alkphos), and older Age is more likely to be a liver patient with active liver disease.

Exercise 2D:

Comment on how the models for adult females and adult males differ. Use significance level alpha=0.1

  • Adult females have only two predictors - DB and Aspartate - that are significant, and these two increase the odds of a liver patient having liver disease. Adult males, on the other hand, have four predictors - DB, Alamine, Age, and Alkphos - that are significant and they also increase the odds of a liver patient having liver disease.

Exercise 3:

Use the sleep data set which originates from http://lib.stat.cmu.edu/datasets/sleep. maxlife10 is 0 if the species maximum life span is less than 10 years and 1 if its maximum life span is greater than or equal to 10 years. Consider finding the best logistic model for predicting the probability that a species’ maximum lifespan will be at least 10 years. Consider all 6 variables as candidates (do not include species) and two index variables of them are categorical in nature. Treat two index variables as categorical variables (e.g. ignore the fact that they are ordinal). Use significance level alpha=0.1

## 'data.frame':    51 obs. of  8 variables:
##  $ species           : chr  "African" "African" "Arctic F" "Asian el" ...
##  $ bodyweight        : num  6654 1 3.38 2547 10.55 ...
##  $ brainweight       : num  5712 6.6 44.5 4603 179.5 ...
##  $ totalsleep        : num  3.3 8.3 12.5 3.9 9.8 19.7 6.2 14.5 9.7 12.5 ...
##  $ gestationtime     : num  645 42 60 624 180 35 392 63 230 112 ...
##  $ predationindex    : int  3 3 1 3 4 1 4 1 1 5 ...
##  $ sleepexposureindex: int  5 1 1 5 4 1 5 2 1 4 ...
##  $ maxlife10         : int  1 0 1 1 1 1 1 1 1 0 ...

Exercise 3A:

First find and specify the best set of predictors via stepwise selection with AIC criteria.

Fit the Logistic Regression model for maxlife10

Step wise model selection with AIC Criteria

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) + 
##     as.factor(predationindex), family = "binomial", data = sleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.42528  -0.00004   0.00000   0.00013   2.37523  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20

Model Significance: At least one estimate has significant p-value, hence the model can be considered significant.

Individual Parameter Significance :

In the step wise selection of model we have started from a null model and went up to full model, performing step wise (Forward-Backward) Selection based on AIC criteria at each step. From the model result, it can be observed that only sleepexposueindex2 [ 1 being least exposed (e.g. animal sleeps in a well-protected den), 5 being most exposed conditions] has significant effect on maxlife10 (Life span of greater than 10 years).

Note: As the index variables are considered as categorical, the model would struggle to fit as it requires sufficient number of data points for accurate fit for all the categories, But in our case there are only 51 observations in total, hence it would lead to some discrepancies in model fit.

##Exercise 3B:

Comment on the significance of parameter estimates, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.

Goodness of fit- Hosmer-Lemeshow’s test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.sleep1$y, fitted(step.sleep1)
## X-squared = 7.0397, df = 8, p-value = 0.5324

H0 : The model is a good-fit (adequate) Ha : The model is not a good fit (adequate)

From the Hosmer-Lemeshow’s test results, p-value is 0.5324, hence we can say that model is a good fit.

Model Diagnostics

From both deviance and pearson type residual plot, we can see that the data points are distributed between 2 and -2. There is no pattern in the distribution of the residuals, hence the model assumptions of residuals having Bernoulli distribution and homoscedescity is valid.

Cooks distance

## 35 40 
## 35 40

From the cooks distance plot, we can notice that observations with ids 35 and 40 seem to have cook’s distance larger than 0.25, hence they are suspicious observations.

Exercise 3C:

Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years.

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + as.factor(sleepexposureindex) + 
##     as.factor(predationindex), family = "binomial", data = sleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.42528  -0.00004   0.00000   0.00013   2.37523  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                    -6.602e+00  4.864e+00  -1.357   0.1747  
## brainweight                     5.101e-02  5.084e-02   1.003   0.3157  
## totalsleep                      4.230e-01  2.647e-01   1.598   0.1100  
## as.factor(sleepexposureindex)2  4.998e+00  2.559e+00   1.953   0.0508 .
## as.factor(sleepexposureindex)3  3.636e+01  9.624e+03   0.004   0.9970  
## as.factor(sleepexposureindex)4  3.370e+01  1.037e+04   0.003   0.9974  
## as.factor(sleepexposureindex)5  7.341e+01  1.262e+04   0.006   0.9954  
## as.factor(predationindex)2     -2.535e+00  1.960e+00  -1.293   0.1960  
## as.factor(predationindex)3     -2.512e+01  1.253e+04  -0.002   0.9984  
## as.factor(predationindex)4     -1.826e+01  6.795e+03  -0.003   0.9979  
## as.factor(predationindex)5     -5.264e+01  1.143e+04  -0.005   0.9963  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.31  on 50  degrees of freedom
## Residual deviance: 15.88  on 40  degrees of freedom
## AIC: 37.88
## 
## Number of Fisher Scoring iterations: 20

Interpretation of Odds ratio for significant parameters

##                    (Intercept)                    brainweight 
##                   1.000000e-03                   1.052000e+00 
##                     totalsleep as.factor(sleepexposureindex)2 
##                   1.527000e+00                   1.480500e+02 
## as.factor(sleepexposureindex)3 as.factor(sleepexposureindex)4 
##                   6.173141e+15                   4.332708e+14 
## as.factor(sleepexposureindex)5     as.factor(predationindex)2 
##                   7.603846e+31                   7.900000e-02 
##     as.factor(predationindex)3     as.factor(predationindex)4 
##                   0.000000e+00                   0.000000e+00 
##     as.factor(predationindex)5 
##                   0.000000e+00
  • The odds of species having maximum lifespan being at least 10 years when sleep exposure index is 2 is exp(4.998e+00) 148.116 times the odds when the sleep exposure index is 1 when rest of the other parameters are kept constant.

Exercise 4:

The index variables in the data set are ordinal, meaning they are categorical and they have a natural ordering. If we treat an index variable as a continuous variable, this will imply a linear change as the index changes. Repeat Exercise 3 by treating two index variables as continuous variables. Use significance level alpha=0.1

Exercise 4A:

First find and specify the best set of predictors via stepwise selection with AIC criteria.

Fit the Logistic Regression model for maxlife10 considering index variables as continuous

Step wise model selection with AIC Criteria

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = sleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82148  -0.04746   0.00000   0.05811   2.41681  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11

Model Significance: At least one estimate has significant p-value, hence the model can be considered significant

Individual Parameter Significance :

From the model , we can see that all the variables brainweight , totalsleep, sleepexposureindex, and predationindex are all significant estimates for the model as the p-value is less than the significance level of (0.01).

NOTE: we have considered sleep exposure index and predation index as continuous variable to cater the data issue discussed in Exercise 3

Exercise 4B:

Comment on the significance of parameter estimates, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot. Do not remove influential points but just make comments on suspicious observations.

Goodness of fit- Hosmer-Lemeshow’s test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  step.sleep2$y, fitted(step.sleep2)
## X-squared = 1.4406, df = 8, p-value = 0.9937

H0 : The model is a good-fit (adequate) Ha : The model is not a good fit (adequate)

From the Hosmer-Lemeshow’s test results, p-value is 0.9937, hence we can say that model is a good fit.

Model Diagonistics

From both deviance and pearson type residual plot, we can see that the data points are distributed between 2 and -2. There is no pattern in the distribution of the residuals, hence the model assumptions of residual having Bernoulli distribution and homoscedescity is valid.

Cooks distance

## 10 35 40 50 
## 10 35 40 50

From the cooks distance plot, we can notice that observations with ids 10,35,40 and 50 seem to have cook’s distance larger than 0.25, hence they are suspicious observations.

###Exercise 4C:

Interpret what the model tells us about relationships between the predictors and the odds of a species’ maximum lifespan being at least 10 years.

## 
## Call:
## glm(formula = maxlife10 ~ brainweight + totalsleep + sleepexposureindex + 
##     predationindex, family = "binomial", data = sleep)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82148  -0.04746   0.00000   0.05811   2.41681  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)  
## (Intercept)        -6.16387    3.59301  -1.716   0.0863 .
## brainweight         0.06018    0.03544   1.698   0.0895 .
## totalsleep          0.35985    0.20995   1.714   0.0865 .
## sleepexposureindex  4.42111    1.97540   2.238   0.0252 *
## predationindex     -3.36917    1.51823  -2.219   0.0265 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 68.310  on 50  degrees of freedom
## Residual deviance: 19.212  on 46  degrees of freedom
## AIC: 29.212
## 
## Number of Fisher Scoring iterations: 11

Interpretation of Odds ratio for significant parameters

##        (Intercept)        brainweight         totalsleep sleepexposureindex 
##              0.002              1.062              1.433             83.188 
##     predationindex 
##              0.034
  • the odds of a species’ maximum lifespan being at least 10 years increases by a factor of EXP(0.06018) = 1.062 (6.2%) with a one unit increase in brain weight; when total sleep, sleep exposure index and predation index are kept constant.

  • the odds of a species’ maximum lifespan being at least 10 years increases by a factor of 1.433 (43.3%) with a one unit increase in total sleep; when brain weight, sleep exposure index and predation index are kept constant.

  • the odds of a species’ maximum lifespan being at least 10 years increases by a factor of 83.188 with a one unit increase in sleep exposure index ; when brain weight ,total sleep, and predation index are kept constant.

  • the odds of a species’ maximum lifespan being at least 10 years decreases by a factor of 0.034 (3.4%) with a one unit increase in predation index; when brain weight, total sleep, sleep exposure index are kept constant.

Conclusion: Higher brain weight, increase in total sleep, higher sleep exposure index and lower predation index leads to higher odds of species’ maximum life of being at least 10 years.