Logistic Regression, Stepwise Model Selection with AIC


The liver data set is a subset of the ILPD (Indian Liver Patient Dataset) data set. It contains the first 10 variables described on the UCI Machine Learning Repository and a LiverPatient variable (indicating whether or not the individual is a liver patient. People with active liver disease are coded as LiverPatient=1 and people without disease are coded LiverPatient=0) for adults in the data set. Adults here are defined to be individuals who are at least 18 years of age. It is possible that there will be different significant predictors of being a liver patient for adult females and adult males.

# Read the CSV file
d1 <- read.csv("liver.csv") 

# Check data format
#str(d1)

Question (a):

We want to only for females in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient.

Answer -:

# Filter Gender for Female and Male

d1.female <- d1[which(d1$Gender=="Female"), ]
d1.male <- d1[which(d1$Gender=="Male"), ]


Step 1)

Female

Run full and null models:

glm.female.full <- glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB , data=d1.female, family = "binomial")
glm.female.null <- glm(LiverPatient ~ 1 , data=d1.female, family = "binomial")

#summary(glm.female.full)


Step 2)

Run Stepwise with AIC

e1.step.model.AIC <- step(glm.female.null,scope = list(upper=glm.female.full), direction="both",test="Chisq", trace = F)
summary(e1.step.model.AIC)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = d1.female)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7


Conclusion:

  • The final model (for female only) is:

    Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate

  • All predictors are significant, although we know that out final model with AIC may include the insignificant variables.


Question (b):

Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Answer -:

Based on P-Values on part (a), since alpha=0.1, thus:
- DB is significant
- Aspartate is significant


Step 1)

Calculate Goodness-of-fit

  • H0: Model is Adequate (Fits Well)
  • H1: Model is not Adequate (Not fits well)
## Hosmer Lemeshow test

hoslem.test(e1.step.model.AIC$y, fitted(e1.step.model.AIC), g=10)  
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  e1.step.model.AIC$y, fitted(e1.step.model.AIC)
## X-squared = 7.7535, df = 8, p-value = 0.4579

Based on P-Value, since alpha level=0.1, and P-Value > 0.1, thus this is our conclusion:

  • We don’t have any evidence to reject the Null, thus Model is Adequate and fits well.


Step 2)

Run Diagnistics, Residuals plots

We will use both Deviance and Person methods:

e1.res.deviance <-residuals(e1.step.model.AIC, type = "deviance")
e1.res.pearson <-residuals(e1.step.model.AIC, type = "pearson")

e1.std.res.deviance <-residuals(e1.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e1.step.model.AIC)) # standardized deviance residuals
e1.std.res.pearson <-residuals(e1.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e1.step.model.AIC)) # standardized pearson residuals

Show Plots:

dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e1.std.res.deviance[e1.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e1.std.res.deviance[e1.step.model.AIC$model$LiverPatient==1], col = "blue")

plot(e1.std.res.pearson[e1.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e1.std.res.pearson[e1.step.model.AIC$model$LiverPatient==1], col = "blue")
  • Since in std. residuals plot, we can not find any pattern and all values (0/1) are in range [-2,2], thus the assumption is valid.


Step 3)

Run Diagnistics, Influence dianostics, cook’s distance

As an assumption, cut-of = 0.25

# visual inspection

dev.new(width = 1000, height = 1000)
plot(e1.step.model.AIC, which = 4, id.n = 5) 
  • Find all pints > cut-off
# which observation has cook's d larger than 0.25 ?
e1.inf.id=which(cooks.distance(e1.step.model.AIC) > 0.25)
e1.inf.id
## named integer(0)
  • There is not any influential points with larger value than 0.25, so no need to eliminate any values.


Conclusion

  • Based on P-Value in Goodness-of-fit: Model is Adequate and fits well.
  • Based on Diagnostics (Std. Residuals plot): Assumption is valid.
  • Based on Diagnostics (Cook’s distance): There is no influential point (larger than 0.25)

Question (c):

Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio).

Answer -:


Step 1)

Calcualte Odds Ratio

## odds ratios 

e1.OR <- exp(e1.step.model.AIC$coefficients)
round(e1.OR, 3)
## (Intercept)          DB   Aspartate 
##       0.723       2.572       1.011

Based on previous part, our final model is:

Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate


Conclusion:

  • Since DB and Aspartate are continuous variables, thus we have Multiplicative change.

  • OR(DB) = exp(0.94479) > 1
    Thus, there exist positive relationship (increase) and high probability chance for having event,
    it means one unit increasing in “DB” will cause multiple of exp(0.94479) change in LiverPatient.

  • OR(Aspartate) = exp(0.01106) = 1 (Almost is equal 1)
    Thus, there is no Aspartate effect, the odds are same and equal probability chance.

In the other word, in female patients increasing the “DB” will cause more chance of LiverPatient.


Male

Repeat exercise 1 for males. In addition to the previous questions, also d) comment on how the models for adult females and adult males differ. Use significance level alpha=0.1

Question (a):

For only Males in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a male is a liver patient.

Answer -:


Step 1)

Run full and null model

glm.male.full <- glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB , data=d1.male, family = "binomial")
glm.male.null <- glm(LiverPatient ~ 1 , data=d1.male, family = "binomial")

#summary(glm.male.full)


Step 2)

Run Stepwise with AIC

e2.step.model.AIC <- step(glm.male.null,scope = list(upper=glm.male.full), direction="both",test="Chisq", trace = F)
summary(e2.step.model.AIC)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1.male)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3405  -0.5170   0.3978   0.8614   1.3756  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.476570   0.481336  -3.068  0.00216 **
## DB           0.512503   0.176066   2.911  0.00360 **
## Alamine      0.016218   0.005239   3.095  0.00197 **
## Age          0.020616   0.008095   2.547  0.01087 * 
## Alkphos      0.001740   0.001058   1.645  0.09992 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 476.28  on 422  degrees of freedom
## Residual deviance: 395.05  on 418  degrees of freedom
## AIC: 405.05
## 
## Number of Fisher Scoring iterations: 7


Conclusion

  • The final model is:

    Log(LiverPatient) = -1.476570 + 0.512503 * DB + 0.016218 * Alamin + 0.020616 * Age + 0.001740 * Alkphos

All predictors are significant, although We know that out final model with AIC may include the insignificant variables.


Question (b):

Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Answer -:

Based on P-Values on part (a), since alpha=0.1, thus:
+ DB is significant
+ Alamine is significant
+ Age is significant
+ Alkphos is significant


Step 1)

Calculate Goodness-of-fit

  • H0: Model is Adequate (Fits Well)
  • H1: Model is not Adequate (Not fits well)
## Hosmer Lemeshow test

hoslem.test(e2.step.model.AIC$y, fitted(e2.step.model.AIC), g=10)  
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  e2.step.model.AIC$y, fitted(e2.step.model.AIC)
## X-squared = 7.043, df = 8, p-value = 0.532

Based on P-Value, since alpha level=0.1, and P-Value > 0.1, thus this is our conclusion:

  • We don’t have any evidence to reject the Null, thus Model is Adequate and fits well.


Step 2)

Run Diagnistics, Residuals plots

We will use both Deviance and Person methods:

e2.res.deviance <-residuals(e2.step.model.AIC, type = "deviance")
e2.res.pearson <-residuals(e2.step.model.AIC, type = "pearson")

e2.std.res.deviance <-residuals(e2.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e2.step.model.AIC)) # standardized deviance residuals
e2.std.res.pearson <-residuals(e2.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e2.step.model.AIC)) # standardized pearson residuals

Show the plots:

dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.std.res.deviance[e2.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.std.res.deviance[e2.step.model.AIC$model$LiverPatient==1], col = "blue")

plot(e2.std.res.pearson[e2.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.std.res.pearson[e2.step.model.AIC$model$LiverPatient==1], col = "blue")
  • Since in std. residuals plots, we can not find any pattern and also most of data (0/1) are in range [-2,2], thus the assumption is valid.


Step 3)

Run Diagnistics, Influence dianostics, cook’s distance

As an assumption, cut-of = 0.25

# visual inspection

dev.new(width = 1000, height = 1000)
plot(e2.step.model.AIC, which = 4, id.n = 5) 
  • Find all pints > cut-off
# which observation has cook's d larger than 0.25 ?
e2.inf.id=which(cooks.distance(e2.step.model.AIC) > 0.25)
e2.inf.id
## 111 
##  86
#d1.male[e2.inf.id,]
  • There exist two influential points with larger value than 0.25, so we must eliminate these influential point and fit model again.


Step 4)

Delete influential points and fit model again

## fit model without observation with cook's d larger than 0.25

e2.glm.male.AIC.WO.Cook = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, data =d1.male[-e2.inf.id, ], family = "binomial")
summary(e2.glm.male.AIC.WO.Cook)
## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1.male[-e2.inf.id, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5166   0.0000   0.3301   0.8648   1.4696  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.902754   0.527386  -3.608 0.000309 ***
## DB           0.573104   0.198893   2.881 0.003958 ** 
## Alamine      0.015850   0.005466   2.900 0.003737 ** 
## Age          0.020418   0.008210   2.487 0.012883 *  
## Alkphos      0.003744   0.001477   2.534 0.011262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 473.51  on 421  degrees of freedom
## Residual deviance: 381.31  on 417  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 8

Here is final linear model, after deleting cook’s distances:

Log(LiverPatient) = -1.902754 + 0.573104 * DB + 0.015850 * Alamin + 0.020418 * Age + 0.003744 * Alkphos

  • All predictors in the model is significant.


Step 5)

Calculate Goodness-of-fit (Without Influential points)

  • H0: Model is Adequate (Fits Well)
  • H1: Model is not Adequate (Not fits well)
## Hosmer Lemeshow test

hoslem.test(e2.glm.male.AIC.WO.Cook$y, fitted(e2.glm.male.AIC.WO.Cook), g=10)  
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  e2.glm.male.AIC.WO.Cook$y, fitted(e2.glm.male.AIC.WO.Cook)
## X-squared = 7.6642, df = 8, p-value = 0.4669
  • Based on P-Value, since alpha level=0.1, and P-Value > 0.1, the model still is Adequate and fits well.


Step 6)

Run Diagnistics, Residuals plots (Without influential points)

We will use both Deviance and Person methods:

e2.res.deviance.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "deviance")
e2.res.pearson.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "pearson")

e2.std.res.deviance.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "deviance")/sqrt(1 - hatvalues(e2.glm.male.AIC.WO.Cook)) # standardized deviance residuals
e2.std.res.pearson.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "pearson")/sqrt(1 - hatvalues(e2.glm.male.AIC.WO.Cook)) # standardized pearson residuals

Show the plots:

dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.std.res.deviance.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.std.res.deviance.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==1], col = "blue")

plot(e2.std.res.pearson.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.std.res.pearson.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==1], col = "blue")
  • Since in std. residuals plot, we can not find any pattern and also most of data (0/1) are in range [-2,2], thus the assumption is valid.


Step 7)

Run Diagnistics, Influence dianostics, cook’s distance (After elimination)

As an assumption, cut-of = 0.25

# visual inspection

dev.new(width = 1000, height = 1000)
plot(e2.glm.male.AIC.WO.Cook, which = 4, id.n = 5) 
  • Find all pints > cut-off
# which observation has cook's d larger than 0.25 ?
e2.inf.id=which(cooks.distance(e2.glm.male.AIC.WO.Cook) > 0.25)
e2.inf.id
## named integer(0)
  • There is no more influential points with larger value than 0.25.


Conclusion:

  • Based on P-Value in Goodness-of-fit: Model is Adequate and fits well.
  • Based on Diagnostics (Std. Residuals plots): Assumption is valid.

Question (c):

Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio).

Answer -:


Step 1)

Calculate Odds Ratio

## odds ratios 

e2.OR <- exp(e2.glm.male.AIC.WO.Cook$coefficients)
round(e2.OR, 3)
## (Intercept)          DB     Alamine         Age     Alkphos 
##       0.149       1.774       1.016       1.021       1.004

Based on previous part, our final model is:

  • Log(LiverPatient) = -1.902754 + 0.573104 * DB + 0.015850 * Alamin + 0.020418 * Age + 0.003744 * Alkphos

Since all predictors are continuous, thus we have Multiplicative change.

  • OR(DB) = exp(0.573104) > 1
    Thus, there is positive relationship (increase), and high probability chance for having event,
    It means one unit increasing in “DB” will cause multiple of exp(0.573104) change in LiverPatient.

  • OR(Alamine) = exp(0.015850) =1 (Almost is equal 1)
    Thus, there is no Alamine effect.

  • OR(Age) = exp(0.020418) = 1 (Almost is equal 1)
    Thus, there is no Age effect.

  • OR(Alkphos) = exp(0.003744) = 1 (Almost is equal 1)
    Thus, there isno Alkphos effect.

In the other word, Also in Males patients increasing the “DB” will cause more chance of LiverPatient.

Question (d):

comment on how the models for adult females and adult males differ

Answer -:

This is the model for the Female:

  • Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate

This is the model for the Male:

  • Log(LiverPatient) = -1.902754 + 0.573104 * DB + 0.015850 * Alamin + 0.020418 * Age + 0.003744 * Alkphos


Conclusion:

  • Although linear regression models for Male and Female include different variables and the final models are not the same, but in both models, increasing in the “DB” variable will cause more chance of having LiverPatient.

The reason is other variables (like Aspartate, Alamin, Age, Alkphos) have OR=1 (almost), thus there are no effects.