Logistic linear regression

Female

Run full and null models:

glm.female.full <- glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB , data=d1.female, family = "binomial")
glm.female.null <- glm(LiverPatient ~ 1 , data=d1.female, family = "binomial")

#summary(glm.female.full)

Step 2)

Run Stepwise with AIC

e1.step.model.AIC <- step(glm.female.null,scope = list(upper=glm.female.full), direction="both",test="Chisq", trace = F)
summary(e1.step.model.AIC)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial", 
##     data = d1.female)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8178  -1.2223   0.4402   1.1091   1.2049  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.32480    0.31013  -1.047   0.2950  
## DB           0.94479    0.55808   1.693   0.0905 .
## Aspartate    0.01106    0.00616   1.796   0.0726 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 175.72  on 134  degrees of freedom
## Residual deviance: 154.27  on 132  degrees of freedom
## AIC: 160.27
## 
## Number of Fisher Scoring iterations: 7

Conclusion:

The final model (for female only) is:

Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate
All predictors are significant, although we know that out final model with AIC may include the insignificant variables.

Question (b):

Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).

Answer -:

Based on P-Values on part (a), since alpha=0.1, thus:
- DB is significant
- Aspartate is significant

Step 1)

Calculate Goodness-of-fit

H0: Model is Adequate (Fits Well)
H1: Model is not Adequate (Not fits well)

## Hosmer Lemeshow test

hoslem.test(e1.step.model.AIC$y, fitted(e1.step.model.AIC), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  e1.step.model.AIC$y, fitted(e1.step.model.AIC)
## X-squared = 7.7535, df = 8, p-value = 0.4579

Based on P-Value, since alpha level=0.1, and P-Value > 0.1, thus this is our conclusion:

We don’t have any evidence to reject the Null, thus Model is Adequate and fits well.

Step 2)

Run Diagnistics, Residuals plots

We will use both Deviance and Person methods:

e1.res.deviance <-residuals(e1.step.model.AIC, type = "deviance")
e1.res.pearson <-residuals(e1.step.model.AIC, type = "pearson")

e1.std.res.deviance <-residuals(e1.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e1.step.model.AIC)) # standardized deviance residuals
e1.std.res.pearson <-residuals(e1.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e1.step.model.AIC)) # standardized pearson residuals

Show Plots:

dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e1.std.res.deviance[e1.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e1.std.res.deviance[e1.step.model.AIC$model$LiverPatient==1], col = "blue")

plot(e1.std.res.pearson[e1.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e1.std.res.pearson[e1.step.model.AIC$model$LiverPatient==1], col = "blue")

Since in std. residuals plot, we can not find any pattern and all values (0/1) are in range [-2,2], thus the assumption is valid.

Step 3)

Run Diagnistics, Influence dianostics, cook’s distance

As an assumption, cut-of = 0.25

# visual inspection

dev.new(width = 1000, height = 1000)
plot(e1.step.model.AIC, which = 4, id.n = 5)

Find all pints > cut-off

# which observation has cook's d larger than 0.25 ?
e1.inf.id=which(cooks.distance(e1.step.model.AIC) > 0.25)
e1.inf.id

## named integer(0)

There is not any influential points with larger value than 0.25, so no need to eliminate any values.

Conclusion

Based on P-Value in Goodness-of-fit: Model is Adequate and fits well.
Based on Diagnostics (Std. Residuals plot): Assumption is valid.
Based on Diagnostics (Cook’s distance): There is no influential point (larger than 0.25)

Question (c):

Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio).

Answer -:

Step 1)

Calcualte Odds Ratio

## odds ratios 

e1.OR <- exp(e1.step.model.AIC$coefficients)
round(e1.OR, 3)

## (Intercept)          DB   Aspartate 
##       0.723       2.572       1.011

Based on previous part, our final model is:

Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate

Conclusion:

Since DB and Aspartate are continuous variables, thus we have Multiplicative change.
OR(DB) = exp(0.94479) > 1
Thus, there exist positive relationship (increase) and high probability chance for having event,
it means one unit increasing in “DB” will cause multiple of exp(0.94479) change in LiverPatient.
OR(Aspartate) = exp(0.01106) = 1 (Almost is equal 1)
Thus, there is no Aspartate effect, the odds are same and equal probability chance.

In the other word, in female patients increasing the “DB” will cause more chance of LiverPatient.

Male

Repeat exercise 1 for males. In addition to the previous questions, also d) comment on how the models for adult females and adult males differ. Use significance level alpha=0.1

Question (a):

For only Males in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a male is a liver patient.

Answer -:

Step 1)

Run full and null model

glm.male.full <- glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB , data=d1.male, family = "binomial")
glm.male.null <- glm(LiverPatient ~ 1 , data=d1.male, family = "binomial")

#summary(glm.male.full)

Step 2)

Run Stepwise with AIC

e2.step.model.AIC <- step(glm.male.null,scope = list(upper=glm.male.full), direction="both",test="Chisq", trace = F)
summary(e2.step.model.AIC)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1.male)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3405  -0.5170   0.3978   0.8614   1.3756  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -1.476570   0.481336  -3.068  0.00216 **
## DB           0.512503   0.176066   2.911  0.00360 **
## Alamine      0.016218   0.005239   3.095  0.00197 **
## Age          0.020616   0.008095   2.547  0.01087 * 
## Alkphos      0.001740   0.001058   1.645  0.09992 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 476.28  on 422  degrees of freedom
## Residual deviance: 395.05  on 418  degrees of freedom
## AIC: 405.05
## 
## Number of Fisher Scoring iterations: 7

Conclusion

The final model is:

Log(LiverPatient) = -1.476570 + 0.512503 * DB + 0.016218 * Alamin + 0.020616 * Age + 0.001740 * Alkphos

All predictors are significant, although We know that out final model with AIC may include the insignificant variables.

Question (b):

Answer -:

Based on P-Values on part (a), since alpha=0.1, thus:
+ DB is significant
+ Alamine is significant
+ Age is significant
+ Alkphos is significant

Step 1)

Calculate Goodness-of-fit

H0: Model is Adequate (Fits Well)
H1: Model is not Adequate (Not fits well)

## Hosmer Lemeshow test

hoslem.test(e2.step.model.AIC$y, fitted(e2.step.model.AIC), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  e2.step.model.AIC$y, fitted(e2.step.model.AIC)
## X-squared = 7.043, df = 8, p-value = 0.532

Based on P-Value, since alpha level=0.1, and P-Value > 0.1, thus this is our conclusion:

We don’t have any evidence to reject the Null, thus Model is Adequate and fits well.

Step 2)

Run Diagnistics, Residuals plots

We will use both Deviance and Person methods:

e2.res.deviance <-residuals(e2.step.model.AIC, type = "deviance")
e2.res.pearson <-residuals(e2.step.model.AIC, type = "pearson")

e2.std.res.deviance <-residuals(e2.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e2.step.model.AIC)) # standardized deviance residuals
e2.std.res.pearson <-residuals(e2.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e2.step.model.AIC)) # standardized pearson residuals

Show the plots:

dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.std.res.deviance[e2.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.std.res.deviance[e2.step.model.AIC$model$LiverPatient==1], col = "blue")

plot(e2.std.res.pearson[e2.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.std.res.pearson[e2.step.model.AIC$model$LiverPatient==1], col = "blue")

Since in std. residuals plots, we can not find any pattern and also most of data (0/1) are in range [-2,2], thus the assumption is valid.

Step 3)

Run Diagnistics, Influence dianostics, cook’s distance

As an assumption, cut-of = 0.25

# visual inspection

dev.new(width = 1000, height = 1000)
plot(e2.step.model.AIC, which = 4, id.n = 5)

Find all pints > cut-off

# which observation has cook's d larger than 0.25 ?
e2.inf.id=which(cooks.distance(e2.step.model.AIC) > 0.25)
e2.inf.id

## 111 
##  86

#d1.male[e2.inf.id,]

There exist two influential points with larger value than 0.25, so we must eliminate these influential point and fit model again.

Step 4)

Delete influential points and fit model again

## fit model without observation with cook's d larger than 0.25

e2.glm.male.AIC.WO.Cook = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, data =d1.male[-e2.inf.id, ], family = "binomial")
summary(e2.glm.male.AIC.WO.Cook)

## 
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial", 
##     data = d1.male[-e2.inf.id, ])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5166   0.0000   0.3301   0.8648   1.4696  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.902754   0.527386  -3.608 0.000309 ***
## DB           0.573104   0.198893   2.881 0.003958 ** 
## Alamine      0.015850   0.005466   2.900 0.003737 ** 
## Age          0.020418   0.008210   2.487 0.012883 *  
## Alkphos      0.003744   0.001477   2.534 0.011262 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 473.51  on 421  degrees of freedom
## Residual deviance: 381.31  on 417  degrees of freedom
## AIC: 391.31
## 
## Number of Fisher Scoring iterations: 8

Here is final linear model, after deleting cook’s distances:

Log(LiverPatient) = -1.902754 + 0.573104 * DB + 0.015850 * Alamin + 0.020418 * Age + 0.003744 * Alkphos

All predictors in the model is significant.

Step 5)

Calculate Goodness-of-fit (Without Influential points)

H0: Model is Adequate (Fits Well)
H1: Model is not Adequate (Not fits well)

## Hosmer Lemeshow test

hoslem.test(e2.glm.male.AIC.WO.Cook$y, fitted(e2.glm.male.AIC.WO.Cook), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  e2.glm.male.AIC.WO.Cook$y, fitted(e2.glm.male.AIC.WO.Cook)
## X-squared = 7.6642, df = 8, p-value = 0.4669

Based on P-Value, since alpha level=0.1, and P-Value > 0.1, the model still is Adequate and fits well.

Step 6)

Run Diagnistics, Residuals plots (Without influential points)

We will use both Deviance and Person methods:

e2.res.deviance.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "deviance")
e2.res.pearson.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "pearson")

e2.std.res.deviance.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "deviance")/sqrt(1 - hatvalues(e2.glm.male.AIC.WO.Cook)) # standardized deviance residuals
e2.std.res.pearson.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "pearson")/sqrt(1 - hatvalues(e2.glm.male.AIC.WO.Cook)) # standardized pearson residuals

Show the plots:

dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.std.res.deviance.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.std.res.deviance.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==1], col = "blue")

plot(e2.std.res.pearson.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.std.res.pearson.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==1], col = "blue")

Since in std. residuals plot, we can not find any pattern and also most of data (0/1) are in range [-2,2], thus the assumption is valid.

Step 7)

Run Diagnistics, Influence dianostics, cook’s distance (After elimination)

As an assumption, cut-of = 0.25

# visual inspection

dev.new(width = 1000, height = 1000)
plot(e2.glm.male.AIC.WO.Cook, which = 4, id.n = 5)

Find all pints > cut-off

# which observation has cook's d larger than 0.25 ?
e2.inf.id=which(cooks.distance(e2.glm.male.AIC.WO.Cook) > 0.25)
e2.inf.id

## named integer(0)

There is no more influential points with larger value than 0.25.

Conclusion:

Based on P-Value in Goodness-of-fit: Model is Adequate and fits well.
Based on Diagnostics (Std. Residuals plots): Assumption is valid.

Question (c):

Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio).

Answer -:

Step 1)

Calculate Odds Ratio

## odds ratios 

e2.OR <- exp(e2.glm.male.AIC.WO.Cook$coefficients)
round(e2.OR, 3)

## (Intercept)          DB     Alamine         Age     Alkphos 
##       0.149       1.774       1.016       1.021       1.004

Based on previous part, our final model is:

Log(LiverPatient) = -1.902754 + 0.573104 * DB + 0.015850 * Alamin + 0.020418 * Age + 0.003744 * Alkphos

Since all predictors are continuous, thus we have Multiplicative change.

OR(DB) = exp(0.573104) > 1
Thus, there is positive relationship (increase), and high probability chance for having event,
It means one unit increasing in “DB” will cause multiple of exp(0.573104) change in LiverPatient.
OR(Alamine) = exp(0.015850) =1 (Almost is equal 1)
Thus, there is no Alamine effect.
OR(Age) = exp(0.020418) = 1 (Almost is equal 1)
Thus, there is no Age effect.
OR(Alkphos) = exp(0.003744) = 1 (Almost is equal 1)
Thus, there isno Alkphos effect.

In the other word, Also in Males patients increasing the “DB” will cause more chance of LiverPatient.

Question (d):

comment on how the models for adult females and adult males differ

Answer -:

This is the model for the Female:

Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate

This is the model for the Male:

Log(LiverPatient) = -1.902754 + 0.573104 * DB + 0.015850 * Alamin + 0.020418 * Age + 0.003744 * Alkphos

Conclusion:

Although linear regression models for Male and Female include different variables and the final models are not the same, but in both models, increasing in the “DB” variable will cause more chance of having LiverPatient.

The reason is other variables (like Aspartate, Alamin, Age, Alkphos) have OR=1 (almost), thus there are no effects.

Logistic linear regression

Arash Hatamirad, ahatamirad@gmail.com

9/1/2021

Logistic Regression, Stepwise Model Selection with AIC

Question (a):

Answer -:

Female

Run full and null models:

Run Stepwise with AIC

Question (b):

Answer -:

Calculate Goodness-of-fit

Run Diagnistics, Residuals plots

Run Diagnistics, Influence dianostics, cook’s distance

Question (c):

Answer -:

Calcualte Odds Ratio

Male

Question (a):

Answer -:

Run full and null model

Run Stepwise with AIC

Question (b):

Answer -:

Calculate Goodness-of-fit

Run Diagnistics, Residuals plots

Run Diagnistics, Influence dianostics, cook’s distance

Delete influential points and fit model again

Calculate Goodness-of-fit (Without Influential points)

Run Diagnistics, Residuals plots (Without influential points)

Run Diagnistics, Influence dianostics, cook’s distance (After elimination)

Question (c):

Answer -:

Calculate Odds Ratio

Question (d):

Answer -: