The liver data set is a subset of the ILPD (Indian Liver Patient Dataset) data set. It contains the first 10 variables described on the UCI Machine Learning Repository and a LiverPatient variable (indicating whether or not the individual is a liver patient. People with active liver disease are coded as LiverPatient=1 and people without disease are coded LiverPatient=0) for adults in the data set. Adults here are defined to be individuals who are at least 18 years of age. It is possible that there will be different significant predictors of being a liver patient for adult females and adult males.
# Read the CSV file
d1 <- read.csv("liver.csv")
# Check data format
#str(d1)
We want to only for females in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a female is a liver patient.
# Filter Gender for Female and Male
d1.female <- d1[which(d1$Gender=="Female"), ]
d1.male <- d1[which(d1$Gender=="Male"), ]
Step 1)
glm.female.full <- glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB , data=d1.female, family = "binomial")
glm.female.null <- glm(LiverPatient ~ 1 , data=d1.female, family = "binomial")
#summary(glm.female.full)
Step 2)
e1.step.model.AIC <- step(glm.female.null,scope = list(upper=glm.female.full), direction="both",test="Chisq", trace = F)
summary(e1.step.model.AIC)
##
## Call:
## glm(formula = LiverPatient ~ DB + Aspartate, family = "binomial",
## data = d1.female)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8178 -1.2223 0.4402 1.1091 1.2049
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.32480 0.31013 -1.047 0.2950
## DB 0.94479 0.55808 1.693 0.0905 .
## Aspartate 0.01106 0.00616 1.796 0.0726 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 175.72 on 134 degrees of freedom
## Residual deviance: 154.27 on 132 degrees of freedom
## AIC: 160.27
##
## Number of Fisher Scoring iterations: 7
Conclusion:
The final model (for female only) is:
Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate
All predictors are significant, although we know that out final model with AIC may include the insignificant variables.
Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).
Based on P-Values on part (a), since alpha=0.1, thus:
- DB is significant
- Aspartate is significant
Step 1)
## Hosmer Lemeshow test
hoslem.test(e1.step.model.AIC$y, fitted(e1.step.model.AIC), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: e1.step.model.AIC$y, fitted(e1.step.model.AIC)
## X-squared = 7.7535, df = 8, p-value = 0.4579
Based on P-Value, since alpha level=0.1, and P-Value > 0.1, thus this is our conclusion:
Step 2)
We will use both Deviance and Person methods:
e1.res.deviance <-residuals(e1.step.model.AIC, type = "deviance")
e1.res.pearson <-residuals(e1.step.model.AIC, type = "pearson")
e1.std.res.deviance <-residuals(e1.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e1.step.model.AIC)) # standardized deviance residuals
e1.std.res.pearson <-residuals(e1.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e1.step.model.AIC)) # standardized pearson residuals
Show Plots:
dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e1.std.res.deviance[e1.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e1.std.res.deviance[e1.step.model.AIC$model$LiverPatient==1], col = "blue")
plot(e1.std.res.pearson[e1.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e1.std.res.pearson[e1.step.model.AIC$model$LiverPatient==1], col = "blue")
Step 3)
As an assumption, cut-of = 0.25
# visual inspection
dev.new(width = 1000, height = 1000)
plot(e1.step.model.AIC, which = 4, id.n = 5)
# which observation has cook's d larger than 0.25 ?
e1.inf.id=which(cooks.distance(e1.step.model.AIC) > 0.25)
e1.inf.id
## named integer(0)
Conclusion
Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio).
Step 1)
## odds ratios
e1.OR <- exp(e1.step.model.AIC$coefficients)
round(e1.OR, 3)
## (Intercept) DB Aspartate
## 0.723 2.572 1.011
Based on previous part, our final model is:
Log(LiverPatient) = -0.32480 + 0.94479 * DB + 0.01106 * Aspartate
Conclusion:
Since DB and Aspartate are continuous variables, thus we have Multiplicative change.
OR(DB) = exp(0.94479) > 1
Thus, there exist positive relationship (increase) and high probability chance for having event,
it means one unit increasing in “DB” will cause multiple of exp(0.94479) change in LiverPatient.
OR(Aspartate) = exp(0.01106) = 1 (Almost is equal 1)
Thus, there is no Aspartate effect, the odds are same and equal probability chance.
In the other word, in female patients increasing the “DB” will cause more chance of LiverPatient.
Repeat exercise 1 for males. In addition to the previous questions, also d) comment on how the models for adult females and adult males differ. Use significance level alpha=0.1
For only Males in the data set, find and specify the best set of predictors via stepwise selection with AIC criteria for a logistic regression model predicting whether a male is a liver patient.
Step 1)
glm.male.full <- glm(LiverPatient ~ Age+TB+DB+Alkphos+Alamine+Aspartate+TP+ALB , data=d1.male, family = "binomial")
glm.male.null <- glm(LiverPatient ~ 1 , data=d1.male, family = "binomial")
#summary(glm.male.full)
Step 2)
e2.step.model.AIC <- step(glm.male.null,scope = list(upper=glm.male.full), direction="both",test="Chisq", trace = F)
summary(e2.step.model.AIC)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = d1.male)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3405 -0.5170 0.3978 0.8614 1.3756
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.476570 0.481336 -3.068 0.00216 **
## DB 0.512503 0.176066 2.911 0.00360 **
## Alamine 0.016218 0.005239 3.095 0.00197 **
## Age 0.020616 0.008095 2.547 0.01087 *
## Alkphos 0.001740 0.001058 1.645 0.09992 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 476.28 on 422 degrees of freedom
## Residual deviance: 395.05 on 418 degrees of freedom
## AIC: 405.05
##
## Number of Fisher Scoring iterations: 7
Conclusion
The final model is:
Log(LiverPatient) = -1.476570 + 0.512503 * DB + 0.016218 * Alamin + 0.020616 * Age + 0.001740 * Alkphos
All predictors are significant, although We know that out final model with AIC may include the insignificant variables.
Comment on the significance of parameter estimates under significance level alpha=0.1, what Hosmer-Lemeshow’s test tells us about goodness of fit, and point out any issues with diagnostics by checking residual plots and cook’s distance plot (with cut-off 0.25).
Based on P-Values on part (a), since alpha=0.1, thus:
+ DB is significant
+ Alamine is significant
+ Age is significant
+ Alkphos is significant
Step 1)
## Hosmer Lemeshow test
hoslem.test(e2.step.model.AIC$y, fitted(e2.step.model.AIC), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: e2.step.model.AIC$y, fitted(e2.step.model.AIC)
## X-squared = 7.043, df = 8, p-value = 0.532
Based on P-Value, since alpha level=0.1, and P-Value > 0.1, thus this is our conclusion:
Step 2)
We will use both Deviance and Person methods:
e2.res.deviance <-residuals(e2.step.model.AIC, type = "deviance")
e2.res.pearson <-residuals(e2.step.model.AIC, type = "pearson")
e2.std.res.deviance <-residuals(e2.step.model.AIC, type = "deviance")/sqrt(1 - hatvalues(e2.step.model.AIC)) # standardized deviance residuals
e2.std.res.pearson <-residuals(e2.step.model.AIC, type = "pearson")/sqrt(1 - hatvalues(e2.step.model.AIC)) # standardized pearson residuals
Show the plots:
dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.std.res.deviance[e2.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.std.res.deviance[e2.step.model.AIC$model$LiverPatient==1], col = "blue")
plot(e2.std.res.pearson[e2.step.model.AIC$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.std.res.pearson[e2.step.model.AIC$model$LiverPatient==1], col = "blue")
Step 3)
As an assumption, cut-of = 0.25
# visual inspection
dev.new(width = 1000, height = 1000)
plot(e2.step.model.AIC, which = 4, id.n = 5)
# which observation has cook's d larger than 0.25 ?
e2.inf.id=which(cooks.distance(e2.step.model.AIC) > 0.25)
e2.inf.id
## 111
## 86
#d1.male[e2.inf.id,]
Step 4)
## fit model without observation with cook's d larger than 0.25
e2.glm.male.AIC.WO.Cook = glm(LiverPatient ~ DB + Alamine + Age + Alkphos, data =d1.male[-e2.inf.id, ], family = "binomial")
summary(e2.glm.male.AIC.WO.Cook)
##
## Call:
## glm(formula = LiverPatient ~ DB + Alamine + Age + Alkphos, family = "binomial",
## data = d1.male[-e2.inf.id, ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5166 0.0000 0.3301 0.8648 1.4696
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.902754 0.527386 -3.608 0.000309 ***
## DB 0.573104 0.198893 2.881 0.003958 **
## Alamine 0.015850 0.005466 2.900 0.003737 **
## Age 0.020418 0.008210 2.487 0.012883 *
## Alkphos 0.003744 0.001477 2.534 0.011262 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 473.51 on 421 degrees of freedom
## Residual deviance: 381.31 on 417 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 8
Here is final linear model, after deleting cook’s distances:
Log(LiverPatient) = -1.902754 + 0.573104 * DB + 0.015850 * Alamin + 0.020418 * Age + 0.003744 * Alkphos
Step 5)
## Hosmer Lemeshow test
hoslem.test(e2.glm.male.AIC.WO.Cook$y, fitted(e2.glm.male.AIC.WO.Cook), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: e2.glm.male.AIC.WO.Cook$y, fitted(e2.glm.male.AIC.WO.Cook)
## X-squared = 7.6642, df = 8, p-value = 0.4669
Step 6)
We will use both Deviance and Person methods:
e2.res.deviance.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "deviance")
e2.res.pearson.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "pearson")
e2.std.res.deviance.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "deviance")/sqrt(1 - hatvalues(e2.glm.male.AIC.WO.Cook)) # standardized deviance residuals
e2.std.res.pearson.wo.cook <-residuals(e2.glm.male.AIC.WO.Cook, type = "pearson")/sqrt(1 - hatvalues(e2.glm.male.AIC.WO.Cook)) # standardized pearson residuals
Show the plots:
dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.std.res.deviance.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.std.res.deviance.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==1], col = "blue")
plot(e2.std.res.pearson.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.std.res.pearson.wo.cook[e2.glm.male.AIC.WO.Cook$model$LiverPatient==1], col = "blue")
Step 7)
As an assumption, cut-of = 0.25
# visual inspection
dev.new(width = 1000, height = 1000)
plot(e2.glm.male.AIC.WO.Cook, which = 4, id.n = 5)
# which observation has cook's d larger than 0.25 ?
e2.inf.id=which(cooks.distance(e2.glm.male.AIC.WO.Cook) > 0.25)
e2.inf.id
## named integer(0)
Conclusion:
Interpret relationships between predictors in the final model and the odds of an adult female being a liver patient. (based on estimated Odds Ratio).
Step 1)
## odds ratios
e2.OR <- exp(e2.glm.male.AIC.WO.Cook$coefficients)
round(e2.OR, 3)
## (Intercept) DB Alamine Age Alkphos
## 0.149 1.774 1.016 1.021 1.004
Based on previous part, our final model is:
Since all predictors are continuous, thus we have Multiplicative change.
OR(DB) = exp(0.573104) > 1
Thus, there is positive relationship (increase), and high probability chance for having event,
It means one unit increasing in “DB” will cause multiple of exp(0.573104) change in LiverPatient.
OR(Alamine) = exp(0.015850) =1 (Almost is equal 1)
Thus, there is no Alamine effect.
OR(Age) = exp(0.020418) = 1 (Almost is equal 1)
Thus, there is no Age effect.
OR(Alkphos) = exp(0.003744) = 1 (Almost is equal 1)
Thus, there isno Alkphos effect.
In the other word, Also in Males patients increasing the “DB” will cause more chance of LiverPatient.
comment on how the models for adult females and adult males differ
This is the model for the Female:
This is the model for the Male:
Conclusion:
The reason is other variables (like Aspartate, Alamin, Age, Alkphos) have OR=1 (almost), thus there are no effects.