Use significance levels of .05 unless the instructions state otherwise. Data Sets: You need to download dataset birthweight_final.csv. The data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are: - Weight: Infant birth weight (gram) - Weight_Gr; Categorical variable for indication of low birthweight; 0 is normal, 1 is low birthweight - Black: Categorical variable; 0 is white, 1 is black - Married: Categorical variable; 0 is not married, 1 is married - Boy: Categorical variable; 0 is girl, 1 is boy - MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom - Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above - MomAge: Mother’s age (centered to zero) - MomWtGain: Mother’s weight gain during pregnancy (centered to zero) - Visit: number of prenatal visits
# Read the CSV file
d1 <- read.csv("birthweight_final.csv")
d1$Black = as.factor(d1$Black)
d1$Married = as.factor(d1$Married)
d1$Boy = as.factor(d1$Boy)
d1$MomSmoke = as.factor(d1$MomSmoke)
d1$Ed = as.factor(d1$Ed)
# Check data format
#str(d1)
Now we consider fitting a logistic regression for low birthweight (Weight_Gr=1). Again consider Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit as possible explanatory variables.
(1) Perform following model selection methods and compare their best models. Comment how they differ or similar in terms of selected variables - Stepwise selection with AIC criteria
- Stepwise selection with BIC criteria
NOTE: stepwise selection with BIC criteria can be performed by step() function by adding an option k=log(n), where n is a sample size. Check Week 15 respiratory data example - how to use this option.
e2.glm.full <- glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit , data=d1, family = "binomial")
e2.glm.null <- glm(Weight_Gr ~ 1 , data=d1, family = "binomial")
e2.stepwise.AIC <- step(e2.glm.null,scope = list(upper=e2.glm.full), direction="both",test="Chisq", trace = F)
summary(e2.stepwise.AIC)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy +
## Ed, family = "binomial", data = d1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9790 -1.0470 -0.6085 1.0966 2.0012
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.240486 0.188075 1.279 0.20101
## MomWtGain -0.038047 0.008471 -4.492 7.07e-06 ***
## MomSmoke1 0.818590 0.310227 2.639 0.00832 **
## MomAge -0.044444 0.019040 -2.334 0.01959 *
## Boy1 -0.407560 0.212600 -1.917 0.05523 .
## Ed1 -0.366259 0.217910 -1.681 0.09280 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 510.15 on 394 degrees of freedom
## AIC: 522.15
##
## Number of Fisher Scoring iterations: 4
The final model, based on stepwise with AIC criteria is:
Log(Weight_Gr) = 0.240486 + 0.038047 * MomWtGain + 0.818590 * MomSmoke - 0.0444444 * MomAge - 0.407560 * Boy - 0.366259 * Ed
With alpha level=0.05, thus except Boy predictor, other predictors are significant, although we know that out final model with AIC may include the insignificant variables.
e2.stepwise.BIC <- step(e2.glm.null,scope = list(upper=e2.glm.full), direction="both",test="Chisq", trace = F,k=log(nrow(d1)))
summary(e2.stepwise.BIC)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = "binomial",
## data = d1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.073 -0.669 1.103 2.000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.132541 0.112817 -1.175 0.24006
## MomWtGain -0.036819 0.008389 -4.389 1.14e-05 ***
## MomSmoke1 0.865786 0.309944 2.793 0.00522 **
## MomAge -0.048266 0.018730 -2.577 0.00997 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 516.39 on 396 degrees of freedom
## AIC: 524.39
##
## Number of Fisher Scoring iterations: 4
The final model, based on stepwise with BIC criteria is:
Log(Weight_Gr) = -0.132541 - 0.036819 * MomWtGain + 0.865786 * MomSmoke - 0.048266 * MomAge
With alpha level=0.05, all predictors are significant, although we know that out final model with AIC may include the insignificant variables.
Answer following questions from the best model determined by stepwise selection with BIC criteria
(2) Fit the logistic regression with the best model determined by stepwise selection with BIC criteria. Do not leave observation which has cook’s d larger than 0.1. Re-fit the model if necessary. Finally how many observations you use in the final model?
We will use both Deviance and Person methods:
e2.res.deviance <-residuals(e2.stepwise.BIC, type = "deviance")
e2.res.pearson <-residuals(e2.stepwise.BIC, type = "pearson")
e2.std.res.deviance <-residuals(e2.stepwise.BIC, type = "deviance")/sqrt(1 - hatvalues(e2.stepwise.BIC)) # standardized deviance residuals
e2.std.res.pearson <-residuals(e2.stepwise.BIC, type = "pearson")/sqrt(1 - hatvalues(e2.stepwise.BIC)) # standardized pearson residuals
Show Plots:
dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.res.deviance[e2.stepwise.BIC$model$Weight_Gr==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.res.deviance[e2.stepwise.BIC$model$Weight_Gr==1], col = "blue")
plot(e2.res.pearson[e2.stepwise.BIC$model$Weight_Gr==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.res.pearson[e2.stepwise.BIC$model$Weight_Gr==1], col = "blue")
As an assumption, cut-of = 0.1
# visual inspection
dev.new(width = 1000, height = 1000)
plot(e2.stepwise.BIC, which = 4, id.n = 5)
e2.inf.id=which(cooks.distance(e2.stepwise.BIC) > 0.1)
e2.inf.id
## named integer(0)
dim(d1)
## [1] 400 10
(3) Based on your final model, interpret the explicit relationship between response and predictors using Odds Ratio.
## odds ratios
e2.OR <- exp(e2.stepwise.BIC$coefficients)
round(e2.OR, 3)
## (Intercept) MomWtGain MomSmoke1 MomAge
## 0.876 0.964 2.377 0.953
Based on previous part, our final model is:
Log(Weight_Gr) = -0.132541 - 0.036819 * MomWtGain + 0.865786 * MomSmoke - 0.048266 * MomAge
OR(MomWtGain) = exp(-0.036819) < 1
Thus, there exist negative relationship (decrease) and high probability chance for not having Weight_Gr event (low birthweight).
OR(MomSmoke) = exp(0.865786) > 1
Thus, the estimation of OR for MomSmoke is 2.377 this means for Smoking mothers has higher effect on having infant low birthweight.
OR(MomAge) = exp(-0.048266) < 1
Thus, there exist negative relationship (decrease) and high probability chance for not having Weight_Gr event (low birthweight).
(4) Which woman has the high chance to deliver a low birthweight infant? For example, answer will be like “a married, high-educated, and older woman has the high chance to deliver a low birthweight infant.”
Based on previous part, including the final model and OR results:
- Women who Smoke more and have a lower age and have lower Weight Gain have high chance to deliver a low birthweight infants (Happened event).
In Brief:
- More Smoke
- Lower Age
- Lower WeightGain
Will cause probablibity low birthweight infants.
(5) What is the sample proportion of low birthweight infant in dataset?
k=mean(d1$Weight_Gr)
print(k)
## [1] 0.4925
(6) Perform classification with probability cut-off set as sample proportion you answer in (5). What is misclassification rate?
e2.predict = predict(e2.stepwise.BIC, type = "response")
e2.sample.proportion = ifelse(e2.predict > k,1,0)
d1=cbind(d1,e2.predict,e2.sample.proportion)
mean(d1$Weight_Gr != d1$e2.sample.proportion) #misclassification rate from sample proportion threshold
## [1] 0.355
(7) Comment on Goodness of fit test and make a conclusion
## Hosmer Lemeshow test
hoslem.test(e2.stepwise.BIC$y, fitted(e2.stepwise.BIC), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: e2.stepwise.BIC$y, fitted(e2.stepwise.BIC)
## X-squared = 9.2068, df = 8, p-value = 0.3252
Based on P-Value, since alpha level=0.1, and P-Value > 0.05, thus this is our conclusion: