You need to download dataset birthweight_final.csv. The data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are:
Weight: Infant birth weight (gram)
Weight_Gr; Categorical variable for indication of low birthweight; 0 is normal, 1 is low birthweight
Black: Categorical variable; 0 is white, 1 is black
Married: Categorical variable; 0 is not married, 1 is married
Boy: Categorical variable; 0 is girl, 1 is boy
MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom
Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above
MomAge: Mother’s age (centered to zero)
MomWtGain: Mother’s weight gain during pregnancy (centered to zero)
Visit: number of prenatal visits
Consider to fit a multiple linear regression to model Weight using possible explanatory variables; Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit (all predictors excluding Weight_Gr).
Perform the following four model selection methods and compare their best models. Comment on how they differ or similar in terms of selected variables in the final model. No need to interpret outputs.
NOTE: R output from Backward selection displays variables “removed” from each step
Multiple Linear Regression
lm.birth <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = bweight)
summary(lm.birth)
##
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed +
## MomAge + MomWtGain + Visit, data = bweight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2433.18 -312.36 14.72 323.08 1562.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3249.006 120.298 27.008 < 2e-16 ***
## Black1 -189.710 83.770 -2.265 0.0241 *
## Married1 63.281 69.819 0.906 0.3653
## Boy1 118.816 55.077 2.157 0.0316 *
## MomSmoke1 -198.047 79.065 -2.505 0.0127 *
## Ed1 71.241 56.300 1.265 0.2065
## MomAge 3.048 5.305 0.574 0.5660
## MomWtGain 12.136 2.117 5.733 1.98e-08 ***
## Visit 13.626 37.691 0.362 0.7179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared: 0.1456, Adjusted R-squared: 0.1281
## F-statistic: 8.328 on 8 and 391 DF, p-value: 1.901e-10
Stepwise Selection
stepwise = ols_step_both_p(lm.birth, pent = 0.01, prem = 0.01, details = F)
stepwise
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------
## 1 MomWtGain addition 0.089 0.087 20.7600 6201.2327 559.8163
## 2 MomSmoke addition 0.107 0.102 14.6830 6195.4042 555.0626
## 3 Black addition 0.127 0.120 7.4660 6188.2794 549.4600
## -----------------------------------------------------------------------------------------
Forward Selection
forward = olsrr::ols_step_forward_p(lm.birth, penter = 0.01, details = F)
forward
##
## Selection Summary
## -----------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------
## 1 MomWtGain 0.0893 0.0870 20.7604 6201.2327 559.8163
## 2 MomSmoke 0.1069 0.1024 14.6832 6195.4042 555.0626
## 3 Black 0.1271 0.1205 7.4659 6188.2794 549.4600
## -----------------------------------------------------------------------------
Backward Selection
backward = olsrr::ols_step_backward_p(lm.birth, prem = 0.01, details = F)
backward
##
##
## Elimination Summary
## ---------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## ---------------------------------------------------------------------------
## 1 Visit 0.1453 0.130 7.1307 6187.8448 546.4642
## 2 MomAge 0.1445 0.1314 5.5158 6186.2385 546.0372
## 3 Married 0.1409 0.130 5.1599 6185.9146 546.4876
## 4 Ed 0.1364 0.1277 5.1841 6185.9688 547.1986
## 5 Boy 0.1271 0.1205 7.4659 6188.2794 549.4600
## ---------------------------------------------------------------------------
Adjusted R-square
best = ols_step_best_subset(lm.birth)
best
## Best Subsets Regression
## -------------------------------------------------------------------
## Model Index Predictors
## -------------------------------------------------------------------
## 1 MomWtGain
## 2 MomSmoke MomWtGain
## 3 Black MomSmoke MomWtGain
## 4 Black Boy MomSmoke MomWtGain
## 5 Black Boy MomSmoke Ed MomWtGain
## 6 Black Married Boy MomSmoke Ed MomWtGain
## 7 Black Married Boy MomSmoke Ed MomAge MomWtGain
## 8 Black Married Boy MomSmoke Ed MomAge MomWtGain Visit
## -------------------------------------------------------------------
##
## Subsets Regression Summary
## --------------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## --------------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0893 0.0870 0.0775 20.7604 6201.2327 5065.9178 6213.2071 125357705.0611 314961.2141 789.4062 0.9199
## 2 0.1069 0.1024 0.091 14.6832 6195.4042 5060.1250 6211.3700 123238567.9566 310405.1540 778.0163 0.9066
## 3 0.1271 0.1205 0.1059 7.4659 6188.2794 5053.1393 6208.2368 120764046.4868 304925.3190 764.3196 0.8906
## 4 0.1364 0.1277 0.1114 5.1841 6185.9688 5050.9395 6209.9176 119772826.2237 303169.1473 759.9653 0.8854
## 5 0.1409 0.1300 0.1116 5.1599 6185.9146 5050.9720 6213.8549 119462530.2702 303128.3995 759.9203 0.8853
## 6 0.1445 0.1314 0.1103 5.5158 6186.2385 5051.3901 6218.1702 119266462.8033 303374.3231 760.6035 0.8860
## 7 0.1453 0.1300 0.1052 7.1307 6187.8448 5053.0558 6223.7680 119453866.2656 304595.5958 763.7420 0.8896
## 8 0.1456 0.1281 0.1014 9.0000 6189.7111 5054.9736 6229.6258 119720142.0535 306020.7931 767.4022 0.8938
## --------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
Conclusion: We can see from the various methods of model selection used above that the stepwise, forward, and backward models all produce the same predictors (MomWtGain, MomSmoke, and Black) for our final model (MomWtGain, MomSmoke, and Black). We receive Black, Married, Boy MomSmoke, Ed, and MomWtGain as predictors for our final model using the best subset strategy is model 6, because it has the greatest Adjusted R-Square(0.1314) among the eight models.
Fit the linear regression with the best model determined by stepwise selection and comment on diagnostics plot. Do not leave observation which has Cook’s distance larger than 0.115. Re-fit the model if necessary. Finally how many observations you use in the final model?
Linear model for stepwise selection
lm.step <- lm(Weight ~ MomWtGain + MomSmoke + Black, data=bweight)
par(mfrow=c(2,2))
plot(lm.step, which=c(1:4))
Model Significance: By looking at the Normal QQ Plot, we can see that for the most part, the points fall along the line, with a few points that curve off the line. Also, we can see from the Squareroot(Standardized Residual) plot that certain data points are greater than 2. With that in mind, we may fairly conclude that the assumption of normality is not reasonable.
Equal Variance Check: There is a pattern in the residual plot and from this we can say that it follows heteroscedasticity
Influential Points: We can see that in the Cook’s distance plot that 129 larger than 0.115.
Cook’s Difference
inf.id <- which(cooks.distance(lm.step) > 0.115)
bweight[inf.id, ]
## Weight Weight_Gr Black Married Boy MomSmoke Ed MomAge MomWtGain Visit
## 129 1804 1 1 1 0 0 1 9 35 3
Refitted Model
lm.refitted.step = lm(Weight ~ MomWtGain + MomSmoke + Black, data=bweight[-inf.id, ])
How much of the variation in Weight is explained by the final model?
summary(lm.refitted.step)
##
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = bweight[-inf.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2427.02 -309.20 2.98 315.40 1472.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3434.252 32.078 107.059 < 2e-16 ***
## MomWtGain 13.112 2.113 6.204 1.39e-09 ***
## MomSmoke1 -238.923 76.251 -3.133 0.00186 **
## Black1 -198.519 78.022 -2.544 0.01133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared: 0.1366, Adjusted R-squared: 0.1301
## F-statistic: 20.84 on 3 and 395 DF, p-value: 1.493e-12
Conclusion: According to the summary table above, the final model can explain 13.66% of the variation in Weight, which is a low percentage. As a result, this model has a low prediction power for variation of weight.
Interpret the relationship between predictor variables (in the final model) and Weight value specifically.
Model Significance: Our model p-value is 1.493e-12, which is below than the significance level of 0.05. As a result, we can reject null hypothesis and conclude that the model is beneficial in explaining behavior of Weight.
Individual Term Significance: All of the p-values in the T-test for each predictor are less than 0.05. Hence, we can conclude that there’s a significant relationship between Weight and all three predictors: MomWtGain, MomSmoke, and Black.
Estimated Regression Line: On average, a one unit increase in MomWtGain is associated with a 13.112 increase in MomWtGain, while MomSmoke and Black are being held constant when there is a one unit increase.
R-squared: The final model can explain 13.66 % of variation of weight . As a result, the final model’s predictive power for Weight low.
Now we consider fitting a logistic regression for low birthweight (Weight_Gr=1). Again consider Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit as possible explanatory variables.
Perform following model selection methods and compare their best models. Comment how they differ or similar in terms of selected variables - Stepwise selection with AIC criteria - Stepwise selection with BIC criteria
NOTE: stepwise selection with BIC criteria can be performed by step() function by adding an option k=log(n), where n is a sample size. Check Week 15 respiratory data example - how to use this option.
Fit Logistic Regression Model for low birthweight
model.null <- glm(Weight_Gr ~ 1, data = bweight, family = "binomial")
model.full <- glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed +
MomAge + MomWtGain + Visit, data = bweight, family = "binomial")
AIC Stepwise Selection
step.AIC<-step(model.null, scope = list(upper=model.full),
direction="both",test="Chisq", trace = F)
summary(step.AIC)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy +
## Ed, family = "binomial", data = bweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9790 -1.0470 -0.6085 1.0966 2.0012
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.240486 0.188075 1.279 0.20101
## MomWtGain -0.038047 0.008471 -4.492 7.07e-06 ***
## MomSmoke1 0.818590 0.310227 2.639 0.00832 **
## MomAge -0.044444 0.019040 -2.334 0.01959 *
## Boy1 -0.407560 0.212600 -1.917 0.05523 .
## Ed1 -0.366259 0.217910 -1.681 0.09280 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 510.15 on 394 degrees of freedom
## AIC: 522.15
##
## Number of Fisher Scoring iterations: 4
BIC Stepwise Selection
step.BIC<-step(model.full,direction="both",test="Chisq",
trace = F, k=log(nrow(bweight)))
summary(step.BIC)
##
## Call:
## glm(formula = Weight_Gr ~ MomSmoke + MomAge + MomWtGain, family = "binomial",
## data = bweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.073 -0.669 1.103 2.000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.132541 0.112817 -1.175 0.24006
## MomSmoke1 0.865786 0.309944 2.793 0.00522 **
## MomAge -0.048266 0.018730 -2.577 0.00997 **
## MomWtGain -0.036819 0.008389 -4.389 1.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 516.39 on 396 degrees of freedom
## AIC: 524.39
##
## Number of Fisher Scoring iterations: 4
Conclusion: We can see that both AIC and BIC based models revealed the same predictors: MomWtGain, MomSmoke, and MomAge, as these predictors have p-values less than the significance level of 0.05, as shown by the Stepwise technique of model selection with AIC and BIC criteria used above.
Fit the logistic regression with the best model determined by stepwise selection with BIC criteria. Do not leave observation which has cook’s d larger than 0.1. Re-fit the model if necessary. Finally how many observations you use in the final model?
glm.model <- glm(Weight_Gr ~ MomWtGain + MomSmoke + MomAge, data = bweight, family = binomial)
summary(glm.model)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = binomial,
## data = bweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.073 -0.669 1.103 2.000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.132541 0.112817 -1.175 0.24006
## MomWtGain -0.036819 0.008389 -4.389 1.14e-05 ***
## MomSmoke1 0.865786 0.309944 2.793 0.00522 **
## MomAge -0.048266 0.018730 -2.577 0.00997 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 516.39 on 396 degrees of freedom
## AIC: 524.39
##
## Number of Fisher Scoring iterations: 4
(inf.id=which(cooks.distance(glm.model)>0.1))
## named integer(0)
Conclusion: When we look at the p-values for each predictor individually, we can see that they all fall below 0.05. This indicates that they are important enough to include in the model. There has also been no observation with a Cook’s Distance greater than 0.1.
Based on your final model, interpret the explicit relationship between response and predictors using Odds Ratio.
round(exp(glm.model$coefficients), 3)
## (Intercept) MomWtGain MomSmoke1 MomAge
## 0.876 0.964 2.377 0.953
Conclusion:
The odds of Weight_Gr change by a factor of 0.964 with a one unit increase in MomWtGain as MomAge and MomSmoke are being held constant
The odds of Weight_Gr change by a factor of 2.377 with a one unit increase in MomSmoke as MomWtGain and MomAge are being held constant
The odds of Weight_Gr change by a factor of 0.953 with a one unit increase in MomAge as MomWtGain and MomSmoke are being held constant
Which woman has the high chance to deliver a low birthweight infant? For example, answer will be like “a married, high-educated, and older woman has the high chance to deliver a lower birthweight infant.”
Conclusion: Younger women with higher smoking levels, and lower weightgain are more likely to deliver a low birthweight infant
What is the sample proportion of low birthweight infant in dataset?
sample.prop <- mean(bweight$Weight_Gr)
sample.prop
## [1] 0.4925
fit.prob <- predict(glm.model, type = "response")
pred.class.2 <- ifelse(fit.prob > sample.prop, 1, 0)
Conclusion: The sample proportion of low birthweight infant in dataset is 49.25%
Perform classification with probability cut-off set as sample proportion you answer in (5). What is misclassification rate?
mean(bweight$Weight_Gr != pred.class.2)
## [1] 0.355
Conclusion: The misclassification rate is 35.5%
Comment on Goodness of fit test and make a conclusion
Hosmer-Lemeshow Test
hoslem.test(glm.model$y, fitted(glm.model), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: glm.model$y, fitted(glm.model)
## X-squared = 9.2068, df = 8, p-value = 0.3252
Conclusion: From the Hosmer Lemeshow’s test performed above we can see that the p-value(0.3252) is above the significant value of 0.1, and we can say that there is enough evidence to accept the null hypothesis reject the alternative hypothesis and say the model is adequate.
Compare results from Exercise 1-2 and comment on different or similar conclusions from each analysis.
Conclusion:
Exercise 1: Only 13.66% of Weight variance can be explained by this model when MomWtGain, Black, and MomSmoke are used as predictors. Hence, this is a poor model for the Weight prediction.
Exercise 2: We see that 35.5% of weight variance can be explained by this model when MomWtGain, MomAge, and MomSmoke are used as predicators. Hence, this is a better model for the prediction of Weight.
Low birthweight is a risk factor that can lead infant mortality. If you want to implement a low-birthweight prevention program, what would you suggest to pregnant women?
Conclusion: When we compare these two models, we can see that MomWtGain and MomSmoke are both present. It’s safe to say that these two predictors are important when it comes to forecasting Weight fluctuation. As a result, it’s recommended for women to avoid smoking and maintain a healthy weight during pregnancy while executing a low-birthweight prevention program.