lm.birth <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight)
summary(lm.birth)
##
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed +
## MomAge + MomWtGain + Visit, data = birthweight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2433.18 -312.36 14.72 323.08 1562.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3249.006 120.298 27.008 < 2e-16 ***
## Black1 -189.710 83.770 -2.265 0.0241 *
## Married1 63.281 69.819 0.906 0.3653
## Boy1 118.816 55.077 2.157 0.0316 *
## MomSmoke1 -198.047 79.065 -2.505 0.0127 *
## Ed1 71.241 56.300 1.265 0.2065
## MomAge 3.048 5.305 0.574 0.5660
## MomWtGain 12.136 2.117 5.733 1.98e-08 ***
## Visit 13.626 37.691 0.362 0.7179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared: 0.1456, Adjusted R-squared: 0.1281
## F-statistic: 8.328 on 8 and 391 DF, p-value: 1.901e-10
model.stepwise <- ols_step_both_p(lm.birth, pent = 0.01, prem = 0.01, details = F)
model.stepwise
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------
## 1 MomWtGain addition 0.089 0.087 20.7600 6201.2327 559.8163
## 2 MomSmoke addition 0.107 0.102 14.6830 6195.4042 555.0626
## 3 Black addition 0.127 0.120 7.4660 6188.2794 549.4600
## -----------------------------------------------------------------------------------------
model.forward<-ols_step_forward_p(lm.birth, penter = 0.01, details = F)
model.forward
##
## Selection Summary
## -----------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------
## 1 MomWtGain 0.0893 0.0870 20.7604 6201.2327 559.8163
## 2 MomSmoke 0.1069 0.1024 14.6832 6195.4042 555.0626
## 3 Black 0.1271 0.1205 7.4659 6188.2794 549.4600
## -----------------------------------------------------------------------------
model.backward<-ols_step_backward_p(lm.birth, prem = 0.01, details = F)
model.backward
##
##
## Elimination Summary
## ---------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## ---------------------------------------------------------------------------
## 1 Visit 0.1453 0.130 7.1307 6187.8448 546.4642
## 2 MomAge 0.1445 0.1314 5.5158 6186.2385 546.0372
## 3 Married 0.1409 0.130 5.1599 6185.9146 546.4876
## 4 Ed 0.1364 0.1277 5.1841 6185.9688 547.1986
## 5 Boy 0.1271 0.1205 7.4659 6188.2794 549.4600
## ---------------------------------------------------------------------------
model.best.subset<-ols_step_best_subset(lm.birth)
model.best.subset
Conclusion: From the various methods of model selection executed above, we can see that the stepwise, forward and backward models all yield the same predictors (MomWtGain, MomSmoke and Black) for our final model. Using the best subset approach (Adjusted R-squared), we get Black, Married, Boy MomSmoke, Ed, and MomWtGain as predictors for our final model (Model 6) This is reasonable because it has the highest Adjusted R-Square out of the eight models (0.1314)
lm.step <- lm(Weight ~ MomWtGain + MomSmoke + Black, data=birthweight)
par(mfrow=c(2,2))
plot(lm.step, which=c(1:4))
Conclusion:
Normality Check: Looking at the Normal QQ Plot, we can see that most of the points fall along the line for the majority of the graph with some points that curve off the line (though they don’t seem to be serious) Moreover, looking at the sqrt(Standardized Residual) Plot, we can see that there are some data points that fall above 2. With that being said, we can safely conclue that the assumption of normality is not reasonable.
Equal Variance Check: Looking at the residual plot, we can see a pattern in the plot. Hence, this supports heteroscedasticity.
Influential Points: Looking at the observations of Cook’s Distance, we can see that there’s one observation that exceeds the 0.115 threshold: Observation 129. To confirm its status, we can check below to see if we need to refit the model.
influential.id <- which(cooks.distance(lm.step) > 0.115)
birthweight[influential.id, ]
lm.step2 <- lm(Weight ~ MomWtGain + MomSmoke + Black, data=birthweight[-influential.id, ])
summary(lm.step2)
##
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = birthweight[-influential.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2427.02 -309.20 2.98 315.40 1472.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3434.252 32.078 107.059 < 2e-16 ***
## MomWtGain 13.112 2.113 6.204 1.39e-09 ***
## MomSmoke1 -238.923 76.251 -3.133 0.00186 **
## Black1 -198.519 78.022 -2.544 0.01133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared: 0.1366, Adjusted R-squared: 0.1301
## F-statistic: 20.84 on 3 and 395 DF, p-value: 1.493e-12
Conclusion: From the summary table above, we can see that 13.66% of the variation of Weight can be explained by the final model, which is fairly low. Hence, this model has a fairly low prediction power for the variation of Weight.
Model Significance: Looking at the F-statistic, our model p-value is 1.493e-12, which falls below the significance level of 0.05. With that being said, we can reject the null hypothesis and conclude that the model is useful to explain the behavior of Weight
Individual Term Significance: Looking at the T-test for each predictor, all the p-values fall below 0.05. With that being said, we can conclude that there’s a significant relationship between Weight and al three predictors: MomWtGain, MomSmoke, and Black
Estimated Regression Line: Y = 3434.252 + 13.112(MomWtGain) - 238.923(MomSmoke) - 198.519(Black) + E. This means that on average, a one unit increase in MomWtGain is associated with a 13.112 increase in MomWtGain, while MomSmoke and Black are being held constant. A one unit increase
R-squared: 13.66% of Weight variation can be explaiend by the final model. Hence, the final model has a fairly low predictive power for Weight.
model.null <- glm(Weight_Gr ~ 1, data = birthweight, family = "binomial")
model.full <- glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight, family = "binomial")
step.model.AIC<-step(model.null, scope = list(upper=model.full),
direction="both",test="Chisq", trace = F)
summary(step.model.AIC)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy +
## Ed, family = "binomial", data = birthweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9790 -1.0470 -0.6085 1.0966 2.0012
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.240486 0.188075 1.279 0.20101
## MomWtGain -0.038047 0.008471 -4.492 7.07e-06 ***
## MomSmoke1 0.818590 0.310227 2.639 0.00832 **
## MomAge -0.044444 0.019040 -2.334 0.01959 *
## Boy1 -0.407560 0.212600 -1.917 0.05523 .
## Ed1 -0.366259 0.217910 -1.681 0.09280 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 510.15 on 394 degrees of freedom
## AIC: 522.15
##
## Number of Fisher Scoring iterations: 4
step.model.BIC<-step(model.full,
direction="both",test="Chisq", trace = F, k=log(nrow(birthweight)))
summary(step.model.BIC)
##
## Call:
## glm(formula = Weight_Gr ~ MomSmoke + MomAge + MomWtGain, family = "binomial",
## data = birthweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.073 -0.669 1.103 2.000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.132541 0.112817 -1.175 0.24006
## MomSmoke1 0.865786 0.309944 2.793 0.00522 **
## MomAge -0.048266 0.018730 -2.577 0.00997 **
## MomWtGain -0.036819 0.008389 -4.389 1.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 516.39 on 396 degrees of freedom
## AIC: 524.39
##
## Number of Fisher Scoring iterations: 4
Conclusion: From the Stepwise method of model selection with AIC and BIC criteria executed above, we can see that the both AIC and BIC based models all yielded the same predictors: MomWtGain, MomSmoke, and MomAge, as these predictors have p-values lower than significance level of 0.05.
glm.model <- glm(Weight_Gr ~ MomWtGain + MomSmoke + MomAge, data = birthweight, family = binomial)
summary(glm.model)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = binomial,
## data = birthweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.073 -0.669 1.103 2.000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.132541 0.112817 -1.175 0.24006
## MomWtGain -0.036819 0.008389 -4.389 1.14e-05 ***
## MomSmoke1 0.865786 0.309944 2.793 0.00522 **
## MomAge -0.048266 0.018730 -2.577 0.00997 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 516.39 on 396 degrees of freedom
## AIC: 524.39
##
## Number of Fisher Scoring iterations: 4
(inf.id=which(cooks.distance(glm.model)>0.1))
## named integer(0)
Conclusion: Looking at the p-value for each individual predictor, we can see that they all fall under the threshold of 0.05. This means that they’re significant enough to be used in the model. Also, there’s no observation with Cook’s Distance higher than 0.1
##EXERCISE 2.3 #### Final Model Log(p/1-p) = -0.1325 - 0.0368 * MomWtGain + 0.8658 * MomSmoke - 0.0483 * MomAge
round(exp(glm.model$coefficients),3)
## (Intercept) MomWtGain MomSmoke1 MomAge
## 0.876 0.964 2.377 0.953
Conclusion:
The odds of Weight_Gr change by a factor of exp(-0.0368) = 0.964 with a one unit increase in MomWtGain as MomAge and MomSmoke are being held constant
The odds of Weight_Gr change by a factor of exp(0.8658) = 2.377 with a one unit increase in MomSmoke as MomWtGain and MomAge are being held constant
The odds of Weight_Gr change by a factor of exp(-0.0483) = 0.953 with a one unit increase in MomAge as MomWtGain and MomSmoke are being held constant
Simply put, younger women with higher smoking levels, and lower weight are more likely to deliver a low birthweight infant
fit.prob <- predict(glm.model, type = "response")
sample.prop <- mean(birthweight$Weight_Gr)
pred.class.2 <- ifelse(fit.prob > sample.prop, 1, 0)
head(pred.class.2, 10)
## 1 2 3 4 5 6 7 8 9 10
## 0 1 1 0 0 1 0 1 0 1
Conclusion: The sample proportion in the dataset is 5
mean(birthweight$Weight_Gr != pred.class.2)
## [1] 0.355
Conclusion: The misclassification rate is 35.5%. Although the misclassification rate is fairly high, since the cutoff point that we’re using is lower than 0.5, there will be more false positive cases than false negative ones, which is acceptable.
####Goodness-of-Fit Test (Hosmer-Lemeshow)
hoslem.test(glm.model$y, fitted(glm.model), g=10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: glm.model$y, fitted(glm.model)
## X-squared = 9.2068, df = 8, p-value = 0.3252
Conclusion: From the Hosmer-Lemeshow (Goodness-of-Fit) test result, we can see it yielded a p-value of 0.1722. Since this is above the significance level of 0.05, we can retain our null hypothesis and conclude that this model is in fact adequate.
Exercise 1: Using MomWtGain, Black and MomSmoke as the predictors for our final model, only 13.66% of Weight variation can be explained by this model. With that being said, the model from Ex. 1 has a fairly low predictive power. Simply put, this is not a good model for the prediction model of Weight.
Exercise 2: Using MomWtGain, MomAge and MomSmoke as the predictors for our final model, we see that 35.5% of Weight variation can be explained by this model. With that being said, the model from Ex. 2 has a better predictive power compared to the model from Exercise 1. Simply put, this is a better model for the prediction model of Weight.
Conclusion: When comparing these two models, we see that predictors MomWtGain and MomSmoke are present in both. It’s safe to say that these two predictors are significant when predicting the variation of Weight. Thus, when implementing a low-birthweight prevention program, it’s best for women to avoid smoking and maintain a healthy weight during pregnancy.