lm.birth <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight)
summary(lm.birth)
##
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed +
## MomAge + MomWtGain + Visit, data = birthweight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2433.18 -312.36 14.72 323.08 1562.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3249.006 120.298 27.008 < 2e-16 ***
## Black1 -189.710 83.770 -2.265 0.0241 *
## Married1 63.281 69.819 0.906 0.3653
## Boy1 118.816 55.077 2.157 0.0316 *
## MomSmoke1 -198.047 79.065 -2.505 0.0127 *
## Ed1 71.241 56.300 1.265 0.2065
## MomAge 3.048 5.305 0.574 0.5660
## MomWtGain 12.136 2.117 5.733 1.98e-08 ***
## Visit 13.626 37.691 0.362 0.7179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared: 0.1456, Adjusted R-squared: 0.1281
## F-statistic: 8.328 on 8 and 391 DF, p-value: 1.901e-10
model.stepwise <- ols_step_both_p(lm.birth, pent = 0.01, prem = 0.01, details = F)
model.stepwise
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------
## 1 MomWtGain addition 0.089 0.087 20.7600 6201.2327 559.8163
## 2 MomSmoke addition 0.107 0.102 14.6830 6195.4042 555.0626
## 3 Black addition 0.127 0.120 7.4660 6188.2794 549.4600
## -----------------------------------------------------------------------------------------
model.forward <- ols_step_forward_p(lm.birth, penter = 0.01, details = F)
model.forward
##
## Selection Summary
## -----------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------
## 1 MomWtGain 0.0893 0.0870 20.7604 6201.2327 559.8163
## 2 MomSmoke 0.1069 0.1024 14.6832 6195.4042 555.0626
## 3 Black 0.1271 0.1205 7.4659 6188.2794 549.4600
## -----------------------------------------------------------------------------
model.backward <- ols_step_backward_p(lm.birth, prem = 0.01, details = F)
model.backward
##
##
## Elimination Summary
## ---------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## ---------------------------------------------------------------------------
## 1 Visit 0.1453 0.130 7.1307 6187.8448 546.4642
## 2 MomAge 0.1445 0.1314 5.5158 6186.2385 546.0372
## 3 Married 0.1409 0.130 5.1599 6185.9146 546.4876
## 4 Ed 0.1364 0.1277 5.1841 6185.9688 547.1986
## 5 Boy 0.1271 0.1205 7.4659 6188.2794 549.4600
## ---------------------------------------------------------------------------
model.best.subset <- ols_step_best_subset(lm.birth)
model.best.subset
From the above results, stepwise, forward, and backward models produce the same predictors: MomWtGain, MomSmoke, & Black. The best subset approach produces Black, Married, Boy, MomSmoke, Ed, & MomWtGain as the predictors. This is a reasonable model because it gives the highest adjusted r-square value out of the eight models
lm.step <- lm(Weight ~ MomWtGain + MomSmoke + Black, data = birthweight)
par(mfrow = c(2,2))
plot(lm.step, which = c(1:4))
Normality: the QQPlot shows most of the points falling along the line for the majority of the graph. Also, the sqrt(Standardized Residual) plot shows some points fall above 2. We can conclude the assumption of normality is NOT reasonable. Equal Variance: the Residual plot shows a pattern - supporting heteroscedasticity. Influential points: Cook’s Distance shows there’s one point that exceeds 0.115 - Obs 129. To confirm, we need to check if refitting the model is needed (test below)
influential.id = which(cooks.distance(lm.step) > 0.115)
birthweight[influential.id, ]
lm.step2 = lm(Weight ~ MomWtGain + MomSmoke + Black, data = birthweight[-influential.id, ])
summary(lm.step2)
##
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = birthweight[-influential.id,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2427.02 -309.20 2.98 315.40 1472.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3434.252 32.078 107.059 < 2e-16 ***
## MomWtGain 13.112 2.113 6.204 1.39e-09 ***
## MomSmoke1 -238.923 76.251 -3.133 0.00186 **
## Black1 -198.519 78.022 -2.544 0.01133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared: 0.1366, Adjusted R-squared: 0.1301
## F-statistic: 20.84 on 3 and 395 DF, p-value: 1.493e-12
From the results above, 13.66% of the variation in Weight can be explained by the final model. This is a fairly low percentage and shows the model has low prediction power for Weight.
Model Significance: the model p-value is lower than the significant value of 0.05 – we can REJECt the null and conclude that the final model is useful in explaining the behavior of Weight. Individual Term Significance: All the p-values fall below 0.05 – there is a significant relationship between Weight and all three predictors (MomWtGain, MomSmoke, & Black) Estimated Regression Line: Y = 3434.252 + 13.112(MomWtGain) - 238.923(MomSmoke) - 198.519(Black) + E R-Squared: 13.66% of Weight variation can be explained by the model. This means the model has low prediciton power.
model.null = glm(Weight_Gr ~ 1, data = birthweight, family = "binomial")
model.full = glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight, family = "binomial")
step.model.aic = step(model.null, scope = list(upper = model.full),
direction = "both", test = "Chisq", trace = F)
summary(step.model.aic)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy +
## Ed, family = "binomial", data = birthweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9790 -1.0470 -0.6085 1.0966 2.0012
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.240486 0.188075 1.279 0.20101
## MomWtGain -0.038047 0.008471 -4.492 7.07e-06 ***
## MomSmoke1 0.818590 0.310227 2.639 0.00832 **
## MomAge -0.044444 0.019040 -2.334 0.01959 *
## Boy1 -0.407560 0.212600 -1.917 0.05523 .
## Ed1 -0.366259 0.217910 -1.681 0.09280 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 510.15 on 394 degrees of freedom
## AIC: 522.15
##
## Number of Fisher Scoring iterations: 4
step.model.bic = step(model.full,
direction = "both", test = "Chisq", trace = F, k = log(nrow(birthweight)))
summary(step.model.bic)
##
## Call:
## glm(formula = Weight_Gr ~ MomSmoke + MomAge + MomWtGain, family = "binomial",
## data = birthweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.073 -0.669 1.103 2.000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.132541 0.112817 -1.175 0.24006
## MomSmoke1 0.865786 0.309944 2.793 0.00522 **
## MomAge -0.048266 0.018730 -2.577 0.00997 **
## MomWtGain -0.036819 0.008389 -4.389 1.14e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 516.39 on 396 degrees of freedom
## AIC: 524.39
##
## Number of Fisher Scoring iterations: 4
From the above results, both AIC and BIC models produced the same predictors that have p-values less than 0.05: MomWtGain, MomSmoke, & MomAge
glm.model = glm(Weight_Gr ~ MomWtGain + MomSmoke + MomAge, data = birthweight, family = "binomial")
summary(glm.model)
##
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = "binomial",
## data = birthweight)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.016 -1.073 -0.669 1.103 2.000
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.132541 0.112817 -1.175 0.24006
## MomWtGain -0.036819 0.008389 -4.389 1.14e-05 ***
## MomSmoke1 0.865786 0.309944 2.793 0.00522 **
## MomAge -0.048266 0.018730 -2.577 0.00997 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 554.43 on 399 degrees of freedom
## Residual deviance: 516.39 on 396 degrees of freedom
## AIC: 524.39
##
## Number of Fisher Scoring iterations: 4
(inf.id = which(cooks.distance(glm.model) > 0.1))
## named integer(0)
All the p-values for the predictors are below 0.05 meaning they are all significant enough to be used in the model. There is also no observations with a Cook’s Distance greater than 0.1
Log(p/1-p) = -0.1325 - 0.0368(MomWtGain) + 0.8658(MomSmoke) - 0.0483(MomAge)
round(exp(glm.model$coefficients), 3)
## (Intercept) MomWtGain MomSmoke1 MomAge
## 0.876 0.964 2.377 0.953
Odds of Weight_Gr change by exp(-0.0368) = 0.964 with one unit increase in MomWtGain (all others constant) Odds of Weight_Gr change by exp(0.8658) = 2.377 with one unit increase in MomSmoke (all others constant) Odds of Weight_Gr change by exp(-0.0483) = 0.953with one unit increase in MomAge (all others constant)
Younger women with higher smoking levels, and lower weight are more likely to deliver an infant with a low birthweight
fit.prob = predict(glm.model, type = "response")
sample.prob = mean(birthweight$Weight_Gr)
pred.class.2 = ifelse(fit.prob > sample.prob, 1, 0)
head(pred.class.2, 10)
## 1 2 3 4 5 6 7 8 9 10
## 0 1 1 0 0 1 0 1 0 1
The sample proportion in the dataset is 5
mean(birthweight$Weight_Gr != pred.class.2)
## [1] 0.355
Misclassification Rate = 35.5% Even though this is high, the cutoff point used is lower than 0.5 so there will be more false positive cases than false negatives - which is acceptable.
hoslem.test(glm.model$y, fitted(glm.model), g = 10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: glm.model$y, fitted(glm.model)
## X-squared = 9.2068, df = 8, p-value = 0.3252
The test above produced a p-value of 0.1722 which is above the 0.05 significance level. We do NOT reject the null and the model is adequate.
The model in Ex1 has low predictive power and is not a good model for predicting Weight
The model has much better predictive power over Ex1 and is therefore a better model for predicting Weight
In both, we see MomWtGain & MomSmoke as predictors so it is probable these are both significant when predicting Weight. In conclusion, when implementing a low-birthweight prevention program, it’s best for expecting women to avoid smoking and maintain healthy weight.