Use significance levels of .05 unless the instructions state otherwise. Data Sets: You need to download dataset birthweight_final.csv. The data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are: - Weight: Infant birth weight (gram) - Weight_Gr; Categorical variable for indication of low birthweight; 0 is normal, 1 is low birthweight - Black: Categorical variable; 0 is white, 1 is black - Married: Categorical variable; 0 is not married, 1 is married - Boy: Categorical variable; 0 is girl, 1 is boy - MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom - Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above - MomAge: Mother’s age (centered to zero) - MomWtGain: Mother’s weight gain during pregnancy (centered to zero) - Visit: number of prenatal visits
Consider to fit a multiple linear regression to model Weight using possible explanatory variables; Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit (all predictors excluding Weight_Gr).
NOTE: R output from Backward selection displays variables “removed” from each step.
# Read the CSV file
d1 <- read.csv("birthweight_final.csv")
d1$Black = as.factor(d1$Black)
d1$Married = as.factor(d1$Married)
d1$Boy = as.factor(d1$Boy)
d1$MomSmoke = as.factor(d1$MomSmoke)
d1$Ed = as.factor(d1$Ed)
# Check data format
#str(d1)
Consider to fit a multiple linear regression to model Weight using possible explanatory variables; Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit (all predictors excluding Weight_Gr).
e1.lm <- lm(Weight ~ Black+Married+Boy+MomSmoke+Ed+MomAge+MomWtGain+Visit, data=d1)
summary(e1.lm)
##
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed +
## MomAge + MomWtGain + Visit, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2433.18 -312.36 14.72 323.08 1562.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3249.006 120.298 27.008 < 2e-16 ***
## Black1 -189.710 83.770 -2.265 0.0241 *
## Married1 63.281 69.819 0.906 0.3653
## Boy1 118.816 55.077 2.157 0.0316 *
## MomSmoke1 -198.047 79.065 -2.505 0.0127 *
## Ed1 71.241 56.300 1.265 0.2065
## MomAge 3.048 5.305 0.574 0.5660
## MomWtGain 12.136 2.117 5.733 1.98e-08 ***
## Visit 13.626 37.691 0.362 0.7179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared: 0.1456, Adjusted R-squared: 0.1281
## F-statistic: 8.328 on 8 and 391 DF, p-value: 1.901e-10
e1.stepwise <- ols_step_both_p(e1.lm,pent = 0.01, prem = 0.01, details = F)
e1.stepwise
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------
## 1 MomWtGain addition 0.089 0.087 20.7600 6201.2327 559.8163
## 2 MomSmoke addition 0.107 0.102 14.6830 6195.4042 555.0626
## 3 Black addition 0.127 0.120 7.4660 6188.2794 549.4600
## -----------------------------------------------------------------------------------------
Run Model:
e1.lm.stepwise <- lm(Weight ~ MomWtGain+MomSmoke+Black, data=d1)
summary(e1.lm.stepwise)
##
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2450.50 -312.94 7.73 325.68 1471.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3435.615 32.505 105.695 < 2e-16 ***
## MomWtGain 12.006 2.116 5.673 2.71e-08 ***
## MomSmoke1 -237.799 77.271 -3.077 0.00223 **
## Black1 -236.556 78.257 -3.023 0.00267 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared: 0.1271, Adjusted R-squared: 0.1205
## F-statistic: 19.22 on 3 and 396 DF, p-value: 1.185e-11
e1.forward <- ols_step_forward_p(e1.lm,pent = 0.01, details = F)
e1.forward
##
## Selection Summary
## -----------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------
## 1 MomWtGain 0.0893 0.0870 20.7604 6201.2327 559.8163
## 2 MomSmoke 0.1069 0.1024 14.6832 6195.4042 555.0626
## 3 Black 0.1271 0.1205 7.4659 6188.2794 549.4600
## -----------------------------------------------------------------------------
Run Model:
e1.lm.forward<- lm(Weight ~ MomWtGain+MomSmoke+Black, data=d1)
summary(e1.lm.forward)
##
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2450.50 -312.94 7.73 325.68 1471.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3435.615 32.505 105.695 < 2e-16 ***
## MomWtGain 12.006 2.116 5.673 2.71e-08 ***
## MomSmoke1 -237.799 77.271 -3.077 0.00223 **
## Black1 -236.556 78.257 -3.023 0.00267 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared: 0.1271, Adjusted R-squared: 0.1205
## F-statistic: 19.22 on 3 and 396 DF, p-value: 1.185e-11
e1.backward <- ols_step_backward_p(e1.lm,prem = 0.01, details = F)
e1.backward
##
##
## Elimination Summary
## ---------------------------------------------------------------------------
## Variable Adj.
## Step Removed R-Square R-Square C(p) AIC RMSE
## ---------------------------------------------------------------------------
## 1 Visit 0.1453 0.130 7.1307 6187.8448 546.4642
## 2 MomAge 0.1445 0.1314 5.5158 6186.2385 546.0372
## 3 Married 0.1409 0.130 5.1599 6185.9146 546.4876
## 4 Ed 0.1364 0.1277 5.1841 6185.9688 547.1986
## 5 Boy 0.1271 0.1205 7.4659 6188.2794 549.4600
## ---------------------------------------------------------------------------
Run Model (After removing above variables):
e1.lm.backward<- lm(Weight ~ Black + MomSmoke + MomWtGain , data=d1)
summary(e1.lm.backward)
##
## Call:
## lm(formula = Weight ~ Black + MomSmoke + MomWtGain, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2450.50 -312.94 7.73 325.68 1471.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3435.615 32.505 105.695 < 2e-16 ***
## Black1 -236.556 78.257 -3.023 0.00267 **
## MomSmoke1 -237.799 77.271 -3.077 0.00223 **
## MomWtGain 12.006 2.116 5.673 2.71e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared: 0.1271, Adjusted R-squared: 0.1205
## F-statistic: 19.22 on 3 and 396 DF, p-value: 1.185e-11
e1.lm.subset <- ols_step_best_subset(e1.lm)
e1.lm.subset
## Best Subsets Regression
## -------------------------------------------------------------------
## Model Index Predictors
## -------------------------------------------------------------------
## 1 MomWtGain
## 2 MomSmoke MomWtGain
## 3 Black MomSmoke MomWtGain
## 4 Black Boy MomSmoke MomWtGain
## 5 Black Boy MomSmoke Ed MomWtGain
## 6 Black Married Boy MomSmoke Ed MomWtGain
## 7 Black Married Boy MomSmoke Ed MomAge MomWtGain
## 8 Black Married Boy MomSmoke Ed MomAge MomWtGain Visit
## -------------------------------------------------------------------
##
## Subsets Regression Summary
## --------------------------------------------------------------------------------------------------------------------------------------------------
## Adj. Pred
## Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
## --------------------------------------------------------------------------------------------------------------------------------------------------
## 1 0.0893 0.0870 0.0775 20.7604 6201.2327 5065.9178 6213.2071 125357705.0611 314961.2141 789.4062 0.9199
## 2 0.1069 0.1024 0.091 14.6832 6195.4042 5060.1250 6211.3700 123238567.9566 310405.1540 778.0163 0.9066
## 3 0.1271 0.1205 0.1059 7.4659 6188.2794 5053.1393 6208.2368 120764046.4868 304925.3190 764.3196 0.8906
## 4 0.1364 0.1277 0.1114 5.1841 6185.9688 5050.9395 6209.9176 119772826.2237 303169.1473 759.9653 0.8854
## 5 0.1409 0.1300 0.1116 5.1599 6185.9146 5050.9720 6213.8549 119462530.2702 303128.3995 759.9203 0.8853
## 6 0.1445 0.1314 0.1103 5.5158 6186.2385 5051.3901 6218.1702 119266462.8033 303374.3231 760.6035 0.8860
## 7 0.1453 0.1300 0.1052 7.1307 6187.8448 5053.0558 6223.7680 119453866.2656 304595.5958 763.7420 0.8896
## 8 0.1456 0.1281 0.1014 9.0000 6189.7111 5054.9736 6229.6258 119720142.0535 306020.7931 767.4022 0.8938
## --------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria
## SBIC: Sawa's Bayesian Information Criteria
## SBC: Schwarz Bayesian Criteria
## MSEP: Estimated error of prediction, assuming multivariate normality
## FPE: Final Prediction Error
## HSP: Hocking's Sp
## APC: Amemiya Prediction Criteria
For Adj. R Square, bigger is better, thus Adj. R-Square=0.1314 is related to Model #6,
thus the best selected model is:
Model #6: Black Married Boy MomSmoke Ed MomWtGain
Run Model:
e1.lm.subset <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomWtGain , data=d1)
summary(e1.lm.subset)
##
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed +
## MomWtGain, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2401.07 -313.80 19.42 323.47 1547.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3270.007 72.733 44.959 < 2e-16 ***
## Black1 -193.724 81.970 -2.363 0.0186 *
## Married1 82.540 64.252 1.285 0.1997
## Boy1 120.232 54.936 2.189 0.0292 *
## MomSmoke1 -201.548 78.751 -2.559 0.0109 *
## Ed1 75.466 55.803 1.352 0.1770
## MomWtGain 12.209 2.106 5.797 1.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 546 on 393 degrees of freedom
## Multiple R-squared: 0.1445, Adjusted R-squared: 0.1314
## F-statistic: 11.06 on 6 and 393 DF, p-value: 2.107e-11
Final model of stepwise:
ŷ = 3435.615 + 12.006 * MomWtGain - 237.799 * MomSmoke - 236.556 * Black
Final model of Forward selection:
ŷ = 3435.615 + 12.006 * MomWtGain - 237.799 * MomSmoke - 236.556 * Black
Final model of Backward Selection:
ŷ = 3435.615 -236.556 * Black -237.799 * MomSmoke + 12.00697 * MomWtGain
Final model of Subset (Adj R Square):
ŷ = 3270.007 - 193.724 * Black + 82.540 * Married + 120.232 * Boy - 201.548 * MomSmoke + 75.466 * Ed + 12.209 * MomWtGain
Answer following questions from the best model determined by Stepwise selection with 0.01 p-value criteria
The final model based on stepwise selection is:
- ŷ = 3435.615 + 12.006 * MomWtGain - 237.799 * MomSmoke - 236.556 * Black
(2) Fit the linear regression with the best model determined by stepwise selection and comment on diagnostics plot. Do not leave observation which has Cook’s distance larger than 0.115. Re-fit the model if necessary. Finally how many observations you use in the final model?
e1.lm.final.stepwise <- lm(Weight ~ MomWtGain+MomSmoke+Black, data=d1)
summary(e1.lm.final.stepwise)
##
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2450.50 -312.94 7.73 325.68 1471.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3435.615 32.505 105.695 < 2e-16 ***
## MomWtGain 12.006 2.116 5.673 2.71e-08 ***
## MomSmoke1 -237.799 77.271 -3.077 0.00223 **
## Black1 -236.556 78.257 -3.023 0.00267 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared: 0.1271, Adjusted R-squared: 0.1205
## F-statistic: 19.22 on 3 and 396 DF, p-value: 1.185e-11
par(mfrow=c(2,2))
plot(e1.lm.final.stepwise,which=1:4,col="darkblue")
It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.
In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.
In the cook distance, there exist some pints greater than 0.115.
e1.cook <- cooks.distance(e1.lm.final.stepwise)
plot(e1.cook,col="darkblue",pch=19,cex=1)
e1.inf.id <- which(cooks.distance(e1.lm.final.stepwise) > 0.115)
e1.inf.id
## 129
## 129
d1.remove.inf <- d1[-e1.inf.id,]
dim(d1.remove.inf)
## [1] 399 10
e1.lm.final.stepwise.after.elimination <- lm(Weight ~ MomWtGain+MomSmoke+Black , data=d1.remove.inf)
summary(e1.lm.final.stepwise.after.elimination)
##
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1.remove.inf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2427.02 -309.20 2.98 315.40 1472.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3434.252 32.078 107.059 < 2e-16 ***
## MomWtGain 13.112 2.113 6.204 1.39e-09 ***
## MomSmoke1 -238.923 76.251 -3.133 0.00186 **
## Black1 -198.519 78.022 -2.544 0.01133 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared: 0.1366, Adjusted R-squared: 0.1301
## F-statistic: 20.84 on 3 and 395 DF, p-value: 1.493e-12
par(mfrow=c(2,2))
plot(e1.lm.final.stepwise.after.elimination,which=1:4,col="darkblue")
3) How much of the variation in Weight is explained by the final model?
Based on previous step:
It means 13.66% of variation in Weight explained by the model.
(4) Interpret the relationship between predictor variables (in the final model) and Weight value specifically.
Final linear regression model is:
ŷ = 3434.252 + 13.112 * MomWtGain - 238.923 * MomSmoke - 198.519 * Black
β0 = 3434.252 –> Intercept
β1 = 13.112 –> Indicates how much increasing/decreasing in ŷ when one unit increase/decrease in β1 (MomWtGain).
β2 = -238.923 –> Indicates how much increasing/decreasing in ŷ when one unit increase/decrease in β2 (MomSmoke).
β3 = - 198.519 –> Indicates how much increasing/decreasing in ŷ when one unit increase/decrease in β3 (Black).
MomWtGain: MomWtGain has positive effect on weight .There is 13.112 units of increase in infant weight per 1 unit increase in MomWtGain.
MomSmoke1: MomSmoke1 has negative effect on weight. There is -238.923 units of decrease in infant weight for smoking mothers.
Black1 : Black1 has negative effect on weight . There is -198.519 units of decrease in infant weight for black mothers.