Use significance levels of .05 unless the instructions state otherwise. Data Sets: You need to download dataset birthweight_final.csv. The data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are: - Weight: Infant birth weight (gram) - Weight_Gr; Categorical variable for indication of low birthweight; 0 is normal, 1 is low birthweight - Black: Categorical variable; 0 is white, 1 is black - Married: Categorical variable; 0 is not married, 1 is married - Boy: Categorical variable; 0 is girl, 1 is boy - MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom - Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above - MomAge: Mother’s age (centered to zero) - MomWtGain: Mother’s weight gain during pregnancy (centered to zero) - Visit: number of prenatal visits

Multiple Linear Regression, Model Selection:


Consider to fit a multiple linear regression to model Weight using possible explanatory variables; Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit (all predictors excluding Weight_Gr).

  1. Perform the following four model selection methods and compare their best models. Comment on how they differ or similar in terms of selected variables in the final model. No need to interpret outputs.
  • Stepwise selection with 0.01 p-value criteria for both entry and stay
  • Forward selection with 0.01 p-value criteria for entry
  • Backward selection with 0.01 p-value criteria for stay
  • Adjusted R-squared criteria

NOTE: R output from Backward selection displays variables “removed” from each step.

# Read the CSV file
d1 <- read.csv("birthweight_final.csv") 

d1$Black = as.factor(d1$Black)
d1$Married = as.factor(d1$Married)
d1$Boy = as.factor(d1$Boy)
d1$MomSmoke = as.factor(d1$MomSmoke)
d1$Ed = as.factor(d1$Ed)


# Check data format
#str(d1)

Question (1):

Consider to fit a multiple linear regression to model Weight using possible explanatory variables; Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit (all predictors excluding Weight_Gr).

  1. Perform the following four model selection methods and compare their best models. Comment on how they differ or similar in terms of selected variables in the final model. No need to interpret outputs.
  • Stepwise selection with 0.01 p-value criteria for both entry and stay
  • Forward selection with 0.01 p-value criteria for entry
  • Backward selection with 0.01 p-value criteria for stay
  • Adjusted R-squared criteria

Answer -:

Run LR:

e1.lm <- lm(Weight ~ Black+Married+Boy+MomSmoke+Ed+MomAge+MomWtGain+Visit, data=d1)
summary(e1.lm)
## 
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed + 
##     MomAge + MomWtGain + Visit, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2433.18  -312.36    14.72   323.08  1562.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3249.006    120.298  27.008  < 2e-16 ***
## Black1      -189.710     83.770  -2.265   0.0241 *  
## Married1      63.281     69.819   0.906   0.3653    
## Boy1         118.816     55.077   2.157   0.0316 *  
## MomSmoke1   -198.047     79.065  -2.505   0.0127 *  
## Ed1           71.241     56.300   1.265   0.2065    
## MomAge         3.048      5.305   0.574   0.5660    
## MomWtGain     12.136      2.117   5.733 1.98e-08 ***
## Visit         13.626     37.691   0.362   0.7179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1281 
## F-statistic: 8.328 on 8 and 391 DF,  p-value: 1.901e-10

Stepwise selection:

  • Stepwise selection with 0.01 p-value criteria for both entry and stay
e1.stepwise <- ols_step_both_p(e1.lm,pent = 0.01, prem = 0.01, details = F)
e1.stepwise
## 
##                                Stepwise Selection Summary                                 
## -----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                          
## Step    Variable     Removed     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------------------
##    1    MomWtGain    addition       0.089       0.087    20.7600    6201.2327    559.8163    
##    2    MomSmoke     addition       0.107       0.102    14.6830    6195.4042    555.0626    
##    3      Black      addition       0.127       0.120     7.4660    6188.2794    549.4600    
## -----------------------------------------------------------------------------------------

Run Model:

e1.lm.stepwise <- lm(Weight ~ MomWtGain+MomSmoke+Black, data=d1)
summary(e1.lm.stepwise)
## 
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2450.50  -312.94     7.73   325.68  1471.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3435.615     32.505 105.695  < 2e-16 ***
## MomWtGain     12.006      2.116   5.673 2.71e-08 ***
## MomSmoke1   -237.799     77.271  -3.077  0.00223 ** 
## Black1      -236.556     78.257  -3.023  0.00267 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared:  0.1271, Adjusted R-squared:  0.1205 
## F-statistic: 19.22 on 3 and 396 DF,  p-value: 1.185e-11

Forward selection:

  • Forward selection with 0.01 p-value criteria for entry
e1.forward <- ols_step_forward_p(e1.lm,pent = 0.01, details = F)
e1.forward
## 
##                               Selection Summary                               
## -----------------------------------------------------------------------------
##         Variable                   Adj.                                          
## Step     Entered     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------
##    1    MomWtGain      0.0893      0.0870    20.7604    6201.2327    559.8163    
##    2    MomSmoke       0.1069      0.1024    14.6832    6195.4042    555.0626    
##    3    Black          0.1271      0.1205     7.4659    6188.2794    549.4600    
## -----------------------------------------------------------------------------

Run Model:

e1.lm.forward<- lm(Weight ~ MomWtGain+MomSmoke+Black, data=d1)
summary(e1.lm.forward)
## 
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2450.50  -312.94     7.73   325.68  1471.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3435.615     32.505 105.695  < 2e-16 ***
## MomWtGain     12.006      2.116   5.673 2.71e-08 ***
## MomSmoke1   -237.799     77.271  -3.077  0.00223 ** 
## Black1      -236.556     78.257  -3.023  0.00267 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared:  0.1271, Adjusted R-squared:  0.1205 
## F-statistic: 19.22 on 3 and 396 DF,  p-value: 1.185e-11

Backward selection:

  • Backward selection with 0.01 p-value criteria for stay
e1.backward <- ols_step_backward_p(e1.lm,prem = 0.01, details = F)
e1.backward
## 
## 
##                             Elimination Summary                             
## ---------------------------------------------------------------------------
##         Variable                  Adj.                                         
## Step    Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ---------------------------------------------------------------------------
##    1    Visit         0.1453       0.130    7.1307    6187.8448    546.4642    
##    2    MomAge        0.1445      0.1314    5.5158    6186.2385    546.0372    
##    3    Married       0.1409       0.130    5.1599    6185.9146    546.4876    
##    4    Ed            0.1364      0.1277    5.1841    6185.9688    547.1986    
##    5    Boy           0.1271      0.1205    7.4659    6188.2794    549.4600    
## ---------------------------------------------------------------------------

Run Model (After removing above variables):

e1.lm.backward<- lm(Weight ~ Black + MomSmoke + MomWtGain , data=d1)
summary(e1.lm.backward)
## 
## Call:
## lm(formula = Weight ~ Black + MomSmoke + MomWtGain, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2450.50  -312.94     7.73   325.68  1471.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3435.615     32.505 105.695  < 2e-16 ***
## Black1      -236.556     78.257  -3.023  0.00267 ** 
## MomSmoke1   -237.799     77.271  -3.077  0.00223 ** 
## MomWtGain     12.006      2.116   5.673 2.71e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared:  0.1271, Adjusted R-squared:  0.1205 
## F-statistic: 19.22 on 3 and 396 DF,  p-value: 1.185e-11

Subset selection (Adj. R Square):

  • Adjusted R-squared criteria
e1.lm.subset <- ols_step_best_subset(e1.lm) 
e1.lm.subset
##                       Best Subsets Regression                      
## -------------------------------------------------------------------
## Model Index    Predictors
## -------------------------------------------------------------------
##      1         MomWtGain                                            
##      2         MomSmoke MomWtGain                                   
##      3         Black MomSmoke MomWtGain                             
##      4         Black Boy MomSmoke MomWtGain                         
##      5         Black Boy MomSmoke Ed MomWtGain                      
##      6         Black Married Boy MomSmoke Ed MomWtGain              
##      7         Black Married Boy MomSmoke Ed MomAge MomWtGain       
##      8         Black Married Boy MomSmoke Ed MomAge MomWtGain Visit 
## -------------------------------------------------------------------
## 
##                                                             Subsets Regression Summary                                                            
## --------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                            
## Model    R-Square    R-Square    R-Square     C(p)         AIC         SBIC          SBC            MSEP             FPE          HSP        APC  
## --------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0893      0.0870      0.0775    20.7604    6201.2327    5065.9178    6213.2071    125357705.0611    314961.2141    789.4062    0.9199 
##   2        0.1069      0.1024       0.091    14.6832    6195.4042    5060.1250    6211.3700    123238567.9566    310405.1540    778.0163    0.9066 
##   3        0.1271      0.1205      0.1059     7.4659    6188.2794    5053.1393    6208.2368    120764046.4868    304925.3190    764.3196    0.8906 
##   4        0.1364      0.1277      0.1114     5.1841    6185.9688    5050.9395    6209.9176    119772826.2237    303169.1473    759.9653    0.8854 
##   5        0.1409      0.1300      0.1116     5.1599    6185.9146    5050.9720    6213.8549    119462530.2702    303128.3995    759.9203    0.8853 
##   6        0.1445      0.1314      0.1103     5.5158    6186.2385    5051.3901    6218.1702    119266462.8033    303374.3231    760.6035    0.8860 
##   7        0.1453      0.1300      0.1052     7.1307    6187.8448    5053.0558    6223.7680    119453866.2656    304595.5958    763.7420    0.8896 
##   8        0.1456      0.1281      0.1014     9.0000    6189.7111    5054.9736    6229.6258    119720142.0535    306020.7931    767.4022    0.8938 
## --------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

For Adj. R Square, bigger is better, thus Adj. R-Square=0.1314 is related to Model #6,
thus the best selected model is:
Model #6: Black Married Boy MomSmoke Ed MomWtGain

Run Model:

e1.lm.subset <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomWtGain , data=d1)
summary(e1.lm.subset)
## 
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed + 
##     MomWtGain, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2401.07  -313.80    19.42   323.47  1547.41 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3270.007     72.733  44.959  < 2e-16 ***
## Black1      -193.724     81.970  -2.363   0.0186 *  
## Married1      82.540     64.252   1.285   0.1997    
## Boy1         120.232     54.936   2.189   0.0292 *  
## MomSmoke1   -201.548     78.751  -2.559   0.0109 *  
## Ed1           75.466     55.803   1.352   0.1770    
## MomWtGain     12.209      2.106   5.797 1.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 546 on 393 degrees of freedom
## Multiple R-squared:  0.1445, Adjusted R-squared:  0.1314 
## F-statistic: 11.06 on 6 and 393 DF,  p-value: 2.107e-11

Conclusion

  • Final model of stepwise:

    ŷ = 3435.615 + 12.006 * MomWtGain - 237.799 * MomSmoke - 236.556 * Black

  • Final model of Forward selection:

    ŷ = 3435.615 + 12.006 * MomWtGain - 237.799 * MomSmoke - 236.556 * Black

  • Final model of Backward Selection:

    ŷ = 3435.615 -236.556 * Black -237.799 * MomSmoke + 12.00697 * MomWtGain

  • Final model of Subset (Adj R Square):

    ŷ = 3270.007 - 193.724 * Black + 82.540 * Married + 120.232 * Boy - 201.548 * MomSmoke + 75.466 * Ed + 12.209 * MomWtGain

Comparison:

  • Stepwise and Forward are the same.
  • Black & MomSmoke & MomWtGain are the common variables in all models.

Answer following questions from the best model determined by Stepwise selection with 0.01 p-value criteria

The final model based on stepwise selection is:
- ŷ = 3435.615 + 12.006 * MomWtGain - 237.799 * MomSmoke - 236.556 * Black

Question (2):

(2) Fit the linear regression with the best model determined by stepwise selection and comment on diagnostics plot. Do not leave observation which has Cook’s distance larger than 0.115. Re-fit the model if necessary. Finally how many observations you use in the final model?

Answer -:

Fit the Model:

e1.lm.final.stepwise <- lm(Weight ~ MomWtGain+MomSmoke+Black, data=d1)
summary(e1.lm.final.stepwise)
## 
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2450.50  -312.94     7.73   325.68  1471.38 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3435.615     32.505 105.695  < 2e-16 ***
## MomWtGain     12.006      2.116   5.673 2.71e-08 ***
## MomSmoke1   -237.799     77.271  -3.077  0.00223 ** 
## Black1      -236.556     78.257  -3.023  0.00267 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 549.5 on 396 degrees of freedom
## Multiple R-squared:  0.1271, Adjusted R-squared:  0.1205 
## F-statistic: 19.22 on 3 and 396 DF,  p-value: 1.185e-11
  • The model is significant and each individual variable is also significant.

Perform Diagnostic Plot

par(mfrow=c(2,2))
plot(e1.lm.final.stepwise,which=1:4,col="darkblue")

  • It seems a normal distribution in Normal QQ plot, in Standardized residual, 95% data are between 0-1.5, thus data follow normal distribution.

  • In the Residuals plot, there is no pattern, thus there exist a Homoscedasticity.

  • In the cook distance, there exist some pints greater than 0.115.

Influential points

e1.cook <- cooks.distance(e1.lm.final.stepwise)
plot(e1.cook,col="darkblue",pch=19,cex=1)

  • Count the influential points:
e1.inf.id <- which(cooks.distance(e1.lm.final.stepwise) > 0.115) 
e1.inf.id
## 129 
## 129
  • There is one influential point Here, wee will eliminate influential points.
d1.remove.inf <- d1[-e1.inf.id,]
dim(d1.remove.inf)
## [1] 399  10
  • There are 399 observations after elimination the influential points.

Refit the model, again

e1.lm.final.stepwise.after.elimination <- lm(Weight ~ MomWtGain+MomSmoke+Black  , data=d1.remove.inf)
summary(e1.lm.final.stepwise.after.elimination)
## 
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = d1.remove.inf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2427.02  -309.20     2.98   315.40  1472.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3434.252     32.078 107.059  < 2e-16 ***
## MomWtGain     13.112      2.113   6.204 1.39e-09 ***
## MomSmoke1   -238.923     76.251  -3.133  0.00186 ** 
## Black1      -198.519     78.022  -2.544  0.01133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared:  0.1366, Adjusted R-squared:  0.1301 
## F-statistic: 20.84 on 3 and 395 DF,  p-value: 1.493e-12

Perform Diagnostic Plot again (After elimination influential points)

par(mfrow=c(2,2))
plot(e1.lm.final.stepwise.after.elimination,which=1:4,col="darkblue")

Question (3):

3) How much of the variation in Weight is explained by the final model?

Answer -:

Based on previous step:

  • R-Squared is: 0.1366

It means 13.66% of variation in Weight explained by the model.


Question (4):

(4) Interpret the relationship between predictor variables (in the final model) and Weight value specifically.

Answer -:

  • Final linear regression model is:

    ŷ = 3434.252 + 13.112 * MomWtGain - 238.923 * MomSmoke - 198.519 * Black

β0 = 3434.252 –> Intercept
β1 = 13.112 –> Indicates how much increasing/decreasing in ŷ when one unit increase/decrease in β1 (MomWtGain).
β2 = -238.923 –> Indicates how much increasing/decreasing in ŷ when one unit increase/decrease in β2 (MomSmoke).
β3 = - 198.519 –> Indicates how much increasing/decreasing in ŷ when one unit increase/decrease in β3 (Black).

  • MomWtGain: MomWtGain has positive effect on weight .There is 13.112 units of increase in infant weight per 1 unit increase in MomWtGain.

  • MomSmoke1: MomSmoke1 has negative effect on weight. There is -238.923 units of decrease in infant weight for smoking mothers.

  • Black1 : Black1 has negative effect on weight . There is -198.519 units of decrease in infant weight for black mothers.