Final Exam

Data Sets:

You need to download dataset birthweight_final.csv. The data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are:

Weight: Infant birth weight (gram)
Weight_Gr; Categorical variable for indication of low birthweight; 0 is normal, 1 is low birthweight
Black: Categorical variable; 0 is white, 1 is black
Married: Categorical variable; 0 is not married, 1 is married
Boy: Categorical variable; 0 is girl, 1 is boy
MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom
Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above
MomAge: Mother’s age (centered to zero)
MomWtGain: Mother’s weight gain during pregnancy (centered to zero)
Visit: number of prenatal visits

Exercise 1

Consider to fit a multiple linear regression to model Weight using possible explanatory variables; Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit (all predictors excluding Weight_Gr).

Exercise 1 A

Perform the following four model selection methods and compare their best models. Comment on how they differ or similar in terms of selected variables in the final model. No need to interpret outputs.

Stepwise selection with 0.01 p-value criteria for both entry and stay
Forward selection with 0.01 p-value criteria for entry
Backward selection with 0.01 p-value criteria for stay
Adjusted R-squared criteria

NOTE: R output from Backward selection displays variables “removed” from each step

Multiple Linear Regression

lm.birth <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = bweight)
summary(lm.birth)

## 
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed + 
##     MomAge + MomWtGain + Visit, data = bweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2433.18  -312.36    14.72   323.08  1562.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3249.006    120.298  27.008  < 2e-16 ***
## Black1      -189.710     83.770  -2.265   0.0241 *  
## Married1      63.281     69.819   0.906   0.3653    
## Boy1         118.816     55.077   2.157   0.0316 *  
## MomSmoke1   -198.047     79.065  -2.505   0.0127 *  
## Ed1           71.241     56.300   1.265   0.2065    
## MomAge         3.048      5.305   0.574   0.5660    
## MomWtGain     12.136      2.117   5.733 1.98e-08 ***
## Visit         13.626     37.691   0.362   0.7179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1281 
## F-statistic: 8.328 on 8 and 391 DF,  p-value: 1.901e-10

Stepwise Selection

stepwise = ols_step_both_p(lm.birth, pent = 0.01, prem = 0.01, details = F)
stepwise

## 
##                                Stepwise Selection Summary                                 
## -----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                          
## Step    Variable     Removed     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------------------
##    1    MomWtGain    addition       0.089       0.087    20.7600    6201.2327    559.8163    
##    2    MomSmoke     addition       0.107       0.102    14.6830    6195.4042    555.0626    
##    3      Black      addition       0.127       0.120     7.4660    6188.2794    549.4600    
## -----------------------------------------------------------------------------------------

Forward Selection

forward = olsrr::ols_step_forward_p(lm.birth, penter = 0.01, details = F)
forward

## 
##                               Selection Summary                               
## -----------------------------------------------------------------------------
##         Variable                   Adj.                                          
## Step     Entered     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------
##    1    MomWtGain      0.0893      0.0870    20.7604    6201.2327    559.8163    
##    2    MomSmoke       0.1069      0.1024    14.6832    6195.4042    555.0626    
##    3    Black          0.1271      0.1205     7.4659    6188.2794    549.4600    
## -----------------------------------------------------------------------------

Backward Selection

backward = olsrr::ols_step_backward_p(lm.birth, prem = 0.01, details = F) 
backward

## 
## 
##                             Elimination Summary                             
## ---------------------------------------------------------------------------
##         Variable                  Adj.                                         
## Step    Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ---------------------------------------------------------------------------
##    1    Visit         0.1453       0.130    7.1307    6187.8448    546.4642    
##    2    MomAge        0.1445      0.1314    5.5158    6186.2385    546.0372    
##    3    Married       0.1409       0.130    5.1599    6185.9146    546.4876    
##    4    Ed            0.1364      0.1277    5.1841    6185.9688    547.1986    
##    5    Boy           0.1271      0.1205    7.4659    6188.2794    549.4600    
## ---------------------------------------------------------------------------

Adjusted R-square

best = ols_step_best_subset(lm.birth)
best

##                       Best Subsets Regression                      
## -------------------------------------------------------------------
## Model Index    Predictors
## -------------------------------------------------------------------
##      1         MomWtGain                                            
##      2         MomSmoke MomWtGain                                   
##      3         Black MomSmoke MomWtGain                             
##      4         Black Boy MomSmoke MomWtGain                         
##      5         Black Boy MomSmoke Ed MomWtGain                      
##      6         Black Married Boy MomSmoke Ed MomWtGain              
##      7         Black Married Boy MomSmoke Ed MomAge MomWtGain       
##      8         Black Married Boy MomSmoke Ed MomAge MomWtGain Visit 
## -------------------------------------------------------------------
## 
##                                                             Subsets Regression Summary                                                            
## --------------------------------------------------------------------------------------------------------------------------------------------------
##                        Adj.        Pred                                                                                                            
## Model    R-Square    R-Square    R-Square     C(p)         AIC         SBIC          SBC            MSEP             FPE          HSP        APC  
## --------------------------------------------------------------------------------------------------------------------------------------------------
##   1        0.0893      0.0870      0.0775    20.7604    6201.2327    5065.9178    6213.2071    125357705.0611    314961.2141    789.4062    0.9199 
##   2        0.1069      0.1024       0.091    14.6832    6195.4042    5060.1250    6211.3700    123238567.9566    310405.1540    778.0163    0.9066 
##   3        0.1271      0.1205      0.1059     7.4659    6188.2794    5053.1393    6208.2368    120764046.4868    304925.3190    764.3196    0.8906 
##   4        0.1364      0.1277      0.1114     5.1841    6185.9688    5050.9395    6209.9176    119772826.2237    303169.1473    759.9653    0.8854 
##   5        0.1409      0.1300      0.1116     5.1599    6185.9146    5050.9720    6213.8549    119462530.2702    303128.3995    759.9203    0.8853 
##   6        0.1445      0.1314      0.1103     5.5158    6186.2385    5051.3901    6218.1702    119266462.8033    303374.3231    760.6035    0.8860 
##   7        0.1453      0.1300      0.1052     7.1307    6187.8448    5053.0558    6223.7680    119453866.2656    304595.5958    763.7420    0.8896 
##   8        0.1456      0.1281      0.1014     9.0000    6189.7111    5054.9736    6229.6258    119720142.0535    306020.7931    767.4022    0.8938 
## --------------------------------------------------------------------------------------------------------------------------------------------------
## AIC: Akaike Information Criteria 
##  SBIC: Sawa's Bayesian Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
##  MSEP: Estimated error of prediction, assuming multivariate normality 
##  FPE: Final Prediction Error 
##  HSP: Hocking's Sp 
##  APC: Amemiya Prediction Criteria

Conclusion: We can see from the various methods of model selection used above that the stepwise, forward, and backward models all produce the same predictors (MomWtGain, MomSmoke, and Black) for our final model (MomWtGain, MomSmoke, and Black). We receive Black, Married, Boy MomSmoke, Ed, and MomWtGain as predictors for our final model using the best subset strategy is model 6, because it has the greatest Adjusted R-Square(0.1314) among the eight models.

Exercise 1 B

Fit the linear regression with the best model determined by stepwise selection and comment on diagnostics plot. Do not leave observation which has Cook’s distance larger than 0.115. Re-fit the model if necessary. Finally how many observations you use in the final model?

Linear model for stepwise selection

lm.step <- lm(Weight ~ MomWtGain + MomSmoke + Black, data=bweight)
par(mfrow=c(2,2))
plot(lm.step, which=c(1:4))

Model Significance: By looking at the Normal QQ Plot, we can see that for the most part, the points fall along the line, with a few points that curve off the line. Also, we can see from the Squareroot(Standardized Residual) plot that certain data points are greater than 2. With that in mind, we may fairly conclude that the assumption of normality is not reasonable.

Equal Variance Check: There is a pattern in the residual plot and from this we can say that it follows heteroscedasticity

Influential Points: We can see that in the Cook’s distance plot that 129 larger than 0.115.

Cook’s Difference

inf.id <- which(cooks.distance(lm.step) > 0.115)
bweight[inf.id, ]

##     Weight Weight_Gr Black Married Boy MomSmoke Ed MomAge MomWtGain Visit
## 129   1804         1     1       1   0        0  1      9        35     3

Refitted Model

lm.refitted.step = lm(Weight ~ MomWtGain + MomSmoke + Black, data=bweight[-inf.id, ])

Exercise 1 C

How much of the variation in Weight is explained by the final model?

summary(lm.refitted.step)

## 
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = bweight[-inf.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2427.02  -309.20     2.98   315.40  1472.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3434.252     32.078 107.059  < 2e-16 ***
## MomWtGain     13.112      2.113   6.204 1.39e-09 ***
## MomSmoke1   -238.923     76.251  -3.133  0.00186 ** 
## Black1      -198.519     78.022  -2.544  0.01133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared:  0.1366, Adjusted R-squared:  0.1301 
## F-statistic: 20.84 on 3 and 395 DF,  p-value: 1.493e-12

Conclusion: According to the summary table above, the final model can explain 13.66% of the variation in Weight, which is a low percentage. As a result, this model has a low prediction power for variation of weight.

Exercise 1 D

Interpret the relationship between predictor variables (in the final model) and Weight value specifically.

Model Significance: Our model p-value is 1.493e-12, which is below than the significance level of 0.05. As a result, we can reject null hypothesis and conclude that the model is beneficial in explaining behavior of Weight.

Individual Term Significance: All of the p-values in the T-test for each predictor are less than 0.05. Hence, we can conclude that there’s a significant relationship between Weight and all three predictors: MomWtGain, MomSmoke, and Black.

Estimated Regression Line: On average, a one unit increase in MomWtGain is associated with a 13.112 increase in MomWtGain, while MomSmoke and Black are being held constant when there is a one unit increase.

R-squared: The final model can explain 13.66 % of variation of weight . As a result, the final model’s predictive power for Weight low.

Exercise 2

Now we consider fitting a logistic regression for low birthweight (Weight_Gr=1). Again consider Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit as possible explanatory variables.

Exercise 2 A

Perform following model selection methods and compare their best models. Comment how they differ or similar in terms of selected variables - Stepwise selection with AIC criteria - Stepwise selection with BIC criteria

NOTE: stepwise selection with BIC criteria can be performed by step() function by adding an option k=log(n), where n is a sample size. Check Week 15 respiratory data example - how to use this option.

Fit Logistic Regression Model for low birthweight

model.null <- glm(Weight_Gr ~ 1, data = bweight, family = "binomial")
model.full <- glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed + 
                  MomAge + MomWtGain + Visit, data = bweight, family = "binomial")

AIC Stepwise Selection

step.AIC<-step(model.null, scope = list(upper=model.full),
                     direction="both",test="Chisq", trace = F) 
summary(step.AIC)

## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy + 
##     Ed, family = "binomial", data = bweight)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9790  -1.0470  -0.6085   1.0966   2.0012  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.240486   0.188075   1.279  0.20101    
## MomWtGain   -0.038047   0.008471  -4.492 7.07e-06 ***
## MomSmoke1    0.818590   0.310227   2.639  0.00832 ** 
## MomAge      -0.044444   0.019040  -2.334  0.01959 *  
## Boy1        -0.407560   0.212600  -1.917  0.05523 .  
## Ed1         -0.366259   0.217910  -1.681  0.09280 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 510.15  on 394  degrees of freedom
## AIC: 522.15
## 
## Number of Fisher Scoring iterations: 4

BIC Stepwise Selection

step.BIC<-step(model.full,direction="both",test="Chisq", 
               trace = F, k=log(nrow(bweight))) 
summary(step.BIC)

## 
## Call:
## glm(formula = Weight_Gr ~ MomSmoke + MomAge + MomWtGain, family = "binomial", 
##     data = bweight)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.016  -1.073  -0.669   1.103   2.000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.132541   0.112817  -1.175  0.24006    
## MomSmoke1    0.865786   0.309944   2.793  0.00522 ** 
## MomAge      -0.048266   0.018730  -2.577  0.00997 ** 
## MomWtGain   -0.036819   0.008389  -4.389 1.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 516.39  on 396  degrees of freedom
## AIC: 524.39
## 
## Number of Fisher Scoring iterations: 4

Conclusion: We can see that both AIC and BIC based models revealed the same predictors: MomWtGain, MomSmoke, and MomAge, as these predictors have p-values less than the significance level of 0.05, as shown by the Stepwise technique of model selection with AIC and BIC criteria used above.

Exercise 2 B

Fit the logistic regression with the best model determined by stepwise selection with BIC criteria. Do not leave observation which has cook’s d larger than 0.1. Re-fit the model if necessary. Finally how many observations you use in the final model?

glm.model <- glm(Weight_Gr ~ MomWtGain + MomSmoke + MomAge, data = bweight, family = binomial)
summary(glm.model)

## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = binomial, 
##     data = bweight)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.016  -1.073  -0.669   1.103   2.000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.132541   0.112817  -1.175  0.24006    
## MomWtGain   -0.036819   0.008389  -4.389 1.14e-05 ***
## MomSmoke1    0.865786   0.309944   2.793  0.00522 ** 
## MomAge      -0.048266   0.018730  -2.577  0.00997 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 516.39  on 396  degrees of freedom
## AIC: 524.39
## 
## Number of Fisher Scoring iterations: 4

(inf.id=which(cooks.distance(glm.model)>0.1))

## named integer(0)

Conclusion: When we look at the p-values for each predictor individually, we can see that they all fall below 0.05. This indicates that they are important enough to include in the model. There has also been no observation with a Cook’s Distance greater than 0.1.

Exercise 2 C

Based on your final model, interpret the explicit relationship between response and predictors using Odds Ratio.

round(exp(glm.model$coefficients), 3)

## (Intercept)   MomWtGain   MomSmoke1      MomAge 
##       0.876       0.964       2.377       0.953

Conclusion:

The odds of Weight_Gr change by a factor of 0.964 with a one unit increase in MomWtGain as MomAge and MomSmoke are being held constant
The odds of Weight_Gr change by a factor of 2.377 with a one unit increase in MomSmoke as MomWtGain and MomAge are being held constant
The odds of Weight_Gr change by a factor of 0.953 with a one unit increase in MomAge as MomWtGain and MomSmoke are being held constant

Exercise 2 D

Which woman has the high chance to deliver a low birthweight infant? For example, answer will be like “a married, high-educated, and older woman has the high chance to deliver a lower birthweight infant.”

Conclusion: Younger women with higher smoking levels, and lower weightgain are more likely to deliver a low birthweight infant

Exercise 2 E

What is the sample proportion of low birthweight infant in dataset?

sample.prop <- mean(bweight$Weight_Gr)
sample.prop

## [1] 0.4925

fit.prob <- predict(glm.model, type = "response")
pred.class.2 <- ifelse(fit.prob > sample.prop, 1, 0)

Conclusion: The sample proportion of low birthweight infant in dataset is 49.25%

Exercise 2 F

Perform classification with probability cut-off set as sample proportion you answer in (5). What is misclassification rate?

mean(bweight$Weight_Gr != pred.class.2)

## [1] 0.355

Conclusion: The misclassification rate is 35.5%

Exercise 2 G

Comment on Goodness of fit test and make a conclusion

Hosmer-Lemeshow Test

hoslem.test(glm.model$y, fitted(glm.model), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.model$y, fitted(glm.model)
## X-squared = 9.2068, df = 8, p-value = 0.3252

Conclusion: From the Hosmer Lemeshow’s test performed above we can see that the p-value(0.3252) is above the significant value of 0.1, and we can say that there is enough evidence to accept the null hypothesis reject the alternative hypothesis and say the model is adequate.

Exercise 3

Compare results from Exercise 1-2 and comment on different or similar conclusions from each analysis.

Conclusion:

Exercise 1: Only 13.66% of Weight variance can be explained by this model when MomWtGain, Black, and MomSmoke are used as predictors. Hence, this is a poor model for the Weight prediction.

Exercise 2: We see that 35.5% of weight variance can be explained by this model when MomWtGain, MomAge, and MomSmoke are used as predicators. Hence, this is a better model for the prediction of Weight.

Low birthweight is a risk factor that can lead infant mortality. If you want to implement a low-birthweight prevention program, what would you suggest to pregnant women?

Conclusion: When we compare these two models, we can see that MomWtGain and MomSmoke are both present. It’s safe to say that these two predictors are important when it comes to forecasting Weight fluctuation. As a result, it’s recommended for women to avoid smoking and maintain a healthy weight during pregnancy while executing a low-birthweight prevention program.

Final Exam

Abha Jha

12/6/2021

Data Sets:

Exercise 1

Exercise 1 A

Exercise 1 B

Exercise 1 C

Exercise 1 D

Exercise 2

Exercise 2 A

Exercise 2 B

Exercise 2 C

Exercise 2 D

Exercise 2 E

Exercise 2 F

Exercise 2 G

Exercise 3