Exercise 1:

Ex 1.1

Multiple Linear Regression

lm.birth <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight)
summary(lm.birth)
## 
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed + 
##     MomAge + MomWtGain + Visit, data = birthweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2433.18  -312.36    14.72   323.08  1562.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3249.006    120.298  27.008  < 2e-16 ***
## Black1      -189.710     83.770  -2.265   0.0241 *  
## Married1      63.281     69.819   0.906   0.3653    
## Boy1         118.816     55.077   2.157   0.0316 *  
## MomSmoke1   -198.047     79.065  -2.505   0.0127 *  
## Ed1           71.241     56.300   1.265   0.2065    
## MomAge         3.048      5.305   0.574   0.5660    
## MomWtGain     12.136      2.117   5.733 1.98e-08 ***
## Visit         13.626     37.691   0.362   0.7179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1281 
## F-statistic: 8.328 on 8 and 391 DF,  p-value: 1.901e-10

Stepwise Model Selection

model.stepwise <- ols_step_both_p(lm.birth, pent = 0.01, prem = 0.01, details = F)
model.stepwise
## 
##                                Stepwise Selection Summary                                 
## -----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                          
## Step    Variable     Removed     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------------------
##    1    MomWtGain    addition       0.089       0.087    20.7600    6201.2327    559.8163    
##    2    MomSmoke     addition       0.107       0.102    14.6830    6195.4042    555.0626    
##    3      Black      addition       0.127       0.120     7.4660    6188.2794    549.4600    
## -----------------------------------------------------------------------------------------

Forward Model Selection

model.forward <- ols_step_forward_p(lm.birth, penter = 0.01, details = F)
model.forward
## 
##                               Selection Summary                               
## -----------------------------------------------------------------------------
##         Variable                   Adj.                                          
## Step     Entered     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------
##    1    MomWtGain      0.0893      0.0870    20.7604    6201.2327    559.8163    
##    2    MomSmoke       0.1069      0.1024    14.6832    6195.4042    555.0626    
##    3    Black          0.1271      0.1205     7.4659    6188.2794    549.4600    
## -----------------------------------------------------------------------------

Backward Model Selection

model.backward <- ols_step_backward_p(lm.birth, prem = 0.01, details = F)
model.backward
## 
## 
##                             Elimination Summary                             
## ---------------------------------------------------------------------------
##         Variable                  Adj.                                         
## Step    Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ---------------------------------------------------------------------------
##    1    Visit         0.1453       0.130    7.1307    6187.8448    546.4642    
##    2    MomAge        0.1445      0.1314    5.5158    6186.2385    546.0372    
##    3    Married       0.1409       0.130    5.1599    6185.9146    546.4876    
##    4    Ed            0.1364      0.1277    5.1841    6185.9688    547.1986    
##    5    Boy           0.1271      0.1205    7.4659    6188.2794    549.4600    
## ---------------------------------------------------------------------------

Adjusted R-Square Model Selection

model.best.subset <- ols_step_best_subset(lm.birth)
model.best.subset

Based on Results

From the above results, stepwise, forward, and backward models produce the same predictors: MomWtGain, MomSmoke, & Black. The best subset approach produces Black, Married, Boy, MomSmoke, Ed, & MomWtGain as the predictors. This is a reasonable model because it gives the highest adjusted r-square value out of the eight models

Ex 1.2

Stepwise Selection Linear Model

lm.step <- lm(Weight ~ MomWtGain + MomSmoke + Black, data = birthweight)
par(mfrow = c(2,2))
plot(lm.step, which = c(1:4))

Based on Results

Normality: the QQPlot shows most of the points falling along the line for the majority of the graph. Also, the sqrt(Standardized Residual) plot shows some points fall above 2. We can conclude the assumption of normality is NOT reasonable. Equal Variance: the Residual plot shows a pattern - supporting heteroscedasticity. Influential points: Cook’s Distance shows there’s one point that exceeds 0.115 - Obs 129. To confirm, we need to check if refitting the model is needed (test below)

Cook’s Distance

influential.id = which(cooks.distance(lm.step) > 0.115)
birthweight[influential.id, ]

Refit Model

lm.step2 = lm(Weight ~ MomWtGain + MomSmoke + Black, data = birthweight[-influential.id, ])

Ex 1.3

summary(lm.step2)
## 
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = birthweight[-influential.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2427.02  -309.20     2.98   315.40  1472.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3434.252     32.078 107.059  < 2e-16 ***
## MomWtGain     13.112      2.113   6.204 1.39e-09 ***
## MomSmoke1   -238.923     76.251  -3.133  0.00186 ** 
## Black1      -198.519     78.022  -2.544  0.01133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared:  0.1366, Adjusted R-squared:  0.1301 
## F-statistic: 20.84 on 3 and 395 DF,  p-value: 1.493e-12

Based on Results

From the results above, 13.66% of the variation in Weight can be explained by the final model. This is a fairly low percentage and shows the model has low prediction power for Weight.

Ex 1.4

Based on Results

Model Significance: the model p-value is lower than the significant value of 0.05 – we can REJECt the null and conclude that the final model is useful in explaining the behavior of Weight. Individual Term Significance: All the p-values fall below 0.05 – there is a significant relationship between Weight and all three predictors (MomWtGain, MomSmoke, & Black) Estimated Regression Line: Y = 3434.252 + 13.112(MomWtGain) - 238.923(MomSmoke) - 198.519(Black) + E R-Squared: 13.66% of Weight variation can be explained by the model. This means the model has low prediciton power.

Exercise 2:

Ex 2.1

Low Birthweight Fit Logistic Regression

model.null = glm(Weight_Gr ~ 1, data = birthweight, family = "binomial")
model.full = glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight, family = "binomial")

Stepwise AIC Model Selection

step.model.aic = step(model.null, scope = list(upper = model.full),
                      direction = "both", test = "Chisq", trace = F)
summary(step.model.aic)
## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy + 
##     Ed, family = "binomial", data = birthweight)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9790  -1.0470  -0.6085   1.0966   2.0012  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.240486   0.188075   1.279  0.20101    
## MomWtGain   -0.038047   0.008471  -4.492 7.07e-06 ***
## MomSmoke1    0.818590   0.310227   2.639  0.00832 ** 
## MomAge      -0.044444   0.019040  -2.334  0.01959 *  
## Boy1        -0.407560   0.212600  -1.917  0.05523 .  
## Ed1         -0.366259   0.217910  -1.681  0.09280 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 510.15  on 394  degrees of freedom
## AIC: 522.15
## 
## Number of Fisher Scoring iterations: 4

Stepwise BIC Selection Model

step.model.bic = step(model.full,
                      direction = "both", test = "Chisq", trace = F, k = log(nrow(birthweight)))
summary(step.model.bic)
## 
## Call:
## glm(formula = Weight_Gr ~ MomSmoke + MomAge + MomWtGain, family = "binomial", 
##     data = birthweight)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.016  -1.073  -0.669   1.103   2.000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.132541   0.112817  -1.175  0.24006    
## MomSmoke1    0.865786   0.309944   2.793  0.00522 ** 
## MomAge      -0.048266   0.018730  -2.577  0.00997 ** 
## MomWtGain   -0.036819   0.008389  -4.389 1.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 516.39  on 396  degrees of freedom
## AIC: 524.39
## 
## Number of Fisher Scoring iterations: 4

Based on Results

From the above results, both AIC and BIC models produced the same predictors that have p-values less than 0.05: MomWtGain, MomSmoke, & MomAge

Ex 2.2

glm.model = glm(Weight_Gr ~ MomWtGain + MomSmoke + MomAge, data = birthweight, family = "binomial")
summary(glm.model)
## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = "binomial", 
##     data = birthweight)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.016  -1.073  -0.669   1.103   2.000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.132541   0.112817  -1.175  0.24006    
## MomWtGain   -0.036819   0.008389  -4.389 1.14e-05 ***
## MomSmoke1    0.865786   0.309944   2.793  0.00522 ** 
## MomAge      -0.048266   0.018730  -2.577  0.00997 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 516.39  on 396  degrees of freedom
## AIC: 524.39
## 
## Number of Fisher Scoring iterations: 4
(inf.id = which(cooks.distance(glm.model) > 0.1))
## named integer(0)

Based on Results

All the p-values for the predictors are below 0.05 meaning they are all significant enough to be used in the model. There is also no observations with a Cook’s Distance greater than 0.1

Ex 2.3

Final Model

Log(p/1-p) = -0.1325 - 0.0368(MomWtGain) + 0.8658(MomSmoke) - 0.0483(MomAge)

Odds Ratio

round(exp(glm.model$coefficients), 3)
## (Intercept)   MomWtGain   MomSmoke1      MomAge 
##       0.876       0.964       2.377       0.953

Based on Results

Odds of Weight_Gr change by exp(-0.0368) = 0.964 with one unit increase in MomWtGain (all others constant) Odds of Weight_Gr change by exp(0.8658) = 2.377 with one unit increase in MomSmoke (all others constant) Odds of Weight_Gr change by exp(-0.0483) = 0.953with one unit increase in MomAge (all others constant)

Ex 2.4

Younger women with higher smoking levels, and lower weight are more likely to deliver an infant with a low birthweight

Ex. 2.5

fit.prob = predict(glm.model, type = "response")
sample.prob = mean(birthweight$Weight_Gr)
pred.class.2 = ifelse(fit.prob > sample.prob, 1, 0)
head(pred.class.2, 10)
##  1  2  3  4  5  6  7  8  9 10 
##  0  1  1  0  0  1  0  1  0  1

Based on Results

The sample proportion in the dataset is 5

Ex. 2.6

mean(birthweight$Weight_Gr != pred.class.2)
## [1] 0.355

Based on Results

Misclassification Rate = 35.5% Even though this is high, the cutoff point used is lower than 0.5 so there will be more false positive cases than false negatives - which is acceptable.

Ex. 2.7

Hosmer-Lemeshow Test for Goodness of Fit

hoslem.test(glm.model$y, fitted(glm.model), g = 10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.model$y, fitted(glm.model)
## X-squared = 9.2068, df = 8, p-value = 0.3252

Based on Results

The test above produced a p-value of 0.1722 which is above the 0.05 significance level. We do NOT reject the null and the model is adequate.

Exercise 3:

Exercise 1

The model in Ex1 has low predictive power and is not a good model for predicting Weight

Exercise 2

The model has much better predictive power over Ex1 and is therefore a better model for predicting Weight

Based on All Results

In both, we see MomWtGain & MomSmoke as predictors so it is probable these are both significant when predicting Weight. In conclusion, when implementing a low-birthweight prevention program, it’s best for expecting women to avoid smoking and maintain healthy weight.