Data Algorithms I

EXERCISE 1.1

Multiple Linear Regression

lm.birth <- lm(Weight ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight)
summary(lm.birth)

## 
## Call:
## lm(formula = Weight ~ Black + Married + Boy + MomSmoke + Ed + 
##     MomAge + MomWtGain + Visit, data = birthweight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2433.18  -312.36    14.72   323.08  1562.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3249.006    120.298  27.008  < 2e-16 ***
## Black1      -189.710     83.770  -2.265   0.0241 *  
## Married1      63.281     69.819   0.906   0.3653    
## Boy1         118.816     55.077   2.157   0.0316 *  
## MomSmoke1   -198.047     79.065  -2.505   0.0127 *  
## Ed1           71.241     56.300   1.265   0.2065    
## MomAge         3.048      5.305   0.574   0.5660    
## MomWtGain     12.136      2.117   5.733 1.98e-08 ***
## Visit         13.626     37.691   0.362   0.7179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 547.1 on 391 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1281 
## F-statistic: 8.328 on 8 and 391 DF,  p-value: 1.901e-10

Model Selection - Stepwise

model.stepwise <- ols_step_both_p(lm.birth, pent = 0.01, prem = 0.01, details = F)
model.stepwise

## 
##                                Stepwise Selection Summary                                 
## -----------------------------------------------------------------------------------------
##                       Added/                   Adj.                                          
## Step    Variable     Removed     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------------------
##    1    MomWtGain    addition       0.089       0.087    20.7600    6201.2327    559.8163    
##    2    MomSmoke     addition       0.107       0.102    14.6830    6195.4042    555.0626    
##    3      Black      addition       0.127       0.120     7.4660    6188.2794    549.4600    
## -----------------------------------------------------------------------------------------

Model Selection - Forward

model.forward<-ols_step_forward_p(lm.birth, penter = 0.01, details = F)
model.forward

## 
##                               Selection Summary                               
## -----------------------------------------------------------------------------
##         Variable                   Adj.                                          
## Step     Entered     R-Square    R-Square     C(p)         AIC         RMSE      
## -----------------------------------------------------------------------------
##    1    MomWtGain      0.0893      0.0870    20.7604    6201.2327    559.8163    
##    2    MomSmoke       0.1069      0.1024    14.6832    6195.4042    555.0626    
##    3    Black          0.1271      0.1205     7.4659    6188.2794    549.4600    
## -----------------------------------------------------------------------------

Model Selection - Backward

model.backward<-ols_step_backward_p(lm.birth, prem = 0.01, details = F) 
model.backward

## 
## 
##                             Elimination Summary                             
## ---------------------------------------------------------------------------
##         Variable                  Adj.                                         
## Step    Removed     R-Square    R-Square     C(p)        AIC         RMSE      
## ---------------------------------------------------------------------------
##    1    Visit         0.1453       0.130    7.1307    6187.8448    546.4642    
##    2    MomAge        0.1445      0.1314    5.5158    6186.2385    546.0372    
##    3    Married       0.1409       0.130    5.1599    6185.9146    546.4876    
##    4    Ed            0.1364      0.1277    5.1841    6185.9688    547.1986    
##    5    Boy           0.1271      0.1205    7.4659    6188.2794    549.4600    
## ---------------------------------------------------------------------------

Model Selection - Adjusted R-squared

model.best.subset<-ols_step_best_subset(lm.birth)
model.best.subset

Conclusion: From the various methods of model selection executed above, we can see that the stepwise, forward and backward models all yield the same predictors (MomWtGain, MomSmoke and Black) for our final model. Using the best subset approach (Adjusted R-squared), we get Black, Married, Boy MomSmoke, Ed, and MomWtGain as predictors for our final model (Model 6) This is reasonable because it has the highest Adjusted R-Square out of the eight models (0.1314)

EXCERISE 1.2

Linear Model determined by Stepwise Selection

lm.step <- lm(Weight ~ MomWtGain + MomSmoke + Black, data=birthweight)
par(mfrow=c(2,2))
plot(lm.step, which=c(1:4))

Conclusion:

Normality Check: Looking at the Normal QQ Plot, we can see that most of the points fall along the line for the majority of the graph with some points that curve off the line (though they don’t seem to be serious) Moreover, looking at the sqrt(Standardized Residual) Plot, we can see that there are some data points that fall above 2. With that being said, we can safely conclue that the assumption of normality is not reasonable.
Equal Variance Check: Looking at the residual plot, we can see a pattern in the plot. Hence, this supports heteroscedasticity.
Influential Points: Looking at the observations of Cook’s Distance, we can see that there’s one observation that exceeds the 0.115 threshold: Observation 129. To confirm its status, we can check below to see if we need to refit the model.

Cook’s Distance

influential.id <- which(cooks.distance(lm.step) > 0.115)
birthweight[influential.id, ]

Refitted Model

lm.step2 <- lm(Weight ~ MomWtGain + MomSmoke + Black, data=birthweight[-influential.id, ])

EXCERISE 1.3

summary(lm.step2)

## 
## Call:
## lm(formula = Weight ~ MomWtGain + MomSmoke + Black, data = birthweight[-influential.id, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2427.02  -309.20     2.98   315.40  1472.75 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3434.252     32.078 107.059  < 2e-16 ***
## MomWtGain     13.112      2.113   6.204 1.39e-09 ***
## MomSmoke1   -238.923     76.251  -3.133  0.00186 ** 
## Black1      -198.519     78.022  -2.544  0.01133 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 542.2 on 395 degrees of freedom
## Multiple R-squared:  0.1366, Adjusted R-squared:  0.1301 
## F-statistic: 20.84 on 3 and 395 DF,  p-value: 1.493e-12

Conclusion: From the summary table above, we can see that 13.66% of the variation of Weight can be explained by the final model, which is fairly low. Hence, this model has a fairly low prediction power for the variation of Weight.

EXERCISE 1.4

Model Significance: Looking at the F-statistic, our model p-value is 1.493e-12, which falls below the significance level of 0.05. With that being said, we can reject the null hypothesis and conclude that the model is useful to explain the behavior of Weight
Individual Term Significance: Looking at the T-test for each predictor, all the p-values fall below 0.05. With that being said, we can conclude that there’s a significant relationship between Weight and al three predictors: MomWtGain, MomSmoke, and Black
Estimated Regression Line: Y = 3434.252 + 13.112(MomWtGain) - 238.923(MomSmoke) - 198.519(Black) + E. This means that on average, a one unit increase in MomWtGain is associated with a 13.112 increase in MomWtGain, while MomSmoke and Black are being held constant. A one unit increase
R-squared: 13.66% of Weight variation can be explaiend by the final model. Hence, the final model has a fairly low predictive power for Weight.

EXERCISE 2.1

Fit Logistic Regression Model (Low Birthweight)

model.null <- glm(Weight_Gr ~ 1, data = birthweight, family = "binomial")
model.full <- glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit, data = birthweight, family = "binomial")

Model Selection - Stepwise (AIC)

step.model.AIC<-step(model.null, scope = list(upper=model.full),
                     direction="both",test="Chisq", trace = F) 
summary(step.model.AIC)

## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy + 
##     Ed, family = "binomial", data = birthweight)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9790  -1.0470  -0.6085   1.0966   2.0012  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.240486   0.188075   1.279  0.20101    
## MomWtGain   -0.038047   0.008471  -4.492 7.07e-06 ***
## MomSmoke1    0.818590   0.310227   2.639  0.00832 ** 
## MomAge      -0.044444   0.019040  -2.334  0.01959 *  
## Boy1        -0.407560   0.212600  -1.917  0.05523 .  
## Ed1         -0.366259   0.217910  -1.681  0.09280 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 510.15  on 394  degrees of freedom
## AIC: 522.15
## 
## Number of Fisher Scoring iterations: 4

Model Selection - Stepwise (BIC)

step.model.BIC<-step(model.full,
                      direction="both",test="Chisq", trace = F, k=log(nrow(birthweight))) 
summary(step.model.BIC)

## 
## Call:
## glm(formula = Weight_Gr ~ MomSmoke + MomAge + MomWtGain, family = "binomial", 
##     data = birthweight)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.016  -1.073  -0.669   1.103   2.000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.132541   0.112817  -1.175  0.24006    
## MomSmoke1    0.865786   0.309944   2.793  0.00522 ** 
## MomAge      -0.048266   0.018730  -2.577  0.00997 ** 
## MomWtGain   -0.036819   0.008389  -4.389 1.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 516.39  on 396  degrees of freedom
## AIC: 524.39
## 
## Number of Fisher Scoring iterations: 4

Conclusion: From the Stepwise method of model selection with AIC and BIC criteria executed above, we can see that the both AIC and BIC based models all yielded the same predictors: MomWtGain, MomSmoke, and MomAge, as these predictors have p-values lower than significance level of 0.05.

EXERCISE 2.2

glm.model <- glm(Weight_Gr ~ MomWtGain + MomSmoke + MomAge, data = birthweight, family = binomial)
summary(glm.model)

## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = binomial, 
##     data = birthweight)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.016  -1.073  -0.669   1.103   2.000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.132541   0.112817  -1.175  0.24006    
## MomWtGain   -0.036819   0.008389  -4.389 1.14e-05 ***
## MomSmoke1    0.865786   0.309944   2.793  0.00522 ** 
## MomAge      -0.048266   0.018730  -2.577  0.00997 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 516.39  on 396  degrees of freedom
## AIC: 524.39
## 
## Number of Fisher Scoring iterations: 4

(inf.id=which(cooks.distance(glm.model)>0.1))

## named integer(0)

Conclusion: Looking at the p-value for each individual predictor, we can see that they all fall under the threshold of 0.05. This means that they’re significant enough to be used in the model. Also, there’s no observation with Cook’s Distance higher than 0.1

##EXERCISE 2.3 #### Final Model Log(p/1-p) = -0.1325 - 0.0368 * MomWtGain + 0.8658 * MomSmoke - 0.0483 * MomAge

Odds Ratio

round(exp(glm.model$coefficients),3)

## (Intercept)   MomWtGain   MomSmoke1      MomAge 
##       0.876       0.964       2.377       0.953

Conclusion:

The odds of Weight_Gr change by a factor of exp(-0.0368) = 0.964 with a one unit increase in MomWtGain as MomAge and MomSmoke are being held constant
The odds of Weight_Gr change by a factor of exp(0.8658) = 2.377 with a one unit increase in MomSmoke as MomWtGain and MomAge are being held constant
The odds of Weight_Gr change by a factor of exp(-0.0483) = 0.953 with a one unit increase in MomAge as MomWtGain and MomSmoke are being held constant

EXERCISE 2.4

Simply put, younger women with higher smoking levels, and lower weight are more likely to deliver a low birthweight infant

EXERCISE 2.5

fit.prob <- predict(glm.model, type = "response")
sample.prop <- mean(birthweight$Weight_Gr)
pred.class.2 <- ifelse(fit.prob > sample.prop, 1, 0)
head(pred.class.2, 10)

##  1  2  3  4  5  6  7  8  9 10 
##  0  1  1  0  0  1  0  1  0  1

Conclusion: The sample proportion in the dataset is 5

EXERCISE 2.6

mean(birthweight$Weight_Gr != pred.class.2)

## [1] 0.355

Conclusion: The misclassification rate is 35.5%. Although the misclassification rate is fairly high, since the cutoff point that we’re using is lower than 0.5, there will be more false positive cases than false negative ones, which is acceptable.

EXERCISE 2.7

####Goodness-of-Fit Test (Hosmer-Lemeshow)

hoslem.test(glm.model$y, fitted(glm.model), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  glm.model$y, fitted(glm.model)
## X-squared = 9.2068, df = 8, p-value = 0.3252

Conclusion: From the Hosmer-Lemeshow (Goodness-of-Fit) test result, we can see it yielded a p-value of 0.1722. Since this is above the significance level of 0.05, we can retain our null hypothesis and conclude that this model is in fact adequate.

EXERCISE 3

Exercise 1: Using MomWtGain, Black and MomSmoke as the predictors for our final model, only 13.66% of Weight variation can be explained by this model. With that being said, the model from Ex. 1 has a fairly low predictive power. Simply put, this is not a good model for the prediction model of Weight.
Exercise 2: Using MomWtGain, MomAge and MomSmoke as the predictors for our final model, we see that 35.5% of Weight variation can be explained by this model. With that being said, the model from Ex. 2 has a better predictive power compared to the model from Exercise 1. Simply put, this is a better model for the prediction model of Weight.
Conclusion: When comparing these two models, we see that predictors MomWtGain and MomSmoke are present in both. It’s safe to say that these two predictors are significant when predicting the variation of Weight. Thus, when implementing a low-birthweight prevention program, it’s best for women to avoid smoking and maintain a healthy weight during pregnancy.

Data Algorithms I - Final

Adrianne Kristianto

12/7/2020