Multiple Linear Regression, Model Selection

Use significance levels of .05 unless the instructions state otherwise. Data Sets: You need to download dataset birthweight_final.csv. The data record live, singleton births to mothers between the ages of 18 and 45 in the United States who were classified as black or white. There are total of 400 observations in birthweight, and variables are: - Weight: Infant birth weight (gram) - Weight_Gr; Categorical variable for indication of low birthweight; 0 is normal, 1 is low birthweight - Black: Categorical variable; 0 is white, 1 is black - Married: Categorical variable; 0 is not married, 1 is married - Boy: Categorical variable; 0 is girl, 1 is boy - MomSmoke: Categorical variable; 0 is non-smoking mom, 1 is smoking mom - Ed: Categorical variable for Mother’s education Level; 0 is high-school grad or less; 1 is college grad or above - MomAge: Mother’s age (centered to zero) - MomWtGain: Mother’s weight gain during pregnancy (centered to zero) - Visit: number of prenatal visits

Multiple Linear Regression, Model Selection:

# Read the CSV file
d1 <- read.csv("birthweight_final.csv") 

d1$Black = as.factor(d1$Black)
d1$Married = as.factor(d1$Married)
d1$Boy = as.factor(d1$Boy)
d1$MomSmoke = as.factor(d1$MomSmoke)
d1$Ed = as.factor(d1$Ed)


# Check data format
#str(d1)

Now we consider fitting a logistic regression for low birthweight (Weight_Gr=1). Again consider Black, Married, Boy, MomSmoke, Ed, MomAge, MomWtGain, and Visit as possible explanatory variables.

Question (1):

(1) Perform following model selection methods and compare their best models. Comment how they differ or similar in terms of selected variables - Stepwise selection with AIC criteria
- Stepwise selection with BIC criteria

NOTE: stepwise selection with BIC criteria can be performed by step() function by adding an option k=log(n), where n is a sample size. Check Week 15 respiratory data example - how to use this option.

Answer -:

Run full and null models:

e2.glm.full <- glm(Weight_Gr ~ Black + Married + Boy + MomSmoke + Ed + MomAge + MomWtGain + Visit , data=d1, family = "binomial")
e2.glm.null <- glm(Weight_Gr ~ 1 , data=d1, family = "binomial")

Stepwise selection with AIC:

Run Stepwise with AIC criteria

e2.stepwise.AIC <- step(e2.glm.null,scope = list(upper=e2.glm.full), direction="both",test="Chisq", trace = F)
summary(e2.stepwise.AIC)

## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge + Boy + 
##     Ed, family = "binomial", data = d1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9790  -1.0470  -0.6085   1.0966   2.0012  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.240486   0.188075   1.279  0.20101    
## MomWtGain   -0.038047   0.008471  -4.492 7.07e-06 ***
## MomSmoke1    0.818590   0.310227   2.639  0.00832 ** 
## MomAge      -0.044444   0.019040  -2.334  0.01959 *  
## Boy1        -0.407560   0.212600  -1.917  0.05523 .  
## Ed1         -0.366259   0.217910  -1.681  0.09280 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 510.15  on 394  degrees of freedom
## AIC: 522.15
## 
## Number of Fisher Scoring iterations: 4

Conclusion:

The final model, based on stepwise with AIC criteria is:

Log(Weight_Gr) = 0.240486 + 0.038047 * MomWtGain + 0.818590 * MomSmoke - 0.0444444 * MomAge - 0.407560 * Boy - 0.366259 * Ed
With alpha level=0.05, thus except Boy predictor, other predictors are significant, although we know that out final model with AIC may include the insignificant variables.

Stepwise selection with BIC:

Run Stepwise with BIC criteria

e2.stepwise.BIC <- step(e2.glm.null,scope = list(upper=e2.glm.full), direction="both",test="Chisq", trace = F,k=log(nrow(d1)))
summary(e2.stepwise.BIC)

## 
## Call:
## glm(formula = Weight_Gr ~ MomWtGain + MomSmoke + MomAge, family = "binomial", 
##     data = d1)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.016  -1.073  -0.669   1.103   2.000  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.132541   0.112817  -1.175  0.24006    
## MomWtGain   -0.036819   0.008389  -4.389 1.14e-05 ***
## MomSmoke1    0.865786   0.309944   2.793  0.00522 ** 
## MomAge      -0.048266   0.018730  -2.577  0.00997 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 554.43  on 399  degrees of freedom
## Residual deviance: 516.39  on 396  degrees of freedom
## AIC: 524.39
## 
## Number of Fisher Scoring iterations: 4

Conclusion:

The final model, based on stepwise with BIC criteria is:

Log(Weight_Gr) = -0.132541 - 0.036819 * MomWtGain + 0.865786 * MomSmoke - 0.048266 * MomAge
With alpha level=0.05, all predictors are significant, although we know that out final model with AIC may include the insignificant variables.

Answer following questions from the best model determined by stepwise selection with BIC criteria

Question (2):

(2) Fit the logistic regression with the best model determined by stepwise selection with BIC criteria. Do not leave observation which has cook’s d larger than 0.1. Re-fit the model if necessary. Finally how many observations you use in the final model?

Answer -:

Run Diagnistics, Residuals plots

We will use both Deviance and Person methods:

e2.res.deviance <-residuals(e2.stepwise.BIC, type = "deviance")
e2.res.pearson <-residuals(e2.stepwise.BIC, type = "pearson")

e2.std.res.deviance <-residuals(e2.stepwise.BIC, type = "deviance")/sqrt(1 - hatvalues(e2.stepwise.BIC)) # standardized deviance residuals
e2.std.res.pearson <-residuals(e2.stepwise.BIC, type = "pearson")/sqrt(1 - hatvalues(e2.stepwise.BIC)) # standardized pearson residuals

Show Plots:

dev.new(width = 1000, height = 1000)
par(mfrow=c(1,2))
plot(e2.res.deviance[e2.stepwise.BIC$model$Weight_Gr==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. deviance residuals", xlab = "ID")
points(e2.res.deviance[e2.stepwise.BIC$model$Weight_Gr==1], col = "blue")

plot(e2.res.pearson[e2.stepwise.BIC$model$Weight_Gr==0], col = "red", ylim = c(-3.5,3.5), ylab = "std. Pearson residuals", xlab = "ID")
points(e2.res.pearson[e2.stepwise.BIC$model$Weight_Gr==1], col = "blue")

Since in std. residuals plot, we can not find any pattern and all values (0/1) are in range [-2,2], thus the assumption is valid.

Run Diagnostics, Influence diagnostics, cook’s distance

As an assumption, cut-of = 0.1

# visual inspection

dev.new(width = 1000, height = 1000)
plot(e2.stepwise.BIC, which = 4, id.n = 5)

Find all pints > cut-off

e2.inf.id=which(cooks.distance(e2.stepwise.BIC) > 0.1)
e2.inf.id

## named integer(0)

There is not any influential points with larger value than 0.1, so no need to eliminate any values. Also no need to refit model again.

dim(d1)

## [1] 400  10

Since there is no influential points, so 400 observations are in the final model.

Question (3):

(3) Based on your final model, interpret the explicit relationship between response and predictors using Odds Ratio.

Answer -:

Calcualte Odds Ratio

## odds ratios 

e2.OR <- exp(e2.stepwise.BIC$coefficients)
round(e2.OR, 3)

## (Intercept)   MomWtGain   MomSmoke1      MomAge 
##       0.876       0.964       2.377       0.953

Based on previous part, our final model is:

Log(Weight_Gr) = -0.132541 - 0.036819 * MomWtGain + 0.865786 * MomSmoke - 0.048266 * MomAge

Conclusion:

OR(MomWtGain) = exp(-0.036819) < 1
Thus, there exist negative relationship (decrease) and high probability chance for not having Weight_Gr event (low birthweight).
OR(MomSmoke) = exp(0.865786) > 1
Thus, the estimation of OR for MomSmoke is 2.377 this means for Smoking mothers has higher effect on having infant low birthweight.
OR(MomAge) = exp(-0.048266) < 1
Thus, there exist negative relationship (decrease) and high probability chance for not having Weight_Gr event (low birthweight).

Question (4):

(4) Which woman has the high chance to deliver a low birthweight infant? For example, answer will be like “a married, high-educated, and older woman has the high chance to deliver a low birthweight infant.”

Answer -:

Based on previous part, including the final model and OR results:
- Women who Smoke more and have a lower age and have lower Weight Gain have high chance to deliver a low birthweight infants (Happened event).

In Brief:
- More Smoke
- Lower Age
- Lower WeightGain

Will cause probablibity low birthweight infants.

Question (5):

(5) What is the sample proportion of low birthweight infant in dataset?

Answer -:

Calculate Sample Proportion value:

This is sample proportion value:

k=mean(d1$Weight_Gr)
print(k)

## [1] 0.4925

Question (6):

(6) Perform classification with probability cut-off set as sample proportion you answer in (5). What is misclassification rate?

Answer -:

Run Prediction model:

e2.predict = predict(e2.stepwise.BIC, type = "response")

Classification with sample proportion (k):

e2.sample.proportion = ifelse(e2.predict > k,1,0)

Add new variables to the dataset:

d1=cbind(d1,e2.predict,e2.sample.proportion)

calculate Misclassification:

mean(d1$Weight_Gr != d1$e2.sample.proportion)  #misclassification rate from sample proportion threshold

## [1] 0.355

Question (7):

(7) Comment on Goodness of fit test and make a conclusion

Answer -:

Calculate Goodness-of-fit

H0: Model is Adequate (Fits Well)
H1: Model is not Adequate (Not fits well)

## Hosmer Lemeshow test

hoslem.test(e2.stepwise.BIC$y, fitted(e2.stepwise.BIC), g=10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  e2.stepwise.BIC$y, fitted(e2.stepwise.BIC)
## X-squared = 9.2068, df = 8, p-value = 0.3252

Conclussion:

Based on P-Value, since alpha level=0.1, and P-Value > 0.05, thus this is our conclusion:

We don’t have any evidence to reject the Null, thus Model is Adequate and fits well.

Multiple Linear Regression, Model Selection

Arash Hatamirad, ahatamirad@gmail.com

9/1/2021