Data files to reporoduce this analysis can be found on my GitHub

Set reference levels

The reference levels can be a little confusing for the regression, so I’ll explain it here.

For predictors (i.e. independent variables), the reference is exactly what it sound like. So if I set the reference level for race to be white, the estimate for coefficients raceXYZ is in reference to white subjects, as shown below:

BIRTH %>%
  mutate(
    race = relevel(race, "White")
  )

# Coefficients:
#                Estimate Std. Error z value Pr(>|z|)    
# ...              ...        ...       ...   ...
# raceHispanic  3.476e-02     ...       ...   ...

For our our outcomes it’s the second level of the factor that becomes our outcome (or “success”). So if the reference is set to having a normal birthweight, then the odds ratio is representing the odds ratio of being born low birth weight.

Below are the reference levels for the BIRTH dataset

## Classes 'tbl_df', 'tbl' and 'data.frame':    4820 obs. of  4 variables:
##  $ LBW   : Factor w/ 2 levels "Normal","Low": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PreMe : Factor w/ 2 levels "Term","Premature": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race  : Factor w/ 4 levels "White","Black",..: 1 1 2 1 1 3 2 2 3 3 ...
##  $ Wanted: Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 1 2 1 ...

Below are the reference levels for the PN dataset

## Classes 'tbl_df', 'tbl' and 'data.frame':    1518 obs. of  6 variables:
##  $ KnowPreg : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gotPNcare: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LBW      : Factor w/ 2 levels "Normal","Low": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PreMe    : Factor w/ 2 levels "Term","Premature": 1 1 1 1 1 1 1 2 1 1 ...
##  $ race     : Factor w/ 4 levels "White","Black",..: 1 2 3 2 2 3 3 1 1 2 ...
##  $ Wanted   : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 2 1 2 2 2 ...

Low Birth Weight

First, a refresher on the number of subjects who were low birth weight

Counts of subjects with normal and low birth weights
Normal Low
4362 458

This model predicts being born as low birth weight for their gestational age with the formula below

Fitting generalized (binomial/logit) linear model: LBW ~ GA + BMI + age + income + race + YrEdu + Wanted
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 18.67 1.019 18.32 5.764e-75
GA -0.5461 0.02308 -23.66 1.009e-123
BMI -0.02369 0.009261 -2.558 0.01053
age 0.0223 0.01313 1.698 0.08955
income -0.001564 0.0004851 -3.225 0.00126
raceBlack 0.5381 0.1523 3.534 0.0004093
raceHispanic -0.03476 0.1635 -0.2126 0.8317
raceOther 0.07275 0.2842 0.256 0.798
YrEdu -0.01591 0.02749 -0.5788 0.5627
WantedNo 0.04859 0.1221 0.3978 0.6908

Note that the estimate above is the Log-odds. The table below transforms the log-odds into odds ratios

term Odds.Ratio conf.low conf.high p.value
GA 0.5792 0.5531 0.6055 0
BMI 0.9766 0.9588 0.9943 0.0105
age 1.0226 0.9964 1.0491 0.0896
income 0.9984 0.9975 0.9994 0.0013
raceBlack 1.7127 1.2702 2.3082 4e-04
raceHispanic 0.9658 0.6987 1.3273 0.8317
raceOther 1.0755 0.5972 1.8290 0.798
YrEdu 0.9842 0.9324 1.0385 0.5627
WantedNo 1.0498 0.8257 1.3332 0.6908
null.deviance df.null logLik AIC BIC deviance df.residual
3027 4819 -1037 2095 2160 2075 4810

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: LBW
##        LR Chisq Df Pr(>Chisq)    
## GA       882.99  1  < 2.2e-16 ***
## BMI        6.71  1   0.009566 ** 
## age        2.85  1   0.091354 .  
## income    10.72  1   0.001063 ** 
## race      15.50  3   0.001433 ** 
## YrEdu      0.34  1   0.562376    
## Wanted     0.16  1   0.690907    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.LBW[[“formula”]]
  Model Likelihood Ratio Test Discrimination Indexes Rank Discrim. Indexes
Obs LR chi2 R2 C
4820 952.1 0.3844 0.8621
Normal d.f. g Dxy
4362 9 1.382 0.7242
Low Pr(> chi2) gr gamma
458 0 3.982 0.7246
max |deriv| gp tau-a
6e-10 0.1174 0.1246
Brier
0.05891
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(BIRTH$LBW == "Normal", 0, 1), fitted(mod.LBW)
## X-squared = 11.667, df = 8, p-value = 0.1667

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##            GVIF Df GVIF^(1/(2*Df))
## GA     1.024522  1        1.012187
## BMI    1.085350  1        1.041802
## age    1.140426  1        1.067907
## income 1.472605  1        1.213509
## race   1.225970  3        1.034538
## YrEdu  1.489736  1        1.220548
## Wanted 1.035868  1        1.017776

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

A observations have standardized residuals with an absolute value greater than 3, shown in the table below:

standardized residuals with abs value > 3 (continued below)
index .std.resid .cooksd LBW GA BMI age income race
889 -3.06 0.00849 Normal 26 29 24 57 Black
1214 3.42 0.0079 Low 42 43 20 500 White
1814 3.46 0.00778 Low 44 42 30 46 Hispanic
3081 3.11 0.00448 Low 41 38 20 291 White
YrEdu Wanted .hat
10 No 0.000804
15 Yes 0.000225
15 No 0.000197
12 Yes 0.000361

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: GA shows a very pretty linear relationship, but all the other relationships don’t really look linear at all. This is certainly a problem for this regression, but is less of an issue for the other models.

Note: one observation has been removed, with a gestational age of 44 weeks

Prediction

Using adjusted rates (assuming other variables are at their means)

This prediction assumes Wanted=="Yes

Cross-Validation

Ignore this section for grading purposes

Confusion matrix using test & training data

First, create a training model using 4/5ths of the original dataset. Below are the odds ratios of the original and training model

## # A tibble: 10 x 3
##    coefficient  `Original OR` `Training OR`
##    <chr>                <dbl>         <dbl>
##  1 (Intercept)  128171564.    721854437.   
##  2 GA                   0.580         0.570
##  3 BMI                  0.98          0.96 
##  4 age                  1.02          1.01 
##  5 income               1             1    
##  6 raceBlack            1.71          1.75 
##  7 raceHispanic         0.97          1.04 
##  8 raceOther            1.08          0.43 
##  9 YrEdu                0.98          1    
## 10 WantedNo             1.05          0.82

Next find the misclassifiaction error rate and plot the ROC curve

## [1] 0.0704

Finally, the confusion matrix using test & training data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Normal  Low
##     Normal   1946  118
##     Low        36   86
##                                          
##                Accuracy : 0.9296         
##                  95% CI : (0.918, 0.9399)
##     No Information Rate : 0.9067         
##     P-Value [Acc > NIR] : 7.930e-05      
##                                          
##                   Kappa : 0.4921         
##                                          
##  Mcnemar's Test P-Value : 6.703e-11      
##                                          
##             Sensitivity : 0.42157        
##             Specificity : 0.98184        
##          Pos Pred Value : 0.70492        
##          Neg Pred Value : 0.94283        
##              Prevalence : 0.09332        
##          Detection Rate : 0.03934        
##    Detection Prevalence : 0.05581        
##       Balanced Accuracy : 0.70170        
##                                          
##        'Positive' Class : Low            
## 

Premature birth

Again, a refresher on the number of subjects who were born before 37 weeks

Counts of subjects born at term and prematurely
Term Premature
4234 586

This model predicts being born prematurely with the formula below

Fitting generalized (binomial/logit) linear model: PreMe ~ BMI + age + income + race + YrEdu + Wanted
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.21 0.3859 -5.727 1.025e-08
BMI 0.01078 0.006525 1.652 0.09846
age 0.01377 0.00973 1.416 0.1569
income -0.0009885 0.0003437 -2.876 0.004027
raceBlack 0.01558 0.1135 0.1373 0.8908
raceHispanic -0.2512 0.1197 -2.099 0.03584
raceOther -0.2384 0.2177 -1.095 0.2736
YrEdu -0.02091 0.02009 -1.04 0.2981
WantedNo 0.2254 0.09006 2.502 0.01234

Again, table with the transformed odds ratios

term Odds.Ratio conf.low conf.high p.value
BMI 1.0108 0.9979 1.0238 0.0985
age 1.0139 0.9946 1.0333 0.1569
income 0.9990 0.9983 0.9997 0.004
raceBlack 1.0157 0.8120 1.2675 0.8908
raceHispanic 0.7779 0.6137 0.9814 0.0358
raceOther 0.7879 0.5033 1.1859 0.2736
YrEdu 0.9793 0.9414 1.0186 0.2981
WantedNo 1.2528 1.0498 1.4946 0.0123
null.deviance df.null logLik AIC BIC deviance df.residual
3567 4819 -1766 3550 3609 3532 4811

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: PreMe
##        LR Chisq Df Pr(>Chisq)   
## BMI      2.7018  1   0.100232   
## age      1.9845  1   0.158915   
## income   8.4442  1   0.003662 **
## race     6.3722  3   0.094843 . 
## YrEdu    1.0848  1   0.297626   
## Wanted   6.2420  1   0.012476 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.PreMe[[“formula”]]
  Model Likelihood Ratio Test Discrimination Indexes Rank Discrim. Indexes
Obs LR chi2 R2 C
4820 34.92 0.01381 0.5778
Term d.f. g Dxy
4234 8 0.2996 0.1556
Premature Pr(> chi2) gr gamma
586 2.762e-05 1.349 0.1559
max |deriv| gp tau-a
4e-10 0.03165 0.03325
Brier
0.106
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(BIRTH$PreMe == "Term", 0, 1), fitted(mod.PreMe)
## X-squared = 13.275, df = 8, p-value = 0.1027

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##            GVIF Df GVIF^(1/(2*Df))
## BMI    1.071061  1        1.034921
## age    1.147842  1        1.071374
## income 1.434414  1        1.197670
## race   1.208518  3        1.032069
## YrEdu  1.481480  1        1.217161
## Wanted 1.034409  1        1.017059

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: Well this looks better than the first regression.

Income looks curvilinear, perhaps because the artificial grouping around 500% of the FPL and some effects for those in deep poverty (i.e. below 100%). BMI and age also have some problems at higher logit predicted values. Somewhat surprisingly, YrEdu doesn’t look bad at all, and this is the only variable I’d really consider making categorical.

Prediction

Using adjusted rates (assuming other variables are at their means)

Knew pregnant

Counts of who knew they were pregnant within the first 6 weeks of the pregnancy

Counts of if mother knew pregnant or not within 6 weeks
No Yes
422 1096

This model predicts the chance of subjects knowing they were pregnant within the first 6 weeks

Fitting generalized (binomial/logit) linear model: KnowPreg ~ PregNum + age + income + race + YrEdu + PregNum + Wanted
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6286 0.446 -1.41 0.1587
PregNum -0.04267 0.03898 -1.095 0.2737
age 0.0235 0.01376 1.708 0.08772
income 0.001165 0.0005047 2.308 0.02103
raceBlack -0.5558 0.151 -3.681 0.0002322
raceHispanic -0.1019 0.1584 -0.643 0.5202
raceOther -0.5359 0.2506 -2.138 0.03249
YrEdu 0.09106 0.02829 3.219 0.001286
WantedNo -0.3587 0.1212 -2.96 0.003079

Odds ratios

term Odds.Ratio conf.low conf.high p.value
PregNum 0.9582 0.8877 1.0346 0.2737
age 1.0238 0.9967 1.0520 0.0877
income 1.0012 1.0002 1.0022 0.021
raceBlack 0.5736 0.4266 0.7713 2e-04
raceHispanic 0.9031 0.6631 1.2345 0.5202
raceOther 0.5851 0.3611 0.9676 0.0325
YrEdu 1.0953 1.0365 1.1581 0.0013
WantedNo 0.6986 0.5509 0.8862 0.0031
null.deviance df.null logLik AIC BIC deviance df.residual
1794 1517 -843.6 1705 1753 1687 1509

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: KnowPreg
##         LR Chisq Df Pr(>Chisq)    
## PregNum   1.1944  1  0.2744379    
## age       2.9460  1  0.0860917 .  
## income    5.4060  1  0.0200680 *  
## race     16.6449  3  0.0008361 ***
## YrEdu    10.4732  1  0.0012112 ** 
## Wanted    8.7215  1  0.0031447 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.KnowPreg[[“formula”]]
  Model Likelihood Ratio Test Discrimination Indexes Rank Discrim. Indexes
Obs LR chi2 R2 C
1518 107.3 0.09843 0.67
No d.f. g Dxy
422 8 0.7196 0.3399
Yes Pr(> chi2) gr gamma
1096 0 2.054 0.3401
max |deriv| gp tau-a
3e-06 0.1349 0.1366
Brier
0.1872
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(PN$KnowPreg == "Yes", 1, 0), fitted(mod.KnowPreg)
## X-squared = 10.765, df = 8, p-value = 0.2154

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##             GVIF Df GVIF^(1/(2*Df))
## PregNum 1.174903  1        1.083930
## age     1.284593  1        1.133399
## income  1.564165  1        1.250666
## race    1.177839  3        1.027656
## YrEdu   1.528460  1        1.236309
## Wanted  1.035627  1        1.017658

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: Unlike the last two regressions, age and BMI look pretty linear. Again income appears to deviate from linearity once it’s below 100% of the FPL, and it looks like there might be a difference in YrEdu for those with less than a high school education (YrEdu < 12) and everyone else. Finally, PregNum looks to be linear despite the few outliers (who were having their 10th or greater pregnancy).

Prediction

Got prenatal care

Counts of who got prenatal care in the first trimester

Counts of if mother got prenatal care in the first trimester
No Yes
172 1346
Fitting generalized (binomial/logit) linear model: gotPNcare ~ KnowPreg + age + income + race + YrEdu + PregNum + Wanted
  Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.157 0.6678 1.733 0.08318
KnowPregYes 2.254 0.1954 11.53 8.937e-31
age 0.01162 0.02046 0.5679 0.5701
income 0.00208 0.000814 2.555 0.01062
raceBlack -0.05857 0.2275 -0.2575 0.7968
raceHispanic -0.2578 0.2364 -1.09 0.2756
raceOther -0.1844 0.3868 -0.4768 0.6335
YrEdu -0.01792 0.04308 -0.4159 0.6775
PregNum -0.1317 0.0546 -2.412 0.01588
WantedNo -0.3936 0.1821 -2.162 0.03065

Odds ratios

term Odds.Ratio conf.low conf.high p.value
KnowPregYes 9.5245 6.5468 14.1062 0
age 1.0117 0.9723 1.0536 0.5701
income 1.0021 1.0005 1.0037 0.0106
raceBlack 0.9431 0.6042 1.4759 0.7968
raceHispanic 0.7728 0.4870 1.2326 0.2756
raceOther 0.8316 0.4009 1.8463 0.6335
YrEdu 0.9822 0.9026 1.0689 0.6775
PregNum 0.8766 0.7873 0.9757 0.0159
WantedNo 0.6746 0.4715 0.9637 0.0307
null.deviance df.null logLik AIC BIC deviance df.residual
1073 1517 -424 868.1 921.3 848.1 1508

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: gotPNcare
##          LR Chisq Df Pr(>Chisq)    
## KnowPreg  156.861  1  < 2.2e-16 ***
## age         0.325  1   0.568874    
## income      6.913  1   0.008556 ** 
## race        1.318  3   0.724950    
## YrEdu       0.173  1   0.677462    
## PregNum     5.802  1   0.016008 *  
## Wanted      4.679  1   0.030533 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.gotPNcare[[“formula”]]
  Model Likelihood Ratio Test Discrimination Indexes Rank Discrim. Indexes
Obs LR chi2 R2 C
1518 224.8 0.2716 0.8227
No d.f. g Dxy
172 9 1.342 0.6455
Yes Pr(> chi2) gr gamma
1346 0 3.828 0.646
max |deriv| gp tau-a
8e-06 0.1275 0.1298
Brier
0.08336
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(PN$gotPNcare == "Yes", 1, 0), fitted(mod.gotPNcare)
## X-squared = 12.33, df = 8, p-value = 0.1371

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##              GVIF Df GVIF^(1/(2*Df))
## KnowPreg 1.041892  1        1.020731
## age      1.278995  1        1.130927
## income   1.472748  1        1.213568
## race     1.178061  3        1.027688
## YrEdu    1.421861  1        1.192418
## PregNum  1.184046  1        1.088139
## Wanted   1.040265  1        1.019934

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: Because the x-axis is a function of the predicted values and we included KnowPreg in our regression (with a massive effect size), these plots spread into two distict groups based on this predictor. The overall loess line (red) clearly doesn’t demonstrate a linear relationship, but within each group it looks like there’s better linearity. At this point I’m not sure what my impression should be, so I’ll just leave it there and let you tell me the answer. Is this too problematic to accept regression 4 as valid?

Note: Two pregnancies (with pregnum=14) were removed

Prediction

A bit on linearity

I’ve read mixed information on what is important for the assumption of linearity, and in the plots above I’ve look at linear relationships between predictors and the logit of the predicted outcome. After the presentation in class (and more reading) I’m adding this section looking at the linear relationships between the predictors and their pearson residuals using the car package.

From what I’m able to gather from reading up on the issue, the pearson residuals vs predictor line should be straight and have a slope of zero. Additionally, the car::residualPlots() function provides a lack of fit test for each variables’ relationship with the residuals.

Low birth weight

##        Test stat Pr(>|Test stat|)   
## GA        9.0744         0.002592 **
## BMI       0.0093         0.923035   
## age       3.8886         0.048614 * 
## income    3.0685         0.079825 . 
## race                                
## YrEdu     4.4824         0.034246 * 
## Wanted                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Premature birth

##        Test stat Pr(>|Test stat|)  
## BMI       4.4619          0.03466 *
## age       0.3181          0.57276  
## income    0.0207          0.88559  
## race                               
## YrEdu     1.1109          0.29188  
## Wanted                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Knew pregnant

##         Test stat Pr(>|Test stat|)  
## PregNum    0.0247          0.87503  
## age        2.1298          0.14446  
## income     0.3208          0.57114  
## race                                
## YrEdu      4.5145          0.03361 *
## Wanted                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Got prenatal care

##          Test stat Pr(>|Test stat|)
## KnowPreg                           
## age         1.4531           0.2280
## income      1.1103           0.2920
## race                               
## YrEdu       0.5658           0.4520
## PregNum     0.6566           0.4178
## Wanted