Data files to reporoduce this analysis can be found on my GitHub
The reference levels can be a little confusing for the regression, so I’ll explain it here.
For predictors (i.e. independent variables), the reference is exactly what it sound like. So if I set the reference level for race to be white, the estimate for coefficients raceXYZ is in reference to white subjects, as shown below:
BIRTH %>%
mutate(
race = relevel(race, "White")
)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# ... ... ... ... ...
# raceHispanic 3.476e-02 ... ... ...
For our our outcomes it’s the second level of the factor that becomes our outcome (or “success”). So if the reference is set to having a normal birthweight, then the odds ratio is representing the odds ratio of being born low birth weight.
Below are the reference levels for the BIRTH dataset
## Classes 'tbl_df', 'tbl' and 'data.frame': 4820 obs. of 4 variables:
## $ LBW : Factor w/ 2 levels "Normal","Low": 1 1 1 1 1 1 1 1 1 1 ...
## $ PreMe : Factor w/ 2 levels "Term","Premature": 1 1 1 1 1 1 1 1 1 1 ...
## $ race : Factor w/ 4 levels "White","Black",..: 1 1 2 1 1 3 2 2 3 3 ...
## $ Wanted: Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 1 2 1 ...
Below are the reference levels for the PN dataset
## Classes 'tbl_df', 'tbl' and 'data.frame': 1518 obs. of 6 variables:
## $ KnowPreg : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ gotPNcare: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ LBW : Factor w/ 2 levels "Normal","Low": 1 1 1 1 1 1 1 1 1 1 ...
## $ PreMe : Factor w/ 2 levels "Term","Premature": 1 1 1 1 1 1 1 2 1 1 ...
## $ race : Factor w/ 4 levels "White","Black",..: 1 2 3 2 2 3 3 1 1 2 ...
## $ Wanted : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 2 1 2 2 2 ...
First, a refresher on the number of subjects who were low birth weight
| Normal | Low |
|---|---|
| 4362 | 458 |
This model predicts being born as low birth weight for their gestational age with the formula below
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 18.67 | 1.019 | 18.32 | 5.764e-75 |
| GA | -0.5461 | 0.02308 | -23.66 | 1.009e-123 |
| BMI | -0.02369 | 0.009261 | -2.558 | 0.01053 |
| age | 0.0223 | 0.01313 | 1.698 | 0.08955 |
| income | -0.001564 | 0.0004851 | -3.225 | 0.00126 |
| raceBlack | 0.5381 | 0.1523 | 3.534 | 0.0004093 |
| raceHispanic | -0.03476 | 0.1635 | -0.2126 | 0.8317 |
| raceOther | 0.07275 | 0.2842 | 0.256 | 0.798 |
| YrEdu | -0.01591 | 0.02749 | -0.5788 | 0.5627 |
| WantedNo | 0.04859 | 0.1221 | 0.3978 | 0.6908 |
Note that the estimate above is the Log-odds. The table below transforms the log-odds into odds ratios
| term | Odds.Ratio | conf.low | conf.high | p.value |
|---|---|---|---|---|
| GA | 0.5792 | 0.5531 | 0.6055 | 0 |
| BMI | 0.9766 | 0.9588 | 0.9943 | 0.0105 |
| age | 1.0226 | 0.9964 | 1.0491 | 0.0896 |
| income | 0.9984 | 0.9975 | 0.9994 | 0.0013 |
| raceBlack | 1.7127 | 1.2702 | 2.3082 | 4e-04 |
| raceHispanic | 0.9658 | 0.6987 | 1.3273 | 0.8317 |
| raceOther | 1.0755 | 0.5972 | 1.8290 | 0.798 |
| YrEdu | 0.9842 | 0.9324 | 1.0385 | 0.5627 |
| WantedNo | 1.0498 | 0.8257 | 1.3332 | 0.6908 |
| null.deviance | df.null | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|
| 3027 | 4819 | -1037 | 2095 | 2160 | 2075 | 4810 |
## Analysis of Deviance Table (Type III tests)
##
## Response: LBW
## LR Chisq Df Pr(>Chisq)
## GA 882.99 1 < 2.2e-16 ***
## BMI 6.71 1 0.009566 **
## age 2.85 1 0.091354 .
## income 10.72 1 0.001063 **
## race 15.50 3 0.001433 **
## YrEdu 0.34 1 0.562376
## Wanted 0.16 1 0.690907
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 4820 | 952.1 | 0.3844 | 0.8621 |
| Normal | d.f. | g | Dxy |
| 4362 | 9 | 1.382 | 0.7242 |
| Low | Pr(> chi2) | gr | gamma |
| 458 | 0 | 3.982 | 0.7246 |
| max |deriv| | gp | tau-a | |
| 6e-10 | 0.1174 | 0.1246 | |
| Brier | |||
| 0.05891 |
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: ifelse(BIRTH$LBW == "Normal", 0, 1), fitted(mod.LBW)
## X-squared = 11.667, df = 8, p-value = 0.1667
Impression: Everything looks good here
Variance inflation factors should be below 5 or 10
## GVIF Df GVIF^(1/(2*Df))
## GA 1.024522 1 1.012187
## BMI 1.085350 1 1.041802
## age 1.140426 1 1.067907
## income 1.472605 1 1.213509
## race 1.225970 3 1.034538
## YrEdu 1.489736 1 1.220548
## Wanted 1.035868 1 1.017776
Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)
A observations have standardized residuals with an absolute value greater than 3, shown in the table below:
| index | .std.resid | .cooksd | LBW | GA | BMI | age | income | race |
|---|---|---|---|---|---|---|---|---|
| 889 | -3.06 | 0.00849 | Normal | 26 | 29 | 24 | 57 | Black |
| 1214 | 3.42 | 0.0079 | Low | 42 | 43 | 20 | 500 | White |
| 1814 | 3.46 | 0.00778 | Low | 44 | 42 | 30 | 46 | Hispanic |
| 3081 | 3.11 | 0.00448 | Low | 41 | 38 | 20 | 291 | White |
| YrEdu | Wanted | .hat |
|---|---|---|
| 10 | No | 0.000804 |
| 15 | Yes | 0.000225 |
| 15 | No | 0.000197 |
| 12 | Yes | 0.000361 |
Check for a linear relationship between continuous predictors and logit of outcome
Impression: GA shows a very pretty linear relationship, but all the other relationships don’t really look linear at all. This is certainly a problem for this regression, but is less of an issue for the other models.
Note: one observation has been removed, with a gestational age of 44 weeks
Using adjusted rates (assuming other variables are at their means)
This prediction assumes Wanted=="Yes
Ignore this section for grading purposes
Confusion matrix using test & training data
First, create a training model using 4/5ths of the original dataset. Below are the odds ratios of the original and training model
## # A tibble: 10 x 3
## coefficient `Original OR` `Training OR`
## <chr> <dbl> <dbl>
## 1 (Intercept) 128171564. 721854437.
## 2 GA 0.580 0.570
## 3 BMI 0.98 0.96
## 4 age 1.02 1.01
## 5 income 1 1
## 6 raceBlack 1.71 1.75
## 7 raceHispanic 0.97 1.04
## 8 raceOther 1.08 0.43
## 9 YrEdu 0.98 1
## 10 WantedNo 1.05 0.82
Next find the misclassifiaction error rate and plot the ROC curve
## [1] 0.0704
Finally, the confusion matrix using test & training data
## Confusion Matrix and Statistics
##
## Reference
## Prediction Normal Low
## Normal 1946 118
## Low 36 86
##
## Accuracy : 0.9296
## 95% CI : (0.918, 0.9399)
## No Information Rate : 0.9067
## P-Value [Acc > NIR] : 7.930e-05
##
## Kappa : 0.4921
##
## Mcnemar's Test P-Value : 6.703e-11
##
## Sensitivity : 0.42157
## Specificity : 0.98184
## Pos Pred Value : 0.70492
## Neg Pred Value : 0.94283
## Prevalence : 0.09332
## Detection Rate : 0.03934
## Detection Prevalence : 0.05581
## Balanced Accuracy : 0.70170
##
## 'Positive' Class : Low
##
Again, a refresher on the number of subjects who were born before 37 weeks
| Term | Premature |
|---|---|
| 4234 | 586 |
This model predicts being born prematurely with the formula below
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.21 | 0.3859 | -5.727 | 1.025e-08 |
| BMI | 0.01078 | 0.006525 | 1.652 | 0.09846 |
| age | 0.01377 | 0.00973 | 1.416 | 0.1569 |
| income | -0.0009885 | 0.0003437 | -2.876 | 0.004027 |
| raceBlack | 0.01558 | 0.1135 | 0.1373 | 0.8908 |
| raceHispanic | -0.2512 | 0.1197 | -2.099 | 0.03584 |
| raceOther | -0.2384 | 0.2177 | -1.095 | 0.2736 |
| YrEdu | -0.02091 | 0.02009 | -1.04 | 0.2981 |
| WantedNo | 0.2254 | 0.09006 | 2.502 | 0.01234 |
Again, table with the transformed odds ratios
| term | Odds.Ratio | conf.low | conf.high | p.value |
|---|---|---|---|---|
| BMI | 1.0108 | 0.9979 | 1.0238 | 0.0985 |
| age | 1.0139 | 0.9946 | 1.0333 | 0.1569 |
| income | 0.9990 | 0.9983 | 0.9997 | 0.004 |
| raceBlack | 1.0157 | 0.8120 | 1.2675 | 0.8908 |
| raceHispanic | 0.7779 | 0.6137 | 0.9814 | 0.0358 |
| raceOther | 0.7879 | 0.5033 | 1.1859 | 0.2736 |
| YrEdu | 0.9793 | 0.9414 | 1.0186 | 0.2981 |
| WantedNo | 1.2528 | 1.0498 | 1.4946 | 0.0123 |
| null.deviance | df.null | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|
| 3567 | 4819 | -1766 | 3550 | 3609 | 3532 | 4811 |
## Analysis of Deviance Table (Type III tests)
##
## Response: PreMe
## LR Chisq Df Pr(>Chisq)
## BMI 2.7018 1 0.100232
## age 1.9845 1 0.158915
## income 8.4442 1 0.003662 **
## race 6.3722 3 0.094843 .
## YrEdu 1.0848 1 0.297626
## Wanted 6.2420 1 0.012476 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 4820 | 34.92 | 0.01381 | 0.5778 |
| Term | d.f. | g | Dxy |
| 4234 | 8 | 0.2996 | 0.1556 |
| Premature | Pr(> chi2) | gr | gamma |
| 586 | 2.762e-05 | 1.349 | 0.1559 |
| max |deriv| | gp | tau-a | |
| 4e-10 | 0.03165 | 0.03325 | |
| Brier | |||
| 0.106 |
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: ifelse(BIRTH$PreMe == "Term", 0, 1), fitted(mod.PreMe)
## X-squared = 13.275, df = 8, p-value = 0.1027
Impression: Everything looks good here
Variance inflation factors should be below 5 or 10
## GVIF Df GVIF^(1/(2*Df))
## BMI 1.071061 1 1.034921
## age 1.147842 1 1.071374
## income 1.434414 1 1.197670
## race 1.208518 3 1.032069
## YrEdu 1.481480 1 1.217161
## Wanted 1.034409 1 1.017059
Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)
Check for a linear relationship between continuous predictors and logit of outcome
Impression: Well this looks better than the first regression.
Income looks curvilinear, perhaps because the artificial grouping around 500% of the FPL and some effects for those in deep poverty (i.e. below 100%). BMI and age also have some problems at higher logit predicted values. Somewhat surprisingly, YrEdu doesn’t look bad at all, and this is the only variable I’d really consider making categorical.
Using adjusted rates (assuming other variables are at their means)
Counts of who knew they were pregnant within the first 6 weeks of the pregnancy
| No | Yes |
|---|---|
| 422 | 1096 |
This model predicts the chance of subjects knowing they were pregnant within the first 6 weeks
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.6286 | 0.446 | -1.41 | 0.1587 |
| PregNum | -0.04267 | 0.03898 | -1.095 | 0.2737 |
| age | 0.0235 | 0.01376 | 1.708 | 0.08772 |
| income | 0.001165 | 0.0005047 | 2.308 | 0.02103 |
| raceBlack | -0.5558 | 0.151 | -3.681 | 0.0002322 |
| raceHispanic | -0.1019 | 0.1584 | -0.643 | 0.5202 |
| raceOther | -0.5359 | 0.2506 | -2.138 | 0.03249 |
| YrEdu | 0.09106 | 0.02829 | 3.219 | 0.001286 |
| WantedNo | -0.3587 | 0.1212 | -2.96 | 0.003079 |
Odds ratios
| term | Odds.Ratio | conf.low | conf.high | p.value |
|---|---|---|---|---|
| PregNum | 0.9582 | 0.8877 | 1.0346 | 0.2737 |
| age | 1.0238 | 0.9967 | 1.0520 | 0.0877 |
| income | 1.0012 | 1.0002 | 1.0022 | 0.021 |
| raceBlack | 0.5736 | 0.4266 | 0.7713 | 2e-04 |
| raceHispanic | 0.9031 | 0.6631 | 1.2345 | 0.5202 |
| raceOther | 0.5851 | 0.3611 | 0.9676 | 0.0325 |
| YrEdu | 1.0953 | 1.0365 | 1.1581 | 0.0013 |
| WantedNo | 0.6986 | 0.5509 | 0.8862 | 0.0031 |
| null.deviance | df.null | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|
| 1794 | 1517 | -843.6 | 1705 | 1753 | 1687 | 1509 |
## Analysis of Deviance Table (Type III tests)
##
## Response: KnowPreg
## LR Chisq Df Pr(>Chisq)
## PregNum 1.1944 1 0.2744379
## age 2.9460 1 0.0860917 .
## income 5.4060 1 0.0200680 *
## race 16.6449 3 0.0008361 ***
## YrEdu 10.4732 1 0.0012112 **
## Wanted 8.7215 1 0.0031447 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 1518 | 107.3 | 0.09843 | 0.67 |
| No | d.f. | g | Dxy |
| 422 | 8 | 0.7196 | 0.3399 |
| Yes | Pr(> chi2) | gr | gamma |
| 1096 | 0 | 2.054 | 0.3401 |
| max |deriv| | gp | tau-a | |
| 3e-06 | 0.1349 | 0.1366 | |
| Brier | |||
| 0.1872 |
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: ifelse(PN$KnowPreg == "Yes", 1, 0), fitted(mod.KnowPreg)
## X-squared = 10.765, df = 8, p-value = 0.2154
Impression: Everything looks good here
Variance inflation factors should be below 5 or 10
## GVIF Df GVIF^(1/(2*Df))
## PregNum 1.174903 1 1.083930
## age 1.284593 1 1.133399
## income 1.564165 1 1.250666
## race 1.177839 3 1.027656
## YrEdu 1.528460 1 1.236309
## Wanted 1.035627 1 1.017658
Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)
Check for a linear relationship between continuous predictors and logit of outcome
Impression: Unlike the last two regressions, age and BMI look pretty linear. Again income appears to deviate from linearity once it’s below 100% of the FPL, and it looks like there might be a difference in YrEdu for those with less than a high school education (YrEdu < 12) and everyone else. Finally, PregNum looks to be linear despite the few outliers (who were having their 10th or greater pregnancy).
Counts of who got prenatal care in the first trimester
| No | Yes |
|---|---|
| 172 | 1346 |
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 1.157 | 0.6678 | 1.733 | 0.08318 |
| KnowPregYes | 2.254 | 0.1954 | 11.53 | 8.937e-31 |
| age | 0.01162 | 0.02046 | 0.5679 | 0.5701 |
| income | 0.00208 | 0.000814 | 2.555 | 0.01062 |
| raceBlack | -0.05857 | 0.2275 | -0.2575 | 0.7968 |
| raceHispanic | -0.2578 | 0.2364 | -1.09 | 0.2756 |
| raceOther | -0.1844 | 0.3868 | -0.4768 | 0.6335 |
| YrEdu | -0.01792 | 0.04308 | -0.4159 | 0.6775 |
| PregNum | -0.1317 | 0.0546 | -2.412 | 0.01588 |
| WantedNo | -0.3936 | 0.1821 | -2.162 | 0.03065 |
Odds ratios
| term | Odds.Ratio | conf.low | conf.high | p.value |
|---|---|---|---|---|
| KnowPregYes | 9.5245 | 6.5468 | 14.1062 | 0 |
| age | 1.0117 | 0.9723 | 1.0536 | 0.5701 |
| income | 1.0021 | 1.0005 | 1.0037 | 0.0106 |
| raceBlack | 0.9431 | 0.6042 | 1.4759 | 0.7968 |
| raceHispanic | 0.7728 | 0.4870 | 1.2326 | 0.2756 |
| raceOther | 0.8316 | 0.4009 | 1.8463 | 0.6335 |
| YrEdu | 0.9822 | 0.9026 | 1.0689 | 0.6775 |
| PregNum | 0.8766 | 0.7873 | 0.9757 | 0.0159 |
| WantedNo | 0.6746 | 0.4715 | 0.9637 | 0.0307 |
| null.deviance | df.null | logLik | AIC | BIC | deviance | df.residual |
|---|---|---|---|---|---|---|
| 1073 | 1517 | -424 | 868.1 | 921.3 | 848.1 | 1508 |
## Analysis of Deviance Table (Type III tests)
##
## Response: gotPNcare
## LR Chisq Df Pr(>Chisq)
## KnowPreg 156.861 1 < 2.2e-16 ***
## age 0.325 1 0.568874
## income 6.913 1 0.008556 **
## race 1.318 3 0.724950
## YrEdu 0.173 1 0.677462
## PregNum 5.802 1 0.016008 *
## Wanted 4.679 1 0.030533 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Model Likelihood Ratio Test | Discrimination Indexes | Rank Discrim. Indexes | |
|---|---|---|---|
| Obs | LR chi2 | R2 | C |
| 1518 | 224.8 | 0.2716 | 0.8227 |
| No | d.f. | g | Dxy |
| 172 | 9 | 1.342 | 0.6455 |
| Yes | Pr(> chi2) | gr | gamma |
| 1346 | 0 | 3.828 | 0.646 |
| max |deriv| | gp | tau-a | |
| 8e-06 | 0.1275 | 0.1298 | |
| Brier | |||
| 0.08336 |
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: ifelse(PN$gotPNcare == "Yes", 1, 0), fitted(mod.gotPNcare)
## X-squared = 12.33, df = 8, p-value = 0.1371
Impression: Everything looks good here
Variance inflation factors should be below 5 or 10
## GVIF Df GVIF^(1/(2*Df))
## KnowPreg 1.041892 1 1.020731
## age 1.278995 1 1.130927
## income 1.472748 1 1.213568
## race 1.178061 3 1.027688
## YrEdu 1.421861 1 1.192418
## PregNum 1.184046 1 1.088139
## Wanted 1.040265 1 1.019934
Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)
Check for a linear relationship between continuous predictors and logit of outcome
Impression: Because the x-axis is a function of the predicted values and we included KnowPreg in our regression (with a massive effect size), these plots spread into two distict groups based on this predictor. The overall loess line (red) clearly doesn’t demonstrate a linear relationship, but within each group it looks like there’s better linearity. At this point I’m not sure what my impression should be, so I’ll just leave it there and let you tell me the answer. Is this too problematic to accept regression 4 as valid?
Note: Two pregnancies (with pregnum=14) were removed
I’ve read mixed information on what is important for the assumption of linearity, and in the plots above I’ve look at linear relationships between predictors and the logit of the predicted outcome. After the presentation in class (and more reading) I’m adding this section looking at the linear relationships between the predictors and their pearson residuals using the car package.
From what I’m able to gather from reading up on the issue, the pearson residuals vs predictor line should be straight and have a slope of zero. Additionally, the car::residualPlots() function provides a lack of fit test for each variables’ relationship with the residuals.
## Test stat Pr(>|Test stat|)
## GA 9.0744 0.002592 **
## BMI 0.0093 0.923035
## age 3.8886 0.048614 *
## income 3.0685 0.079825 .
## race
## YrEdu 4.4824 0.034246 *
## Wanted
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Test stat Pr(>|Test stat|)
## BMI 4.4619 0.03466 *
## age 0.3181 0.57276
## income 0.0207 0.88559
## race
## YrEdu 1.1109 0.29188
## Wanted
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Test stat Pr(>|Test stat|)
## PregNum 0.0247 0.87503
## age 2.1298 0.14446
## income 0.3208 0.57114
## race
## YrEdu 4.5145 0.03361 *
## Wanted
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Test stat Pr(>|Test stat|)
## KnowPreg
## age 1.4531 0.2280
## income 1.1103 0.2920
## race
## YrEdu 0.5658 0.4520
## PregNum 0.6566 0.4178
## Wanted