Data files to reporoduce this analysis can be found on my GitHub

Set reference levels

The reference levels can be a little confusing for the regression, so I’ll explain it here.

For predictors (i.e. independent variables), the reference is exactly what it sound like. So if I set the reference level for race to be white, the estimate for coefficients raceXYZ is in reference to white subjects, as shown below:

BIRTH %>%
  mutate(
    race = relevel(race, "White")
  )

# Coefficients:
#                Estimate Std. Error z value Pr(>|z|)    
# ...              ...        ...       ...   ...
# raceHispanic  3.476e-02     ...       ...   ...

For our our outcomes it’s the second level of the factor that becomes our outcome (or “success”). So if the reference is set to having a normal birthweight, then the odds ratio is representing the odds ratio of being born low birth weight.

Below are the reference levels for the BIRTH dataset

## Classes 'tbl_df', 'tbl' and 'data.frame':    4820 obs. of  4 variables:
##  $ LBW   : Factor w/ 2 levels "Normal","Low": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PreMe : Factor w/ 2 levels "Term","Premature": 1 1 1 1 1 1 1 1 1 1 ...
##  $ race  : Factor w/ 4 levels "White","Black",..: 1 1 2 1 1 3 2 2 3 3 ...
##  $ Wanted: Factor w/ 2 levels "Yes","No": 2 1 1 1 1 1 1 1 2 1 ...

Below are the reference levels for the PN dataset

## Classes 'tbl_df', 'tbl' and 'data.frame':    1518 obs. of  6 variables:
##  $ KnowPreg : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gotPNcare: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ LBW      : Factor w/ 2 levels "Normal","Low": 1 1 1 1 1 1 1 1 1 1 ...
##  $ PreMe    : Factor w/ 2 levels "Term","Premature": 1 1 1 1 1 1 1 2 1 1 ...
##  $ race     : Factor w/ 4 levels "White","Black",..: 1 2 3 2 2 3 3 1 1 2 ...
##  $ Wanted   : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 2 1 2 2 2 ...

Low Birth Weight

First, a refresher on the number of subjects who were low birth weight

Counts of subjects with normal and low birth weights
Normal	Low
4362	458

This model predicts being born as low birth weight for their gestational age with the formula below

Fitting generalized (binomial/logit) linear model: LBW ~ GA + BMI + age + income + race + YrEdu + Wanted
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	18.67	1.019	18.32	5.764e-75
GA	-0.5461	0.02308	-23.66	1.009e-123
BMI	-0.02369	0.009261	-2.558	0.01053
age	0.0223	0.01313	1.698	0.08955
income	-0.001564	0.0004851	-3.225	0.00126
raceBlack	0.5381	0.1523	3.534	0.0004093
raceHispanic	-0.03476	0.1635	-0.2126	0.8317
raceOther	0.07275	0.2842	0.256	0.798
YrEdu	-0.01591	0.02749	-0.5788	0.5627
WantedNo	0.04859	0.1221	0.3978	0.6908

Note that the estimate above is the Log-odds. The table below transforms the log-odds into odds ratios

term	Odds.Ratio	conf.low	conf.high	p.value
GA	0.5792	0.5531	0.6055	0
BMI	0.9766	0.9588	0.9943	0.0105
age	1.0226	0.9964	1.0491	0.0896
income	0.9984	0.9975	0.9994	0.0013
raceBlack	1.7127	1.2702	2.3082	4e-04
raceHispanic	0.9658	0.6987	1.3273	0.8317
raceOther	1.0755	0.5972	1.8290	0.798
YrEdu	0.9842	0.9324	1.0385	0.5627
WantedNo	1.0498	0.8257	1.3332	0.6908

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual
3027	4819	-1037	2095	2160	2075	4810

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: LBW
##        LR Chisq Df Pr(>Chisq)    
## GA       882.99  1  < 2.2e-16 ***
## BMI        6.71  1   0.009566 ** 
## age        2.85  1   0.091354 .  
## income    10.72  1   0.001063 ** 
## race      15.50  3   0.001433 ** 
## YrEdu      0.34  1   0.562376    
## Wanted     0.16  1   0.690907    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.LBW[[“formula”]]
	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs	LR chi2	R2	C
4820	952.1	0.3844	0.8621
Normal	d.f.	g	Dxy
4362	9	1.382	0.7242
Low	Pr(> chi2)	gr	gamma
458	0	3.982	0.7246
max \|deriv\|		gp	tau-a
6e-10		0.1174	0.1246
		Brier
		0.05891

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(BIRTH$LBW == "Normal", 0, 1), fitted(mod.LBW)
## X-squared = 11.667, df = 8, p-value = 0.1667

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##            GVIF Df GVIF^(1/(2*Df))
## GA     1.024522  1        1.012187
## BMI    1.085350  1        1.041802
## age    1.140426  1        1.067907
## income 1.472605  1        1.213509
## race   1.225970  3        1.034538
## YrEdu  1.489736  1        1.220548
## Wanted 1.035868  1        1.017776

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

A observations have standardized residuals with an absolute value greater than 3, shown in the table below:

standardized residuals with abs value > 3 (continued below)
index	.std.resid	.cooksd	LBW	GA	BMI	age	income	race
889	-3.06	0.00849	Normal	26	29	24	57	Black
1214	3.42	0.0079	Low	42	43	20	500	White
1814	3.46	0.00778	Low	44	42	30	46	Hispanic
3081	3.11	0.00448	Low	41	38	20	291	White

YrEdu	Wanted	.hat
10	No	0.000804
15	Yes	0.000225
15	No	0.000197
12	Yes	0.000361

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: GA shows a very pretty linear relationship, but all the other relationships don’t really look linear at all. This is certainly a problem for this regression, but is less of an issue for the other models.

Note: one observation has been removed, with a gestational age of 44 weeks

Prediction

Using adjusted rates (assuming other variables are at their means)

This prediction assumes Wanted=="Yes

Cross-Validation

Ignore this section for grading purposes

Confusion matrix using test & training data

First, create a training model using 4/5ths of the original dataset. Below are the odds ratios of the original and training model

## # A tibble: 10 x 3
##    coefficient  `Original OR` `Training OR`
##    <chr>                <dbl>         <dbl>
##  1 (Intercept)  128171564.    721854437.   
##  2 GA                   0.580         0.570
##  3 BMI                  0.98          0.96 
##  4 age                  1.02          1.01 
##  5 income               1             1    
##  6 raceBlack            1.71          1.75 
##  7 raceHispanic         0.97          1.04 
##  8 raceOther            1.08          0.43 
##  9 YrEdu                0.98          1    
## 10 WantedNo             1.05          0.82

Next find the misclassifiaction error rate and plot the ROC curve

## [1] 0.0704

Finally, the confusion matrix using test & training data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Normal  Low
##     Normal   1946  118
##     Low        36   86
##                                          
##                Accuracy : 0.9296         
##                  95% CI : (0.918, 0.9399)
##     No Information Rate : 0.9067         
##     P-Value [Acc > NIR] : 7.930e-05      
##                                          
##                   Kappa : 0.4921         
##                                          
##  Mcnemar's Test P-Value : 6.703e-11      
##                                          
##             Sensitivity : 0.42157        
##             Specificity : 0.98184        
##          Pos Pred Value : 0.70492        
##          Neg Pred Value : 0.94283        
##              Prevalence : 0.09332        
##          Detection Rate : 0.03934        
##    Detection Prevalence : 0.05581        
##       Balanced Accuracy : 0.70170        
##                                          
##        'Positive' Class : Low            
##

Premature birth

Again, a refresher on the number of subjects who were born before 37 weeks

Counts of subjects born at term and prematurely
Term	Premature
4234	586

This model predicts being born prematurely with the formula below

Fitting generalized (binomial/logit) linear model: PreMe ~ BMI + age + income + race + YrEdu + Wanted
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-2.21	0.3859	-5.727	1.025e-08
BMI	0.01078	0.006525	1.652	0.09846
age	0.01377	0.00973	1.416	0.1569
income	-0.0009885	0.0003437	-2.876	0.004027
raceBlack	0.01558	0.1135	0.1373	0.8908
raceHispanic	-0.2512	0.1197	-2.099	0.03584
raceOther	-0.2384	0.2177	-1.095	0.2736
YrEdu	-0.02091	0.02009	-1.04	0.2981
WantedNo	0.2254	0.09006	2.502	0.01234

Again, table with the transformed odds ratios

term	Odds.Ratio	conf.low	conf.high	p.value
BMI	1.0108	0.9979	1.0238	0.0985
age	1.0139	0.9946	1.0333	0.1569
income	0.9990	0.9983	0.9997	0.004
raceBlack	1.0157	0.8120	1.2675	0.8908
raceHispanic	0.7779	0.6137	0.9814	0.0358
raceOther	0.7879	0.5033	1.1859	0.2736
YrEdu	0.9793	0.9414	1.0186	0.2981
WantedNo	1.2528	1.0498	1.4946	0.0123

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual
3567	4819	-1766	3550	3609	3532	4811

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: PreMe
##        LR Chisq Df Pr(>Chisq)   
## BMI      2.7018  1   0.100232   
## age      1.9845  1   0.158915   
## income   8.4442  1   0.003662 **
## race     6.3722  3   0.094843 . 
## YrEdu    1.0848  1   0.297626   
## Wanted   6.2420  1   0.012476 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.PreMe[[“formula”]]
	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs	LR chi2	R2	C
4820	34.92	0.01381	0.5778
Term	d.f.	g	Dxy
4234	8	0.2996	0.1556
Premature	Pr(> chi2)	gr	gamma
586	2.762e-05	1.349	0.1559
max \|deriv\|		gp	tau-a
4e-10		0.03165	0.03325
		Brier
		0.106

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(BIRTH$PreMe == "Term", 0, 1), fitted(mod.PreMe)
## X-squared = 13.275, df = 8, p-value = 0.1027

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##            GVIF Df GVIF^(1/(2*Df))
## BMI    1.071061  1        1.034921
## age    1.147842  1        1.071374
## income 1.434414  1        1.197670
## race   1.208518  3        1.032069
## YrEdu  1.481480  1        1.217161
## Wanted 1.034409  1        1.017059

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: Well this looks better than the first regression.

Income looks curvilinear, perhaps because the artificial grouping around 500% of the FPL and some effects for those in deep poverty (i.e. below 100%). BMI and age also have some problems at higher logit predicted values. Somewhat surprisingly, YrEdu doesn’t look bad at all, and this is the only variable I’d really consider making categorical.

Prediction

Using adjusted rates (assuming other variables are at their means)

Knew pregnant

Counts of who knew they were pregnant within the first 6 weeks of the pregnancy

Counts of if mother knew pregnant or not within 6 weeks
No	Yes
422	1096

This model predicts the chance of subjects knowing they were pregnant within the first 6 weeks

Fitting generalized (binomial/logit) linear model: KnowPreg ~ PregNum + age + income + race + YrEdu + PregNum + Wanted
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-0.6286	0.446	-1.41	0.1587
PregNum	-0.04267	0.03898	-1.095	0.2737
age	0.0235	0.01376	1.708	0.08772
income	0.001165	0.0005047	2.308	0.02103
raceBlack	-0.5558	0.151	-3.681	0.0002322
raceHispanic	-0.1019	0.1584	-0.643	0.5202
raceOther	-0.5359	0.2506	-2.138	0.03249
YrEdu	0.09106	0.02829	3.219	0.001286
WantedNo	-0.3587	0.1212	-2.96	0.003079

Odds ratios

term	Odds.Ratio	conf.low	conf.high	p.value
PregNum	0.9582	0.8877	1.0346	0.2737
age	1.0238	0.9967	1.0520	0.0877
income	1.0012	1.0002	1.0022	0.021
raceBlack	0.5736	0.4266	0.7713	2e-04
raceHispanic	0.9031	0.6631	1.2345	0.5202
raceOther	0.5851	0.3611	0.9676	0.0325
YrEdu	1.0953	1.0365	1.1581	0.0013
WantedNo	0.6986	0.5509	0.8862	0.0031

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual
1794	1517	-843.6	1705	1753	1687	1509

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: KnowPreg
##         LR Chisq Df Pr(>Chisq)    
## PregNum   1.1944  1  0.2744379    
## age       2.9460  1  0.0860917 .  
## income    5.4060  1  0.0200680 *  
## race     16.6449  3  0.0008361 ***
## YrEdu    10.4732  1  0.0012112 ** 
## Wanted    8.7215  1  0.0031447 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.KnowPreg[[“formula”]]
	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs	LR chi2	R2	C
1518	107.3	0.09843	0.67
No	d.f.	g	Dxy
422	8	0.7196	0.3399
Yes	Pr(> chi2)	gr	gamma
1096	0	2.054	0.3401
max \|deriv\|		gp	tau-a
3e-06		0.1349	0.1366
		Brier
		0.1872

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(PN$KnowPreg == "Yes", 1, 0), fitted(mod.KnowPreg)
## X-squared = 10.765, df = 8, p-value = 0.2154

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##             GVIF Df GVIF^(1/(2*Df))
## PregNum 1.174903  1        1.083930
## age     1.284593  1        1.133399
## income  1.564165  1        1.250666
## race    1.177839  3        1.027656
## YrEdu   1.528460  1        1.236309
## Wanted  1.035627  1        1.017658

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: Unlike the last two regressions, age and BMI look pretty linear. Again income appears to deviate from linearity once it’s below 100% of the FPL, and it looks like there might be a difference in YrEdu for those with less than a high school education (YrEdu < 12) and everyone else. Finally, PregNum looks to be linear despite the few outliers (who were having their 10th or greater pregnancy).

Prediction

Got prenatal care

Counts of who got prenatal care in the first trimester

Counts of if mother got prenatal care in the first trimester
No	Yes
172	1346

Fitting generalized (binomial/logit) linear model: gotPNcare ~ KnowPreg + age + income + race + YrEdu + PregNum + Wanted
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	1.157	0.6678	1.733	0.08318
KnowPregYes	2.254	0.1954	11.53	8.937e-31
age	0.01162	0.02046	0.5679	0.5701
income	0.00208	0.000814	2.555	0.01062
raceBlack	-0.05857	0.2275	-0.2575	0.7968
raceHispanic	-0.2578	0.2364	-1.09	0.2756
raceOther	-0.1844	0.3868	-0.4768	0.6335
YrEdu	-0.01792	0.04308	-0.4159	0.6775
PregNum	-0.1317	0.0546	-2.412	0.01588
WantedNo	-0.3936	0.1821	-2.162	0.03065

Odds ratios

term	Odds.Ratio	conf.low	conf.high	p.value
KnowPregYes	9.5245	6.5468	14.1062	0
age	1.0117	0.9723	1.0536	0.5701
income	1.0021	1.0005	1.0037	0.0106
raceBlack	0.9431	0.6042	1.4759	0.7968
raceHispanic	0.7728	0.4870	1.2326	0.2756
raceOther	0.8316	0.4009	1.8463	0.6335
YrEdu	0.9822	0.9026	1.0689	0.6775
PregNum	0.8766	0.7873	0.9757	0.0159
WantedNo	0.6746	0.4715	0.9637	0.0307

null.deviance	df.null	logLik	AIC	BIC	deviance	df.residual
1073	1517	-424	868.1	921.3	848.1	1508

Type III Anova

## Analysis of Deviance Table (Type III tests)
## 
## Response: gotPNcare
##          LR Chisq Df Pr(>Chisq)    
## KnowPreg  156.861  1  < 2.2e-16 ***
## age         0.325  1   0.568874    
## income      6.913  1   0.008556 ** 
## race        1.318  3   0.724950    
## YrEdu       0.173  1   0.677462    
## PregNum     5.802  1   0.016008 *  
## Wanted      4.679  1   0.030533 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hosmer-Lemeshow & more

Fitting logistic regression model: mod.gotPNcare[[“formula”]]
	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs	LR chi2	R2	C
1518	224.8	0.2716	0.8227
No	d.f.	g	Dxy
172	9	1.342	0.6455
Yes	Pr(> chi2)	gr	gamma
1346	0	3.828	0.646
max \|deriv\|		gp	tau-a
8e-06		0.1275	0.1298
		Brier
		0.08336

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  ifelse(PN$gotPNcare == "Yes", 1, 0), fitted(mod.gotPNcare)
## X-squared = 12.33, df = 8, p-value = 0.1371

Impression: Everything looks good here

Multicolinearity / VIF

Variance inflation factors should be below 5 or 10

##              GVIF Df GVIF^(1/(2*Df))
## KnowPreg 1.041892  1        1.020731
## age      1.278995  1        1.130927
## income   1.472748  1        1.213568
## race     1.178061  3        1.027688
## YrEdu    1.421861  1        1.192418
## PregNum  1.184046  1        1.088139
## Wanted   1.040265  1        1.019934

Impression: Looks great, all VIFs are well below 5 (none are even above 2.5)

Influential values

Linearity

Check for a linear relationship between continuous predictors and logit of outcome

Impression: Because the x-axis is a function of the predicted values and we included KnowPreg in our regression (with a massive effect size), these plots spread into two distict groups based on this predictor. The overall loess line (red) clearly doesn’t demonstrate a linear relationship, but within each group it looks like there’s better linearity. At this point I’m not sure what my impression should be, so I’ll just leave it there and let you tell me the answer. Is this too problematic to accept regression 4 as valid?

Note: Two pregnancies (with pregnum=14) were removed

Prediction

A bit on linearity

I’ve read mixed information on what is important for the assumption of linearity, and in the plots above I’ve look at linear relationships between predictors and the logit of the predicted outcome. After the presentation in class (and more reading) I’m adding this section looking at the linear relationships between the predictors and their pearson residuals using the car package.

From what I’m able to gather from reading up on the issue, the pearson residuals vs predictor line should be straight and have a slope of zero. Additionally, the car::residualPlots() function provides a lack of fit test for each variables’ relationship with the residuals.

Low birth weight

##        Test stat Pr(>|Test stat|)   
## GA        9.0744         0.002592 **
## BMI       0.0093         0.923035   
## age       3.8886         0.048614 * 
## income    3.0685         0.079825 . 
## race                                
## YrEdu     4.4824         0.034246 * 
## Wanted                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Premature birth

##        Test stat Pr(>|Test stat|)  
## BMI       4.4619          0.03466 *
## age       0.3181          0.57276  
## income    0.0207          0.88559  
## race                               
## YrEdu     1.1109          0.29188  
## Wanted                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Knew pregnant

##         Test stat Pr(>|Test stat|)  
## PregNum    0.0247          0.87503  
## age        2.1298          0.14446  
## income     0.3208          0.57114  
## race                                
## YrEdu      4.5145          0.03361 *
## Wanted                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Got prenatal care

##          Test stat Pr(>|Test stat|)
## KnowPreg                           
## age         1.4531           0.2280
## income      1.1103           0.2920
## race                               
## YrEdu       0.5658           0.4520
## PregNum     0.6566           0.4178
## Wanted

Some models

Hunter Ratliff

11/17/2019

Set reference levels

Low Birth Weight

Type III Anova

Hosmer-Lemeshow & more

Multicolinearity / VIF

Influential values

Linearity

Prediction

Cross-Validation

Premature birth

Type III Anova

Hosmer-Lemeshow & more

Multicolinearity / VIF

Influential values

Linearity

Prediction

Knew pregnant

Type III Anova

Hosmer-Lemeshow & more

Multicolinearity / VIF

Influential values

Linearity

Prediction

Got prenatal care

Type III Anova

Hosmer-Lemeshow & more

Multicolinearity / VIF

Influential values

Linearity

Prediction

A bit on linearity

Low birth weight

Premature birth

Knew pregnant

Got prenatal care