First, a short overview of the variables we have. We have two datasets, BIRTH and PN (more on this later), and the breakdown of the variables between the two datasets are below:
Now here are our two data sets. You’ll see that BIRTH contains 4820 observations, while PN only contains 1518 observations
glimpse(BIRTH)
## Observations: 4,820
## Variables: 11
## $ CASEID <dbl> 70627, 70627, 70628, 70632, 70633, 70637, 70641, 70641, 70654,…
## $ LBW <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ PreMe <fct> Term, Term, Term, Term, Term, Term, Term, Term, Term, Term, Te…
## $ GA <dbl> 40, 39, 39, 41, 39, 40, 40, 40, 38, 39, 39, 39, 35, 39, 37, 36…
## $ BMI <dbl> 39, 39, 26, 28, 26, 25, 26, 26, 27, 27, 36, 34, 30, 30, 26, 20…
## $ age <dbl> 28, 33, 23, 24, 23, 22, 20, 21, 27, 29, 24, 30, 26, 27, 32, 27…
## $ income <dbl> 500, 500, 189, 500, 112, 113, 9, 9, 361, 361, 108, 500, 356, 3…
## $ race <fct> White, White, Black, White, White, Hispanic, Black, Black, His…
## $ YrEdu <dbl> 16, 16, 15, 12, 10, 13, 10, 10, 18, 18, 10, 12, 11, 11, 19, 12…
## $ eduCat <fct> Bachelors, Bachelors, Associates, HS or GED, HS or GED, Some c…
## $ Wanted <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, …
glimpse(PN)
## Observations: 1,518
## Variables: 14
## $ CASEID <dbl> 70627, 70628, 70637, 70641, 70641, 70654, 70654, 70662, 706…
## $ KnowPreg <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, …
## $ gotPNcare <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ LBW <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PreMe <fct> Term, Term, Term, Term, Term, Term, Term, Premature, Term, …
## $ GA <dbl> 39, 39, 40, 40, 40, 38, 39, 35, 39, 39, 39, 40, 39, 39, 41,…
## $ BMI <dbl> 39, 26, 25, 26, 26, 27, 27, 30, 30, 20, 20, 22, 27, 27, 32,…
## $ PregNum <dbl> 3, 3, 2, 4, 4, 2, 2, 4, 4, 3, 3, 1, 3, 3, 4, 4, 2, 1, 4, 4,…
## $ age <dbl> 33, 23, 22, 20, 21, 27, 29, 26, 27, 24, 27, 31, 22, 23, 27,…
## $ income <dbl> 500, 189, 113, 9, 9, 361, 361, 356, 356, 172, 172, 118, 9, …
## $ race <fct> White, Black, Hispanic, Black, Black, Hispanic, Hispanic, W…
## $ YrEdu <dbl> 16, 15, 13, 10, 10, 18, 18, 11, 11, 13, 13, 12, 11, 11, 17,…
## $ eduCat <fct> Bachelors, Associates, Some college, <HS, <HS, Bachelors, B…
## $ Wanted <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FA…
This occured because we handled missing values by deleting that observation. Since many women couldn’t recall what week they first knew they were pregnant or when they first sought prenatal care, our PN dataset is much smaller. The two figures below show the patterns of missing data (before I split them into their respective sets)
These are our continuous columns
## Observations: 4,820
## Variables: 5
## $ GA <dbl> 40, 39, 39, 41, 39, 40, 40, 40, 38, 39, 39, 39, 35, 39, 37, 36…
## $ BMI <dbl> 39, 39, 26, 28, 26, 25, 26, 26, 27, 27, 36, 34, 30, 30, 26, 20…
## $ age <dbl> 28, 33, 23, 24, 23, 22, 20, 21, 27, 29, 24, 30, 26, 27, 32, 27…
## $ income <dbl> 500, 500, 189, 500, 112, 113, 9, 9, 361, 361, 108, 500, 356, 3…
## $ YrEdu <dbl> 16, 16, 15, 12, 10, 13, 10, 10, 18, 18, 10, 12, 11, 11, 19, 12…
Age: This follows an odd pattern, but perhaps it’s a function of each mom (possibly) having more than one pregnancy?
GA: This one has one outlier who was born at 9 weeks, which can’t be right. But for now it’s in our data set
Income: You can really see the artificial binning of considering everyone who was 500% or above the poverty line as being only 500%
| age | BMI | GA | income | YrEdu | |
|---|---|---|---|---|---|
| Mean | 26.54 | 28.88 | 38.57 | 216.96 | 13.53 |
| Std.Dev | 4.84 | 6.82 | 2.46 | 160.62 | 2.74 |
| Min | 20.00 | 16.00 | 9.00 | 5.00 | 9.00 |
| Max | 40.00 | 51.00 | 44.00 | 500.00 | 19.00 |
| Median | 26.00 | 27.00 | 39.00 | 172.00 | 13.00 |
| Q1 | 22.00 | 23.00 | 38.00 | 78.00 | 12.00 |
| Q3 | 30.00 | 33.00 | 40.00 | 356.00 | 16.00 |
| IQR | 8.00 | 10.00 | 2.00 | 278.00 | 4.00 |
| Skewness | 0.58 | 0.72 | -2.53 | 0.54 | 0.24 |
| Kurtosis | -0.50 | -0.18 | 11.69 | -1.03 | -0.69 |
| CV | 0.18 | 0.24 | 0.06 | 0.74 | 0.20 |
These are our discrete columns
## Observations: 4,820
## Variables: 5
## $ LBW <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ PreMe <fct> Term, Term, Term, Term, Term, Term, Term, Term, Term, Term, Te…
## $ race <fct> White, White, Black, White, White, Hispanic, Black, Black, His…
## $ eduCat <fct> Bachelors, Bachelors, Associates, HS or GED, HS or GED, Some c…
## $ Wanted <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, …
We can see that relatively few of subjects had babies that were either premature or low birth weight, but quite a few had unwanted pregnancies
Counts of categorical variables
| Wanted | FALSE | TRUE | Sum | ||
| LBW | PreMe | ||||
| FALSE | Premature | 145 | 158 | 303 | |
| Term | 1689 | 2370 | 4059 | ||
| Sum | 1834 | 2528 | 4362 | ||
| TRUE | Premature | 141 | 142 | 283 | |
| Term | 81 | 94 | 175 | ||
| Sum | 222 | 236 | 458 | ||
| Sum | Premature | 286 | 300 | 586 | |
| Term | 1770 | 2464 | 4234 | ||
| Sum | 2056 | 2764 | 4820 |
By gestational age
| LBW | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Low Wt | 458 | 35 | 4.3 | 0.2 | 6 | -0.98 | 2.3 | 35 |
| Normal Wt | 4362 | 39 | 1.7 | 0.026 | 2 | -1.6 | 6.6 | 39 |
| total | 4820 | 39 | 2.5 | 0.035 | 2 | -2.5 | 12 | 39 |
By BMI
| LBW | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Low Wt | 458 | 29 | 6.7 | 0.31 | 9 | 0.72 | 0.0027 | 28 |
| Normal Wt | 4362 | 29 | 6.8 | 0.1 | 10 | 0.72 | -0.2 | 27 |
| total | 4820 | 29 | 6.8 | 0.098 | 10 | 0.72 | -0.18 | 27 |
By age
| LBW | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Low Wt | 458 | 27 | 5.2 | 0.24 | 9 | 0.63 | -0.62 | 25 |
| Normal Wt | 4362 | 27 | 4.8 | 0.073 | 7.8 | 0.57 | -0.48 | 26 |
| total | 4820 | 27 | 4.8 | 0.07 | 8 | 0.58 | -0.5 | 26 |
By income as percent of poverty line
| LBW | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Low Wt | 458 | 173 | 151 | 7.1 | 181 | 0.99 | -0.27 | 113 |
| Normal Wt | 4362 | 222 | 161 | 2.4 | 271 | 0.51 | -1.1 | 183 |
| total | 4820 | 217 | 161 | 2.3 | 278 | 0.55 | -1 | 172 |
By years of education
| LBW | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Low Wt | 458 | 13 | 2.7 | 0.13 | 3 | 0.44 | -0.48 | 12 |
| Normal Wt | 4362 | 14 | 2.7 | 0.041 | 4 | 0.22 | -0.7 | 13 |
| total | 4820 | 14 | 2.7 | 0.039 | 4 | 0.24 | -0.69 | 13 |
By Race
| Black | Hispanic | Other | White | |
|---|---|---|---|---|
| Low Wt | 155 | 94 | 22 | 187 |
| Normal Wt | 991 | 1054 | 246 | 2071 |
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 28.308, df = 3, p-value = 3.129e-06
By Wantedness
| Unwanted | Wanted | |
|---|---|---|
| Low Wt | 222 | 236 |
| Normal Wt | 1834 | 2528 |
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 6.999, df = 1, p-value = 0.008157
By prematurity
| Premature | Term | |
|---|---|---|
| Low Wt | 283 | 175 |
| Normal Wt | 303 | 4059 |
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 1167.4, df = 1, p-value = 7.53e-256
Note that pre-mature birth is a direct function of gestational age
By BMI
| PreMe | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Premature | 586 | 30 | 6.7 | 0.28 | 10 | 0.55 | -0.31 | 29 |
| Term | 4234 | 29 | 6.8 | 0.1 | 10 | 0.75 | -0.15 | 27 |
| total | 4820 | 29 | 6.8 | 0.098 | 10 | 0.72 | -0.18 | 27 |
By age
| PreMe | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Premature | 586 | 26 | 4.8 | 0.2 | 8 | 0.58 | -0.48 | 26 |
| Term | 4234 | 27 | 4.8 | 0.074 | 8 | 0.58 | -0.5 | 26 |
| total | 4820 | 27 | 4.8 | 0.07 | 8 | 0.58 | -0.5 | 26 |
By income as percent of the poverty line
| PreMe | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Premature | 586 | 193 | 154 | 6.3 | 208 | 0.83 | -0.55 | 144 |
| Term | 4234 | 220 | 161 | 2.5 | 272 | 0.51 | -1.1 | 183 |
| total | 4820 | 217 | 161 | 2.3 | 278 | 0.55 | -1 | 172 |
By years of education
| PreMe | n | mean | sd | se_mean | IQR | skewness | kurtosis | p50 |
|---|---|---|---|---|---|---|---|---|
| Premature | 586 | 13 | 2.6 | 0.11 | 3 | 0.28 | -0.52 | 13 |
| Term | 4234 | 14 | 2.8 | 0.042 | 4 | 0.23 | -0.71 | 13 |
| total | 4820 | 14 | 2.7 | 0.039 | 4 | 0.24 | -0.69 | 13 |
By Race
| Black | Hispanic | Other | White | |
|---|---|---|---|---|
| Premature | 163 | 125 | 26 | 272 |
| Term | 983 | 1023 | 242 | 1986 |
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 7.851, df = 3, p-value = 0.0492
By Wantedness
| Unwanted | Wanted | |
|---|---|---|
| Premature | 286 | 300 |
| Term | 1770 | 2464 |
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 10.315, df = 1, p-value = 0.00132
By LBW
| Low Wt | Normal Wt | |
|---|---|---|
| Premature | 283 | 303 |
| Term | 175 | 4059 |
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 1167.4, df = 1, p-value = 7.53e-256
| Black | Hispanic | Other | White | |
|---|---|---|---|---|
| Unwanted | 622 | 470 | 92 | 872 |
| Wanted | 524 | 678 | 176 | 1386 |
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 87.29, df = 3, p-value = 8.382e-19
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 25.23 | 0.1408 | 179.1 | 0 |
| raceHispanic | 1.242 | 0.1991 | 6.237 | 4.843e-10 |
| raceOther | 2.895 | 0.3235 | 8.95 | 4.994e-19 |
| raceWhite | 1.835 | 0.1729 | 10.61 | 5.069e-26 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 160.2 | 4.506 | 35.56 | 3.799e-246 |
| raceHispanic | 8.239 | 6.369 | 1.294 | 0.1959 |
| raceOther | 84.49 | 10.35 | 8.164 | 4.115e-16 |
| raceWhite | 106.9 | 5.532 | 19.32 | 3.499e-80 |
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 12.92 | 0.0777 | 166.2 | 0 |
| raceHispanic | -0.3561 | 0.1098 | -3.242 | 0.001193 |
| raceOther | 1.697 | 0.1785 | 9.506 | 3.027e-21 |
| raceWhite | 1.294 | 0.0954 | 13.57 | 3.62e-41 |
Saving this resource for later Logistic regression diagnostic plots in R