Definitions & Structure

First, a short overview of the variables we have. We have two datasets, BIRTH and PN (more on this later), and the breakdown of the variables between the two datasets are below:

  • CASEID (both): Case identification number (PRIMARY KEY)
  • LBW (BIRTH only, outcome): Was the first baby of this pregnancy of low birth weight? codebook
  • PreMe (BIRTH only, outcome): Was the GA < 37 weeks?
  • GA (BIRTH only): Gestational age in weeks wksgest
  • BMI (BIRTH only): Body Mass Index BMI
  • age (both): Age at time of conception AGECON
  • income (both): Expressed as percent of poverty level, with 100 being the poverty line. Responses above 500% were rounded down to 500 POVERTY
  • race (both): Race & Hispanic origin of respondent HISPRACE
  • YrEdu (both): Number of years of schooling, see EDUCAT for details. Note the limits on the upper and lower bounds (e.g. someone with over 7 years of college is still classified as 19 years of schooling)
  • Wanted (both): Bianary variable that is TRUE when NEWWANTR is “Right time”, otherwise FLASE
  • KnowPreg (PN only, outcome): TRUE if KNEWPREG <= 6 weeks
  • gotPNcare (PN only, outcome): TRUE if BGNPRENA < 13 weeks
  • PregNum (PN only): Number of lifetime pregnancies PREGNUM

Now here are our two data sets. You’ll see that BIRTH contains 4820 observations, while PN only contains 1518 observations

glimpse(BIRTH)
## Observations: 4,820
## Variables: 11
## $ CASEID <dbl> 70627, 70627, 70628, 70632, 70633, 70637, 70641, 70641, 70654,…
## $ LBW    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ PreMe  <fct> Term, Term, Term, Term, Term, Term, Term, Term, Term, Term, Te…
## $ GA     <dbl> 40, 39, 39, 41, 39, 40, 40, 40, 38, 39, 39, 39, 35, 39, 37, 36…
## $ BMI    <dbl> 39, 39, 26, 28, 26, 25, 26, 26, 27, 27, 36, 34, 30, 30, 26, 20…
## $ age    <dbl> 28, 33, 23, 24, 23, 22, 20, 21, 27, 29, 24, 30, 26, 27, 32, 27…
## $ income <dbl> 500, 500, 189, 500, 112, 113, 9, 9, 361, 361, 108, 500, 356, 3…
## $ race   <fct> White, White, Black, White, White, Hispanic, Black, Black, His…
## $ YrEdu  <dbl> 16, 16, 15, 12, 10, 13, 10, 10, 18, 18, 10, 12, 11, 11, 19, 12…
## $ eduCat <fct> Bachelors, Bachelors, Associates, HS or GED, HS or GED, Some c…
## $ Wanted <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, …
glimpse(PN)
## Observations: 1,518
## Variables: 14
## $ CASEID    <dbl> 70627, 70628, 70637, 70641, 70641, 70654, 70654, 70662, 706…
## $ KnowPreg  <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, No, Yes, …
## $ gotPNcare <fct> Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ LBW       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ PreMe     <fct> Term, Term, Term, Term, Term, Term, Term, Premature, Term, …
## $ GA        <dbl> 39, 39, 40, 40, 40, 38, 39, 35, 39, 39, 39, 40, 39, 39, 41,…
## $ BMI       <dbl> 39, 26, 25, 26, 26, 27, 27, 30, 30, 20, 20, 22, 27, 27, 32,…
## $ PregNum   <dbl> 3, 3, 2, 4, 4, 2, 2, 4, 4, 3, 3, 1, 3, 3, 4, 4, 2, 1, 4, 4,…
## $ age       <dbl> 33, 23, 22, 20, 21, 27, 29, 26, 27, 24, 27, 31, 22, 23, 27,…
## $ income    <dbl> 500, 189, 113, 9, 9, 361, 361, 356, 356, 172, 172, 118, 9, …
## $ race      <fct> White, Black, Hispanic, Black, Black, Hispanic, Hispanic, W…
## $ YrEdu     <dbl> 16, 15, 13, 10, 10, 18, 18, 11, 11, 13, 13, 12, 11, 11, 17,…
## $ eduCat    <fct> Bachelors, Associates, Some college, <HS, <HS, Bachelors, B…
## $ Wanted    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FA…

This occured because we handled missing values by deleting that observation. Since many women couldn’t recall what week they first knew they were pregnant or when they first sought prenatal care, our PN dataset is much smaller. The two figures below show the patterns of missing data (before I split them into their respective sets)

Birth dataset

Univariate

Continuous

These are our continuous columns

## Observations: 4,820
## Variables: 5
## $ GA     <dbl> 40, 39, 39, 41, 39, 40, 40, 40, 38, 39, 39, 39, 35, 39, 37, 36…
## $ BMI    <dbl> 39, 39, 26, 28, 26, 25, 26, 26, 27, 27, 36, 34, 30, 30, 26, 20…
## $ age    <dbl> 28, 33, 23, 24, 23, 22, 20, 21, 27, 29, 24, 30, 26, 27, 32, 27…
## $ income <dbl> 500, 500, 189, 500, 112, 113, 9, 9, 361, 361, 108, 500, 356, 3…
## $ YrEdu  <dbl> 16, 16, 15, 12, 10, 13, 10, 10, 18, 18, 10, 12, 11, 11, 19, 12…

Age: This follows an odd pattern, but perhaps it’s a function of each mom (possibly) having more than one pregnancy?

GA: This one has one outlier who was born at 9 weeks, which can’t be right. But for now it’s in our data set

Income: You can really see the artificial binning of considering everyone who was 500% or above the poverty line as being only 500%

  age BMI GA income YrEdu
Mean 26.54 28.88 38.57 216.96 13.53
Std.Dev 4.84 6.82 2.46 160.62 2.74
Min 20.00 16.00 9.00 5.00 9.00
Max 40.00 51.00 44.00 500.00 19.00
Median 26.00 27.00 39.00 172.00 13.00
Q1 22.00 23.00 38.00 78.00 12.00
Q3 30.00 33.00 40.00 356.00 16.00
IQR 8.00 10.00 2.00 278.00 4.00
Skewness 0.58 0.72 -2.53 0.54 0.24
Kurtosis -0.50 -0.18 11.69 -1.03 -0.69
CV 0.18 0.24 0.06 0.74 0.20

Discrete

These are our discrete columns

## Observations: 4,820
## Variables: 5
## $ LBW    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
## $ PreMe  <fct> Term, Term, Term, Term, Term, Term, Term, Term, Term, Term, Te…
## $ race   <fct> White, White, Black, White, White, Hispanic, Black, Black, His…
## $ eduCat <fct> Bachelors, Bachelors, Associates, HS or GED, HS or GED, Some c…
## $ Wanted <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, …

We can see that relatively few of subjects had babies that were either premature or low birth weight, but quite a few had unwanted pregnancies

Counts of categorical variables

Wanted FALSE TRUE Sum
LBW PreMe
FALSE Premature 145 158 303
Term 1689 2370 4059
Sum 1834 2528 4362
TRUE Premature 141 142 283
Term 81 94 175
Sum 222 236 458
Sum Premature 286 300 586
Term 1770 2464 4234
Sum 2056 2764 4820

Bivariate

Outcome: Low Birth Weight

Continuous

By gestational age

LBW n mean sd se_mean IQR skewness kurtosis p50
Low Wt 458 35 4.3 0.2 6 -0.98 2.3 35
Normal Wt 4362 39 1.7 0.026 2 -1.6 6.6 39
total 4820 39 2.5 0.035 2 -2.5 12 39

By BMI

LBW n mean sd se_mean IQR skewness kurtosis p50
Low Wt 458 29 6.7 0.31 9 0.72 0.0027 28
Normal Wt 4362 29 6.8 0.1 10 0.72 -0.2 27
total 4820 29 6.8 0.098 10 0.72 -0.18 27

By age

LBW n mean sd se_mean IQR skewness kurtosis p50
Low Wt 458 27 5.2 0.24 9 0.63 -0.62 25
Normal Wt 4362 27 4.8 0.073 7.8 0.57 -0.48 26
total 4820 27 4.8 0.07 8 0.58 -0.5 26

By income as percent of poverty line

LBW n mean sd se_mean IQR skewness kurtosis p50
Low Wt 458 173 151 7.1 181 0.99 -0.27 113
Normal Wt 4362 222 161 2.4 271 0.51 -1.1 183
total 4820 217 161 2.3 278 0.55 -1 172

By years of education

LBW n mean sd se_mean IQR skewness kurtosis p50
Low Wt 458 13 2.7 0.13 3 0.44 -0.48 12
Normal Wt 4362 14 2.7 0.041 4 0.22 -0.7 13
total 4820 14 2.7 0.039 4 0.24 -0.69 13

Discrete

By Race

  Black Hispanic Other White
Low Wt 155 94 22 187
Normal Wt 991 1054 246 2071
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 28.308, df = 3, p-value = 3.129e-06

By Wantedness

  Unwanted Wanted
Low Wt 222 236
Normal Wt 1834 2528
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 6.999, df = 1, p-value = 0.008157

By prematurity

  Premature Term
Low Wt 283 175
Normal Wt 303 4059
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 1167.4, df = 1, p-value = 7.53e-256

Outcome: Pre-mature birth

Continuous

Note that pre-mature birth is a direct function of gestational age

By BMI

PreMe n mean sd se_mean IQR skewness kurtosis p50
Premature 586 30 6.7 0.28 10 0.55 -0.31 29
Term 4234 29 6.8 0.1 10 0.75 -0.15 27
total 4820 29 6.8 0.098 10 0.72 -0.18 27

By age

PreMe n mean sd se_mean IQR skewness kurtosis p50
Premature 586 26 4.8 0.2 8 0.58 -0.48 26
Term 4234 27 4.8 0.074 8 0.58 -0.5 26
total 4820 27 4.8 0.07 8 0.58 -0.5 26

By income as percent of the poverty line

PreMe n mean sd se_mean IQR skewness kurtosis p50
Premature 586 193 154 6.3 208 0.83 -0.55 144
Term 4234 220 161 2.5 272 0.51 -1.1 183
total 4820 217 161 2.3 278 0.55 -1 172

By years of education

PreMe n mean sd se_mean IQR skewness kurtosis p50
Premature 586 13 2.6 0.11 3 0.28 -0.52 13
Term 4234 14 2.8 0.042 4 0.23 -0.71 13
total 4820 14 2.7 0.039 4 0.24 -0.69 13

Discrete

By Race

  Black Hispanic Other White
Premature 163 125 26 272
Term 983 1023 242 1986
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 7.851, df = 3, p-value = 0.0492

By Wantedness

  Unwanted Wanted
Premature 286 300
Term 1770 2464
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 10.315, df = 1, p-value = 0.00132

By LBW

  Low Wt Normal Wt
Premature 283 303
Term 175 4059
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 1167.4, df = 1, p-value = 7.53e-256


Selected highlights

Race

Wantedness

  Black Hispanic Other White
Unwanted 622 470 92 872
Wanted 524 678 176 1386
## Call: xtabs(formula = formula_str, data = data, addNA = TRUE)
## Number of cases in table: 4820 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 87.29, df = 3, p-value = 8.382e-19

Age

Fitting linear model: formula(formula_str)
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.23 0.1408 179.1 0
raceHispanic 1.242 0.1991 6.237 4.843e-10
raceOther 2.895 0.3235 8.95 4.994e-19
raceWhite 1.835 0.1729 10.61 5.069e-26

Income

Fitting linear model: formula(formula_str)
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 160.2 4.506 35.56 3.799e-246
raceHispanic 8.239 6.369 1.294 0.1959
raceOther 84.49 10.35 8.164 4.115e-16
raceWhite 106.9 5.532 19.32 3.499e-80

Years of education

Fitting linear model: formula(formula_str)
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.92 0.0777 166.2 0
raceHispanic -0.3561 0.1098 -3.242 0.001193
raceOther 1.697 0.1785 9.506 3.027e-21
raceWhite 1.294 0.0954 13.57 3.62e-41

Saving this resource for later Logistic regression diagnostic plots in R