Executive Summary

In this analysis I focus on how socioeconomic and health factors influence birth weight in newborn babies. This dataset contains information on 3.8M births for 2018 in the United States. It is open-sourced made available by the CDC / National Center for Health Statistics (NCHS) and is publicly available through Kaggle. The dataset can be found here : https://www.kaggle.com/datasets/des137/us-births-2018

It is well known that smoking is TERRIBLE and this analysis is just another example of that! The analysis concludes that across different health factors and socioeconomic factors, if a mother smokes she will on average have a lighter weight baby. More importantly, a trend is seen when smoking, infection and risk are involved in pregnancy. Independently, across all variables of interest smoking, infection, and risk showed a dip in lowering birth weight. For example, smoking was associated with a decrease in baby weight across each category of mother’s age regardless of whether the mother was younger or older. The combination of the three variables along with a low BMI is rare but is associated with a very significant 1.34lb difference in baby weight.

library(data.table)
library(tidyverse)
library(knitr)
library(dplyr)
library(kableExtra)
dt = fread("C:/Users/Alexei/Desktop/Projects/births/births_2018.csv")
head(dt) %>% kable("html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
ATTEND BFACIL BMI CIG_0 DBWT DLMP_MM DLMP_YY DMAR DOB_MM DOB_TT DOB_WK DOB_YY DWgt_R FAGECOMB FEDUC FHISPX FRACE15 FRACE31 FRACE6 ILLB_R ILOP_R ILP_R IMP_SEX IP_GON LD_INDL MAGER MAGE_IMPFLG MAR_IMP MBSTATE_REC MEDUC MHISPX MM_AICU MRACE15 MRACE31 MRACEIMP MRAVE6 MTRAN M_Ht_In NO_INFEC NO_MMORB NO_RISKS PAY PAY_REC PRECARE PREVIS PRIORDEAD PRIORLIVE PRIORTERM PWgt_R RDMETH_REC RESTATUS RF_CESAR RF_CESARN SEX WTGAIN
1 1 30.7 0 3657 4 2017 1 1 1227 2 2018 231 31 3 1 1 1 1 16 33 16 NA N N 30 NA NA 1 6 0 N 1 1 NA 1 N 66 1 1 1 2 2 3 8 0 1 2 190 1 2 N 0 M 41
1 1 33.3 2 3242 99 9999 2 1 1704 2 2018 185 35 4 0 3 3 3 180 888 180 NA N N 35 NA NA 1 9 0 N 3 3 NA 3 N 63 1 1 0 1 1 3 9 0 2 0 188 4 2 Y 2 F 0
1 1 30.0 0 3470 4 2017 1 1 336 2 2018 273 31 4 0 1 1 1 999 888 999 NA N N 28 NA NA 1 6 0 N 1 1 NA 1 N 71 1 1 0 5 4 5 17 0 1 0 215 1 1 N 0 M 58
3 1 23.7 0 3140 5 2017 2 1 938 2 2018 138 26 2 0 3 3 3 43 888 43 NA N N 23 NA NA 1 2 0 N 3 3 NA 3 N 64 1 1 1 1 1 5 6 0 2 0 138 1 2 N 0 F 0
1 1 35.5 0 2125 99 9999 1 1 830 3 2018 219 35 3 0 2 2 2 999 999 999 NA N N 37 NA NA 1 4 0 N 1 1 NA 1 N 66 1 1 1 1 1 5 15 0 1 4 220 3 1 N 0 M 0
4 2 31.3 0 4082 3 2017 1 1 28 2 2018 247 28 6 6 1 1 1 39 888 39 NA N N 26 NA NA 1 6 0 N 1 1 NA 1 N 67 1 1 1 2 2 2 13 0 1 0 200 1 1 N 0 F 47

Data cleaning


There are quite a few variables in the dataset that one could investigate but I focused the analysis on 6 main variables - mothers education, marital status, smoking, BMI, known risk factors and known infections and how they interact with each other and influence new born weight.

All of the columns were coded using numeric values, for instance, the column meduc was coded 1-9 and each number meant a different level of education. 1 was coded for least amount of education and 8 being highest level of education. To make sense of this the columns needed to be cleaned for ease of use. I used the column key that was attached to the data set, it can be found here: https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/natality/UserGuide2018-508.pdf

cols = c("MAGER", "DMAR", "MEDUC", "CIG_0", "BMI", "DBWT", "NO_RISKS", "NO_INFEC")
dt = dt[ , cols, with = FALSE]

colnames(dt) = tolower(colnames(dt))

Babyweight

# dbwt = babyweight at birth is coded in oz, convert to lbs for ease of use. 
#dbwt (need to convert to oz to lbs )
dt[dbwt == 9999, babyweight_oz := NA]
dt[, babyweight_oz := dbwt/28.35]
dt[ , dbwt := NULL]
dt[babyweight_oz == 9999, babyweight_lbs := NA]
dt[ , babyweight_lbs := babyweight_oz/16]
# Remove outliers for baby weight. Babies weighing in at > 15 lbs are very, very rare. https://www.nationwidechildrens.org/conditions/health-library/newborn-measurements

dt = dt[babyweight_lbs < 15.0]

dt[babyweight_lbs > 8.8, babyweight_category := "c.macrosomia"]
dt[babyweight_lbs >= 5.5 & babyweight_lbs <= 8.8, babyweight_category := "b.normal birthweight"]
dt[babyweight_lbs < 5.5, babyweight_category := "a.low birthweight"]

Smoking

# Smoker = cig_0: Coded for cigarettes before pregnancy. 
dt[cig_0 == 99, smoker := NA]
dt[cig_0 == 0, smoker := FALSE]
dt[cig_0 > 0, smoker := TRUE]
dt[ , cig_0 := NULL]

Risk factors

# no_risk = no risk factors reported. The guide does not specify what a risk is. 
dt[no_risks == 1, has_risk := FALSE]
dt[no_risks == 0, has_risk := TRUE]
dt[no_risks == 9, has_risk := NA]

Infection

# no_infec = no infection reported. The guide does not specify what an infection could be. 
dt[no_infec == 1, has_infection := FALSE]
dt[no_infec == 0, has_infection := TRUE]
dt[no_infec == 9, has_infection := NA]

BMI

#bmi
dt[ , bmi := ifelse(bmi == 99.9, NA, bmi)]

dt[bmi < 18.5, bmi_category := "a. mom underweight"]
dt[bmi >= 18.5 & bmi < 25.0, bmi_category := "b. mom healthy"]
dt[bmi >= 25.0 & bmi < 30.0, bmi_category := "c. mom overweight"]
dt[bmi >= 30, bmi_category := "e. mom obese"]

Mother’s age

# mager = mothers age 
dt[mager >= 13 & mager < 19, mother_age_category := "a. teen mom"]
dt[mager >= 19 & mager < 25, mother_age_category := "b. young aged mom"]
dt[mager >= 25 & mager < 35, mother_age_category := "c. standard aged mom"]
dt[mager  >= 35, mother_age_category := "d. advanced aged mom"]

Mother’s education

# meduc = mothers education level 
dt[meduc == 1, mother_ed := "a. <= 8th grade"]
dt[meduc == 2, mother_ed := "b. high school no diploma"]
dt[meduc == 3, mother_ed := "c. high school"]
dt[meduc == 4, mother_ed := "d. college credit"]
dt[meduc %in% c(5,6), mother_ed := "e. undergraduate degree"]
dt[meduc %in% c(7,8), mother_ed := "f. graduate degree"]
dt[meduc == 9, mother_ed := "g. NA"]
dt[ , meduc := NULL]

Marital status

# dmar = marriage status 
dt[dmar == 1, married :=  TRUE]
dt[dmar == 2, married :=  FALSE]
dt[ , dmar := NULL]

Analyzing key variables


Birth weight

Now that the data is nice and tidy I wanted to understand how each variable interacted with birth weight. First, I took a look at how heath factors interact and secondly how socioeconomic factors interact with weight.

ggplot(dt, aes(x = babyweight_lbs)) +
  geom_density(fill = "light blue", alpha = 0.7)

For birth weight, there is a normal distribution of weight with a mean around 7lbs. The left tail is slightly wider with more lighter weight babies.

Smoking

# smoking and BW 
ggplot(dt, aes(x = smoker, y = babyweight_lbs, fill = smoker)) + 
  geom_violin()

This violin plot shows the relationship between a mother who is a smoker to a nonsmoker and how that impacts birth weight. It highlights that the median birth weight for smokers is lower than that of nonsmokers.

dt[ , .(.N, avg_weight = mean(babyweight_lbs, na.rm = TRUE)), by = smoker] 
##    smoker       N avg_weight
## 1:  FALSE 3462936   7.221064
## 2:   TRUE  335587   6.874621

Mothers who smoke have a lower birth weight with a 0.35 lb difference.

Risk factors

#risk factors 
ggplot(dt, aes(x= has_risk, y = babyweight_lbs, fill = has_risk)) +
  geom_violin()

This violin plot shows the relationship between a mother who has a known risk-factor and who does not. It highlights that the median birth weight for a mother with a known risk is lower in relation to a mother who has no risk. The thicker tail also shows that are more cases where birth weight is lower with a known risk factor.

Infection

ggplot(dt, aes(x= has_infection, y = babyweight_lbs, fill = has_infection)) +
  geom_violin()

This density plot compares mothers with a known infection to mothers without infection. The plot highlights that the median birth weight for a mother with a known infection is lower in relation to a mother without an infection.

BMI

#mothers BMI
ggplot(dt, aes(x = bmi)) +
  geom_density(fill = "light blue", alpha = 0.7)
## Warning: Removed 85199 rows containing non-finite values (`stat_density()`).

By using a density plot we can visualize where the median for BMI is for the dataset. Most women fall WNL (which is 18.5 - 24.9 considered a healthy BMI) with spikes around 30 BMI (overweight).

ggplot(dt, aes(x = bmi, y = babyweight_lbs)) +
  geom_bin2d(bin = 100) +
  scale_fill_continuous(type = "viridis") 
## Warning in geom_bin2d(bin = 100): Ignoring unknown parameters: `bin`
## Warning: Removed 85199 rows containing non-finite values (`stat_bin2d()`).

The density map shows us that there is a significant number of women with a higher BMI but it does not seem to show any real correlation to low birth weight.

dt[ , mean(babyweight_lbs), by = bmi_category][order(bmi_category)]
##          bmi_category       V1
## 1: a. mom underweight 6.737027
## 2:     b. mom healthy 7.138015
## 3:  c. mom overweight 7.254280
## 4:       e. mom obese 7.280933
## 5:               <NA> 6.936613

There seems to be a relation to low birth weight if the mother herself is underweight with a 0.4lb difference in weight compared to a mother with a healthy bmi. As the mothers bmi increases so does the weight of the newborn.

Mother’s age

# mothers age 
ggplot(dt, aes(x = mager, y = babyweight_lbs)) +
  geom_bin2d(bin = 100) +
  scale_fill_continuous(type = "viridis")
## Warning in geom_bin2d(bin = 100): Ignoring unknown parameters: `bin`

This is another way to show that there does not seem to be a strong correlation between BMI and birth weight. The density for birth weight is around 7lbs and does not drop as mother BMI increases.

ggplot(dt, aes(x = mother_age_category, y = babyweight_lbs, fill = mother_age_category)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = "white", color = "black") +
    theme(axis.text.x = element_text(angle=45))

As mothers age increases baby weight increases. Teen mothers have a lower median baby weight compared to older mothers.

Mother’s education

# mothers education
dt_mother_ed = dt[ , .N, by = mother_ed][order(mother_ed)]
dt_mother_ed[ , pct := N / sum(N)] %>% kable("html") %>% kable_styling("striped")
mother_ed N pct
  1. <= 8th grade
118024 0.0310710
  1. high school no diploma
358451 0.0943659
  1. high school
967710 0.2547595
  1. college credit
751781 0.1979140
  1. undergraduate degree
1090837 0.2871740
  1. graduate degree
463153 0.1219298
  1. NA
48567 0.0127858

The table shows percents of the population according to mothers education levels. Majority of women have between a high school and undergraduate degree. A surprise in the data is how many women have lower educations.

ggplot(dt, aes(x = mother_ed, y = babyweight_lbs, fill = mother_ed)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = "white", color = "black") +
  theme(axis.text.x = element_text(angle=45))

The median birth weight for a high school diploma or less is lower in relation to a mother with higher educational levels.

dt[ , mean(babyweight_lbs), by = mother_ed][order(mother_ed)]
##                    mother_ed       V1
## 1:           a. <= 8th grade 7.208248
## 2: b. high school no diploma 6.981034
## 3:            c. high school 7.086086
## 4:         d. college credit 7.174980
## 5:   e. undergraduate degree 7.315141
## 6:        f. graduate degree 7.313910
## 7:                     g. NA 7.034306

High school credit with no diploma has the lowest on average babyweight compared to higher educational levels. What is interesting is the women who have lower than a high school level education on average have average weighted babies.

Marital status

#married
ggplot(dt, aes(x = married, y = babyweight_lbs, fill = married)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = "white", color = "black") 

Marriage seems to be a social tie that contributes to healthier weight for babies.

Interaction of key variables with smoking on birth weight


Let us compare the health factors smoking + risk and smoking + infection in relation to BMI, education, age and marital status.

dt[smoker == FALSE & has_risk == FALSE, smoke_risk_category := "Not a smoker and no risk"]
dt[smoker == FALSE & has_risk == TRUE, smoke_risk_category := "Not a  smoker and has risk"]
dt[smoker == TRUE & has_risk == FALSE, smoke_risk_category := "smoker and no risk"]
dt[smoker == TRUE & has_risk == TRUE, smoke_risk_category := "smoker and has risk"]
dt[smoker == FALSE & has_infection == FALSE, smoke_infect_category := "Not a smoker and no infection"]
dt[smoker == FALSE & has_infection == TRUE, smoke_infect_category := "Not a  smoker and has infection"]
dt[smoker == TRUE & has_infection == FALSE, smoke_infect_category := "smoker and no infection"]
dt[smoker == TRUE & has_infection == TRUE, smoke_infect_category := "smoker and has infection"]
dt[smoker == FALSE & has_infection == FALSE & has_risk == FALSE, smoke_i_r_category := "Nonsmoker:no infection or risk"]
dt[smoker == FALSE & has_infection == TRUE & has_risk == TRUE, smoke_i_r_category := "Nonsmoker:has infection & risk"]
dt[smoker == TRUE & has_infection == FALSE & has_risk == FALSE, smoke_i_r_category := "Smoker:no infection or risk"]
dt[smoker == TRUE & has_infection == TRUE & has_risk == TRUE, smoke_i_r_category := "Smoker: has infection & risk"]

BMI

dt[ , .(mean(babyweight_lbs)), by = .(bmi_category, smoke_risk_category)][order(smoke_risk_category, bmi_category)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
bmi_category smoke_risk_category V1
  1. mom underweight
Not a smoker and has risk 6.557711
  1. mom healthy
Not a smoker and has risk 6.968372
  1. mom overweight
Not a smoker and has risk 7.101688
  1. mom obese
Not a smoker and has risk 7.185557
NA Not a smoker and has risk 6.829529
  1. mom underweight
Not a smoker and no risk 6.846959
  1. mom healthy
Not a smoker and no risk 7.234404
  1. mom overweight
Not a smoker and no risk 7.362274
  1. mom obese
Not a smoker and no risk 7.395239
NA Not a smoker and no risk 7.072394
  1. mom underweight
smoker and has risk 6.074468
  1. mom healthy
smoker and has risk 6.507460
  1. mom overweight
smoker and has risk 6.773441
  1. mom obese
smoker and has risk 6.985766
NA smoker and has risk 6.339734
  1. mom underweight
smoker and no risk 6.484968
  1. mom healthy
smoker and no risk 6.834372
  1. mom overweight
smoker and no risk 7.051636
  1. mom obese
smoker and no risk 7.159840
NA smoker and no risk 6.537244
  1. mom underweight
NA 6.445976
  1. mom healthy
NA 6.781874
  1. mom overweight
NA 7.069293
  1. mom obese
NA 7.186677
NA NA 6.504703

WOW! Underweight mothers who have a known risk and smoke have on average a significantly lower weight baby in comparison to a healthy mother by a 1.16 lb difference.

dt[ , .(mean(babyweight_lbs)), by = .(bmi_category, smoke_infect_category)][order(smoke_infect_category, bmi_category)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
bmi_category smoke_infect_category V1
  1. mom underweight
Not a smoker and has infection 6.557840
  1. mom healthy
Not a smoker and has infection 6.879859
  1. mom overweight
Not a smoker and has infection 7.037077
  1. mom obese
Not a smoker and has infection 7.112889
NA Not a smoker and has infection 6.625991
  1. mom underweight
Not a smoker and no infection 6.802793
  1. mom healthy
Not a smoker and no infection 7.179467
  1. mom overweight
Not a smoker and no infection 7.286381
  1. mom obese
Not a smoker and no infection 7.308549
NA Not a smoker and no infection 7.012842
  1. mom underweight
smoker and has infection 6.267857
  1. mom healthy
smoker and has infection 6.567159
  1. mom overweight
smoker and has infection 6.737807
  1. mom obese
smoker and has infection 6.931946
NA smoker and has infection 6.208082
  1. mom underweight
smoker and no infection 6.409434
  1. mom healthy
smoker and no infection 6.770930
  1. mom overweight
smoker and no infection 6.980036
  1. mom obese
smoker and no infection 7.090450
NA smoker and no infection 6.508547
  1. mom underweight
NA 6.053150
  1. mom healthy
NA 6.542826
  1. mom overweight
NA 6.732803
  1. mom obese
NA 6.736082
NA NA 6.409067

An underweight mother who smokes and has a known infection has on average a 0.91 lb difference compared to a healthy mother.

dt[ , .(n = .N, babyweight_lbs = mean(babyweight_lbs)), by = .(bmi_category, smoke_i_r_category)][order(smoke_i_r_category, bmi_category)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
bmi_category smoke_i_r_category n babyweight_lbs
  1. mom underweight
Nonsmoker:has infection & risk 650 6.302164
  1. mom healthy
Nonsmoker:has infection & risk 7251 6.640815
  1. mom overweight
Nonsmoker:has infection & risk 5867 6.812269
  1. mom obese
Nonsmoker:has infection & risk 8923 6.987495
NA Nonsmoker:has infection & risk 574 6.450014
  1. mom underweight
Nonsmoker:no infection or risk 79910 6.855686
  1. mom healthy
Nonsmoker:no infection or risk 1076981 7.241397
  1. mom overweight
Nonsmoker:no infection or risk 607300 7.368130
  1. mom obese
Nonsmoker:no infection or risk 512472 7.400858
NA Nonsmoker:no infection or risk 50664 7.086106
  1. mom underweight
Smoker: has infection & risk 357 5.902515
  1. mom healthy
Smoker: has infection & risk 3519 6.354822
  1. mom overweight
Smoker: has infection & risk 2186 6.575385
  1. mom obese
Smoker: has infection & risk 2573 6.827571
NA Smoker: has infection & risk 318 6.086084
  1. mom underweight
Smoker:no infection or risk 11774 6.497483
  1. mom healthy
Smoker:no infection or risk 83970 6.855321
  1. mom overweight
Smoker:no infection or risk 48894 7.072248
  1. mom obese
Smoker:no infection or risk 52647 7.170103
NA Smoker:no infection or risk 5719 6.571920
  1. mom underweight
NA 25529 6.498823
  1. mom healthy
NA 393543 6.931580
  1. mom overweight
NA 323837 7.080848
  1. mom obese
NA 465141 7.169487
NA NA 27924 6.759759

The lowest on average weight we’ve seen yet in the analysis. Underweight mothers who smoke and have a risk and infection have significantly lower weights with a 5.9 lbs average weight - A 1.34 lb difference.

Mother’s age

dt[ , .(mean(babyweight_lbs)), by = .(smoke_risk_category, mother_age_category)][order(mother_age_category, smoke_risk_category)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
smoke_risk_category mother_age_category V1
Not a smoker and has risk
  1. teen mom
6.643661
Not a smoker and no risk
  1. teen mom
6.946099
smoker and has risk
  1. teen mom
6.629075
smoker and no risk
  1. teen mom
6.886098
NA
  1. teen mom
6.443932
Not a smoker and has risk
  1. young aged mom
6.910346
Not a smoker and no risk
  1. young aged mom
7.146110
smoker and has risk
  1. young aged mom
6.802486
smoker and no risk
  1. young aged mom
6.961263
NA
  1. young aged mom
6.784995
Not a smoker and has risk
  1. standard aged mom
7.118856
Not a smoker and no risk
  1. standard aged mom
7.348293
smoker and has risk
  1. standard aged mom
6.774697
smoker and no risk
  1. standard aged mom
6.951784
NA
  1. standard aged mom
6.901465
Not a smoker and has risk
  1. advanced aged mom
7.087011
Not a smoker and no risk
  1. advanced aged mom
7.338563
smoker and has risk
  1. advanced aged mom
6.602384
smoker and no risk
  1. advanced aged mom
6.810126
NA
  1. advanced aged mom
6.850426
Not a smoker and has risk NA 6.560112
Not a smoker and no risk NA 6.224721

Regardless of age, if a mother is a smoker and has risk the average weight decreases. For older mothers and teen mothers, the weight of a smoker and known risk significantly decreases weight.

Advanced maternal age coupled with smoking and a known infection seems to be a key factor when determining lower birth weight.

dt[ , .(mean(babyweight_lbs)), by = .(smoke_infect_category, mother_age_category)][order(smoke_infect_category, mother_age_category)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
smoke_infect_category mother_age_category V1
Not a smoker and has infection
  1. teen mom
6.825696
Not a smoker and has infection
  1. young aged mom
6.932546
Not a smoker and has infection
  1. standard aged mom
7.024985
Not a smoker and has infection
  1. advanced aged mom
6.998519
Not a smoker and has infection NA 5.936949
Not a smoker and no infection
  1. teen mom
6.918317
Not a smoker and no infection
  1. young aged mom
7.106121
Not a smoker and no infection
  1. standard aged mom
7.281290
Not a smoker and no infection
  1. advanced aged mom
7.230882
Not a smoker and no infection NA 6.287104
smoker and has infection
  1. teen mom
6.871506
smoker and has infection
  1. young aged mom
6.781263
smoker and has infection
  1. standard aged mom
6.611574
smoker and has infection
  1. advanced aged mom
6.424698
smoker and no infection
  1. teen mom
6.854834
smoker and no infection
  1. young aged mom
6.939886
smoker and no infection
  1. standard aged mom
6.910951
smoker and no infection
  1. advanced aged mom
6.730659
NA
  1. teen mom
6.330203
NA
  1. young aged mom
6.477386
NA
  1. standard aged mom
6.658244
NA
  1. advanced aged mom
6.577888

Regardless of age, if a mother has a known infection and smokes the weight of the baby lowers signifcantly. Older mothers who smoke and have an infection show the lowest at a 6.4 lb average.

dt[ , .(mean(babyweight_lbs)), by = .(smoke_i_r_category, mother_age_category)][order(smoke_i_r_category, mother_age_category)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
smoke_i_r_category mother_age_category V1
Nonsmoker:has infection & risk
  1. teen mom
6.565531
Nonsmoker:has infection & risk
  1. young aged mom
6.731292
Nonsmoker:has infection & risk
  1. standard aged mom
6.865472
Nonsmoker:has infection & risk
  1. advanced aged mom
6.835796
Nonsmoker:no infection or risk
  1. teen mom
6.953718
Nonsmoker:no infection or risk
  1. young aged mom
7.154507
Nonsmoker:no infection or risk
  1. standard aged mom
7.352264
Nonsmoker:no infection or risk
  1. advanced aged mom
7.341065
Nonsmoker:no infection or risk NA 6.231261
Smoker: has infection & risk
  1. teen mom
6.558642
Smoker: has infection & risk
  1. young aged mom
6.662261
Smoker: has infection & risk
  1. standard aged mom
6.497953
Smoker: has infection & risk
  1. advanced aged mom
6.349676
Smoker:no infection or risk
  1. teen mom
6.883792
Smoker:no infection or risk
  1. young aged mom
6.977089
Smoker:no infection or risk
  1. standard aged mom
6.976632
Smoker:no infection or risk
  1. advanced aged mom
6.834178
NA
  1. teen mom
6.737533
NA
  1. young aged mom
6.911568
NA
  1. standard aged mom
7.087237
NA
  1. advanced aged mom
7.059787
NA NA 6.497795

Smokers who have a known risk and infection and who are of advanced age have on average lower weight babies.

Mother’s education

ggplot(dt, aes(x = mother_ed, y = smoke_infect_category, fill = babyweight_lbs)) +
  geom_tile() +
  scale_fill_gradient(low="white", high="blue") +
  theme(axis.text.x = element_text(angle=45))

A mothers level of education also seems to play a role. Mothers who smoke or have a known infection and have a high school or lower education seem to have lower than average birth weights.

 ggplot(dt, aes(x = mother_ed, y = smoke_risk_category, fill = babyweight_lbs)) +
  geom_tile() +
  scale_fill_gradient(low="white", high="blue") +
  theme(axis.text.x = element_text(angle=45))

Risk coupled with smoking also is indicated to have an impact on weight for mothers who have a high school or lower education.

dt[ , .(mean(babyweight_lbs)), by = .(mother_ed, smoke_risk_category)][order(smoke_risk_category, mother_ed)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
mother_ed smoke_risk_category V1
  1. <= 8th grade
Not a smoker and has risk 7.128094
  1. high school no diploma
Not a smoker and has risk 6.942638
  1. high school
Not a smoker and has risk 7.000002
  1. college credit
Not a smoker and has risk 7.059247
  1. undergraduate degree
Not a smoker and has risk 7.152244
  1. graduate degree
Not a smoker and has risk 7.148772
  1. NA
Not a smoker and has risk 6.925055
  1. <= 8th grade
Not a smoker and no risk 7.282870
  1. high school no diploma
Not a smoker and no risk 7.083236
  1. high school
Not a smoker and no risk 7.175663
  1. college credit
Not a smoker and no risk 7.271049
  1. undergraduate degree
Not a smoker and no risk 7.400901
  1. graduate degree
Not a smoker and no risk 7.391967
  1. NA
Not a smoker and no risk 7.164486
  1. <= 8th grade
smoker and has risk 6.546461
  1. high school no diploma
smoker and has risk 6.576046
  1. high school
smoker and has risk 6.752725
  1. college credit
smoker and has risk 6.810587
  1. undergraduate degree
smoker and has risk 6.940088
  1. graduate degree
smoker and has risk 7.003810
  1. NA
smoker and has risk 6.313459
  1. <= 8th grade
smoker and no risk 6.773087
  1. high school no diploma
smoker and no risk 6.780626
  1. high school
smoker and no risk 6.924282
  1. college credit
smoker and no risk 7.016880
  1. undergraduate degree
smoker and no risk 7.154075
  1. graduate degree
smoker and no risk 7.306444
  1. NA
smoker and no risk 6.595227
  1. <= 8th grade
NA 7.430649
  1. high school no diploma
NA 6.659285
  1. high school
NA 6.594177
  1. college credit
NA 6.864671
  1. undergraduate degree
NA 7.258410
  1. graduate degree
NA 7.236790
  1. NA
NA 6.292635

Smoking and a known risk regardless of education level drives down weight.

dt[ , .(mean(babyweight_lbs)), by = .(mother_ed, smoke_infect_category)][order(smoke_infect_category, mother_ed)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
mother_ed smoke_infect_category V1
  1. <= 8th grade
Not a smoker and has infection 7.085872
  1. high school no diploma
Not a smoker and has infection 6.895973
  1. high school
Not a smoker and has infection 6.928241
  1. college credit
Not a smoker and has infection 6.987702
  1. undergraduate degree
Not a smoker and has infection 7.099854
  1. graduate degree
Not a smoker and has infection 7.180005
  1. NA
Not a smoker and has infection 6.776868
  1. <= 8th grade
Not a smoker and no infection 7.238241
  1. high school no diploma
Not a smoker and no infection 7.050682
  1. high school
Not a smoker and no infection 7.131319
  1. college credit
Not a smoker and no infection 7.209835
  1. undergraduate degree
Not a smoker and no infection 7.325956
  1. graduate degree
Not a smoker and no infection 7.316162
  1. NA
Not a smoker and no infection 7.096050
  1. <= 8th grade
smoker and has infection 6.489807
  1. high school no diploma
smoker and has infection 6.570212
  1. high school
smoker and has infection 6.697327
  1. college credit
smoker and has infection 6.727684
  1. undergraduate degree
smoker and has infection 6.737024
  1. graduate degree
smoker and has infection 6.910538
  1. NA
smoker and has infection 6.138921
  1. <= 8th grade
smoker and no infection 6.719711
  1. high school no diploma
smoker and no infection 6.733619
  1. high school
smoker and no infection 6.885887
  1. college credit
smoker and no infection 6.963012
  1. undergraduate degree
smoker and no infection 7.089085
  1. graduate degree
smoker and no infection 7.203003
  1. NA
smoker and no infection 6.548371
  1. <= 8th grade
NA 7.101600
  1. high school no diploma
NA 6.388365
  1. high school
NA 6.424440
  1. college credit
NA 6.577856
  1. undergraduate degree
NA 6.837060
  1. graduate degree
NA 6.805563
  1. NA
NA 6.270521

Infection also drives down weight in mothers with less education. The trend still holds that smoking and infection (regardless of education level drives down weight)

dt[ , .(mean(babyweight_lbs)), by = .(smoke_i_r_category, mother_ed)][order(smoke_i_r_category, mother_ed)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
smoke_i_r_category mother_ed V1
Nonsmoker:has infection & risk
  1. <= 8th grade
7.010168
Nonsmoker:has infection & risk
  1. high school no diploma
6.738782
Nonsmoker:has infection & risk
  1. high school
6.767663
Nonsmoker:has infection & risk
  1. college credit
6.819223
Nonsmoker:has infection & risk
  1. undergraduate degree
6.891021
Nonsmoker:has infection & risk
  1. graduate degree
7.030520
Nonsmoker:has infection & risk
  1. NA
6.541675
Nonsmoker:no infection or risk
  1. <= 8th grade
7.287876
Nonsmoker:no infection or risk
  1. high school no diploma
7.091954
Nonsmoker:no infection or risk
  1. high school
7.183927
Nonsmoker:no infection or risk
  1. college credit
7.277111
Nonsmoker:no infection or risk
  1. undergraduate degree
7.403139
Nonsmoker:no infection or risk
  1. graduate degree
7.393043
Nonsmoker:no infection or risk
  1. NA
7.170110
Smoker: has infection & risk
  1. <= 8th grade
6.440378
Smoker: has infection & risk
  1. high school no diploma
6.407769
Smoker: has infection & risk
  1. high school
6.552478
Smoker: has infection & risk
  1. college credit
6.586868
Smoker: has infection & risk
  1. undergraduate degree
6.650195
Smoker: has infection & risk
  1. graduate degree
6.842458
Smoker: has infection & risk
  1. NA
6.000490
Smoker:no infection or risk
  1. <= 8th grade
6.807455
Smoker:no infection or risk
  1. high school no diploma
6.798876
Smoker:no infection or risk
  1. high school
6.941398
Smoker:no infection or risk
  1. college credit
7.034957
Smoker:no infection or risk
  1. undergraduate degree
7.168269
Smoker:no infection or risk
  1. graduate degree
7.309864
Smoker:no infection or risk
  1. NA
6.634552
NA
  1. <= 8th grade
7.098757
NA
  1. high school no diploma
6.881254
NA
  1. high school
6.967949
NA
  1. college credit
7.033591
NA
  1. undergraduate degree
7.146604
NA
  1. graduate degree
7.148474
NA
  1. NA
6.864992

Marital status

dt[ , .(mean(babyweight_lbs)), by = .(married, smoke_risk_category)][order(smoke_risk_category, married)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
married smoke_risk_category V1
FALSE Not a smoker and has risk 6.883207
TRUE Not a smoker and has risk 7.159108
NA Not a smoker and has risk 7.163161
FALSE Not a smoker and no risk 7.091426
TRUE Not a smoker and no risk 7.404219
NA Not a smoker and no risk 7.291443
FALSE smoker and has risk 6.675995
TRUE smoker and has risk 6.879813
NA smoker and has risk 6.952230
FALSE smoker and no risk 6.882694
TRUE smoker and no risk 7.079143
NA smoker and no risk 7.023344
FALSE NA 6.561099
TRUE NA 7.212676
NA NA 6.985597

Unwed mothers who have a known risk and/or smoke have on average lower birth weights.

dt[ , .(mean(babyweight_lbs)), by = .(married, smoke_infect_category)][order(smoke_infect_category, married)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
married smoke_infect_category V1
FALSE Not a smoker and has infection 6.894787
TRUE Not a smoker and has infection 7.139267
NA Not a smoker and has infection 7.137239
FALSE Not a smoker and no infection 7.039423
TRUE Not a smoker and no infection 7.325896
NA Not a smoker and no infection 7.256991
FALSE smoker and has infection 6.646922
TRUE smoker and has infection 6.710516
NA smoker and has infection 6.812935
FALSE smoker and no infection 6.837096
TRUE smoker and no infection 7.017957
NA smoker and no infection 7.010407
FALSE NA 6.367787
TRUE NA 6.848663
NA NA 6.672506

Unwed mothers who also have a known infection also have the lowest weight on average. However, there is still a pattern regardless of marital status where birth weight is lower when there is a known infection and the mother smokes.

dt[ , .(mean(babyweight_lbs)), by = .(married, smoke_i_r_category)][order(smoke_i_r_category, married)] %>% kable("html") %>% kable_styling("striped") %>% scroll_box(height = "400px")
married smoke_i_r_category V1
FALSE Nonsmoker:has infection & risk 6.713503
TRUE Nonsmoker:has infection & risk 6.977176
NA Nonsmoker:has infection & risk 7.004074
FALSE Nonsmoker:no infection or risk 7.099507
TRUE Nonsmoker:no infection or risk 7.406384
NA Nonsmoker:no infection or risk 7.292522
FALSE Smoker: has infection & risk 6.499502
TRUE Smoker: has infection & risk 6.580918
NA Smoker: has infection & risk 6.615853
FALSE Smoker:no infection or risk 6.903229
TRUE Smoker:no infection or risk 7.092864
NA Smoker:no infection or risk 7.029073
FALSE NA 6.859020
TRUE NA 7.146295
NA NA 7.160098

Conclusion


A trend is seen when smoking, infection and risk are involved in pregnancy. Independently, across all variables of interest smoking, infection, and risk showed a dip in lowering birth weight. For example, smoking was associated with a decrease in baby weight across each category of mother’s age regardless of whether the mother was younger or older. The combination of the three variables along with a low BMI is rare but is associated with a very significant 1.34lb difference in baby weight.

dt[ , .(n = .N, babyweight_lbs = mean(babyweight_lbs)), by = smoker]
##    smoker       n babyweight_lbs
## 1:  FALSE 3462936       7.221064
## 2:   TRUE  335587       6.874621

Smokers have on average a 0.346443lb difference in birth weight.

dt[ , .(n = .N, babyweight_lbs = mean(babyweight_lbs)), by = has_risk]
##    has_risk       n babyweight_lbs
## 1:    FALSE 2606982       7.256789
## 2:     TRUE 1189034       7.045747
## 3:       NA    2507       6.846969

At risk pregnancies have a 0.211042 lb difference in birth weights.

dt[ , .(n = .N, babyweight_lbs = mean(babyweight_lbs)), by = has_infection]
##    has_infection       n babyweight_lbs
## 1:         FALSE 3685781       7.200389
## 2:          TRUE  104663       6.886994
## 3:            NA    8079       6.590603

Infections during pregnancy has a 0.313395 lb difference in birth weights.

dt[ , unhealthy_category := "all other"]
dt[smoker == TRUE & has_infection == TRUE & has_risk == TRUE & bmi_category == "a. mom underweight", unhealthy_category := "underweight smoker w/ risk & infec"]
dt[ , .(mean(babyweight_lbs)), by = unhealthy_category] %>% kable("html") %>% kable_styling("striped") 
unhealthy_category V1
all other 7.190578
underweight smoker w/ risk & infec 5.902515

The combination of the three variables along with a low BMI is rare but is associated with a very significant 1.288063 difference in baby weight.

ggplot(dt, aes(x = unhealthy_category, y = babyweight_lbs, fill = unhealthy_category)) +
  geom_violin() +
  geom_boxplot(width = 0.1, fill = "white", color = "black") +
  theme(axis.text.x = element_text(angle=45))


Futher thoughts

Next steps would be to do a linear regression model to see if infection, risk or smoking is more of a driver on weight. Other factors included in the data could also be considered.