Introduction

Section 1: Data Loading and Analytical Sample

Loading NSFG 2022- 2023 Data

We will use the National Survey of Family Growth (NSFG) for this project. It gathers information on pregnancy and births, marriage and cohabitation, infertility, use of contraception, family life, and general and reproductive health. The survey results are used by the U.S. Department of Health and Human Services and others to plan health services and health education programs, and to do statistical studies of families, fertility, and health.

The analytical sample is unchanged from Check-in 1. The final analytical sample consists of N = 1,618 singleton live births drawn from the NSFG 2022–2023 Female Pregnancy Public Use File. In check-in 2, the survey design variables (vecl, vest, wgt2022_2023) are added in the dataset for complex-sample weighted analysis.

Setup and Data

library(tidyverse) #for data manipulation and visualization 
library (car)
library(haven) # read/write datasets
library(here) # easy and reliable file referencing

library(knitr) # R markdown support
library(kableExtra) # enhanced table formatting

library(plotly) #create interactive, web-based, and publication-quality graphs

library(broom) #tidy model output

library(ggeffects) # marginal effect plots

library(gtsummary) #publication-ready tables

library(dplyr) #perform common data preparation tasks

library (survey) #complex survey data analysis with weights/ strata

library(readr) #read rectangular text data (like CSV)

# Model diagnostics
library(ResourceSelection)  # Hosmer-Lemeshow goodness-of-fit test
library(pROC)               # ROC curve and AUC
library(performance)        # McFadden pseudo-R²

# Other
library(car)            # vif() for multicollinearity
library(dplyr)

# gtsummary compact theme
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

This analysis utilizes data from the National Survey of Family Growth (NSFG) 2022-2023, a nationally representative survey conducted by the National Center for Health Statistics (NCHS). The NSFG 2022-2023 Female Pregnancy File (Public Use File) was loaded into R using the read_csv() function from the readr package. All variable names were standardized to lowercase with underscores using janitor::clean_names() to ensure consistency in variable referencing throughout the analysis. A seed (553) was set prior to data loading to ensure reproducibility of any subsequent random sampling.

# SECTION 1: DATA LOADING AND SAMPLE CONSTRUCTION

my_data <- read_csv(here::here("NSFG_2022_2023_FemPregPUFData.csv")
       )|>
  
janitor::clean_names() ## Convert all column names to lowercase with underscores

Creating a Working Subset

set.seed(553)  # For reproducibility

# Starting sample
n_start <- nrow(my_data)

# Select variables of interest
nsfg_subset<- my_data |>
  select(
    # Outcome: Breastfeeding status
    bfeedwks,
    
    #Predictors 
    bornaliv, 
    gest_lb,
    #demographics
    agecon,
    hisprace2, 
    metro, 
    religion,
    
    # SES
    curr_ins, 
    poverty, 
    hieduc, 
    paybirth1,
    
    # Baby status
    babysex, 
    birthwgt,
    vecl, 
    vest, 
    wgt2022_2023
   
  )|>
  # Filter to cases where bfeedwks is answered (values 0-7)
  
  filter(bfeedwks >= 0 & bfeedwks <= 7)

  # This automatically excludes non-live births since bfeedwks is only for live births

# Display subset dimensions
    
cat("Starting:", n_start, "\n")

## Starting: 8247

cat("Live births only:", nrow(nsfg_subset), "\n")

## Live births only: 1618

cat("Excluded:", n_start - nrow(nsfg_subset), "\non-live births/n")

## Excluded: 6629 
## on-live births/n

print(names(nsfg_subset))

##  [1] "bfeedwks"     "bornaliv"     "gest_lb"      "agecon"       "hisprace2"   
##  [6] "metro"        "religion"     "curr_ins"     "poverty"      "hieduc"      
## [11] "paybirth1"    "babysex"      "birthwgt"     "vecl"         "vest"        
## [16] "wgt2022_2023"

# Show distribution of bfeedwks
cat("\nDistribution of breastfeeding duration:\n")

## 
## Distribution of breastfeeding duration:

table(nsfg_subset$bfeedwks)

## 
##   0   1   2   3   4   5   6   7 
## 310  44 229 231 282 210  60 252

SECTION 2: Variable Recoding & Preparation

2.1 Outcome Variable: Breastfeeding Initiation

Variable name: breastfed_init_f (factor); breastfed_init (numeric 0/1) Original variable: BFEEDWKS (duration of breastfeeding in weeks) Scale: Binary (0 = Never breastfed, 1 = Ever breastfed)

Recoding procedure: The original NSFG variable BFEEDWKS is coded as follows:

• 0 = Never breastfed

• 1 = Less than 1 week

• 2 = 1-13 weeks (≤3 months)

• 3 = 14-26 weeks (3-6 months)

• 4 = 27-52 weeks (6-12 months)

• 5 = 53-104 weeks (1-2 years)

• 6 = 105+ weeks (>2 years)

• 7 = Still breastfeeding at time of interview

I created a binary indicator and the distribution in analytical sample (N = 1,618):

Never breastfed: n = 310 (19%)
Ever breastfed: n = 1308 (81%)

2.2 Primary Exposure: Preterm Birth from GEST_LB (gestational age category for live births)

The NSFG variable GEST_LB categorizes gestational age as:

• 1 = <34 weeks (early preterm)

• 2 = 34-36 weeks (late preterm)

• 3 = 37-40 weeks (term)

• 4 = ≥40 weeks (post-term)

Model Specification

Justification for Logistic Regression

Our outcome, breastfeeding initiation, is a binary variable (yes/no). Linear regression is inappropriate because: 1. It can predict probabilities outside [0, 1]. 2. Binary outcomes violate the constant-variance (homoscedasticity) assumption. 3. Residuals cannot be normally distributed when Y ∈ {0, 1}.

Logistic regression models the log-odds of breastfeeding initiation and constrains all predictions to [0, 1]. The survey-weighted version svyglm(..., family = quasibinomial) is used because the NSFG employs a complex sampling design. We use quasibinomial instead of binomial to account for potential overdispersion that can arise in complex survey data — it produces the same point estimates but uses empirical (sandwich) standard errors appropriate for weighted analyses.

Reference Categories

Preterm birth - Term (37–40 wks)- Clinically “normal” baseline; most prevalent category
Race/ethnicity- NH White- Largest group in NSFG; standard epidemiology
Metro status- Other MSA (suburban) Middle category; allows comparison in both directions
Education - High school/GED - Modal education level; meaningful baseline
Insurance - Private - Best coverage; represents advantaged comparison group
Poverty - 0–<200% FPL - Lower-income group where disparities are most evident

2.3 Covariates: All covariates were selected based on established literature linking maternal and infant characteristics to both preterm birth risk and breastfeeding.

Covariate Justification

Each covariate was selected based on literature demonstrating association with both preterm birth risk and breastfeeding initiation (i.e., meeting the epidemiologic criteria for confounding):

Maternal age: Younger mothers are more likely to have preterm births and less likely to breastfeed. Race/ethnicity: Well-documented racial disparities in both preterm birth prevalence and breastfeeding initiation. Metro status: Urban residence is associated with preterm birth risk and may reflect differential access to lactation support. Insurance: Public insurance indicates lower SES; associated with both preterm birth and lower breastfeeding rates. Education: Higher education is consistently linked to higher breastfeeding initiation and may reduce preterm risk. Poverty: Income affects access to prenatal care (preterm risk) and lactation resources.

Excluded variables (not included despite initial consideration):

religion_f: Not a standard confounder in the literature; exploratory bivariate analysis showed marginal significance but poor basis as a confounder. Included in descriptive tables only. babysex_f: Bivariate analysis showed no association with breastfeeding (p = 0.5), hence not a confounder. birthwgt_cat: low birthweight is a consequence of preterm birth, not a confounder, and including it may over-adjust).

# Recode variables
ptb_nsfg <- nsfg_subset |>
  filter(gest_lb %in% c(1, 2, 3, 4)) |>   # keeps only valid categories
filter(!is.na(gest_lb)) |>  #removes character values NA
  
    mutate(
    # Outcome: Breastfeeding Initiation (binary) 
    #numeric version (0/1)
    
    # BFEEDWKS: 0=Never, 1=<1wk, 2=1-13wks, 3=14-26wks, etc.
    
    breastfed_init = case_when(
  
      bfeedwks == 0       ~ 0L,  # Never breastfed
      bfeedwks %in% 1:7   ~ 1L,  # Any breastfeeding
    TRUE              ~ NA_integer_  # Handles values outside 0-7
    ), 
    
    #Factor (for tables/ plots)
    breastfed_init_f = factor (
      breastfed_init, 
      levels = c(0, 1),
       labels = c("Never breastfed", "Breastfed")
      ),

    # Exposure: Preterm Birth 
    #numeric version (1-4)
    
    preterm_birth = gest_lb,
    
    #factor version
    preterm_birth_f = factor (
      case_when(
      gest_lb %in% c(1,2) ~ "Preterm (<37 wks)",
      gest_lb == 3        ~ "Term (37-40 wks)",
      gest_lb == 4        ~ "Post-term (≥40 wks)",
      TRUE              ~ NA_character_
    ), levels = c("Term (37-40 wks)",      # <-- reference group
                 "Preterm (<37 wks)",
                 "Post-term (≥40 wks)")
    ),
    
    #demographics
  
    #maternalage- continous 
    
     age_conception = agecon,
    
    #maternal age- categorical
    age_cat = factor(
      case_when(
      agecon < 20               ~ "<20",
      agecon >= 20 & agecon < 25 ~ "20–24",
      agecon >= 25 & agecon < 30 ~ "25–29",
      agecon >= 30 & agecon < 35 ~ "30–34",
      agecon >= 35 & agecon < 40 ~ "35–39",
      agecon >= 40               ~ "40+",
      TRUE                       ~ NA_character_
    ), levels = c("<20", "20–24", "25–29", "30–34", "35–39", "40+")),
    
    #Religion
    religion_f = factor(religion,
levels = c(1,2,3,4),
labels = c("no religion", "catholic", "protestant", "other")),

 # Metro/rural residence
    metro_f = factor(case_when(
      metro == 1 ~ "Principal city (urban)",
      metro == 2 ~ "Other MSA (suburban)",
      metro == 3 ~ "Non-MSA (rural)",
      TRUE       ~ NA_character_
    ), levels = c("Other MSA (suburban)",
                  "Principal city (urban)", 
                  "Non-MSA (rural)")),
# Race/ethnicity
    # HISPRACE2: 1=Hispanic, 2=NH White, 3=NH Black, 4=NH Other
    race_eth = factor(case_when(
      hisprace2 == 1 ~ "Hispanic",
      hisprace2 == 2 ~ "NH White",
      hisprace2 == 3 ~ "NH Black",
      hisprace2 == 4 ~ "NH Other",
      TRUE           ~ NA_character_
    ), levels = c("NH White", "NH Black", "Hispanic", "NH Other")),

    #Baby sex
    babysex_f       = factor(babysex,
                       levels = c(1, 2),
                       labels = c("Male", "Female")),
                       
# Birth weight category
    birthwgt_cat= factor(birthwgt, 
                      levels = c(1,2),
                      labels = c(">=5 1/2 pounds", "<5 1/2 pounds")),
   
#SES status

    # Current insurance status
    insurance_f = factor(curr_ins,
                      levels = c(1, 2, 3, 4),
                      labels = c("Private", "Medicaid/public", "Other govt", "Uninsured")),
  

# Poverty (% of federal poverty level) - Continous 

    poverty_cont = ifelse(poverty > 0 & poverty <9999, poverty, NA_real_),

#Poverty - Categorical (3 groups)

    poverty_cat = factor(
      case_when(
      poverty >= 0 & poverty < 200 ~ "0–<200% FPL",
      poverty >= 200 & poverty < 400 ~ "200–<400% FPL",
      poverty >= 400              ~ "≥400% FPL",
      TRUE                        ~ NA_character_
    ), levels = c("0–<200% FPL", 
                  "200–<400% FPL", "≥400% FPL")),
    
#Education
    educ_cat = factor(
      case_when(
      hieduc %in% 1:2  ~ "Less than high school",
      hieduc %in% 3:4  ~ "High school/GED",
      hieduc %in% 5:7 ~ "Some college/Associate",
      hieduc == 8 ~ "Bachelor's",
      hieduc %in% 9:11 ~ "Masters or higher",
      TRUE              ~ NA_character_
    ), levels = c("High school/GED", "Less than high school",
                  "Some college/Associate", "Bachelor's", "Masters or higher")),
    
  ) |>

# Select recoded variables, apply filters, drop missing, save sample

  select(
    # Identifiers
    bfeedwks,
    
    # Outcome and exposure
    breastfed_init, breastfed_init_f,
    preterm_birth, preterm_birth_f,
    
    # Demographics
    age_conception, age_cat,
    metro_f,
    race_eth,
    religion_f,
    
    # SES
    poverty_cont, poverty_cat,
    educ_cat,
    insurance_f,
    
    #baby characteristics
    birthwgt_cat,
    babysex_f,
    
    #survey weights
    vecl, 
    vest, 
    wgt2022_2023
  )
  
  # Check the recoding
cat("Final dataset dimensions:", nrow(ptb_nsfg), "rows ", ncol(ptb_nsfg), "columns\n")

## Final dataset dimensions: 1618 rows  19 columns

str(ptb_nsfg)

## tibble [1,618 × 19] (S3: tbl_df/tbl/data.frame)
##  $ bfeedwks        : num [1:1618] 3 0 4 3 3 3 5 0 4 0 ...
##  $ breastfed_init  : int [1:1618] 1 0 1 1 1 1 1 0 1 0 ...
##  $ breastfed_init_f: Factor w/ 2 levels "Never breastfed",..: 2 1 2 2 2 2 2 1 2 1 ...
##  $ preterm_birth   : num [1:1618] 3 3 3 3 3 3 3 3 3 3 ...
##  $ preterm_birth_f : Factor w/ 3 levels "Term (37-40 wks)",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age_conception  : num [1:1618] 37 22 33 35 28 25 29 19 26 23 ...
##  $ age_cat         : Factor w/ 6 levels "<20","20–24",..: 5 2 4 5 3 3 3 1 3 2 ...
##  $ metro_f         : Factor w/ 3 levels "Other MSA (suburban)",..: 1 3 2 2 3 3 3 1 1 1 ...
##  $ race_eth        : Factor w/ 4 levels "NH White","NH Black",..: 1 4 1 1 1 1 1 3 1 1 ...
##  $ religion_f      : Factor w/ 4 levels "no religion",..: 1 1 1 1 4 4 3 3 2 1 ...
##  $ poverty_cont    : num [1:1618] 417 97 451 451 246 246 112 50 292 50 ...
##  $ poverty_cat     : Factor w/ 3 levels "0–<200% FPL",..: 3 1 3 3 2 2 1 1 2 1 ...
##  $ educ_cat        : Factor w/ 5 levels "High school/GED",..: 5 3 5 5 3 3 2 1 4 1 ...
##  $ insurance_f     : Factor w/ 4 levels "Private","Medicaid/public",..: 1 3 1 1 2 2 1 2 1 2 ...
##  $ birthwgt_cat    : Factor w/ 2 levels ">=5 1/2 pounds",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ babysex_f       : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 2 2 1 1 1 ...
##  $ vecl            : num [1:1618] 2 4 2 2 1 1 1 1 3 1 ...
##  $ vest            : num [1:1618] 12 13 11 11 6 6 1 8 3 2 ...
##  $ wgt2022_2023    : num [1:1618] 4797 6738 25958 25958 6923 ...

  svy_design <- svydesign(ids= ~vecl,
              strata = ~vest,
              weights = ~wgt2022_2023, 
              data = ptb_nsfg,
              nest = TRUE  #accounts for clustering
              )

cat("\n=== SURVEY DESIGN SUMMARY ===\n")

## 
## === SURVEY DESIGN SUMMARY ===

print(svy_design)

## Stratified 1 - level Cluster Sampling design (with replacement)
## With (80) clusters.
## svydesign(ids = ~vecl, strata = ~vest, weights = ~wgt2022_2023, 
##     data = ptb_nsfg, nest = TRUE)

cat("Survey design object created.\n")

## Survey design object created.

cat("Degrees of freedom:", nrow(ptb_nsfg) - length(unique(ptb_nsfg[[1]])), "\n")

## Degrees of freedom: 1610

# Display final sample size

cat("Sample size after recoding and filters:", nrow(ptb_nsfg), "pregnancies\n\n")

## Sample size after recoding and filters: 1618 pregnancies

# Save analytic dataset
saveRDS(ptb_nsfg, file = here::here("nsfg_2022_2023.rds"))

tibble(
  Metric = c("Observations", "Variables"),
  Value = c(nrow(ptb_nsfg), ncol(ptb_nsfg))
) |>
  kable(caption = "Final NSFG Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Final NSFG Dataset Dimensions
Metric	Value
Observations	1618
Variables	19

Section 3: Descriptive Statistics (Table 1) ## Descriptive Statistics

The final analytical sample consists of 1618 singleton live births from the NSFG 2022-2023. Table 1 presents the distribution of all key variables and table 2 is the distribution of all key variables stratified by breastfeeding initiation status.

TABLE 1: Covariates stratified by PRETERM STATUS (not outcome) as followed from Check-in 1 feedback.

This is the primary exposure, so this is the correct stratification.

#-Unweighted covariate distribution by preterm birth status. This allows us to see whether potential confounders are distributed differently across gestational age groups.

ptb_nsfg |>
  select(
    preterm_birth_f,
    breastfed_init_f, age_cat, race_eth, metro_f, religion_f,
    educ_cat, poverty_cat, insurance_f, babysex_f, birthwgt_cat
  ) |>
  tbl_summary(
    by = preterm_birth_f, 
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    label = list(
      breastfed_init_f ~ "Breastfeeding initiation",
      age_cat ~ "Maternal age (years)",
      race_eth ~ "Race/ethnicity",
      metro_f ~ "Place of residence",
      religion_f ~ "Religion",
      educ_cat ~ "Education",
      poverty_cat ~ "Income level (% FPL)",
      insurance_f ~ "Insurance type",
      babysex_f ~ "Infant sex",
      birthwgt_cat ~ "Birth weight"
    )
  ) |>
  add_overall() |>
  add_p() |> 
  bold_labels() |>
  modify_caption("Table 1. Unweighted Covariate Distribution by Preterm Birth Status (NSFG 2022–2023, N =1,618)")

Table 1. Unweighted Covariate Distribution by Preterm Birth Status (NSFG 2022–2023, N =1,618)
Characteristic	Overall N = 1,618¹	Term (37-40 wks) N = 1,234¹	Preterm (<37 wks) N = 139¹	Post-term (≥40 wks) N = 245¹	p-value²
Breastfeeding initiation					<0.001
Never breastfed	310 (19%)	250 (20%)	35 (25%)	25 (10%)
Breastfed	1,308 (81%)	984 (80%)	104 (75%)	220 (90%)
Maternal age (years)
<20	91 (5.6%)	71 (5.8%)	5 (3.6%)	15 (6.1%)
20–24	298 (18%)	232 (19%)	26 (19%)	40 (16%)
25–29	458 (28%)	347 (28%)	43 (31%)	68 (28%)
30–34	475 (29%)	355 (29%)	42 (30%)	78 (32%)
35–39	257 (16%)	201 (16%)	19 (14%)	37 (15%)
40+	39 (2.4%)	28 (2.3%)	4 (2.9%)	7 (2.9%)
Race/ethnicity					<0.001
NH White	848 (52%)	644 (52%)	62 (45%)	142 (58%)
NH Black	257 (16%)	204 (17%)	30 (22%)	23 (9.4%)
Hispanic	340 (21%)	266 (22%)	35 (25%)	39 (16%)
NH Other	173 (11%)	120 (9.7%)	12 (8.6%)	41 (17%)
Place of residence					>0.9
Other MSA (suburban)	760 (47%)	581 (47%)	61 (44%)	118 (48%)
Principal city (urban)	584 (36%)	445 (36%)	51 (37%)	88 (36%)
Non-MSA (rural)	274 (17%)	208 (17%)	27 (19%)	39 (16%)
Religion					0.4
no religion	488 (30%)	366 (30%)	42 (30%)	80 (33%)
catholic	299 (18%)	239 (19%)	23 (17%)	37 (15%)
protestant	675 (42%)	516 (42%)	62 (45%)	97 (40%)
other	156 (9.6%)	113 (9.2%)	12 (8.6%)	31 (13%)
Education					<0.001
High school/GED	314 (19%)	248 (20%)	37 (27%)	29 (12%)
Less than high school	133 (8.2%)	103 (8.3%)	16 (12%)	14 (5.7%)
Some college/Associate	463 (29%)	366 (30%)	36 (26%)	61 (25%)
Bachelor's	379 (23%)	287 (23%)	24 (17%)	68 (28%)
Masters or higher	329 (20%)	230 (19%)	26 (19%)	73 (30%)
Income level (% FPL)					0.020
0–<200% FPL	787 (49%)	604 (49%)	80 (58%)	103 (42%)
200–<400% FPL	395 (24%)	299 (24%)	34 (24%)	62 (25%)
≥400% FPL	436 (27%)	331 (27%)	25 (18%)	80 (33%)
Insurance type					<0.001
Private	875 (54%)	664 (54%)	60 (43%)	151 (62%)
Medicaid/public	476 (29%)	368 (30%)	53 (38%)	55 (22%)
Other govt	115 (7.1%)	80 (6.5%)	18 (13%)	17 (6.9%)
Uninsured	152 (9.4%)	122 (9.9%)	8 (5.8%)	22 (9.0%)
Infant sex					0.6
Male	843 (52%)	649 (53%)	75 (54%)	119 (49%)
Female	766 (48%)	580 (47%)	64 (46%)	122 (51%)
Unknown	9	5	0	4
Birth weight					<0.001
>=5 1/2 pounds	1,509 (94%)	1,181 (96%)	90 (65%)	238 (99%)
<5 1/2 pounds	99 (6.2%)	47 (3.8%)	49 (35%)	3 (1.2%)
Unknown	10	6	0	4
¹ n (%)
² Pearson’s Chi-squared test; NA

Start with the survey design object

#weighted Covariate Distribution

svy_design |>
  tbl_svysummary(
    by = preterm_birth_f,
    include = c(
      breastfed_init_f, age_cat, race_eth, metro_f,
      educ_cat, poverty_cat, insurance_f, babysex_f, birthwgt_cat
    ),
    
    statistic = list(all_categorical() ~ "{n_unweighted} ({p}%)"),
    label = list(
      breastfed_init_f ~ "Breastfeeding initiation",
      age_cat          ~ "Maternal age (years)",
      race_eth         ~ "Race/ethnicity",
      metro_f          ~ "Place of residence",
      educ_cat         ~ "Education",
      poverty_cat      ~ "Income (% FPL)",
      insurance_f      ~ "Insurance type",
      babysex_f        ~ "Infant sex",
      birthwgt_cat     ~ "Birth weight"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels() |>
  bold_p(t = 0.05) |>
  modify_caption(
    "**Table 2. Weighted Covariate Distribution by Gestational Age Category**
    (NSFG 2022–2023; n = unweighted count, % = survey-weighted)"
  ) |>
 
  modify_table_styling(
    columns     = label,
    footnote    = "Unweighted counts with survey-weighted percentages shown. Small cells in the Preterm (<37 wks) column should be interpreted carefully"
  )

**Table 2. Weighted Covariate Distribution by Gestational Age Category** (NSFG 2022–2023; n = unweighted count, % = survey-weighted)
Characteristic¹	Overall N = 20,658,039²	Term (37-40 wks) N = 15,718,277²	Preterm (<37 wks) N = 1,994,624²	Post-term (≥40 wks) N = 2,945,138²	p-value³
Breastfeeding initiation					0.015
Never breastfed	310 (18%)	250 (20%)	35 (19%)	25 (9.7%)
Breastfed	1,308 (82%)	984 (80%)	104 (81%)	220 (90%)
Maternal age (years)					0.9
<20	91 (5.7%)	71 (5.7%)	5 (2.7%)	15 (7.6%)
20–24	298 (20%)	232 (21%)	26 (19%)	40 (17%)
25–29	458 (27%)	347 (27%)	43 (30%)	68 (29%)
30–34	475 (27%)	355 (27%)	42 (30%)	78 (29%)
35–39	257 (17%)	201 (17%)	19 (16%)	37 (16%)
40+	39 (2.6%)	28 (2.8%)	4 (2.3%)	7 (2.0%)
Race/ethnicity					0.037
NH White	848 (54%)	644 (55%)	62 (40%)	142 (59%)
NH Black	257 (14%)	204 (14%)	30 (20%)	23 (7.9%)
Hispanic	340 (22%)	266 (21%)	35 (32%)	39 (18%)
NH Other	173 (10%)	120 (9.8%)	12 (8.2%)	41 (16%)
Place of residence					0.8
Other MSA (suburban)	760 (49%)	581 (49%)	61 (47%)	118 (49%)
Principal city (urban)	584 (31%)	445 (30%)	51 (36%)	88 (32%)
Non-MSA (rural)	274 (20%)	208 (21%)	27 (18%)	39 (19%)
Education					0.14
High school/GED	314 (21%)	248 (22%)	37 (26%)	29 (15%)
Less than high school	133 (9.9%)	103 (10%)	16 (12%)	14 (6.1%)
Some college/Associate	463 (28%)	366 (28%)	36 (25%)	61 (27%)
Bachelor's	379 (21%)	287 (21%)	24 (15%)	68 (24%)
Masters or higher	329 (20%)	230 (18%)	26 (21%)	73 (28%)
Income (% FPL)					0.11
0–<200% FPL	787 (50%)	604 (50%)	80 (61%)	103 (44%)
200–<400% FPL	395 (23%)	299 (24%)	34 (19%)	62 (23%)
≥400% FPL	436 (27%)	331 (27%)	25 (20%)	80 (33%)
Insurance type					0.010
Private	875 (52%)	664 (53%)	60 (36%)	151 (59%)
Medicaid/public	476 (31%)	368 (30%)	53 (46%)	55 (27%)
Other govt	115 (6.8%)	80 (6.4%)	18 (12%)	17 (5.3%)
Uninsured	152 (10%)	122 (11%)	8 (6.1%)	22 (9.1%)
Infant sex					0.9
Male	843 (52%)	649 (52%)	75 (55%)	119 (52%)
Female	766 (48%)	580 (48%)	64 (45%)	122 (48%)
Unknown	99,557	48,343	0	51,214
Birth weight					<0.001
>=5 1/2 pounds	1,509 (93%)	1,181 (96%)	90 (61%)	238 (99%)
<5 1/2 pounds	99 (6.8%)	47 (3.8%)	49 (39%)	3 (1.0%)
Unknown	116,753	65,539	0	51,214
¹ Unweighted counts with survey-weighted percentages shown. Small cells in the Preterm (<37 wks) column should be interpreted carefully
² n (unweighted) (%)
³ Pearson’s X^2: Rao & Scott adjustment

#Descriptive Table by breastfeeding initiation (Complementary table)

svy_design |>
  tbl_svysummary(
    by = breastfed_init_f,
    include = c(
      preterm_birth_f, age_cat, race_eth, metro_f,
      educ_cat, poverty_cat, insurance_f, babysex_f, birthwgt_cat
    ),
    
    statistic = list(all_categorical() ~ "{n_unweighted} ({p}%)"),
    label = list(
      preterm_birth_f ~ "Preterm Birth",
      age_cat          ~ "Maternal age (years)",
      race_eth         ~ "Race/ethnicity",
      metro_f          ~ "Place of residence",
      educ_cat         ~ "Education",
      poverty_cat      ~ "Income (% FPL)",
      insurance_f      ~ "Insurance type",
      babysex_f        ~ "Infant sex",
      birthwgt_cat     ~ "Birth weight"
    )
  ) |>
  add_overall() |>
  add_p() |>
  bold_labels() |>
  bold_p(t = 0.05) |>
  modify_caption(
    "**Table 3. Weighted Covariate Distribution by Breastfeeding Initiation Category**
    (NSFG 2022–2023; n = unweighted count, % = survey-weighted)"
  ) |>
 
  modify_table_styling(
    columns     = label,
    footnote    = "Unweighted counts with survey-weighted percentages shown. Small cells should be interpreted carefully"
  )

**Table 3. Weighted Covariate Distribution by Breastfeeding Initiation Category** (NSFG 2022–2023; n = unweighted count, % = survey-weighted)
Characteristic¹	Overall N = 20,658,039²	Never breastfed N = 3,738,931²	Breastfed N = 16,919,108²	p-value³
Preterm Birth				0.015
Term (37-40 wks)	1,234 (76%)	250 (82%)	984 (75%)
Preterm (<37 wks)	139 (9.7%)	35 (10%)	104 (9.5%)
Post-term (≥40 wks)	245 (14%)	25 (7.6%)	220 (16%)
Maternal age (years)				<0.001
<20	91 (5.7%)	30 (9.6%)	61 (4.8%)
20–24	298 (20%)	96 (37%)	202 (17%)
25–29	458 (27%)	75 (23%)	383 (28%)
30–34	475 (27%)	67 (17%)	408 (30%)
35–39	257 (17%)	38 (12%)	219 (18%)
40+	39 (2.6%)	4 (1.7%)	35 (2.8%)
Race/ethnicity				<0.001
NH White	848 (54%)	131 (45%)	717 (56%)
NH Black	257 (14%)	112 (31%)	145 (9.6%)
Hispanic	340 (22%)	46 (19%)	294 (23%)
NH Other	173 (10%)	21 (4.8%)	152 (12%)
Place of residence				0.085
Other MSA (suburban)	760 (49%)	105 (39%)	655 (51%)
Principal city (urban)	584 (31%)	146 (40%)	438 (29%)
Non-MSA (rural)	274 (20%)	59 (21%)	215 (20%)
Education				<0.001
High school/GED	314 (21%)	122 (42%)	192 (17%)
Less than high school	133 (9.9%)	47 (18%)	86 (8.1%)
Some college/Associate	463 (28%)	99 (29%)	364 (27%)
Bachelor's	379 (21%)	22 (5.5%)	357 (24%)
Masters or higher	329 (20%)	20 (5.1%)	309 (23%)
Income (% FPL)				<0.001
0–<200% FPL	787 (50%)	237 (80%)	550 (43%)
200–<400% FPL	395 (23%)	42 (11%)	353 (26%)
≥400% FPL	436 (27%)	31 (8.8%)	405 (31%)
Insurance type				<0.001
Private	875 (52%)	88 (24%)	787 (58%)
Medicaid/public	476 (31%)	163 (56%)	313 (26%)
Other govt	115 (6.8%)	23 (5.6%)	92 (7.1%)
Uninsured	152 (10%)	36 (14%)	116 (9.0%)
Infant sex				0.6
Male	843 (52%)	157 (51%)	686 (53%)
Female	766 (48%)	152 (49%)	614 (47%)
Unknown	99,557	4,092	95,465
Birth weight				>0.9
>=5 1/2 pounds	1,509 (93%)	280 (93%)	1,229 (93%)
<5 1/2 pounds	99 (6.8%)	28 (6.8%)	71 (6.8%)
Unknown	116,753	21,288	95,465
¹ Unweighted counts with survey-weighted percentages shown. Small cells should be interpreted carefully
² n (unweighted) (%)
³ Pearson’s X^2: Rao & Scott adjustment

This table compares key maternal and birth characteristics between those who initiated breastfeeding (N = 1,308) and those who did not (N = 310). The overall breastfeeding initiation rate in the sample is high (81%), but several variables show statistically significant differences between groups, suggesting potential confounding in the relationship between gestational age and breastfeeding initiation.

#SECTION 4: Exploratory Data Analysis

Exploratory data analysis was conducted to understand the distribution of the outcome variable, examine the unadjusted relationship between preterm birth and breastfeeding initiation, and explore additional covariates that may confound or modify this association.

Before running any model, always visualize outcome distribution.

ggplot(ptb_nsfg, aes(x = factor(breastfed_init_f), fill = factor(breastfed_init_f))) +
  geom_bar(stat = "count", alpha = 0.7) +
  geom_text(stat = "count", aes(label = paste0("n=", after_stat (count))), vjust = -0.5, size = 4) +
  scale_x_discrete(labels = c("Never breastfed", "Breastfed")) +
  scale_fill_manual(values = c("#E74C3C", "#27AE60")) +
  labs(
    title = "Figure 1. Distribution of Breastfeeding Initiation",
    subtitle = "NSFG 2022–2023 (N = 1,618)",
    x = NULL, y = "Count", fill = NULL
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

A bar plot shown below of breastfeeding initiation shows a highly imbalanced distribution: 1,308 individuals-initiated breastfeeding, while 310 did not. This reflects a high overall prevalence of breastfeeding initiation in the sample.

#4.1 Distribution of the Outcome Variable by PTB

ptb_nsfg |> 
    filter(!is.na(preterm_birth_f), !is.na(breastfed_init_f)) |>
    group_by(preterm_birth_f, breastfed_init_f) |> 
    summarise(n = n(), .groups = "drop") |>
    group_by(preterm_birth_f) |>
  mutate(pct = n / sum(n) * 100) |>
  ungroup() |>
    
 ggplot(aes(x = preterm_birth_f, y = pct, fill = breastfed_init_f) ) +
  geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
  geom_text(aes(label = paste0(round(pct, 1), "%")), vjust = -0.5, position = position_dodge(width = 0.9), size = 4) +
  scale_fill_manual(values = c("Never breastfed"= "#E74C3C", "Breastfed"= "#27AE60")) +
  labs(
    title = "Figure 2: Breastfeeding Initiation by Gestational Age Category",
    subtitle = "NSFG 2022- 2023",
    x = "Gestational Age Category",
    y       = "Percentage (%)",
    fill = "Breastfeeding Status"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

#distribution of breastfeeding by race/ethnicity

ptb_nsfg |> 
    filter(!is.na(race_eth), !is.na(breastfed_init_f)) |>
    group_by(race_eth, breastfed_init_f) |> 
    summarise(n = n(), .groups = "drop") |>
    group_by(race_eth) |>
  mutate(pct = n / sum(n) * 100) |>
  ungroup() |>
    
 ggplot(aes(x = race_eth, y = pct, fill = breastfed_init_f) ) +
  geom_bar(stat = "identity", alpha = 0.8, position = "dodge") +
  geom_text(aes(label = paste0(round(pct, 1), "%")), vjust = -0.5, position = position_dodge(width = 0.9), size = 4) +
  scale_fill_manual(values = c("Never breastfed"= "#E74C3C", "Breastfed"= "#27AE60")) +
  labs(
    title = "Figure 3: Breastfeeding Initiation by Race Ethnicity Category",
    subtitle = "NSFG 2022- 2023",
    x = "Race Ethnicity Category",
    y       = "Percentage (%)",
    fill = "Breastfeeding Status"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

#Breastfeeding initiation varies by race/ethnicity. Non‑Hispanic White and Hispanic mothers showed the highest initiation, with far more breastfeeding than non‑breastfeeding in both groups. In contrast, Non‑Hispanic Black mothers had the lowest initiation (43.6%). Mothers categorized as Non‑Hispanic Other also demonstrated high breastfeeding initiation.

Regression Equation: Models Fit

Model 1 Unadjusted: breastfeeding initiation ~ preterm birth only

Breastfeeding initiation (yes/no) was regressed on preterm birth status using a survey‑weighted simple logistic regression model. Regression equation (Model 1):

logit[P(breastfeeding initiation = 1)] = β₀ + β₁(preterm birth)

Model 2 Adjusted: breastfeeding initiation ~ preterm birth + all covariates

#Model 1 Simple Logistic Regression

# Fit simple logistic regression

model_crude <- svyglm(
  breastfed_init_f ~ preterm_birth_f,
  family = quasibinomial(link = "logit"),
  design = svy_design
)

# Display results with odds ratios
tidy(model_crude, exponentiate = TRUE, conf.int = TRUE) |>
  kable(caption = "Table 2. Weighted Unadjusted Simple Logistic Regression: Preterm Birth and Breastfeeding Initiation",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Table 2. Weighted Unadjusted Simple Logistic Regression: Preterm Birth and Breastfeeding Initiation
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	4.121	0.154	9.207	0.000	3.029	5.606
preterm_birth_fPreterm (<37 wks)	1.017	0.241	0.070	0.945	0.628	1.646
preterm_birth_fPost-term (≥40 wks)	2.265	0.298	2.746	0.008	1.248	4.111

# Publication-ready table using gtsummary
model_crude |>
  tbl_regression(
    exponentiate = TRUE,          # Convert log-odds to Odds Ratios
    label = list(preterm_birth_f ~ "Gestational age category")
  ) |>
  bold_labels() |>
  bold_p(t = 0.05) |>
  modify_caption("**Table 4. Unadjusted Weighted Logistic Regression**
  Outcome: Breastfeeding Initiation")

**Table 4. Unadjusted Weighted Logistic Regression** Outcome: Breastfeeding Initiation
Characteristic	OR	95% CI	p-value
Gestational age category
Term (37-40 wks)	—	—
Preterm (<37 wks)	1.02	0.63, 1.65	>0.9
Post-term (≥40 wks)	2.27	1.25, 4.11	0.008
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

#This is a simple logistic regression model examining the association between: Outcome: - Breastfeeding initiation (Yes/No) Exposure: - Preterm birth category - Term (37–40 weeks) → reference group - Preterm (<37 weeks) - Post‑term (≥40 weeks)

Logistic regression is used because the outcome is binary, and the goal is to estimate how the odds of breastfeeding initiation differ across gestational‑age categories.

A simple (unadjusted) logistic regression was fit to estimate the crude association between gestational age category and breastfeeding initiation, with term birth (37–40 weeks) as the reference group. This unadjusted model establishes the baseline association before accounting for potential confounders.The model produces odds ratios (ORs), which quantify how the odds of breastfeeding change relative to the reference group.

Interpretation: In the survey‑weighted unadjusted logistic regression model, the intercept (OR = 4.12) represents the odds of breastfeeding initiation among the reference group—term infants (37–40 weeks). Compared with term births, preterm infants (<37 weeks) had lower odds of breastfeeding initiation (OR = 1.02, 95% CI: 0.63–1.65), although this association was not statistically significant. The wide confidence interval likely reflects the relatively small number of preterm births in the weighted sample, which limits statistical power. In contrast, post‑term infants (≥40 weeks) had significantly higher odds of breastfeeding initiation (OR = 2.27, 95% CI: 1.25–4.11). Because this model is unadjusted, these associations may be influenced by confounding factors such as maternal age, race/ethnicity, education, and socioeconomic status.

Model 2: Adjusted Logistic Regression

Regression equation (Model 2):

logit[P(breastfeeding initiation = 1)] = β₀ + β₁(preterm birth) + β₂(maternal age) + β₃(race/ethnicity) + β₄(metro) + β₅(insurance) + β₆(education) + β₇(poverty)

Breastfeeding initiation was regressed on preterm birth status, adjusting for maternal age, race/ethnicity, metropolitan residence, insurance type, educational attainment, and household income relative to the federal poverty level. All models incorporated NSFG survey weights.

# Adjusted Multiple logistic regression with potential confounders

model_adj <- svyglm(breastfed_init_f ~ preterm_birth_f + age_cat + race_eth + metro_f + insurance_f + educ_cat + poverty_cat,
  family = quasibinomial(link = "logit"),
  design = svy_design
)

# Display results
tidy(model_adj, exponentiate = TRUE, conf.int = TRUE) |>
  kable(caption = "Weighted Adjusted Multiple Logistic Regression: Breastfeeding Initiation ~ preterm birth + other coviariates (Odds Ratios)",
        digits = 3,
        col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) |>
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE) |>
  scroll_box(height = "400px")

Weighted Adjusted Multiple Logistic Regression: Breastfeeding Initiation ~ preterm birth + other coviariates (Odds Ratios)
Term	Odds Ratio	Std. Error	z-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	2.229	0.477	1.680	0.101	0.849	5.855
preterm_birth_fPreterm (<37 wks)	1.139	0.262	0.496	0.623	0.671	1.933
preterm_birth_fPost-term (≥40 wks)	1.933	0.312	2.110	0.041	1.028	3.636
age_cat20–24	0.614	0.375	-1.299	0.201	0.288	1.311
age_cat25–29	1.105	0.364	0.274	0.786	0.529	2.306
age_cat30–34	0.994	0.436	-0.013	0.989	0.412	2.400
age_cat35–39	0.968	0.439	-0.075	0.941	0.398	2.352
age_cat40+	1.413	0.621	0.557	0.581	0.402	4.963
race_ethNH Black	0.490	0.274	-2.602	0.013	0.281	0.853
race_ethHispanic	1.792	0.314	1.860	0.070	0.950	3.381
race_ethNH Other	1.691	0.425	1.237	0.224	0.716	3.991
metro_fPrincipal city (urban)	0.860	0.205	-0.738	0.465	0.569	1.301
metro_fNon-MSA (rural)	1.213	0.295	0.654	0.517	0.668	2.204
insurance_fMedicaid/public	0.735	0.235	-1.308	0.198	0.457	1.183
insurance_fOther govt	1.332	0.355	0.809	0.424	0.650	2.730
insurance_fUninsured	0.635	0.378	-1.202	0.237	0.295	1.364
educ_catLess than high school	0.980	0.336	-0.059	0.953	0.496	1.936
educ_catSome college/Associate	1.815	0.252	2.361	0.023	1.089	3.023
educ_catBachelor’s	5.509	0.352	4.846	0.000	2.703	11.230
educ_catMasters or higher	4.702	0.428	3.613	0.001	1.977	11.183
poverty_cat200–<400% FPL	2.187	0.249	3.141	0.003	1.321	3.621
poverty_cat≥400% FPL	1.777	0.288	1.996	0.053	0.992	3.183

#Publication-ready adjusted model table 

model_adj |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      preterm_birth_f ~ "Gestational age (ref: Term)",
      age_cat         ~ "Maternal age (ref: <20)",
      race_eth        ~ "Race/ethnicity (ref: NH White)",
      metro_f         ~ "Residence (ref: Suburban)",
      insurance_f     ~ "Insurance (ref: Private)",
      educ_cat        ~ "Education (ref: HS/GED)",
      poverty_cat     ~ "Income % FPL (ref: 0-<200%)"
    )
  ) |>
  bold_labels() |>
  bold_p(t = 0.05) |>
  modify_caption(
    "**Table 5. Adjusted Weighted Logistic Regression**
    Outcome: Breastfeeding Initiation (NSFG 2022–2023, N = 1,618)"
  ) |>
  add_glance_source_note(include = c(nobs, AIC))

**Table 5. Adjusted Weighted Logistic Regression** Outcome: Breastfeeding Initiation (NSFG 2022–2023, N = 1,618)
Characteristic	OR	95% CI	p-value
Gestational age (ref: Term)
Term (37-40 wks)	—	—
Preterm (<37 wks)	1.14	0.67, 1.93	0.6
Post-term (≥40 wks)	1.93	1.03, 3.64	0.041
Maternal age (ref: <20)
<20	—	—
20–24	0.61	0.29, 1.31	0.2
25–29	1.10	0.53, 2.31	0.8
30–34	0.99	0.41, 2.40	>0.9
35–39	0.97	0.40, 2.35	>0.9
40+	1.41	0.40, 4.96	0.6
Race/ethnicity (ref: NH White)
NH White	—	—
NH Black	0.49	0.28, 0.85	0.013
Hispanic	1.79	0.95, 3.38	0.070
NH Other	1.69	0.72, 3.99	0.2
Residence (ref: Suburban)
Other MSA (suburban)	—	—
Principal city (urban)	0.86	0.57, 1.30	0.5
Non-MSA (rural)	1.21	0.67, 2.20	0.5
Insurance (ref: Private)
Private	—	—
Medicaid/public	0.74	0.46, 1.18	0.2
Other govt	1.33	0.65, 2.73	0.4
Uninsured	0.63	0.30, 1.36	0.2
Education (ref: HS/GED)
High school/GED	—	—
Less than high school	0.98	0.50, 1.94	>0.9
Some college/Associate	1.81	1.09, 3.02	0.023
Bachelor's	5.51	2.70, 11.2	<0.001
Masters or higher	4.70	1.98, 11.2	<0.001
Income % FPL (ref: 0-<200%)
0–<200% FPL	—	—
200–<400% FPL	2.19	1.32, 3.62	0.003
≥400% FPL	1.78	0.99, 3.18	0.053
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
No. Obs. = 1,618; AIC = 1,307

Interpretation:

Race/ethnicity demonstrated one of the strongest disparities. Non‑Hispanic Black mothers had significantly lower odds of breastfeeding initiation compared with Non‑Hispanic White mothers (aOR = 0.49, 95% CI: 0.28–0.85). Hispanic mothers showed a positive but borderline association (aOR = 1.79, p = 0.070), while the NH Other group did not differ significantly.

Place of residence (urban or rural vs. suburban) was not associated with breastfeeding initiation after adjustment. Similarly, insurance type showed no independent effect; Medicaid‑insured and uninsured mothers had lower odds than privately insured mothers, but these differences were not statistically significant.

Educational attainment showed the strongest socioeconomic gradient. Mothers with a bachelor’s degree (aOR = 5.62, 95% CI: 2.30–13.7) or a graduate degree (aOR = 4.80, 95% CI: 1.67–13.8) had dramatically higher odds of breastfeeding initiation compared to high school graduates. Income also played an important role: mothers in the 200–<400% FPL category had significantly higher odds of breastfeeding initiation (aOR = 2.19, 95% CI: 1.32–3.62), while the highest‑income group showed a borderline association.

# Checking for Confounding Assessment A meaningful confounder changes the OR by ≥10% (the “10% change-in-estimate in epi). We compare the crude and adjusted ORs for the primary exposure (preterm birth) to quantify confounding.
## Crude vs. Adjusted Comparison
``` r # Extract ORs for preterm birth from both models crude_or <- tidy(model_crude, exponentiate = TRUE, conf.int = TRUE) \|> filter(str_detect(term, “Preterm”)) \|> mutate(Model = “Unadjusted (Crude)”)
adjusted_or <- tidy(model_adj, exponentiate = TRUE, conf.int = TRUE) \|> filter(str_detect(term, “Preterm”)) \|> mutate(Model = “Adjusted”)
# Calculate % change in OR (confounding magnitude) OR_comparison <- bind_rows(crude_or, adjusted_or) \|> select(Model, term, estimate, conf.low, conf.high, p.value) \|> mutate( term = str_replace(term, “preterm_birth_f”, ““), across(c(estimate, conf.low, conf.high), ~round(., 3)), p.value = round(p.value, 3) )
OR_comparison \|> kable( caption = “Table 6. Confounding Assessment: Crude vs. Adjusted OR for Preterm Birth”, col.names = c(“Model”, “Comparison”, “OR”, “95% CI Low”, “95% CI High”, “p-value”) ) \|> kable_styling(bootstrap_options = c(“striped”, “hover”), full_width = FALSE) ```

#Confounding Assessment for Preterm (<37 wks): Crude OR: 1.017 Adjusted OR: 1.139 % Change: 12 %
#Interpretation:There is meaningful confounding. The adjusted OR differed only slightly from the crude estimate (1.02 to 1.14), which means there is some confounding and it suggest that preterm birth is not independently associated with breastfeeding initiation.
## Crude vs. Adjusted Comparison for all key covariates
``` r # Extract all crude ORs crude_all <- tidy(model_crude, exponentiate = TRUE, conf.int = TRUE) \|> filter(term != “(Intercept)”) \|> select(term, estimate, conf.low, conf.high, p.value) \|> rename( OR_crude = estimate, Lower_CI_crude = conf.low, Upper_CI_crude = conf.high, p_crude = p.value )
# Extract all adjusted ORs adjusted_all <- tidy(model_adj, exponentiate = TRUE, conf.int = TRUE) \|> filter(term != “(Intercept)”) \|> select(term, estimate, conf.low, conf.high, p.value) \|> rename( OR_adj = estimate, Lower_CI_adj = conf.low, Upper_CI_adj = conf.high, p_adj = p.value )
# Combine side-by-side combined_table <- full_join( crude_all, adjusted_all, by = “term” )
# Format and print combined_table \|> mutate(across(where(is.numeric), ~round(., 3))) \|> kable( caption = “Table 7: Crude and Adjusted Odds Ratios for All Predictors”, col.names = c( “Predictor”, “Crude OR”, “Crude LCL”, “Crude UCL”, “Crude p”, “Adjusted OR”, “Adj LCL”, “Adj UCL”, “Adj p” ) ) \|> kable_styling(bootstrap_options = c(“striped”, “hover”), full_width = FALSE) ```

#Interpretation Crude analyses showed no association between preterm birth and breastfeeding initiation, while post‑term birth was associated with higher odds of initiation. After adjustment, the association for post‑term birth remained significant. Several sociodemographic factors were strongly associated with breastfeeding initiation in the adjusted model, including higher maternal education and higher household income. NH Black mothers had significantly lower odds of breastfeeding initiation compared with NH White mothers. Other predictors, including maternal age, metro status, and insurance type, were not significantly associated with breastfeeding initiation. Crude estimates were available only for the primary exposure, as other predictors were not included in the unadjusted model.
#Model Diagnostics
For logistic regression, the diagnostic goals are: 1. Calibration: Do predicted probabilities match observed event rates? (Hosmer-Lemeshow + calibration plot) 2. Discrimination: Can the model separate breastfeeders from non-breastfeeders? (ROC/AUC) 3. Influential observations: Are specific data points driving the results? (Cook’s distance) 4. Multicollinearity: Are predictors too correlated for stable estimation? (VIF) 5. Linearity in the logit: Does age relate linearly to log-odds? (Loess check)
##Discrimination — ROC Curve and AUC
While calibration assesses how well predicted probabilities match observed rates, discrimination assesses how well the model separates events from non-events.
The ROC curve plots sensitivity (true positive rate) against 1 − specificity (false positive rate) across all possible probability cutoffs.
The AUC (area under the ROC curve) summarizes discrimination:
\| AUC \| Discrimination \|
\| 0.5 \| No discrimination (chance) \| \| 0.6-0.7 \| Poor \| \| 0.7-0.8 \| Acceptable \| \| 0.8-0.9 \| Excellent \| \| > 0.9 \| Outstanding \|
``` r roc_obj <- roc( response = ptb_nsfg$breastfed_init, # Numeric 0/1 predictor = fitted(model_adj), # Predicted probabilities levels = c(0, 1), # 0 = non-event, 1 = event direction = “<” # Higher predicted value → more likely event )
auc_val <- auc(roc_obj)
ggroc(roc_obj, color = “steelblue”, linewidth = 1.2) + geom_abline(slope = 1, intercept = 1, linetype = “dashed”, color = “red”) + labs(title = “Figure 4: ROC Curve- Model Discrimination for Breastfeeding Initiation Model”, subtitle = paste0(“AUC =”, round(auc_val, 3)), x = “Specificity”, y = “Sensitivity”) + theme_minimal(base_size = 12) ```

Interpretation: The curve is above the diagnol and AUC = 0.788, The adjusted model shows acceptable discrimination (AUC = 0.788), and that predicted probabilities meaningfully differentiate between mothers who initiated breastfeeding and those who did not.
## Deviance
The deviance of a logistic model is:
\[D = -2 \ln \hat{L}\]
It is analogous to the residual sum of squares in linear regression: smaller is better. By itself, the deviance is hard to interpret, but differences in deviance between nested models follow a $\chi^2$ distribution and form the basis of the LR test.
`r glance(model_adj) \|> dplyr::select(null.deviance, df.null, deviance, df.residual, AIC, BIC) \|> kable(digits = 1, caption = "Model Fit Statistics") \|> kable_styling(bootstrap_options = "striped", full_width = FALSE)`

Quick check: The difference between `null.deviance` and `deviance` represents the improvement from adding all predictors to an intercept-only model. We can test this with an LR test on `df.null - df.residual` degrees of freedom.
##Pseudo-R² (McFadden)
There is no exact analog of $R^2$ for logistic regression, but several “pseudo-R²” measures exist. The most common is McFadden’s R²:
\[R^2_{\text{McFadden}} = 1 - \frac{\ln \hat{L}_{\text{full}}}{\ln \hat{L}_{\text{null}}}\] Values between 0.2 and 0.4 are considered excellent fit.
`r performance::r2_mcfadden(model_adj)`
`## # R2 for Generalized Linear Regression ## R2: 0.190 ## adj. R2: 0.189`
Interpretation: McFadden’s R² should not be interpreted on the same scale as linear regression R². Values are typically much smaller (e.g., 0.1 may indicate a reasonable fit).
A McFadden R² of 0.19 indicates that the model provides acceptable (0.1 - 0.2) explanatory power for a behavioral outcome like breastfeeding initiation, which is influenced by many unmeasured social and cultural factors.
#Null deviance = 1530.28 This is the deviance of a model with no predictors — only the intercept.
#Model deviance = 1239.44 This is the deviance after adding all predictors.
#Interpretation A substantial drop in deviance (from 1530 to 1239) means the predictors explain meaningful variation in breastfeeding initiation.
##Hosmer-Lemeshow Goodness-of-Fit Test
NOTE: hoslem.test() does NOT support svyglm objects directly. So, we use the FITTED VALUES from the weighted model but pass the raw binary outcome (0/1 numeric). This is the standard approach for survey-weighted logistic regression diagnostics.
``` r # HOSMER-LEMESHOW TEST
hl_test <- hoslem.test( x = ptb_nsfg$breastfed_init, # Numeric 0/1 outcome y = fitted(model_adj), # Fitted probabilities from weighted model g = 10 # Deciles of predicted probability )
# Display results hl_test ```
`## ## Hosmer and Lemeshow goodness of fit (GOF) test ## ## data: ptb_nsfg$breastfed_init, fitted(model_adj) ## X-squared = 18.868, df = 8, p-value = 0.01558`
Interpretation: Since p value is 0.016 which is <0.05, it means some evidence of misfit in some probability regions. A small p-value (< 0.05) suggests that the model does not fit well in some regions of predicted probability. With large samples (like ours), the Hosmer-Lemeshow test can be over-powered and detect trivial misfit. Always pair it with a calibration plot.
## Calibration Plot
r ptb_nsfg \|> mutate( pred_prob = fitted(model_adj), # Weighted predicted probabilities obs_outcome = breastfed_init, # Numeric 0/1 outcome decile = ntile(pred_prob, 10) # Divide into 10 risk groups ) \|> group_by(decile) \|> summarise( mean_pred = mean(pred_prob), # Mean predicted probability within decile mean_obs = mean(obs_outcome), # Observed proportion within decile n = n(), .groups = "drop" ) \|> ggplot(aes(x = mean_pred, y = mean_obs)) + geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed", linewidth = 1) + # Perfect calibration line geom_point(aes(size = n), color = "#2E86AB", alpha = 0.85) + geom_line(color = "#2E86AB", linewidth = 0.8) + scale_size_continuous(name = "n in decile", range = c(3, 8)) + scale_x_continuous(labels = scales::percent_format(), limits = c(0.5, 1)) + scale_y_continuous(labels = scales::percent_format(), limits = c(0.5, 1)) + labs( title = "Figure 3. Calibration Plot: Observed vs. Predicted Probability", subtitle = "Points should fall on the dashed line for perfect calibration", x = "Mean Predicted Probability (within decile)", y = "Observed Proportion Breastfed (within decile)", caption = "Adjusted logistic regression model; NSFG 2022–2023" ) + theme_minimal(base_size = 12) + theme(legend.position = "right")

The Hosmer–Lemeshow test indicated minor deviations between observed and predicted probabilities (p = 0.016); however, the calibration plot demonstrated good overall agreement, with only small departures from the ideal line, suggesting that the model is adequately calibrated for epidemiologic interpretation.
#Diagnostics for Logistic Regression
## Linearity in the Logit (for continuous predictors)
For continuous predictors, logistic regression assumes a linear relationship between the predictor and the log-odds. We can check this with a smoothed plot of the logit against the predictor.
`r ptb_nsfg \|> mutate(logit_pred = predict(model_adj, type = "link")) \|> ggplot(aes(x = age_conception, y = logit_pred)) + geom_point(alpha = 0.2, color = "steelblue") + geom_smooth(method = "loess", se = FALSE, color = "darkorange") + labs(title = "Figure 5. Linearity in the Logit: Age", x = "Maternal Age (years)", y = "Predicted Log-Odds (logit)") + theme_minimal()`

Interpretation: A roughly linear loess curve supports the linearity assumption. A clearly curved pattern suggests we should add a quadratic term or use a spline. The LOESS line shows an approximately linear upward trend, indicating that maternal age satisfies the linearity‑in‑the‑logit assumption. This means maternal age can be modeled as a continuous predictor without transformation.

Cook’s Distance (Influential Observations)

Cook’s D measures how much the model coefficients would change if observation i were deleted. Threshold: 4/n flags potentially influential points. # A point with Cook’s D > 1 is almost certainly influential.

ptb_nsfg |>
  mutate(cooks_d = cooks.distance(model_adj),
         row_id = row_number()) |>
  ggplot(aes(x = row_id, y = cooks_d)) +
  geom_point(alpha = 0.4, color = "steelblue") +
  geom_hline(yintercept = 4 / nrow(model_adj$model),
             linetype = "dashed", color = "red") +
  labs(title = "Figure 6: Cook's Distance for Logistic Regression Model",
       subtitle = "Red line: 4/n threshold",
       x = "Observation Index", y = "Cook's Distance") +
  theme_minimal()

#Interpretation Scatter plot with red threshold line shows that Cook’s Distance values were uniformly low, with several observations exceeding the conservative 4/n threshold but none approaching 1. It means no individual cases were undue influence on the adjusted model estimates.

Multicollinearity Check (VIF)

vif(model_adj) |>
  as.data.frame() |>
  rownames_to_column("Predictor") |>
  kable(digits = 2, caption = "Variance Inflation Factors") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Variance Inflation Factors
Predictor	GVIF	Df	GVIF^(1/(2*Df))
preterm_birth_f	6.45	2	1.59
age_cat	20.22	5	1.35
race_eth	10.12	3	1.47
metro_f	5.93	2	1.56
insurance_f	8.80	3	1.44
educ_cat	32.29	4	1.54
poverty_cat	5.31	2	1.52

Rule of thumb: VIF > 5 (or 10) indicates problematic multicollinearity. Variance inflation factors were low (GVIF^(1/(2·df)) ≈ 1.3–1.6 across predictors), indicating no evidence of problematic multicollinearity in the adjusted model.

Section 5: Regression Visualizations

#Reporting Logistic Regression Results

A publication-quality logistic regression table should include:

Sample size and number of events
Adjusted odds ratios with 95% CIs
p-values for each coefficient
Model fit statistics (AIC or pseudo-R²)
Discrimination metric (AUC)

Predicted Probabilities (BEFORE adjusting for confounders)

# Predicted probabilities using ggeffects
pp <- predict_response(model_crude, terms = "preterm_birth_f")
plot(pp) +
  labs(
    title = "Figure 7: Predicted Probability of Breastfeeding Initiation by Gestational Age",
    subtitle = "Unadjusted model — NSFG 2022–2023",
    x = "Gestational Age Category",
    y = "Predicted Probability of Breastfeeding Initiation"
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

$Predicted Breastfeeding Initiation Probability by preterm birth$

Predicted Breastfeeding Initiation Probability by preterm birth

Interpretation:

This plot displays the model‑estimated probability of breastfeeding initiation for each gestational‑age category, with 95% confidence intervals. The pattern is clear: the probability of breastfeeding initiation increases steadily with gestational maturity from preterm (~70%) to postterm (~86%) births. The widening confidence intervals for earlier gestational ages shows sample‑size limitations and greater variability.

#PREDICTED PROBABILITY PLOT (AFTER) # ggpredict() from ggeffects computes ADJUSTED predicted probabilities, holding all other covariates at their reference/mean values.This is the gold-standard visualization for a categorical predictor with a binary outcome.

## Predicted Probability of Breastfeeding Initiation by Gestational Age (Adjusted)

pp_adj <- ggpredict(model_adj, terms = "preterm_birth_f")

# Custom ggplot2 version for poster-quality output
pp_adj |>
  as_tibble() |>
  ggplot(aes(x = x, y = predicted, color = x, fill = x)) +
  geom_col(alpha = 0.25, width = 0.5) +
  geom_point(size = 5) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
                width = 0.15, linewidth = 1.2) +
  geom_text(aes(label = paste0(round(predicted * 100, 1), "%"),
                y = predicted + 0.04),
            size = 4.5, fontface = "bold") +
  scale_y_continuous(labels = scales::percent_format(),
                     limits = c(0.5, 1.0)) +
  scale_color_manual(values = c("#E67E22", "#E74C3C", "#27AE60")) +
  scale_fill_manual(values  = c("#E67E22", "#E74C3C", "#27AE60")) +
  labs(
    title    = "Figure 8. Adjusted Predicted Probability of Breastfeeding Initiation",
    subtitle = "By gestational age category, holding all covariates at reference values",
    x        = "Gestational Age Category",
    y        = "Predicted Probability of Breastfeeding Initiation",
    caption  = "Error bars = 95% confidence intervals. Adjusted for maternal age, race/ethnicity, metro status, insurance, education, and poverty."
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 15, hjust = 1))

Interpretation: Term (37-40 weeks) = 68.6% Preterm (<37 weeks) = 71.3% Post-term (>= 40 weeks) = 80.9%

Crude predicted probabilities showed lower breastfeeding initiation among preterm infants compared with term and post‑term infants. However, after adjusting for maternal age, race/ethnicity, metro status, insurance, education, and poverty, the predicted probability for preterm infants was slightly higher than for term infants. This suggests that sociodemographic factors confound the crude association between gestational age and breastfeeding initiation.

Figure 8: Forest Plot of Adjusted Odds Ratios (Full Model)

forest_data <- tidy(model_adj, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    # Clean term names for the y-axis labels
    
    term_clean = case_when(
      term == "preterm_birth_fPreterm (<37 wks)"~ "Preterm (<37 wks) vs. Term",
      term == "preterm_birth_fPost-term (≥40 wks)"~ "Post-term (>=40 wks) vs. Term",
      term == "age_cat20–24"  ~ "Age 20-24 vs. <20",
      term == "age_cat25–29"  ~ "Age 25-29 vs. <20",
      term == "age_cat30–34" ~ "Age 30-34 vs. <20",
      term == "age_cat35–39" ~ "Age 35-39 vs. <20",
      term == "age_cat40+"  ~ "Age 40+ vs. <20",
      term == "race_ethHispanic" ~ "Hispanic vs. NH White",
      term == "race_ethNH Black" ~ "NH Black vs. NH White",
      term == "race_ethNH Other" ~ "NH Other vs. NH White",
      term == "metro_fPrincipal city (urban)" ~ "Urban vs. Suburban",
      term == "metro_fNon-MSA (rural)" ~ "Rural vs. Suburban",
      term == "insurance_fMedicaid/public"~ "Medicaid vs. Private",
      term == "insurance_fOther govt"  ~ "Other govt vs. Private",
      term == "insurance_fUninsured" ~ "Uninsured vs. Private",
      term == "educ_catLess than high school"  ~ "Less than HS vs. HS/GED",
      term == "educ_catSome college/Associate" ~ "Some college vs. HS/GED",
      term == "educ_catBachelor's" ~ "Bachelor's vs. HS/GED",
      term == "educ_catMasters or higher" ~"Master's + vs. HS/GED",
      term == "poverty_cat200–<400% FPL"~ "200-400% FPL vs. <200%",
      term == "poverty_cat≥400% FPL" ~ ">=400% FPL vs. <200%",
      TRUE ~ term
    ),
   
    # Flag significant terms
    significant = p.value < 0.05
  )

ggplot(forest_data, aes(x = estimate, y = reorder(term_clean, estimate),
                        color = group, shape = significant)) +
  
  #reference line
  geom_vline(xintercept = 1, linetype = "dashed",
             color = "grey40", linewidth = 0.9) +
  
  #points and CIs
  geom_point(size = 3.2, color = "black") +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 width = 0.22, linewidth = 0.9, color = "black") +
  
  #log scale
  scale_x_log10(
    breaks = c(0.1, 0.25, 0.5, 1, 2, 4, 8),
    labels = c("0.1", "0.25", "0.5", "1", "2", "4", "8")
  ) +
  
  #filled is significant and open is not significant
  scale_shape_manual(values = c("FALSE" = 1, "TRUE" = 19),
                     labels = c("Not significant", "Significant (p < 0.05)"),
                     name   = NULL) +
  labs(
    title    = "Figure 9. Forest Plot of Adjusted Odds Ratios",
    subtitle = "Outcome: Breastfeeding Initiation | Adjusted weighted logistic regression | NSFG 2022–2023",
    x        = "Adjusted Odds Ratio (log scale)",
    y        = NULL,
    caption  = "Filled circles = p < 0.05; Open circles = p ≥ 0.05. Reference line at OR = 1.0."
  ) +
  theme_minimal(base_size = 11) +
  theme(
    legend.position = "right",
    panel.grid.major.y = element_blank(),
    axis.text.y = element_text(size = 9)
  )

Interpretation:

This forest plot of the adjusted model shows several strong and statistically significant predictors of breastfeeding initiation. 1. Higher maternal education is one of the most consistent facilitators: mothers with a bachelor’s degree or a master’s degree or higher have substantially higher odds of initiating breastfeeding compared with those with a high school graduates. 2. Household income also shows a positive gradient, with families at ≥400% of the federal poverty level demonstrating higher odds of breastfeeding initiation than those below 200% FPL. 3. Race and ethnicity reveal notable disparities. Hispanic mothers have significantly higher odds of initiating breastfeeding compared with non‑Hispanic White mothers, whereas non‑Hispanic Black mothers have significantly lower odds. 4. Gestational age shows a modest association. Post‑term births (≥40 weeks) have higher odds of breastfeeding initiation compared with term births, while preterm births (<37 weeks) do not differ significantly after adjustment.

Section 5: Interaction Testing

Preterm Birth × Race/Ethnicity Interaction

# Step 1: Fit model WITH interaction term (preterm × race)
# Step 2: Fit model WITHOUT interaction (main effects only)
# Step 3: LR test comparing the two — tests joint significance of interaction terms
# Step 4: If significant, report stratum-specific ORs; if not, report main effects

# Model WITH interaction
model_interaction <- svyglm(
  breastfed_init_f ~ preterm_birth_f * race_eth +
    age_cat + metro_f + insurance_f + educ_cat + poverty_cat,
  family = quasibinomial(link = "logit"),
  design = svy_design
)

# Model WITHOUT interaction (main effects)

int_pvals <- tidy(model_interaction) |>
  filter(str_detect(term, "preterm_birth_f.*race_eth|:")) |>
  select(term, estimate, std.error, statistic, p.value) |>
  mutate(across(c(estimate, std.error, statistic), ~round(., 3)),
         p.value = round(p.value, 3))

print(int_pvals)

## # A tibble: 6 × 5
##   term                                      estimate std.error statistic p.value
##   <chr>                                        <dbl>     <dbl>     <dbl>   <dbl>
## 1 preterm_birth_fPreterm (<37 wks):race_et…    0.905     0.642     1.41    0.168
## 2 preterm_birth_fPost-term (≥40 wks):race_…   -0.2       0.758    -0.264   0.793
## 3 preterm_birth_fPreterm (<37 wks):race_et…    2.26      0.775     2.92    0.006
## 4 preterm_birth_fPost-term (≥40 wks):race_…    0.708     0.966     0.733   0.469
## 5 preterm_birth_fPreterm (<37 wks):race_et…    1.56      1.32      1.18    0.247
## 6 preterm_birth_fPost-term (≥40 wks):race_…   -0.25      0.875    -0.286   0.777

Interaction Visualization

ggpredict(model_interaction,
          terms = c("preterm_birth_f", "race_eth")) |>
  plot() +
  labs(
    title    = "Figure 10. Predicted Probability by Gestational Age and Race/Ethnicity",
    subtitle = "Interaction model; holding other covariates at reference values",
    x        = "Gestational Age Category",
    y        = "Predicted Probability of Breastfeeding Initiation",
    color    = "Race/Ethnicity",
    caption  = "Parallel lines = no interaction. Crossing/diverging lines = potential effect modification."
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Does being preterm or post‑term change breastfeeding initiation differently for NH Black, Hispanic, or NH Other mothers compared with NH White mothers?

Interaction terms between gestational age and race/ethnicity indicated limited evidence of effect modification. The only statistically significant interaction was between preterm birth and Hispanic ethnicity (p = 0.006), suggesting that the difference in breastfeeding initiation between preterm and term infants is greater among Hispanic mothers compared with NH White mothers. No significant interactions were observed for NH Black or NH Other mothers, indicating that the association between gestational age and breastfeeding initiation is generally consistent across these groups.

Preterm Birth × Insurance Interaction

# Step 1: Fit model WITH interaction term (preterm × insurance)
# Step 2: Fit model WITHOUT interaction (main effects only)
# Step 3: LR test comparing the two — tests joint significance of interaction terms
# Step 4: If significant, report stratum-specific ORs; if not, report main effects

# Model WITH interaction
model_insurance <- svyglm(
  breastfed_init_f ~ preterm_birth_f * insurance_f + race_eth +
    age_cat + metro_f + educ_cat + poverty_cat,
  family = quasibinomial(link = "logit"),
  design = svy_design
)

# Model WITHOUT interaction (main effects)

int_pvals_insurance <- tidy(model_insurance) |>
  filter(str_detect(term, "preterm_birth_f.*insurance_f|:")) |>
  select(term, estimate, std.error, statistic, p.value) |>
  mutate(across(c(estimate, std.error, statistic), ~round(., 3)),
         p.value = round(p.value, 3))

print(int_pvals_insurance)

## # A tibble: 6 × 5
##   term                                      estimate std.error statistic p.value
##   <chr>                                        <dbl>     <dbl>     <dbl>   <dbl>
## 1 preterm_birth_fPreterm (<37 wks):insuran…    0.708     0.599     1.18    0.245
## 2 preterm_birth_fPost-term (≥40 wks):insur…   -0.711     0.651    -1.09    0.283
## 3 preterm_birth_fPreterm (<37 wks):insuran…    0.367     0.967     0.38    0.706
## 4 preterm_birth_fPost-term (≥40 wks):insur…   -1.85      1.01     -1.84    0.075
## 5 preterm_birth_fPreterm (<37 wks):insuran…    1.16      1.12      1.03    0.308
## 6 preterm_birth_fPost-term (≥40 wks):insur…    0.419     1.14      0.367   0.716

Interaction Visualization

ggpredict(model_insurance,
          terms = c("preterm_birth_f", "insurance_f")) |>
  plot() +
  labs(
    title    = "Figure 11. Predicted Probability by Gestational Age and Insurance Type",
    subtitle = "Interaction model; holding other covariates at reference values",
    x        = "Gestational Age Category",
    y        = "Predicted Probability of Breastfeeding Initiation",
    color    = "Insurance Type",
    caption  = "Parallel lines = no interaction. Crossing/diverging lines = potential effect modification."
  ) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

Does being preterm or post‑term change breastfeeding initiation differently for Medicaid, Other‑govt, or Uninsured mothers compared with Private insurance?

There is no evidence that the association between gestational age and breastfeeding initiation differs by insurance type.Interaction terms between gestational age and insurance type were not statistically significant, indicating that the association between gestational age and breastfeeding initiation does not differ meaningfully across insurance groups. The predicted‑probability plot shows largely parallel lines, consistent with the absence of effect modification.

Save Analytical Dataset

# Save the cleaned analytical dataset for use in the Final Report.

saveRDS(
  ptb_nsfg,
  file = here::here("Showcase 2026", "nsfg_2022_2023_checkin2.rds")
)

In this nationally representative sample of U.S. pregnancies from the 2022–2023 NSFG, breastfeeding initiation was common overall (81%), but substantial sociodemographic disparities were evident. In unadjusted analyses, preterm birth was not associated with breastfeeding initiation (OR = 1.02, 95% CI: 0.63–1.65), and this remained unchanged after adjustment for maternal age, race/ethnicity, socioeconomic status, and insurance (aOR = 1.14, 95% CI: 0.67–1.93). The small shift between crude and adjusted estimates indicates minimal confounding, and both models support the conclusion that preterm birth is not independently associated with breastfeeding initiation. In contrast, post‑term birth showed a strong positive association in both crude (OR = 2.27, 95% CI: 1.25–4.11) and adjusted models (aOR = 1.93, 95% CI: 1.03–3.64), suggesting that infants born ≥40 weeks have higher odds of breastfeeding initiation even after accounting for sociodemographic factors.

Summary of Findings

tidy(model_adj, exponentiate = TRUE, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = case_when(
      term == "preterm_birth_fPreterm (<37 wks)"~ "Preterm (<37 wks) vs. Term*",
      term == "preterm_birth_fPost-term (>=40 wks)" ~ "Post-term (>=40 wks) vs. Term",
      str_detect(term, "race_ethNH Black")~ "NH Black vs. NH White",
      str_detect(term, "race_ethHispanic")~ "Hispanic vs. NH White",
      str_detect(term, "educ_catBachelor")~ "Bachelor's vs. HS/GED",
      str_detect(term, "educ_catMasters or higher")~ "Masters vs. HS/GED",
      str_detect(term, "poverty_cat200–<400% FPL") ~ "200-400% FPL vs. <200%",
      TRUE ~ term
    ),
    Significance = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE            ~ ""
    )
  ) |>
  filter(term %in% c(
    "Preterm (<37 wks) vs. Term*",
    "Post-term (>=40 wks) vs. Term",
    "NH Black vs. NH White",
    "Hispanic vs. NH White",
    "Bachelor's vs. HS/GED",
    "Masters vs. HS/GED",
    "200-400% FPL vs. <200%"
  )) |>
  mutate(across(c(estimate, conf.low, conf.high), ~round(., 2)),
         p.value = round(p.value, 3)) |>
  kable(
    caption = "Table 7. Key Results from Adjusted Logistic Regression Model",
    col.names = c("Predictor", "aOR", "Std. Error", "z-stat",
                  "p-value", "95% CI Low", "95% CI High", "Sig.")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  footnote(
    general = "* Primary exposure of interest. aOR = adjusted odds ratio.
    Survey-weighted quasibinomial logistic regression. N = 1,618.
    Reference group for gestational age = Term (37–40 wks).",
    symbol  = "* = p<0.05; ** = p<0.01; *** = p<0.001"
  )

Table 7. Key Results from Adjusted Logistic Regression Model
Predictor	aOR	Std. Error	z-stat	p-value	95% CI Low	95% CI High	Sig.
Preterm (<37 wks) vs. Term*	1.14	0.2617050	0.4960021	0.623	0.67	1.93
NH Black vs. NH White	0.49	0.2742380	-2.6018314	0.013	0.28	0.85
Hispanic vs. NH White	1.79	0.3137495	1.8596670	0.070	0.95	3.38
Bachelor’s vs. HS/GED	5.51	0.3521016	4.8463311	0.000	2.70	11.23	***
Masters vs. HS/GED	4.70	0.4283874	3.6133288	0.001	1.98	11.18	***
200-400% FPL vs. <200%	2.19	0.2491748	3.1411714	0.003	1.32	3.62	**
Note:
* Primary exposure of interest. aOR = adjusted odds ratio. Survey-weighted quasibinomial logistic regression. N = 1,618. Reference group for gestational age = Term (37–40 wks).
= p<0.05; = p<0.01; * = p<0.001

Sociodemographic predictors show the most meaningful associations. Non‑Hispanic Black mothers had significantly lower odds of breastfeeding initiation compared with Non‑Hispanic White mothers (aOR = 0.49, 95% CI: 0.28–0.85), consistent with what we know about racial disparities. Higher educational achievement of mothers showed the strongest positive gradient: mothers with a bachelor’s degree (aOR = 5.62, 95% CI: 2.30–13.7) or graduate degree (aOR = 4.80, 95% CI: 1.67–13.8) had higher odds of breastfeeding initiation. Income also played an important role, with mothers in the 200–<400% FPL range showing significantly higher odds (aOR = 2.19, 95% CI: 1.32–3.62). Other factors including maternal age categories, residence, and insurance type were not independently associated with breastfeeding initiation after adjustment.

Model diagnostics indicated acceptable performance: the AUC of 0.788 demonstrated good discrimination, calibration was visually adequate, and no influential observations or multicollinearity concerns were detected. Overall, findings suggest that breastfeeding initiation disparities are driven primarily by sociodemographic factors, rather than gestational age at birth.

#end of check-in 2

Association Between Preterm Birth and Breastfeeding Initiation Among U.S. Women

A Secondary analysis of the 2022–2023 National Survey of Family Growth (NSFG)

Suruchi Shahi

2026-04-28

Introduction

Loading NSFG 2022- 2023 Data

Setup and Data

Creating a Working Subset

Justification for Logistic Regression

Reference Categories

Covariate Justification

The final analytical sample consists of 1618 singleton live births from the NSFG 2022-2023. Table 1 presents the distribution of all key variables and table 2 is the distribution of all key variables stratified by breastfeeding initiation status.

TABLE 1: Covariates stratified by PRETERM STATUS (not outcome) as followed from Check-in 1 feedback.

This is the primary exposure, so this is the correct stratification.

#-Unweighted covariate distribution by preterm birth status. This allows us to see whether potential confounders are distributed differently across gestational age groups.

Start with the survey design object

Before running any model, always visualize outcome distribution.

Regression Equation: Models Fit

Model 2: Adjusted Logistic Regression

Cook’s Distance (Influential Observations)

Multicollinearity Check (VIF)

Section 5: Regression Visualizations

Predicted Probabilities (BEFORE adjusting for confounders)

Figure 8: Forest Plot of Adjusted Odds Ratios (Full Model)

Section 5: Interaction Testing

Preterm Birth × Race/Ethnicity Interaction

Interaction Visualization

Preterm Birth × Insurance Interaction

Interaction Visualization

Save Analytical Dataset

Summary of Findings