Research Question: Do maternal age, weight gain during pregnancy, and smoking habits significantly predict a baby’s birth weight?

Significance: Understanding these predictors allows healthcare providers to identify at-risk pregnancies earlier.

1 Introduction

The births14 dataset is a random sample of 1,000 cases from US birth records in 2014. This project investigates whether maternal age (mage), weight gain during pregnancy (gained), and smoking habits (habit) significantly predict a baby’s birth weight (weight)

Dataset Overview

Dataset: births14 (US Births, 2014). Observations: 1,000. Variables used: habit (smoker/nonsmoker), gained (weight gain), mage (mother’s age), and weight (baby’s birth weight).

Variables: Dependent (Y): weight (numeric, birth weight in pounds). Independent (X1): mage (numeric, mother’s age). Independent (X2): gained (numeric, mother’s weight gain). Independent (X3): habit (categorical, smoker/nonsmoker).

Setup

2 Data Analysis (Wrangling)

To ensure my statistical results are valid, I cleaned the data by filtering out any observations with missing values in my target variables

# 1. Loading the package that contains the data
library(dplyr)

# 2. Importing my data set

births_clean <- read_csv("births14.csv") %>% 
  # Filtering out rows with missing critical data
  filter(!is.na(weight), !is.na(gained), !is.na(habit), !is.na(mage)) %>% 
  # Selecting only relevant variables
  select(weight, gained, mage, habit)

# 3. Viewing summary and dimensions

summary(births_clean)
##      weight           gained           mage          habit          
##  Min.   : 1.120   Min.   : 0.00   Min.   :14.00   Length:941        
##  1st Qu.: 6.560   1st Qu.:20.00   1st Qu.:24.00   Class :character  
##  Median : 7.310   Median :30.00   Median :28.00   Mode  :character  
##  Mean   : 7.209   Mean   :30.48   Mean   :28.41                     
##  3rd Qu.: 8.000   3rd Qu.:39.00   3rd Qu.:33.00                     
##  Max.   :10.420   Max.   :98.00   Max.   :47.00
dim(births_clean)
## [1] 941   4

3 Exploratory Data Analysis (EDA)

Before modeling, I look for trends. The boxplot shows the initial difference in weights between smokers and non-smokers.

 ggplot(births_clean, aes(x = gained, y = weight, color = habit)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Birth Weight vs. Maternal Weight Gain",
       x = "Weight Gained by Mother (lbs)",
       y = "Birth Weight of Baby (lbs)") +
  theme_minimal()

4 Statistical Analysis (Multiple Regression)

I used the lm() function to fit a Multiple Linear Regression model. This allowed me to see the impact of one variable while holding others constant.

Statistical Analysis Choice Selected Method: Multiple Regression. Why: Since weight (baby’s birth weight) is a continuous numerical variable, I used Multiple Linear Regression to see how several independent variables predict it.

# Fitting my multiple linear regression model
model <- lm(weight ~ gained + mage + habit, data = births_clean)

# Showing the summary of the model
summary(model)
## 
## Call:
## lm(formula = weight ~ gained + mage + habit, data = births_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1642 -0.6541  0.1158  0.7734  3.1819 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.488198   0.225160  28.816  < 2e-16 ***
## gained       0.012172   0.002654   4.587 5.11e-06 ***
## mage         0.014375   0.007068   2.034   0.0422 *  
## habitsmoker -0.493991   0.125655  -3.931 9.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.245 on 937 degrees of freedom
## Multiple R-squared:  0.04317,    Adjusted R-squared:  0.04011 
## F-statistic: 14.09 on 3 and 937 DF,  p-value: 5.45e-09

Interpretation:

Smoking is the most significant negative factor, reducing birth weight by 0.49 lbs. Every pound a mother gains increases birth weight by 0.012 lbs.

Model Fit: The Adjusted R-squared is 0.04, indicating that while these factors are significant, other variables not in my model could also influence birth weight.

5 Model Assumptions and Diagnostics

To ensure my model is valid, ill check for normality and constant variance of residuals.

# Generating diagnostic plots
par(mfrow = c(2, 2))
plot(model)

Linearity: The Residuals vs Fitted plot shows a flat line, confirming a linear relationship.

Normality: The Q-Q Plot shows residuals following the diagonal line, confirming normal distribution.

Homoscedasticity: The Scale-Location plot shows consistent spread, meaning error variance is constant.

Leverage: The Residuals vs Leverage plot shows no influential outliers are biasing the coefficients.

Final Conclusion

The study concludes that maternal lifestyle and physical factors are critical predictors of birth weight. Smoking has a drastic negative impact, while maternal weight gain and age have smaller, positive correlations with birth weight. These findings support the need for smoking cessation programs in prenatal care.

References

CDC National Center for Health Statistics. Natality Detail File, 2014.