Project1-Data110: Analysis of Birth Data in the United States for the Year 2014

Author

Merveille Kuendzong

Published

March 10, 2024

Introduction

Every year, the United States Department of Health and Human Services publicly releases a large dataset containing information on recorded births in the country. This work is based on the exploration of the ‘births14’ dataset, a random sample of 1,000 cases from the 2014 release. Our focus will be on the weights of babies at birth, the maternal weight gain during pregnancy, and influential factors such as race and smoking habits.

Data

The dataset I am using comprises 13 variables, including: fage (Father’s age in years), mage (Mother’s age in years), mature (Maturity status of mother: mature or younger), weeks(Length of pregnancy in weeks), premie (Whether the birth was classified as premature (premie) or full-term), visits (Number of hospital visits during pregnancy), gained (Weight gained by mother during pregnancy in pounds), weight (Weight of the baby at birth in pounds), lowbirthweight (Whether baby was classified as low birthweight (low) or not (not low)), sex (Sex of the baby, female or male), habit (Status of the mother as a nonsmoker or a smoker), marital(Whether mother is married or not married at birth), whitemom (Whether mom is white or not white).

Dataset Source: United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. Natality Detail File, 2014 United States. Inter-university Consortium for Political and Social Research, 2016-10-07. .

Load Libraries

library(tidyverse)
Warning: package 'dplyr' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(treemap)
Warning: package 'treemap' was built under R version 4.3.2
library(ggfortify)

Load data

# set working directory
setwd("C:/Users/kmerv_6exilcx/Dropbox/SPRING 2024/Data 110/week7/project1")
births <- read_csv('births14.csv')

# display the first six rows
head(births)
# A tibble: 6 × 13
   fage  mage mature      weeks premie visits gained weight lowbirthweight sex  
  <dbl> <dbl> <chr>       <dbl> <chr>   <dbl>  <dbl>  <dbl> <chr>          <chr>
1    34    34 younger mom    37 full …     14     28   6.96 not low        male 
2    36    31 younger mom    41 full …     12     41   8.86 not low        fema…
3    37    36 mature mom     37 full …     10     28   7.51 not low        fema…
4    NA    16 younger mom    38 full …     NA     29   6.19 not low        male 
5    32    31 younger mom    36 premie     12     48   6.75 not low        fema…
6    32    26 younger mom    39 full …     14     45   6.69 not low        fema…
# ℹ 3 more variables: habit <chr>, marital <chr>, whitemom <chr>

Scatterplot of Pregnancy Length to Birth Weight

ggplot(births, aes(x = weeks, y = weight, color = sex))+
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  # add color by sex
  scale_color_manual(values = c("male" = "skyblue", "female" = "pink")) +
  theme_minimal(base_size = 12)+
  labs(x="Pregnancy Length (weeks)", 
       y="Birth Weight (pounds)",
       title = "Scatterplot of Pregnancy Length to Birth Weight",
       caption = "Source: United States Department of Health and Human Services.")
`geom_smooth()` using formula = 'y ~ x'

Correlation and Linear Regression Model

cor(births$weeks, births$weight)
[1] 0.5480481
fit = lm(weight ~ weeks, data = births)
summary(fit)

Call:
lm(formula = weight ~ weeks, data = births)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0175 -0.7187  0.0190  0.6886  3.6078 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.59797    0.52273  -6.883 1.03e-11 ***
weeks        0.27922    0.01349  20.699  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.094 on 998 degrees of freedom
Multiple R-squared:  0.3004,    Adjusted R-squared:  0.2997 
F-statistic: 428.4 on 1 and 998 DF,  p-value: < 2.2e-16

Correlation is equal to 0.55, meaning that there is a weak correlation between the length of pregnancy and the weight of baby at birth.

The model has the equation: weight = 0.27922(weeks) -3.59797

The slope may be interpreted as: For each additional week of pregnancy, there is a predicted increase of 0.27922 pounds

The p-value on the right of weeks has 3 asterisks which suggests it is a meaningful variable to explain the linear increase in weight. The Adjusted R-Squared value states that about 29.9% of the variation in the observations may be explained by the model. In other words, 70.1% of the variation in the data is likely not explained by this model.

Diagnostic plots

autoplot(fit, 1:4, nrow=2, ncol=2)

The non horizontal pattern or trend of the Residual vs Fitted plot may suggest violations of the assumption of constant variance. Both Residual and Normal Q-Q plots show observations 246, 155, and 976 have an effect on those plot as well as having high scale-location values.

Treemap of Mean Gained Weight by Pregnacy Length

# group the births dataset by the variable 'weeks' and calculates the mean values of 'gained' and 'weight' variables within each group
births1 <- births|>
  group_by(weeks)|>
  summarize(gained = mean(gained, na.rm = TRUE), weight = mean(weight, na.rm = TRUE)) 

# Treemap 
treemap(births1, index="weeks", vSize="gained", title = "Mean gained weight by pregnancy lenght", vColor="weight", type="manual",  palette="RdYlBu") +
  theme_minimal()

NULL

Each rectangle represents a duration (in weeks) of pregnancy, sized based on the mean ‘gained’ weight of mothers, colored by the mean ‘weight’ of babies at birth.

This treemap illustrates that the highest mean gained weight was observed for pregnancies lasting 26 weeks (which appears to be atypical), followed by 46 weeks (which seems more typical), and then 31 weeks, and so on.

Weight Gains by Maternal Habit and Ethnicity

Create the new variable wgained

# create new dataset births2 which is a copy of birth1 + new variable wgained
births2 <- births |>
  # remove na's
  filter(!is.na(gained) & !is.na(habit)) |>
  mutate(wgained = case_when(
    gained < 25 ~ "low",
    between(gained, 25, 40) ~ "medium",
    gained > 40 ~ "high"
  ))

Histogram of Weight Gain Categories by Habit and race

# Reorder the factor levels of 'wgained' to reflect the desired order
births2$wgained <- factor(births2$wgained, levels = c("low", "medium", "high"))

# Create a histogram with facets by 'whitemom'
plot <- ggplot(births2, aes(x = habit, fill = wgained)) +
  geom_bar(position = "dodge", stat = "count") +
  scale_fill_manual(values = c("low" = "pink", "medium" = "salmon", "high" = "skyblue")) +
  labs(x = "Habit", y = "Count", fill = "Weight Gained",
       title = "Counts of Weight Gain Categories by Habit and race") +
  facet_wrap(~whitemom, scales = "fixed") +
  theme_dark()

plot

We observe an imbalance in the dataset, with a larger representation of white women compared to non-white women, leading to higher counts for white mothers than non-white mothers. In both white and non-white groups, there is a prevalence of women with medium weight gain, a moderate number with low weight gain, and a smaller proportion with very high weight gain. Additionally, in both white and non-white groups, non-smokers outnumber smokers.

Conclusion:

This dataset was suitable for examining the relationship between expectant mothers’ habits and practices and the birth of their children. The scatterplot revealed a weak association between pregnancy length and baby weight at birth, while the linear model confirmed that the number of weeks is a reliable predictor of birth weight. The limitation of using only two colors in the scatterplot arose from categorical variables like ‘sex,’ which, along with other categorical variables in the dataset, had only two possible values. The treemap demonstrated that high weight gain is not necessarily tied to greater pregnancy length. An unexpected finding was the highest mean weight gain observed in women with pregnancies lasting only 26 weeks. The histogram highlighted the prevalence of white women in the dataset and a lower number of smokers compared to non-smokers. Cleaning the dataset before working was unnecessary, as all variable names were already lowercase and clearly named, and there was no need to filter data. However,I had to remove NA values when conducting computations or before generating the histogram. Although I wished to explore the influence of the ‘visits’ variable on weight gained by mother or baby weight at birth, the complexity involved would extend the analysis beyond the necessary scope of this assignment.