Research Question

Do babies born to mothers who smoke have a lower average birth weight than those born to non-smoking mothers?

Introduction

The data set I am using is titled “births14,” which comes from the https://www.openintro.org/data/index.php?data=births14 . This data set is about births recorded in the US. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This data set from 2014. This data set has 1000 observations and 13 variables.

For this project, I focused on specific two columns from the data set.

weight habit

The weight column shows the baby’s birth weight (in pound) and its is a quantitative variable. The habit column shows the mother’s smoking status: if she is smoking or not. And it is a categorical variable.

Source

United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. Natality Detail File, 2014 United States. Inter-university Consortium for Political and Social Research, 2016-10-07

Data Analysis

Before running the hypothesis test, I cleaned and explored the data set to make sure I was working with accurate information. First I renamed two variables (father’s age and mother’s age) just to make the data set easier to understand. When I checked for missing values, I found that the habit column had 19 NAs, while the weight column had non. Since both variables are required for my comparison, I removed the rows with missing smoking status using the filter function. After cleaning, I used group_by and summarize to calculate summary statistics for the two groups. This included the number of observations, the mean birth weight, and the standard deviation for smokers and non-smokers. For visualization, I created a box plot to compare the distribution of birth weights between smoking and non-smoking mothers. I chose box plot because it clearly shows differences in medians and spread.

Load the libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Loading the data set

setwd("C:/Users/thilo/OneDrive/Desktop/DATA 101")
births <- read_csv("births14.csv")

To look at the data type and first 6 rows

str(births)
## spc_tbl_ [1,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ fage          : num [1:1000] 34 36 37 NA 32 32 37 29 30 29 ...
##  $ mage          : num [1:1000] 34 31 36 16 31 26 36 24 32 26 ...
##  $ mature        : chr [1:1000] "younger mom" "younger mom" "mature mom" "younger mom" ...
##  $ weeks         : num [1:1000] 37 41 37 38 36 39 36 40 39 39 ...
##  $ premie        : chr [1:1000] "full term" "full term" "full term" "full term" ...
##  $ visits        : num [1:1000] 14 12 10 NA 12 14 10 13 15 11 ...
##  $ gained        : num [1:1000] 28 41 28 29 48 45 20 65 25 22 ...
##  $ weight        : num [1:1000] 6.96 8.86 7.51 6.19 6.75 6.69 6.13 6.74 8.94 9.12 ...
##  $ lowbirthweight: chr [1:1000] "not low" "not low" "not low" "not low" ...
##  $ sex           : chr [1:1000] "male" "female" "female" "male" ...
##  $ habit         : chr [1:1000] "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...
##  $ marital       : chr [1:1000] "married" "married" "married" "not married" ...
##  $ whitemom      : chr [1:1000] "white" "white" "not white" "white" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   fage = col_double(),
##   ..   mage = col_double(),
##   ..   mature = col_character(),
##   ..   weeks = col_double(),
##   ..   premie = col_character(),
##   ..   visits = col_double(),
##   ..   gained = col_double(),
##   ..   weight = col_double(),
##   ..   lowbirthweight = col_character(),
##   ..   sex = col_character(),
##   ..   habit = col_character(),
##   ..   marital = col_character(),
##   ..   whitemom = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(births)
## # A tibble: 6 × 13
##    fage  mage mature      weeks premie visits gained weight lowbirthweight sex  
##   <dbl> <dbl> <chr>       <dbl> <chr>   <dbl>  <dbl>  <dbl> <chr>          <chr>
## 1    34    34 younger mom    37 full …     14     28   6.96 not low        male 
## 2    36    31 younger mom    41 full …     12     41   8.86 not low        fema…
## 3    37    36 mature mom     37 full …     10     28   7.51 not low        fema…
## 4    NA    16 younger mom    38 full …     NA     29   6.19 not low        male 
## 5    32    31 younger mom    36 premie     12     48   6.75 not low        fema…
## 6    32    26 younger mom    39 full …     14     45   6.69 not low        fema…
## # ℹ 3 more variables: habit <chr>, marital <chr>, whitemom <chr>

Cleaning the data set

births1 <- births |>
  rename(father_age = fage,
         mother_age = mage) ##I did this just because to make sense of the data set 
births1
## # A tibble: 1,000 × 13
##    father_age mother_age mature weeks premie visits gained weight lowbirthweight
##         <dbl>      <dbl> <chr>  <dbl> <chr>   <dbl>  <dbl>  <dbl> <chr>         
##  1         34         34 young…    37 full …     14     28   6.96 not low       
##  2         36         31 young…    41 full …     12     41   8.86 not low       
##  3         37         36 matur…    37 full …     10     28   7.51 not low       
##  4         NA         16 young…    38 full …     NA     29   6.19 not low       
##  5         32         31 young…    36 premie     12     48   6.75 not low       
##  6         32         26 young…    39 full …     14     45   6.69 not low       
##  7         37         36 matur…    36 premie     10     20   6.13 not low       
##  8         29         24 young…    40 full …     13     65   6.74 not low       
##  9         30         32 young…    39 full …     15     25   8.94 not low       
## 10         29         26 young…    39 full …     11     22   9.12 not low       
## # ℹ 990 more rows
## # ℹ 4 more variables: sex <chr>, habit <chr>, marital <chr>, whitemom <chr>

Checking for NAs

colSums(is.na(births1))
##     father_age     mother_age         mature          weeks         premie 
##            114              0              0              0              0 
##         visits         gained         weight lowbirthweight            sex 
##             56             42              0              0              0 
##          habit        marital       whitemom 
##             19              0              0

Handling NAs

births_clean <- births1 |>
  filter(!is.na(habit))

Summary for Smokers vs Non-Smokers

summary_birth <- births_clean |>
  group_by(habit) |>
  summarize(
    Count = n(),
    mean_weight = mean(weight),
    sd_weight = sd(weight)) |>
  arrange(desc(Count))
summary_birth
## # A tibble: 2 × 4
##   habit     Count mean_weight sd_weight
##   <chr>     <int>       <dbl>     <dbl>
## 1 nonsmoker   867        7.27      1.23
## 2 smoker      114        6.68      1.60

Visualization

ggplot(births_clean,aes(x=habit,y=weight,fill = habit)) +
  geom_boxplot() +
  labs(title = "Baby Birth Weight by Mother's Smoking Habit",
       x= "Smoking Status",
       y="Birth Weight(lb)")+
  theme_minimal()

Statistical Analysis

\(H_0\): \(\mu_1\) = \(\mu_2\) \(H_a\): \(\mu_1\) < \(\mu_2\)

Where,

\(\mu_1\) = The mean birth weight of babies born to smoking mothers

\(\mu_2\) = The mean birth weight of babies born to non-smoking mothers

t.test(births_clean$weight[births_clean$habit=="smoker"],
       births_clean$weight[births_clean$habit == "nonsmoker"],alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  births_clean$weight[births_clean$habit == "smoker"] and births_clean$weight[births_clean$habit == "nonsmoker"]
## t = -3.8166, df = 131.31, p-value = 0.0001038
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.3354351
## sample estimates:
## mean of x mean of y 
##  6.677193  7.269873
  1. p-value = 0.0001038. Statistically significant at α = 0.05. There is strong evidence that the mean birth weight of babies born to smoking mothers is less than the mean birth weight of babies born to non-smoking mothers.

  2. 95% CI for difference = (-Inf, - 0.335). The interval is entirely do not include 0, showing that the mean birth weight of babies born to smoking mothers is less than the mean birth weight of babies born to non-smoking mothers.

  3. Therefore, we are rejecting the null.

Conclusion

Overall, my results showed a really clear pattern which is, babies born to mothers who smoke do weigh less on average compared to babies of non-smoking moms. After cleaning the data and running the two-sample test, the p-value came out small that is 0.0001038, which is much lower than the 0.05. This means I rejected the null hypothesis and there is strong evidence that smoking during pregnancy is linked to lower birth weight. The confidence interval was also below zero, which basically supports the same idea. Looking at the actual averages, babies of smoking mothers weighed about 0.6 pounds less, which is a pretty noticeable difference.

These results matter because birth weight is an important part of a newborn’s health, and even a small differences can be meaningful. For future directions, there is a lot more this data set could tell us. I only focused on smoking and baby weight, but there are other variables like parental visits, mother’s age, weight gain, and even racial background that could also play a role. It would be interesting to explore whether any of those factors change or strengthen the relationship between smoking and birth weight.

References

https://bookdown.org/rwnahhas/IntroToR/rename.html <- I forgot how to do rename the variable’s name so, I used this.