Do babies born to mothers who smoke have a lower average birth weight than those born to non-smoking mothers?
The data set I am using is titled “births14,” which comes from the https://www.openintro.org/data/index.php?data=births14 . This data set is about births recorded in the US. This data set has been of interest to medical researchers who are studying the relation between habits and practices of expectant mothers and the birth of their children. This data set from 2014. This data set has 1000 observations and 13 variables.
For this project, I focused on specific two columns from the data set.
weight habit
The weight column shows the baby’s birth weight (in pound) and its is a quantitative variable. The habit column shows the mother’s smoking status: if she is smoking or not. And it is a categorical variable.
United States Department of Health and Human Services. Centers for Disease Control and Prevention. National Center for Health Statistics. Natality Detail File, 2014 United States. Inter-university Consortium for Political and Social Research, 2016-10-07
Before running the hypothesis test, I cleaned and explored the data set to make sure I was working with accurate information. First I renamed two variables (father’s age and mother’s age) just to make the data set easier to understand. When I checked for missing values, I found that the habit column had 19 NAs, while the weight column had non. Since both variables are required for my comparison, I removed the rows with missing smoking status using the filter function. After cleaning, I used group_by and summarize to calculate summary statistics for the two groups. This included the number of observations, the mean birth weight, and the standard deviation for smokers and non-smokers. For visualization, I created a box plot to compare the distribution of birth weights between smoking and non-smoking mothers. I chose box plot because it clearly shows differences in medians and spread.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/thilo/OneDrive/Desktop/DATA 101")
births <- read_csv("births14.csv")
str(births)
## spc_tbl_ [1,000 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ fage : num [1:1000] 34 36 37 NA 32 32 37 29 30 29 ...
## $ mage : num [1:1000] 34 31 36 16 31 26 36 24 32 26 ...
## $ mature : chr [1:1000] "younger mom" "younger mom" "mature mom" "younger mom" ...
## $ weeks : num [1:1000] 37 41 37 38 36 39 36 40 39 39 ...
## $ premie : chr [1:1000] "full term" "full term" "full term" "full term" ...
## $ visits : num [1:1000] 14 12 10 NA 12 14 10 13 15 11 ...
## $ gained : num [1:1000] 28 41 28 29 48 45 20 65 25 22 ...
## $ weight : num [1:1000] 6.96 8.86 7.51 6.19 6.75 6.69 6.13 6.74 8.94 9.12 ...
## $ lowbirthweight: chr [1:1000] "not low" "not low" "not low" "not low" ...
## $ sex : chr [1:1000] "male" "female" "female" "male" ...
## $ habit : chr [1:1000] "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...
## $ marital : chr [1:1000] "married" "married" "married" "not married" ...
## $ whitemom : chr [1:1000] "white" "white" "not white" "white" ...
## - attr(*, "spec")=
## .. cols(
## .. fage = col_double(),
## .. mage = col_double(),
## .. mature = col_character(),
## .. weeks = col_double(),
## .. premie = col_character(),
## .. visits = col_double(),
## .. gained = col_double(),
## .. weight = col_double(),
## .. lowbirthweight = col_character(),
## .. sex = col_character(),
## .. habit = col_character(),
## .. marital = col_character(),
## .. whitemom = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(births)
## # A tibble: 6 × 13
## fage mage mature weeks premie visits gained weight lowbirthweight sex
## <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 34 34 younger mom 37 full … 14 28 6.96 not low male
## 2 36 31 younger mom 41 full … 12 41 8.86 not low fema…
## 3 37 36 mature mom 37 full … 10 28 7.51 not low fema…
## 4 NA 16 younger mom 38 full … NA 29 6.19 not low male
## 5 32 31 younger mom 36 premie 12 48 6.75 not low fema…
## 6 32 26 younger mom 39 full … 14 45 6.69 not low fema…
## # ℹ 3 more variables: habit <chr>, marital <chr>, whitemom <chr>
births1 <- births |>
rename(father_age = fage,
mother_age = mage) ##I did this just because to make sense of the data set
births1
## # A tibble: 1,000 × 13
## father_age mother_age mature weeks premie visits gained weight lowbirthweight
## <dbl> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 34 34 young… 37 full … 14 28 6.96 not low
## 2 36 31 young… 41 full … 12 41 8.86 not low
## 3 37 36 matur… 37 full … 10 28 7.51 not low
## 4 NA 16 young… 38 full … NA 29 6.19 not low
## 5 32 31 young… 36 premie 12 48 6.75 not low
## 6 32 26 young… 39 full … 14 45 6.69 not low
## 7 37 36 matur… 36 premie 10 20 6.13 not low
## 8 29 24 young… 40 full … 13 65 6.74 not low
## 9 30 32 young… 39 full … 15 25 8.94 not low
## 10 29 26 young… 39 full … 11 22 9.12 not low
## # ℹ 990 more rows
## # ℹ 4 more variables: sex <chr>, habit <chr>, marital <chr>, whitemom <chr>
colSums(is.na(births1))
## father_age mother_age mature weeks premie
## 114 0 0 0 0
## visits gained weight lowbirthweight sex
## 56 42 0 0 0
## habit marital whitemom
## 19 0 0
births_clean <- births1 |>
filter(!is.na(habit))
summary_birth <- births_clean |>
group_by(habit) |>
summarize(
Count = n(),
mean_weight = mean(weight),
sd_weight = sd(weight)) |>
arrange(desc(Count))
summary_birth
## # A tibble: 2 × 4
## habit Count mean_weight sd_weight
## <chr> <int> <dbl> <dbl>
## 1 nonsmoker 867 7.27 1.23
## 2 smoker 114 6.68 1.60
ggplot(births_clean,aes(x=habit,y=weight,fill = habit)) +
geom_boxplot() +
labs(title = "Baby Birth Weight by Mother's Smoking Habit",
x= "Smoking Status",
y="Birth Weight(lb)")+
theme_minimal()
\(H_0\): \(\mu_1\) = \(\mu_2\) \(H_a\): \(\mu_1\) < \(\mu_2\)
Where,
\(\mu_1\) = The mean birth weight of babies born to smoking mothers
\(\mu_2\) = The mean birth weight of babies born to non-smoking mothers
t.test(births_clean$weight[births_clean$habit=="smoker"],
births_clean$weight[births_clean$habit == "nonsmoker"],alternative = "less")
##
## Welch Two Sample t-test
##
## data: births_clean$weight[births_clean$habit == "smoker"] and births_clean$weight[births_clean$habit == "nonsmoker"]
## t = -3.8166, df = 131.31, p-value = 0.0001038
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.3354351
## sample estimates:
## mean of x mean of y
## 6.677193 7.269873
p-value = 0.0001038. Statistically significant at α = 0.05. There is strong evidence that the mean birth weight of babies born to smoking mothers is less than the mean birth weight of babies born to non-smoking mothers.
95% CI for difference = (-Inf, - 0.335). The interval is entirely do not include 0, showing that the mean birth weight of babies born to smoking mothers is less than the mean birth weight of babies born to non-smoking mothers.
Therefore, we are rejecting the null.
Overall, my results showed a really clear pattern which is, babies born to mothers who smoke do weigh less on average compared to babies of non-smoking moms. After cleaning the data and running the two-sample test, the p-value came out small that is 0.0001038, which is much lower than the 0.05. This means I rejected the null hypothesis and there is strong evidence that smoking during pregnancy is linked to lower birth weight. The confidence interval was also below zero, which basically supports the same idea. Looking at the actual averages, babies of smoking mothers weighed about 0.6 pounds less, which is a pretty noticeable difference.
These results matter because birth weight is an important part of a newborn’s health, and even a small differences can be meaningful. For future directions, there is a lot more this data set could tell us. I only focused on smoking and baby weight, but there are other variables like parental visits, mother’s age, weight gain, and even racial background that could also play a role. It would be interesting to explore whether any of those factors change or strengthen the relationship between smoking and birth weight.
https://bookdown.org/rwnahhas/IntroToR/rename.html <- I forgot how to do rename the variable’s name so, I used this.