Data101 Project 2

A- Introduction (1-2 paragraphs):

My research question:

Does maternal smoking associate with lower birth-weight?

My Data-set:

The data-set I will be using is called “babies”. This data-set contains 1236 observations and 8 variables. Each observation is of a babies birth-weight, gestation, parity, age of mother, height of mother, weight of mother, and if the mother was/did smoke. Essentially I am going to answer my question by comparing maternal smoking mothers vs non-smoking maternal mothers and see if there is correlation to the child’s birth-weight. Why am I looking into this? Well I want to continue to challenge myself in understanding different think I don’t know many facts about. Since I am taking a health class, I found the topic of smoking and maternity to be a good topic to choose. First I want to identify that the importance in relationship of smoking and being pregnant is already a known factor for infant health, as apparently it can cause higher infant mortality, development delays, weaker immune system, and even long-term health issues. With this being known, it is recommended by doctors to not smoke while pregnant. Now with knowing that maternal mothers who smoke can cause “development delays”, I am predicting that why I do code my visualization, there will be a clear display of a correlation of smoking mother causing lower birth-weights.

Here are some more facts to help users understand what I want to find association with:

Here’s the conversion:

  • 1 pound = 16 ounces

  • 7.5 lbs × 16 oz = 120 ounces

A normal healthy babies weight range is typically:

  • 5.5 lbs to 8.8 lbs

  • 88 oz to 141 oz

Data-set link: https://www.openintro.org/data/index.php?data=babies

B- Data Analysis (1 paragraph and 3-5 chunks of code): In your paragraph, describe the type of data analysis you will perform and the types of plots you will generate to address your research question.

For my data analysis part, I will first select the variables I will be using which are bwt(representing the babies body weight) and smoke(representing is the mother smokes or not while maternal). During this process, I will use the filter function to filter out any Na’s as there are some in the birth weight. After I will mutate my code so the 0’s and 1’s in the smoke variables turn into “Non-smoker” and “Smoker”. Afterwards, to use more dplyr functions, I will use summarize to create a table to get the mean body weight between smoking and non-smoking maternal mothers. After this, I want to make a box plot to show the mean value of the body weight of the babies of smoking vs. non-smoking mothers so show how there will be a correlation to the babies weight.

library(tidyverse)
library(ggplot2)
library(dplyr)

#Setting Working directory
setwd("C:/Users/Joanne G/OneDrive/Data101(Fall 2025)/Datasets")

#read the babies.csv in here
babies_df <- read.csv("babies.csv")

Clean the data-set and conduct exploratory data analysis (EDA) to better understand the data (2 functions minimum)

# EDA Data-set Chunk

#dimensions
dim(babies_df)
## [1] 1236    8
#head
head(babies_df)
##   case bwt gestation parity age height weight smoke
## 1    1 120       284      0  27     62    100     0
## 2    2 113       282      0  33     64    135     0
## 3    3 128       279      0  28     64    115     1
## 4    4 123        NA      0  36     69    190     0
## 5    5 108       282      0  23     67    125     1
## 6    6 136       286      0  25     62     93     0
summary(babies_df)
##       case             bwt          gestation         parity      
##  Min.   :   1.0   Min.   : 55.0   Min.   :148.0   Min.   :0.0000  
##  1st Qu.: 309.8   1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000  
##  Median : 618.5   Median :120.0   Median :280.0   Median :0.0000  
##  Mean   : 618.5   Mean   :119.6   Mean   :279.3   Mean   :0.2549  
##  3rd Qu.: 927.2   3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000  
##  Max.   :1236.0   Max.   :176.0   Max.   :353.0   Max.   :1.0000  
##                                   NA's   :13                      
##       age            height          weight          smoke       
##  Min.   :15.00   Min.   :53.00   Min.   : 87.0   Min.   :0.0000  
##  1st Qu.:23.00   1st Qu.:62.00   1st Qu.:114.8   1st Qu.:0.0000  
##  Median :26.00   Median :64.00   Median :125.0   Median :0.0000  
##  Mean   :27.26   Mean   :64.05   Mean   :128.6   Mean   :0.3948  
##  3rd Qu.:31.00   3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.0000  
##  Max.   :45.00   Max.   :72.00   Max.   :250.0   Max.   :1.0000  
##  NA's   :2       NA's   :22      NA's   :36      NA's   :10

Use a minimum of three dplyr functions (filter, select, mutate, summary, mean, max, etc.,) to manipulate the data-set and prepare it for analysis.

# Use select to choose the variables I will use, and filter to remove any NA's in the bwt(body weight) and in smoke
filtered_babies_df <- babies_df |>
  select(bwt, smoke) |>
  filter(!is.na(bwt), !is.na(smoke)) |>
  mutate(smoke = factor(smoke, labels = c("Non-smoker", "Smoker")))
# Used to make a summary table of smoking vs non-smoking maternal mothers and get the mean weight(in ounces)
filtered_babies_df |>
  group_by(smoke) |>
  summarize(
    mean_bwt = mean(bwt)) 
## # A tibble: 2 × 2
##   smoke      mean_bwt
##   <fct>         <dbl>
## 1 Non-smoker     123.
## 2 Smoker         114.

Create visualizations (e.g., histograms, Box plots, etc.) to visualize the data’s distribution and relationships. Use codes we covered in this class or code you learned in previous courses.

ggplot(filtered_babies_df, aes(x = factor(smoke), y = bwt)) +
  geom_boxplot(fill = c("palegreen2", "chocolate1")) +
  labs(
    x = "Smoking Status",
    y = "Birthweight (ounces)",
    title = "Birthweight by Maternal Smoking Status"
  ) +
  theme_minimal()

This is an extra visualization to show each babies birth-weight, just to see the common trend overall.

ggplot(filtered_babies_df, aes(x = factor(smoke), y = bwt)) +
  geom_boxplot(fill = c("lightblue", "salmon1")) +
  geom_jitter(width = 0.2, alpha = 0.3) +   # adds individual points
  labs(
    x = "Smoking status",
    y = "Birthweight (ounces)",
    title = "Birthweight by Maternal Smoking Status"
  ) +
  theme_minimal()

C- Statistical Analysis (1 paragraph and 1-3 chunks of code):

State your hypothesis clearly, use the correct notation, and type them properly. Perform the appropriate test (e.g., t-test, chi-square test, etc.)

  • Hypothesis:

    \(\mu_0\) = Maternal non-smoking mothers

    \(\mu_1\) = Maternal smoking mothers

    \(H_0\): \(\mu_0\) = \(\mu_1\) : (the mean birth-weight is the same for babies of smoking vs non-smoking mothers/ NO difference)

    \(H_a\): \(\mu_0\) > \(\mu_1\) : (the mean birth-weight is lower for babies of mothers who smoke)

  • I will be doing an independent two-sample t-test:

t.test(bwt~ smoke, data = filtered_babies_df, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  bwt by smoke
## t = 8.5813, df = 1003.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Non-smoker and group Smoker is greater than 0
## 95 percent confidence interval:
##  7.222928      Inf
## sample estimates:
## mean in group Non-smoker     mean in group Smoker 
##                 123.0472                 114.1095

Interpret the results, including the p-value, alpha, and other relevant statistics, and discuss their significance. You need to include statements for the null and the alternative hypothesis.

Based on my findings, it can be interpreted that there is a significant difference in a babies birth-weight when associated with smoking vs non-smoking mothers. We can see that Non-smoking maternal mothers have a mean of 123.0472, while the mean for smoking maternal mothers was 114.1095. We can also see the p-value = 2.2 * 10^-16 which is lower than the significance level, meaning that there is enough evidence to prove that there is correlation between smoking mothers and babies weight.

D- Conclusion and Future Directions(1-2 paragraphs): Summarize the key findings of your analysis, discuss the implications of your results and their relevance to the research question, and suggest potential avenues for future research or further analysis

To conclude, my question “Does maternal smoking associate with lower birth-weight?” The answer is YES. In accordance to my research, I was able to find out what a healthy baby weight in ounces is, which falls between 88 oz to 141 oz. Looking at the visualizations, we can see that the mean was lower on the graph for Smoking Maternal Mothers with some outliers. After doing my T-Testing as well, there was enough evidence to prove that there was correlation between smoking mothers and there babies weight(it was lower than mothers who didn’t smoke while maternal). Overall I was able to show and prove my hypothesis was correct, personally I had predicted there would be a correlation as in my Health class, I had learned that usually babies who are born from maternal smoking mother’s usually are born with some type of health issue, but being able to visualize it for people to see and understand was a great challenge for me. Moving forward towards project 3 and the final project, I hope to continue to use my data analysis skills to show more visualizations on more topics I didn’t know much about, as I’ve been challenging myself to dive deeper into subjects I am not so knowledgeable in.