Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. You will be using these colors in the next questions

# replace "#FFFFFF" with your own colors
colors <- c("#ffbee7", "#9261e8", "#c52222", "#7cffae", "#407bdb")

swatchplot(colors)

Problem 2: We will be using this dataset

NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Rows: 1409 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Plural, Sex, MomAge, Weeks, Gained, Smoke, BirthWeightGm, Low, Pre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NCbirths)
## # A tibble: 6 × 10
##   Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
##    <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
## 1      1     1     32    40     38     0         3147.     0      0       0
## 2      1     2     32    37     34     0         3289.     0      0       0
## 3      1     1     27    39     12     0         3912.     0      0       0
## 4      1     1     27    39     15     0         3856.     0      0       0
## 5      1     1     25    39     32     0         3430.     0      0       0
## 6      1     1     28    43     32     0         3317.     0      0       0

Question: Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?

To answer this question, you will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.

Use a boxplot for the first part of the question and faceted bar plot for the second question part of the question.

Hints:

`scale_x_discrete(
    name = "Mother",
    labels = c("non-smoker", "smoker")
geom_bar(
    position = position_stack(reverse = TRUE)

Introduction: The dataset used is NCBirths, which are North Carolina Birth Records in 2001. This dataset contains 1,450 birth records, which were selected by statistician John Holcomb from the North Carolina State Center for Health and Environmental Statistics. The question I will be answering is “Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?” In order to answer this question, I will be analyzing three variables in the NCBirths dataset: - Smoke: A binary numerical variable that indicates whether the mother smoked during pregnancy. 1 = smoker and 0 = non-smoker. - BirthWeightGm: A numerical variable that represents the babies’ birth weights in grams. - Plural: A numerical variable indicating whether the birth was a singleton (1), twin (2), or triplet (3).

Approach: I will create the following graphs in order to find if there is a relationship between whether a mother smokes or not and her baby’s weight at birth:

  1. A boxplot to compare the distribution of babies weight at birth if their mother are smokers versus non-smokers.

  2. A faceted stacked bar plot to display the number of births in different weight categories, faceted by smoking status. This helps us understand how birth weight is distributed among different groups.

  3. A faceted side-by-side bar plot to show the same distribution as plot 2, but side-by-side instead of stacked so that the ratio of singletons, twins, and triplets, will be more clear for comparisons.

In order to create the required plots, I will be using as.factor() to turn the variables “Smoke” and “Plural” into categorical variables. I will also be mutating a new column called “BirthWeightCat” (grouping birth weights into ranges) so that I can display birth weight in my bar plots without it being cluttered.

Analysis:

plot1 <- NCbirths |>
  ggplot(aes(as.factor(Smoke), BirthWeightGm, fill = as.factor(Smoke))) + 
  labs(y = "Birth Weight (Grams)", 
       title = "Distribution of Birth Weight by Smoking Status",
       caption = "North Carolina State Center for Health and Environmental Statistics") +
  geom_boxplot() +
  scale_x_discrete(
    name = "Mother",
    labels = c("non-smoker", "smoker") #Change labels so smoking status doesn't show up as "0" and "1"
    ) +
  scale_fill_manual(name = "Mother", labels = c("non-smoker", "smoker"), values = c("#ffbee7", "#9261e8")) #Change labels so smoking status doesn't show up as "0" and "1" and color code
 
plot1

#Mutate a new variable that groups birth weight into groups by range so that birth weight becomes categorical variable
NCbirths <- NCbirths |>
  mutate(BirthWeightCat = ifelse(BirthWeightGm < 2500, "<2.5k", 
                            ifelse(BirthWeightGm >= 2500 & BirthWeightGm < 3000, "2.5k-2999", 
                                   ifelse(BirthWeightGm >= 3000 & BirthWeightGm < 3500, "3k-3499", 
                                          ifelse(BirthWeightGm >= 3500 & BirthWeightGm < 4000, "3.5k-3999", "4k+")))))

plot2 <- NCbirths |>
  ggplot(aes(x = BirthWeightCat, fill = as.factor(Plural))) + 
  labs(x = "Birth Weight in Grams",
       y = "Count",
       fill = "Plurality",
       title = "Non-Smoker vs. Smoker Mothers' Births by Weight and Plurality",
       caption = "North Carolina State Center for Health and Environmental Statistics") +
  geom_bar(
    position = position_stack(reverse = TRUE)) + #Stacking from 1-3
  facet_wrap(vars(Smoke), labeller = as_labeller(c("0" = "non-smoker", "1" = "smoker"))) + #Facet by Smoke
  scale_fill_manual(labels = c("Single", "Twins", "Triplets"), values = c("#c52222", "#7cffae", "#407bdb")) # Color code and label "Plural" variable

plot2

plot3 <- NCbirths |>
  ggplot(aes(x = BirthWeightCat, fill = as.factor(Plural))) +
  labs(x = "Birth Weight in Grams",
       y = "Count",
       fill = "Plurality",
       title = "Non-Smoker vs. Smoker Mothers' Births by Weight and Plurality (Side-By-Side)",
       caption = "North Carolina State Center for Health and Environmental Statistics") +
  geom_bar(
    position = position_dodge(preserve = "single")) + #Side by side instead of stacked
  facet_wrap(vars(Smoke), labeller = as_labeller(c("0" = "non-smoker", "1" = "smoker"))) +
  scale_fill_manual(labels = c("Single", "Twins", "Triplets"), values = c("#c52222", "#7cffae", "#407bdb"))
  
plot3

`

Discussion: ### Interpretation of Results

The first boxplot shows a that there difference in birth weight distribution between babies with smoker and non-smoker mothers. The graph shows that babies of non-smoking mothers tend to have higher average birth weight than babies of smoking mothers This suggests that giving birth as a smoker may be associated with lower birth weight. Additionally, the boxplot shows that there are many outliers for the baby birth weight distribution for non-smoker mothers; the baby with the lowest birth was born to a non-smoker mother, but the baby with the highest birth weight was also born to a non-smoker mother. This could mean that giving birth as a smoker may not greatly impact the baby’s birth weights. But since the average birth weight for babies with non-smoker mothers is higher than the average birth weight for babies with smoker mothers, we know that there is some correlation.

The second and third bar plots breaks birth weights into categories (“<2.5k”, “2.5k-2999”, “3k-3499”, “3.5k-3999”, “4k+”) due to mutating with the ifelse() function. Birth plurality is also introduced as a factor that could affect the relationship between smoking and a baby’s birth weight. The two plots show that both non-smoking mothers and smoking mothers tend to give birth to babies that weigh around 3000-3499 grams, but overall non-smoking mothers have a higher number seeing as the bar for non-smoker is much higher than the bar for smoker. The plots also show that most of the dataset contained data from mothers that are non-smokers, which means the sample is may not provide an equal representation of both groups. Since the number of non-smoking mothers is much higher than that of smoking mothers, the comparisons may be somewhat skewed, making it harder to generalize the results. Additionally, because the sample size for smoker mothers are smaller, the variability in birth weights among smoking mothers might be more noticeable.

Since the third plot represents plurality side-by-side instead of stacked like the second plot, it provides a clearer view of how birth weight differs based on plurality and smoking status. It can be seen that there were no triplet births for smoking mothers, and only a small number of twins. It is also shown that the twins born to smoking mothers weighed 2500 grams or less at birth, introducing the idea that multiple births could lead to lower birth weights. This is further shown on the non-smoker side, where most twins and triplets weigh less than 3000 grams at birth. Overall, my analysis suggests that smoking during pregnancy is associated with lower birth weights, but other factors, such as plurality, may also affect these differences. Further research would be needed to confirm the correlation.