Project 1

Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. You will be using these colors in the next questions

# replace "#FFFFFF" with your own colors
colors <- c("#EF436B", "#FFCE5C", "#05C793", "#26547D", "#8951a3")

swatchplot(colors)

Problem 2: We will be using this dataset

NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")

## Rows: 1409 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Plural, Sex, MomAge, Weeks, Gained, Smoke, BirthWeightGm, Low, Pre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(NCbirths)

## # A tibble: 6 × 10
##   Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
##    <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
## 1      1     1     32    40     38     0         3147.     0      0       0
## 2      1     2     32    37     34     0         3289.     0      0       0
## 3      1     1     27    39     12     0         3912.     0      0       0
## 4      1     1     27    39     15     0         3856.     0      0       0
## 5      1     1     25    39     32     0         3430.     0      0       0
## 6      1     1     28    43     32     0         3317.     0      0       0

NCbirths |>
  group_by(Smoke) |>
  summarize(count = n())

## # A tibble: 2 × 2
##   Smoke count
##   <dbl> <int>
## 1     0  1203
## 2     1   206

Question: Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?

To answer this question, you will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.

Use a boxplot for the first part of the question and faceted bar plot for the second question part of the question.

Hints:

When doing the first part using boxplots, use labels = to provide explicit labels so ggplot2 doesn’t write 0 and 1. like:

`scale_x_discrete(
    name = "Mother",
    labels = c("non-smoker", "smoker")

Use scale_fill_manual to fill the colors. Your colors need to be two colors from the 5 colors you picked in problem 1.
For the second part with the bar plot, use position = position_stack(reverse = TRUE) This will stack in reverse order so singletons come first, then twins, then triplets. Your colors should be the other three colors from the previously picked colors in problem 1. Finally, facet by Smoke

geom_bar(
    position = position_stack(reverse = TRUE)

NCbirths |>
  group_by(Smoke) |>
  summarize(count = n())

## # A tibble: 2 × 2
##   Smoke count
##   <dbl> <int>
## 1     0  1203
## 2     1   206

Extra credit- add a third plot showing the bars side-by-side instead of stacked.

Introduction:

The dataset NCBirths is a dataset containing birth record from North Carolina in the year 2001, containing 1450 different observations with 15 different variables. All information was collected by statistician John Holcomb from the North Carolina State Center for Health and Environmental Statistics. In this study we will be discovering whether or whether not there is correlation between if a mother smokes and the weight of her produced child. Specifically we will be comparing birth weight to smoking status, then smoking status to the quantity of children a mother is pregnant with and the amount of mothers. The names for these variables are as follows:

Smoke as smoking status
BirthWeightGM as the child’s weight
Plural as the amount of children the mother is carrying.

Approach:

The general method I took when graphing was to first lay out my data, then make all the aesthetic changes as needed. As you may notice, a lot of aesthetic choices were made. I immediately plugged in my data and added the correlating ggplot graph type, changing variables Smoke and Plural into factors as needed in order for ggplot to function correctly. For the graphs requiring faceting, I did that in between this and the next step to ensure I had the skeleton of my graphs. Immediately after I focused on the fill colors of my graphs, as those would play a vital part in the output of them. Any good graph should have proper labels, so those came next, altering my variable names, graph title, and legend names, as well as adding a caption. It was at this point I realized everything was either too cramped or too messy, so in came the block of straight formatting, where the caption size was changed, the title was moved over, and the margins were altered to suit my graphs and titles better. For the bar graphs, I also put in the effort of moving around the x-axis labels so they would be a bit easier to understand.

Analysis:

ggplot(NCbirths, aes(as.factor(Smoke), BirthWeightGm, fill = as.factor(Smoke))) +
  geom_boxplot() +
  scale_fill_manual(
    values = c("#05C793", "#8951a3"), 
    labels = c("Non-Smoker", "Smoker"),
    name = "Mother Smoking Status") +
  scale_x_discrete(
    name = "Mother Smoking Status",
    labels = c("non-smoker", "smoker")) +
  labs(
    y = "Child Birth Weight in Grams",
    title = "Mother Smoking Affect on Child Birth Weight",
    caption = "North Carolina births in 2001, selected by John Holcomb of the North Carolina State Center for Health and Environmental Statistics") +
  theme(        #Positioning adjustments
    plot.caption = element_text(hjust = 0.5, size = 6),
    plot.caption.position = "plot",
    plot.margin = margin(t = 20, r = 10, b = 10, l = 10),
    plot.title = element_text(hjust = 0.5, margin = margin(b=10)))

THANK YOU https://ggplot2.tidyverse.org/reference/facet_wrap.html https://ggplot2.tidyverse.org/reference/theme.html

SHOUTOUT FOR SAVING ME WITH THE FACET STUFF AND POSITIONING :D

ggplot(NCbirths, aes(x = factor(Smoke), fill = factor(Plural))) +
  geom_bar(position = position_stack(reverse = TRUE)) +
  scale_fill_manual(
    values = c("#EF436B", "#FFCE5C", "#26547D"), 
    labels = c("Single", "Twins", "Triplets"),
    name = "Child Count") +
  facet_wrap(
    ~Smoke, 
    strip.position = "bottom", 
    labeller = as_labeller(c(`0` = "Non-Smoker", `1` = "Smoker")), 
    scales = "free_x")+
  scale_x_discrete(
    name = "Mother Smoking Status",
    labels = c("")) +
  labs(
    title = "Amount of Mothers by Smoking Status & Child Counts",
    y = "Total Mothers",
    caption = "North Carolina births in 2001, selected by John Holcomb of the North Carolina State Center for Health and Environmental Statistics") +
  theme(        #Positioning adjustments
    plot.caption = element_text(hjust = 0.5, size = 6),
    plot.caption.position = "plot",
    strip.background = element_blank(),
    axis.ticks.x = element_blank(),
    axis.title.x = element_text(vjust = 6),
    plot.margin = margin(t = 20, r = 10, b = 10, l = 10),
    plot.title = element_text(hjust = 0.5, margin = margin(b=10)))

ggplot(NCbirths, aes(x = factor(Plural), fill = factor(Plural))) +
  geom_bar() +
   scale_fill_manual(
    values = c("#EF436B", "#FFCE5C", "#26547D"), 
    labels = c("Single", "Twins", "Triplets"),
    name = "Child Count") +
  labs(title = "Amount of Mothers by Smoking Status & Child Counts",
       x = "Number of Babies",
       y = "Count of Mothers",
       caption = "North Carolina births in 2001, selected by John Holcomb of the North Carolina State Center for Health and Environmental Statistics") +
  facet_grid(Smoke ~ ., labeller = as_labeller(c(`0` = "Non-smoker",`1` = "Smoker"))) +
  coord_flip() +
  theme(        #Positioning adjustments
    plot.caption = element_text(hjust = 0.5, size = 6),
    plot.caption.position = "plot",
    plot.margin = margin(t = 20, r = 10, b = 10, l = 10),
    plot.title = element_text(hjust = 0.5, margin = margin(b=10)))

Discussion: Your discussion of results here.

First of all, as a general note, there is a significantly higher amount of mothers that do not smoke when pregnant as compared to those who do. There is only 206 out of a total 1409 mothers that do actually smoke, meaning this sample may not be fully accurate to some degree.

Moving on from that, looking at the box plot, it does seem as if smoking may produce lighter children. The plot for those who smoked generally ebbs on the lower side, with much smaller maximums. In addition to that, it seems that mothers who generally do not smoke seem to have a higher chance of having twins or triplets, due to how there is nearly no twins for smoking mothers, and not a single triplet for smokers either. However with the given sample size, it may not be all that accurate, as it is a chance of odds with this data. The amount of smoking versus not is not even enough for it to be fair. With more data, this may be better proven.

To answer the question based on the data given, it seems that mothers who do smoke will give birth to lighter children, but nothing can be certain until there is either more data or a more balanced sample.

Project 1

R.R.

2025-02-09