Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. You will be using these colors in the next questions

# replace "#FFFFFF" with your own colors
colors <- c("#9C2555", "#E2BA4A", "#3BADCF", "#34B586", "#9744C0")

swatchplot(colors)

Problem 2: We will be using this dataset

NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Rows: 1409 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Plural, Sex, MomAge, Weeks, Gained, Smoke, BirthWeightGm, Low, Pre...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NCbirths)
## # A tibble: 6 × 10
##   Plural   Sex MomAge Weeks Gained Smoke BirthWeightGm   Low Premie Marital
##    <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>         <dbl> <dbl>  <dbl>   <dbl>
## 1      1     1     32    40     38     0         3147.     0      0       0
## 2      1     2     32    37     34     0         3289.     0      0       0
## 3      1     1     27    39     12     0         3912.     0      0       0
## 4      1     1     27    39     15     0         3856.     0      0       0
## 5      1     1     25    39     32     0         3430.     0      0       0
## 6      1     1     28    43     32     0         3317.     0      0       0

Question: Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?

To answer this question, you will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.

Use a boxplot for the first part of the question and faceted bar plot for the second question part of the question.

Hints:

`scale_x_discrete(
    name = "Mother",
    labels = c("non-smoker", "smoker")
geom_bar(
    position = position_stack(reverse = TRUE)

Introduction: The question at hand, “is there a relationship between whether a mother smokes or not and her baby’s weight at birth?”, can be answered by using statistical graphs demonstrating the relationship of a baby’s weight and the mother’s smoking status. My hypothesis is that there is a correlation between a baby’s weight and the mother’s smoking status due to the unhealthy affects of smoking. The data set being used is the North Carolina birth records in the year 2001 where we will use the variables “Smoke” for the mother’s smoking status, “BirthWeightGm” for the baby’s weight, and “Plural” for if the baby was a singleton, twins, or triplets.

Approach: First I will determine the relationship between non smokers and the baby’s weight, as well as smokers and the baby’s weight, represented by two different box plots to see the difference between the two. If the mean of non smokers baby birth weight is higher than the mean of smokers baby birth weight then we can assume that smoking status in fact does affect the new born baby’s weight. Afterwords I will create a bar graph to represent how many smoking and non smoking mothers there are. This will determine the variability and statistical error from the assumption we might make from the box plot.

Analysis:

# Your R code here

ggplot(NCbirths, aes(x = factor(Smoke), y = BirthWeightGm, fill = factor(Smoke))) + # using the factor function in ggplot to treat "Smoke" as a categorical variable
  geom_boxplot() +
  scale_x_discrete(
    name = "Smoking Status",  # Label the x-axis
    labels = c("0" = "Non-Smoker", "1" = "Smoker")  # label 0's as non-smoke and 1's as smoker on the graph for their corresponding box plot
  ) +
  scale_fill_manual(
    values = c("0" = "#9C2555", "1" = "#E2BA4A"), #Assigning colors for non-smokers and smokers
    labels = c("0" = "Non-Smoker", "1" = "Smoker")) +  #Changing fill label name for text to the right of graph 
  labs(
    title = "Birth Weight Distribution by Smoking Status",
    x = "Smoking Status",
    y = "Birth Weight (grams)" 
  ) +
  theme_minimal()

ggplot(NCbirths, aes(x = factor(Smoke), fill = factor(Plural))) + #Use factor function for ggplot to recognize variables as categorical 
  geom_bar(position = position_stack(reverse = TRUE)) +
  scale_x_discrete(
    name = "Smoking Status",
    labels = c("0" = "Non-Smoker", "1" = "Smoker") # Label x axis to show non smoker bar and smoker bar
  ) +
  scale_fill_manual(
    values = c("1" = "#3BADCF", "2" = "#E2BA4A", "3" = "#9744C0"), #Assign color to singleton, twin, and triplet
    labels = c("1" = "Singletons", "2" = "Twins", "3" = "Triplets") #Label the colors on the right for singleton, twin, and triplet
  ) +
  labs(
    title = "Count of Mothers by Smoking Status and Birth Type",
    x = "Smoking Status",
    y = "Count of Mothers"
  ) +
  theme_minimal()

Extra Credit

# Your R code here
ggplot(NCbirths, aes(x = factor(Smoke), fill = factor(Plural))) + #use factor function for both "Smoke" and "Plural" so ggplot treats them both as categorical functions
  geom_bar(position = position_stack(reverse = TRUE)) +
  scale_x_discrete(
    name = "Smoking Status",  # Label the x-axis
    labels = c("0" = "Non-Smoker", "1" = "Smoker")  # Show 0 as Non-Smoker, 1 as Smoker
  ) +
  scale_fill_manual(
    values = c("1" = "#3BADCF", "2" = "#E2BA4A", "3" = "#9744C0"),  # Custom colors for singletons, twins, and triplets
    labels = c("1" = "Singletons", "2" = "Twins", "3" = "Triplets")  # Change the labels for the fill for the text to the right of graph
  ) +
  labs(
    title = "Count of Mothers by Smoking Status and Birth Type",
    x = "Smoking Status",
    y = "Count of Mothers"
  ) +
  facet_wrap(~ Plural, labeller = labeller(Plural = c("1" = "Singletons", "2" = "Twins", "3" = "Triplets"))) +  # Change facet level names for corresponding amount of children
  theme_minimal()

Discussion: The box plot I’ve created supports my hypotheses of smoking status affecting a baby’s birth weight because the mean of non smoker’s baby birth weight is higher than the smoker’s baby birth weight. However, using the bar graphs I created there are extremely more non smoking mothers than smoking so this can cause a increase in statistical error. In order to dive deeper a ANOVA test can be useful to compare the difference in means between the two groups to better grasp the comparison rather than just a box plot.