Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. You will be using these colors in the next questions
# replace "#FFFFFF" with your own colors
colors <- c("#9C2555", "#E2BA4A", "#3BADCF", "#34B586", "#9744C0")
swatchplot(colors)
Problem 2: We will be using this dataset
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Rows: 1409 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Plural, Sex, MomAge, Weeks, Gained, Smoke, BirthWeightGm, Low, Pre...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NCbirths)
## # A tibble: 6 × 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 32 40 38 0 3147. 0 0 0
## 2 1 2 32 37 34 0 3289. 0 0 0
## 3 1 1 27 39 12 0 3912. 0 0 0
## 4 1 1 27 39 15 0 3856. 0 0 0
## 5 1 1 25 39 32 0 3430. 0 0 0
## 6 1 1 28 43 32 0 3317. 0 0 0
Question: Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?
To answer this question, you will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.
Use a boxplot for the first part of the question and faceted bar plot for the second question part of the question.
Hints:
labels =
to provide explicit labels so ggplot2 doesn’t write 0 and 1. like:`scale_x_discrete(
name = "Mother",
labels = c("non-smoker", "smoker")
Use scale_fill_manual to fill the colors. Your
colors need to be two colors from the 5 colors you picked in problem
1.
For the second part with the bar plot, use
position = position_stack(reverse = TRUE) This will stack
in reverse order so singletons come first, then twins, then triplets.
Your colors should be the other three colors from the previously picked
colors in problem 1. Finally, facet by Smoke
geom_bar(
position = position_stack(reverse = TRUE)
Introduction: The question at hand, “is there a relationship between whether a mother smokes or not and her baby’s weight at birth?”, can be answered by using statistical graphs demonstrating the relationship of a baby’s weight and the mother’s smoking status. My hypothesis is that there is a correlation between a baby’s weight and the mother’s smoking status due to the unhealthy affects of smoking. The data set being used is the North Carolina birth records in the year 2001 where we will use the variables “Smoke” for the mother’s smoking status, “BirthWeightGm” for the baby’s weight, and “Plural” for if the baby was a singleton, twins, or triplets.
Approach: First I will determine the relationship between non smokers and the baby’s weight, as well as smokers and the baby’s weight, represented by two different box plots to see the difference between the two. If the mean of non smokers baby birth weight is higher than the mean of smokers baby birth weight then we can assume that smoking status in fact does affect the new born baby’s weight. Afterwords I will create a bar graph to represent how many smoking and non smoking mothers there are. This will determine the variability and statistical error from the assumption we might make from the box plot.
Analysis:
# Your R code here
ggplot(NCbirths, aes(x = factor(Smoke), y = BirthWeightGm, fill = factor(Smoke))) + # using the factor function in ggplot to treat "Smoke" as a categorical variable
geom_boxplot() +
scale_x_discrete(
name = "Smoking Status", # Label the x-axis
labels = c("0" = "Non-Smoker", "1" = "Smoker") # label 0's as non-smoke and 1's as smoker on the graph for their corresponding box plot
) +
scale_fill_manual(
values = c("0" = "#9C2555", "1" = "#E2BA4A"), #Assigning colors for non-smokers and smokers
labels = c("0" = "Non-Smoker", "1" = "Smoker")) + #Changing fill label name for text to the right of graph
labs(
title = "Birth Weight Distribution by Smoking Status",
x = "Smoking Status",
y = "Birth Weight (grams)"
) +
theme_minimal()
ggplot(NCbirths, aes(x = factor(Smoke), fill = factor(Plural))) + #Use factor function for ggplot to recognize variables as categorical
geom_bar(position = position_stack(reverse = TRUE)) +
scale_x_discrete(
name = "Smoking Status",
labels = c("0" = "Non-Smoker", "1" = "Smoker") # Label x axis to show non smoker bar and smoker bar
) +
scale_fill_manual(
values = c("1" = "#3BADCF", "2" = "#E2BA4A", "3" = "#9744C0"), #Assign color to singleton, twin, and triplet
labels = c("1" = "Singletons", "2" = "Twins", "3" = "Triplets") #Label the colors on the right for singleton, twin, and triplet
) +
labs(
title = "Count of Mothers by Smoking Status and Birth Type",
x = "Smoking Status",
y = "Count of Mothers"
) +
theme_minimal()
Extra Credit
# Your R code here
ggplot(NCbirths, aes(x = factor(Smoke), fill = factor(Plural))) + #use factor function for both "Smoke" and "Plural" so ggplot treats them both as categorical functions
geom_bar(position = position_stack(reverse = TRUE)) +
scale_x_discrete(
name = "Smoking Status", # Label the x-axis
labels = c("0" = "Non-Smoker", "1" = "Smoker") # Show 0 as Non-Smoker, 1 as Smoker
) +
scale_fill_manual(
values = c("1" = "#3BADCF", "2" = "#E2BA4A", "3" = "#9744C0"), # Custom colors for singletons, twins, and triplets
labels = c("1" = "Singletons", "2" = "Twins", "3" = "Triplets") # Change the labels for the fill for the text to the right of graph
) +
labs(
title = "Count of Mothers by Smoking Status and Birth Type",
x = "Smoking Status",
y = "Count of Mothers"
) +
facet_wrap(~ Plural, labeller = labeller(Plural = c("1" = "Singletons", "2" = "Twins", "3" = "Triplets"))) + # Change facet level names for corresponding amount of children
theme_minimal()
Discussion: The box plot I’ve created supports my hypotheses of smoking status affecting a baby’s birth weight because the mean of non smoker’s baby birth weight is higher than the smoker’s baby birth weight. However, using the bar graphs I created there are extremely more non smoking mothers than smoking so this can cause a increase in statistical error. In order to dive deeper a ANOVA test can be useful to compare the difference in means between the two groups to better grasp the comparison rather than just a box plot.