Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. You will be using these colors in the next questions
# replace "#FFFFFF" with your own colors
colors <- c("#EF436B", "#FFCE5C", "#05C793", "#26547D", "#8951a3")
swatchplot(colors)
Problem 2: We will be using this dataset
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Rows: 1409 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Plural, Sex, MomAge, Weeks, Gained, Smoke, BirthWeightGm, Low, Pre...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NCbirths)
## # A tibble: 6 × 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 32 40 38 0 3147. 0 0 0
## 2 1 2 32 37 34 0 3289. 0 0 0
## 3 1 1 27 39 12 0 3912. 0 0 0
## 4 1 1 27 39 15 0 3856. 0 0 0
## 5 1 1 25 39 32 0 3430. 0 0 0
## 6 1 1 28 43 32 0 3317. 0 0 0
NCbirths |>
group_by(Smoke) |>
summarize(count = n())
## # A tibble: 2 × 2
## Smoke count
## <dbl> <int>
## 1 0 1203
## 2 1 206
Question: Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?
To answer this question, you will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.
Use a boxplot for the first part of the question and faceted bar plot for the second question part of the question.
Hints:
labels =
to provide explicit labels so ggplot2 doesn’t write 0 and 1. like:`scale_x_discrete(
name = "Mother",
labels = c("non-smoker", "smoker")
Use scale_fill_manual to fill the colors. Your
colors need to be two colors from the 5 colors you picked in problem
1.
For the second part with the bar plot, use
position = position_stack(reverse = TRUE) This will stack
in reverse order so singletons come first, then twins, then triplets.
Your colors should be the other three colors from the previously picked
colors in problem 1. Finally, facet by Smoke
geom_bar(
position = position_stack(reverse = TRUE)
NCbirths |>
group_by(Smoke) |>
summarize(count = n())
## # A tibble: 2 × 2
## Smoke count
## <dbl> <int>
## 1 0 1203
## 2 1 206
Introduction:
The dataset NCBirths is a dataset containing birth record from North Carolina in the year 2001, containing 1450 different observations with 15 different variables. All information was collected by statistician John Holcomb from the North Carolina State Center for Health and Environmental Statistics. In this study we will be discovering whether or whether not there is correlation between if a mother smokes and the weight of her produced child. Specifically we will be comparing birth weight to smoking status, then smoking status to the quantity of children a mother is pregnant with and the amount of mothers. The names for these variables are as follows:
Smoke as smoking statusBirthWeightGM as the child’s weightPlural as the amount of children the mother is
carrying.Approach:
The general method I took when graphing was to first lay out my data,
then make all the aesthetic changes as needed. As you may notice, a
lot of aesthetic choices were made. I immediately
plugged in my data and added the correlating ggplot graph type, changing
variables Smoke and Plural into factors as
needed in order for ggplot to function correctly. For the graphs
requiring faceting, I did that in between this and the next step to
ensure I had the skeleton of my graphs. Immediately after I focused on
the fill colors of my graphs, as those would play a vital part in the
output of them. Any good graph should have proper labels, so those came
next, altering my variable names, graph title, and legend names, as well
as adding a caption. It was at this point I realized everything was
either too cramped or too messy, so in came the block of straight
formatting, where the caption size was changed, the title was moved
over, and the margins were altered to suit my graphs and titles better.
For the bar graphs, I also put in the effort of moving around the x-axis
labels so they would be a bit easier to understand.
Analysis:
ggplot(NCbirths, aes(as.factor(Smoke), BirthWeightGm, fill = as.factor(Smoke))) +
geom_boxplot() +
scale_fill_manual(
values = c("#05C793", "#8951a3"),
labels = c("Non-Smoker", "Smoker"),
name = "Mother Smoking Status") +
scale_x_discrete(
name = "Mother Smoking Status",
labels = c("non-smoker", "smoker")) +
labs(
y = "Child Birth Weight in Grams",
title = "Mother Smoking Affect on Child Birth Weight",
caption = "North Carolina births in 2001, selected by John Holcomb of the North Carolina State Center for Health and Environmental Statistics") +
theme( #Positioning adjustments
plot.caption = element_text(hjust = 0.5, size = 6),
plot.caption.position = "plot",
plot.margin = margin(t = 20, r = 10, b = 10, l = 10),
plot.title = element_text(hjust = 0.5, margin = margin(b=10)))
THANK YOU https://ggplot2.tidyverse.org/reference/facet_wrap.html https://ggplot2.tidyverse.org/reference/theme.html
SHOUTOUT FOR SAVING ME WITH THE FACET STUFF AND POSITIONING :D
ggplot(NCbirths, aes(x = factor(Smoke), fill = factor(Plural))) +
geom_bar(position = position_stack(reverse = TRUE)) +
scale_fill_manual(
values = c("#EF436B", "#FFCE5C", "#26547D"),
labels = c("Single", "Twins", "Triplets"),
name = "Child Count") +
facet_wrap(
~Smoke,
strip.position = "bottom",
labeller = as_labeller(c(`0` = "Non-Smoker", `1` = "Smoker")),
scales = "free_x")+
scale_x_discrete(
name = "Mother Smoking Status",
labels = c("")) +
labs(
title = "Amount of Mothers by Smoking Status & Child Counts",
y = "Total Mothers",
caption = "North Carolina births in 2001, selected by John Holcomb of the North Carolina State Center for Health and Environmental Statistics") +
theme( #Positioning adjustments
plot.caption = element_text(hjust = 0.5, size = 6),
plot.caption.position = "plot",
strip.background = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_text(vjust = 6),
plot.margin = margin(t = 20, r = 10, b = 10, l = 10),
plot.title = element_text(hjust = 0.5, margin = margin(b=10)))
ggplot(NCbirths, aes(x = factor(Plural), fill = factor(Plural))) +
geom_bar() +
scale_fill_manual(
values = c("#EF436B", "#FFCE5C", "#26547D"),
labels = c("Single", "Twins", "Triplets"),
name = "Child Count") +
labs(title = "Amount of Mothers by Smoking Status & Child Counts",
x = "Number of Babies",
y = "Count of Mothers",
caption = "North Carolina births in 2001, selected by John Holcomb of the North Carolina State Center for Health and Environmental Statistics") +
facet_grid(Smoke ~ ., labeller = as_labeller(c(`0` = "Non-smoker",`1` = "Smoker"))) +
coord_flip() +
theme( #Positioning adjustments
plot.caption = element_text(hjust = 0.5, size = 6),
plot.caption.position = "plot",
plot.margin = margin(t = 20, r = 10, b = 10, l = 10),
plot.title = element_text(hjust = 0.5, margin = margin(b=10)))
Discussion: Your discussion of results here.
First of all, as a general note, there is a significantly higher amount of mothers that do not smoke when pregnant as compared to those who do. There is only 206 out of a total 1409 mothers that do actually smoke, meaning this sample may not be fully accurate to some degree.
Moving on from that, looking at the box plot, it does seem as if smoking may produce lighter children. The plot for those who smoked generally ebbs on the lower side, with much smaller maximums. In addition to that, it seems that mothers who generally do not smoke seem to have a higher chance of having twins or triplets, due to how there is nearly no twins for smoking mothers, and not a single triplet for smokers either. However with the given sample size, it may not be all that accurate, as it is a chance of odds with this data. The amount of smoking versus not is not even enough for it to be fair. With more data, this may be better proven.
To answer the question based on the data given, it seems that mothers who do smoke will give birth to lighter children, but nothing can be certain until there is either more data or a more balanced sample.