Problem 1: Use the color picker app from the colorspace package (colorspace::choose_color()) to create a qualitative color scale containing five colors. You will be using these colors in the next questions
# replace "#FFFFFF" with your own colors
colors <- c("#bc0563", "#cd4f62", "#cdba4f", "#055dbc", "#914fcd")
swatchplot(colors)
Problem 2: We will be using this dataset
NCbirths <- read_csv("https://wilkelab.org/classes/SDS348/data_sets/NCbirths.csv")
## Rows: 1409 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (10): Plural, Sex, MomAge, Weeks, Gained, Smoke, BirthWeightGm, Low, Pre...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(NCbirths)
## # A tibble: 6 × 10
## Plural Sex MomAge Weeks Gained Smoke BirthWeightGm Low Premie Marital
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 32 40 38 0 3147. 0 0 0
## 2 1 2 32 37 34 0 3289. 0 0 0
## 3 1 1 27 39 12 0 3912. 0 0 0
## 4 1 1 27 39 15 0 3856. 0 0 0
## 5 1 1 25 39 32 0 3430. 0 0 0
## 6 1 1 28 43 32 0 3317. 0 0 0
Question: Is there a relationship between whether a mother smokes or not and her baby’s weight at birth?
To answer this question, you will plot the distribution of birth weight by smoking status, and we will also plot the number of mothers that are smokers and non-smokers, respectively.
Use a boxplot for the first part of the question and faceted bar plot for the second question part of the question.
Hints:
labels =
to provide explicit labels so ggplot2 doesn’t write 0 and 1. like:`scale_x_discrete(
name = "Mother",
labels = c("non-smoker", "smoker")
Use scale_fill_manual to fill the colors. Your
colors need to be two colors from the 5 colors you picked in problem
1.
For the second part with the bar plot, use
position = position_stack(reverse = TRUE) This will stack
in reverse order so singletons come first, then twins, then triplets.
Your colors should be the other three colors from the previously picked
colors in problem 1. Finally, facet by Smoke
geom_bar(
position = position_stack(reverse = TRUE)
Introduction:
The NCBirths data set with information on child births. It has a total of 10 columns and 1,409 observations. Through this data set, I will investigate whether or not there is a relationship between a babies birth weight and their mother’s smoking status. To do this, I will need to use the following variables: Smoke, a binary variable that shows whether a mother smokes or not, BirthWeightGM, a continuous numeric variable of the babies birth weight in grams, and Plural, a numeric variable that shows the number of babies the mother gave birth to.
Approach:
I’ll start my project by manipulating some relevant variables. I’ll change the Smoke and Plural variables from the original data set in to factor levels, specifying what each level means using words instead of numbers. Using this manipulated data set, I’ll create two types of graphs: the distribution of birth weight by smoking status and the number of mothers that are smokers and non-smokers. I also plan to count the Smoke variable considering that’s the main variable I’ll be using.
Analysis:
births <- NCbirths |>
mutate(
Smoker = ifelse(Smoke == 1, "Smoker", "Non-smoker"),
Babies = case_when(Plural %in% 1 ~ "Single",
Plural %in% 2 ~ "Twins",
Plural %in% 3 ~ "Triplets")
) |>
select(-Smoke, -Plural)
#Reordering the Babies variable so that Triplets doesn't come before twins because it messes with the graphs later on. Discovered from ?position_stack
Babies2 <- factor(births$Babies, levels = c('Single', 'Twins', 'Triplets'))
births |>
count(Smoker)
## # A tibble: 2 × 2
## Smoker n
## <chr> <int>
## 1 Non-smoker 1203
## 2 Smoker 206
ggplot(births, aes(Smoker, BirthWeightGm, fill = Smoker)) +
geom_boxplot() +
scale_fill_manual(
values = c(c(`Non-smoker` = "#055dbc", Smoker = "#bc0563"))
) +
labs(
title = "Birth Weight Given Smoking Status",
x = "Smoking Status",
y = "Birth Weight (g)",
fill = NULL
) +
theme_light()
ggplot(births, aes(Smoker, fill = Babies2)) +
scale_fill_manual(
values = c(c(Single = "#cd4f62", Twins = "#cdba4f", Triplets = "#914fcd"))
) +
facet_wrap(~Smoker) +
labs(
title = "Count of Mothers Faceted by Smoking Status",
subtitle = "Filled by Number of Babies",
x = "Smoking Status",
y = NULL,
fill = "Number of Babies"
) +
geom_bar(position = position_stack(reverse = TRUE)) +
theme_test()
ggplot(births, aes(Smoker, fill = Babies2)) +
scale_fill_manual(
values = c(c(Single = "#cd4f62", Twins = "#cdba4f", Triplets = "#914fcd"))
) +
facet_wrap(~Smoker) +
labs(
title = "Count of Mothers Faceted by Smoking Satus (Dodged)",
subtitle = "Filled by Number of Babies",
fill = "Number of Babies",
y = NULL
) +
scale_x_discrete(
name = "Smoking Status"
) +
geom_bar(position = position_dodge()) +
theme_test()
Discussion:
The purpose of creating these plots was to investigate whether or not there was a relationship between a baby’s birth weight and their mother’s smoking status. The first plot, which was a boxplot, showed that non-smoking mothers gave birth to slightly heavier babies. Although there were far more outliers in non-smokers, this is probably just due to the fact that there are far more non-smoking mothers in this data set than smokers. The second plot, the bar graph, mainly showed the count of the mothers. One could observed that smoking mothers didn’t give birth to any triplets while non-smokers did, but this doesn’t necessarily mean anything when considering the size difference between the two groups. Overall, these graphs alone do not provide strong evidence of a relationship between birth weight and smoking status. With that said, the median difference shown in the box plot does make me want to investigate more using more advanced tests like hypothesis testing.