We are interested in the Palmer Penguins data set, and we will use two inferential tools to study group differences. In this project, we will work with a randomization test to ask whether an observed difference between two groups is real or plausibly due to chance, and a bootstrap confidence interval to estimate how large that difference plausibly is.
We will do this twice. First, we will work through a complete example together: studying the difference in body mass between male and female Gentoo penguins. Then, you will carry out the same analysis yourself for a second comparison: bill depth in Adelie versus Chinstrap penguins.
We begin by filtering the penguins data to Gentoo penguins only, keeping only rows with complete data for body mass and sex.
gentoo <- penguins %>%
filter(species == "Gentoo", !is.na(body_mass_g), !is.na(sex))
nrow(gentoo)
## [1] 119
table(gentoo$sex)
##
## female male
## 58 61
We can visualize the raw data to get a sense of whether a difference exists before doing any inference.
ggplot(gentoo, aes(x = sex, y = body_mass_g, fill = sex)) +
geom_boxplot(alpha = 0.7) +
labs(title = "Body mass of Gentoo penguins by sex",
x = "Sex", y = "Body mass (g)") +
theme_minimal() +
theme(legend.position = "none")
There appears to be a visible difference. But could this have arisen by chance? We use a randomization test to find out.
Our test statistic is the difference in mean body mass: male mean minus female mean.
gentoo %>%
group_by(sex) %>%
summarise(mean_body_mass = mean(body_mass_g))
## # A tibble: 2 × 2
## sex mean_body_mass
## <fct> <dbl>
## 1 female 4680.
## 2 male 5485.
observed_diff <- mean((gentoo %>% filter(sex == "male"))$body_mass_g) -
mean((gentoo %>% filter(sex == "female"))$body_mass_g)
observed_diff
## [1] 805.0947
The observed difference in mean body mass is approximately 805 grams, with males heavier on average.
To ask whether this difference is real, we simulate a world in which sex has no relationship to body mass. We do this by shuffling the sex labels and recomputing the difference in means, repeating the process 1,000 times to build a null distribution.
one_shuffle_diff <- function() {
gentoo %>%
mutate(sex_shuffle = sample(sex)) %>%
group_by(sex_shuffle) %>%
summarise(mean_mass = mean(body_mass_g)) %>%
summarise(diff = mean_mass[sex_shuffle == "male"] -
mean_mass[sex_shuffle == "female"]) %>%
pull(diff)
}
null_distribution <- replicate(1000, one_shuffle_diff())
We can visualize the null distribution and mark where our observed difference falls.
null_df <- data.frame(sim_diff = null_distribution)
ggplot(null_df, aes(x = sim_diff)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white", alpha = 0.8) +
geom_vline(xintercept = observed_diff, color = "firebrick", linewidth = 1.2,
linetype = "dashed") +
annotate("text", x = observed_diff - 200, y = 60,
label = paste0("Observed\nDifference\n", round(observed_diff, 0), "g"),
color = "firebrick", hjust = 0, size = 3.5) +
labs(title = "Null distribution: 1,000 shuffles of sex labels",
x = "Simulated difference in mean body mass (g)",
y = "Count") +
theme_minimal()
The observed difference falls far out in the tail of the null distribution — none of the 1,000 shuffles came close to producing a difference as large as what we observed.
We can quantify this with an informal p-value: the proportion of simulated differences that were as large or larger than what we observed.
p_value <- sum(null_distribution >= observed_diff) / length(null_distribution)
p_value
## [1] 0
The p-value is 0. Out of 1,000 random shuffles, zero produced a difference as large as the observed difference of 805 grams. This is strong evidence that the difference in body mass between male and female Gentoo penguins is not due to chance.
The randomization test tells us the difference is real. Now we ask: how large is it? We use bootstrapping to estimate a plausible range for the true difference. Unlike the randomization test, bootstrapping resamples from the real data with replacement.
gentoo_males <- gentoo %>% filter(sex == "male")
gentoo_females <- gentoo %>% filter(sex == "female")
one_boot_diff <- function() {
boot_males <- sample(gentoo_males$body_mass_g, size = nrow(gentoo_males), replace = TRUE)
boot_females <- sample(gentoo_females$body_mass_g, size = nrow(gentoo_females), replace = TRUE)
mean(boot_males) - mean(boot_females)
}
boot_diffs <- replicate(1000, one_boot_diff())
Once we have our bootstrapped distribution, we can look at it, and calculate the 95% confidence interval by chopping off the most extreme values (2.5% on each side).
data.frame(boot_diff = boot_diffs) %>%
ggplot(aes(x = boot_diff)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
geom_vline(xintercept = observed_diff, color = "firebrick", linewidth = 1.2) +
labs(title = "Bootstrap distribution: difference in mean body mass",
subtitle = "Red = observed difference",
x = "Bootstrap difference in mean body mass (male - female)",
y = "Count") +
theme_minimal()
ci_diff <- quantile(boot_diffs, probs = c(0.025, 0.975))
ci_diff
## 2.5% 97.5%
## 695.3692 917.7067
The 95% confidence interval runs from approximately 710g to 910g (although exact values will change depending on the randomness in our specific simulation). We are 95% confident that male Gentoo penguins are, on average, somewhere between 710 and 910 grams heavier than female Gentoo penguins. Notice that this interval does not include zero, which is consistent with our randomization test conclusion that the difference is real.
Now it is your turn. You will carry out the same two-part analysis for a different comparison. Pick a different species (Adelie or Chinstrap) and a quantitative variable (body mass is fine to do, but you may find more interesting results with bill depth, bill length, or flipper length) and address the following questions:
Question A: Consider the difference in mean between the male and female penguins of your chosen species. Is that difference real, or is it plausibly due to chance?
The p-value is 0. Out of 1,000 random shuffles, zero produced a difference as large as the observed difference of 412 grams. This is strong evidence that the difference in body mass between male and female Chinstrap penguins is not due to chance.
Question B: What is a plausible range of values for the size of that difference?
The 95% confidence interval runs from approximately 254g to 566g (although exact values will change depending on the randomness in our specific simulation). We are 95% confident that male Chinstrap penguins are, on average, somewhere between 254 and 566 grams heavier than female Chinstrap penguins. Notice that this interval does not include zero, which is consistent with our randomization test conclusion that the difference is real.
253.6581 566.1949
Include the following:
(1) An explicit calculation of the observed statistic. (This is a difference in mean between males and females of your species. For the Gentoo body mass example, this was roughly 805.)
chinstrap <- penguins %>%
filter(species == "Chinstrap", !is.na(body_mass_g), !is.na(sex))
chinstrap %>%
group_by(sex) %>%
summarise(mean_body = mean(body_mass_g))
## # A tibble: 2 × 2
## sex mean_body
## <fct> <dbl>
## 1 female 3527.
## 2 male 3939.
# The female mean body mass is 3527
# The male mean body mass is 3939
observed_diff <- 3939 - 3527
observed_diff <- mean((chinstrap%>%filter(sex=="male"))$body_mass_g) - mean((chinstrap%>%filter(sex=="female"))$body_mass_g)
observed_diff
## [1] 411.7647
(2) A histogram showing the difference in mean that is created using the simulated randomization test. (This comes from Script 2)
chinstrap_shuffle <- chinstrap %>%
mutate(sex_shuffle = sample(sex))
one_shuffle_diff <- function() {
chinstrap %>%
mutate(sex_shuffle = sample(sex)) %>% # shuffle sex labels
group_by(sex_shuffle) %>%
summarise(mean_mass = mean(body_mass_g)) %>%
summarise(diff = mean_mass[sex_shuffle == "male"] -
mean_mass[sex_shuffle == "female"]) %>%
pull(diff)
}
null_distribution <- replicate(1000, one_shuffle_diff())
null_df <- data.frame(sim_diff = null_distribution)
ggplot(null_df, aes(x = sim_diff)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white", alpha = 0.8)
ggplot(null_df, aes(x = sim_diff)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white", alpha = 0.8) +
geom_vline(xintercept = observed_diff, color = "firebrick", linewidth = 1.2,
linetype = "dashed") +
annotate("text", x = observed_diff - 200, y = 60,
label = paste0("Observed\nDifference\n", round(observed_diff, 0), "g"),
color = "firebrick", hjust = 0, size = 3.5) +
labs(title = "Null distribution",
x = "Simulated difference in mean body mass (g)",
y = "Count") +
theme_minimal()
# The p-value is the proportion of simulated differences that were AS LARGE OR LARGER than what we actually observed.
p_value <- sum(null_distribution >= observed_diff)/length(null_distribution)
p_value
## [1] 0
(3) A histogram showing the bootstrapped distribution via resampling. (This comes from Script 3)
chinstrap_males <- chinstrap %>% filter(sex == "male")
chinstrap_females <- chinstrap %>% filter(sex == "female")
one_boot_diff <- function() {
boot_males <- sample(chinstrap_males$body_mass_g, size = nrow(chinstrap_males), replace = TRUE)
boot_females <- sample(chinstrap_females$body_mass_g, size = nrow(chinstrap_females), replace = TRUE)
mean(boot_males) - mean(boot_females)
}
# Build the bootstrap distribution
boot_diffs <- replicate(1000, one_boot_diff())
obs_diff <- mean(chinstrap_males$body_mass_g) - mean(chinstrap_females$body_mass_g)
# Visualize the bootstrapped sampling distribution
data.frame(boot_diff = boot_diffs) %>%
ggplot(aes(x = boot_diff)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
geom_vline(xintercept = obs_diff, color = "firebrick", linewidth = 1.2) +
labs(title = "Bootstrap distribution: difference in mean body mass",
subtitle = "Red = observed difference",
x = "Bootstrap difference in mean body mass (male - female)",
y = "Count") +
theme_minimal()
# Calculate a 95\% confidence interval
ci_diff <- quantile(boot_diffs, probs = c(0.025, 0.975))
ci_diff # print the interval
## 2.5% 97.5%
## 261.7096 556.6912
data.frame(boot_diff = boot_diffs) %>%
ggplot(aes(x = boot_diff)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
geom_vline(xintercept = obs_diff, color = "firebrick", linewidth = 1.2) +
geom_vline(xintercept = ci_diff[1], color = "darkorange", linewidth = 1, linetype = "dashed") +
geom_vline(xintercept = ci_diff[2], color = "darkorange", linewidth = 1, linetype = "dashed") +
annotate("text", x = ci_diff[1] - 10, y = 65,
label = paste0("2.5%\n", round(ci_diff[1], 0), "g"),
color = "darkorange", hjust = 1, size = 3.2) +
annotate("text", x = ci_diff[2] + 10, y = 65,
label = paste0("97.5%\n", round(ci_diff[2], 0), "g"),
color = "darkorange", hjust = 0, size = 3.2) +
labs(title = "Bootstrap distribution: difference in mean body mass",
subtitle = "Red = observed difference | Orange dashed = 95% CI bounds",
x = "Bootstrap difference in mean body mass (g, male - female)",
y = "Count") +
theme_minimal()
(4) Appropriate analysis and conclusion. (How do these two graphs compare to each other? What can you conclude about the difference between male and female penguins of your species – is the observed difference likely due to an underlying difference, or is it possible that it’s due to random chance?)
# Null distribution (shuffle sex labels)
null_diffs <- replicate(1000, {
chinstrap %>%
mutate(sex_shuffle = sample(sex)) %>%
group_by(sex_shuffle) %>%
summarise(m = mean(body_mass_g)) %>%
summarise(d = m[sex_shuffle == "male"] - m[sex_shuffle == "female"]) %>%
pull(d)
})
chinstrap_males <- chinstrap %>% filter(sex == "male")
chinstrap_females <- chinstrap %>% filter(sex == "female")
obs_diff <- mean(chinstrap_males$body_mass_g) - mean(chinstrap_females$body_mass_g)
one_boot_diff <- function() {
boot_males <- sample(chinstrap_males$body_mass_g, size = nrow(chinstrap_males), replace = TRUE)
boot_females <- sample(chinstrap_females$body_mass_g, size = nrow(chinstrap_females), replace = TRUE)
mean(boot_males) - mean(boot_females)
}
# Build the bootstrap distribution
boot_diffs <- replicate(1000, one_boot_diff())
# Combine both distributions for plotting
comparison_df <- bind_rows(
data.frame(diff = null_diffs, type = "Null distribution\n(shuffled labels)"),
data.frame(diff = boot_diffs, type = "Bootstrap distribution\n(resampled real data)")
)
ggplot(comparison_df, aes(x = diff, fill = type)) +
geom_histogram(bins = 40, color = "white", alpha = 0.8) +
geom_vline(xintercept = obs_diff, color = "firebrick", linewidth = 1.2, linetype = "dashed") +
facet_wrap(~ type, scales = "free_y") +
labs(title = "Null distribution vs. bootstrap distribution",
subtitle = "Red dashed line = observed difference",
x = "Difference in mean body mass (g)", y = "Count") +
theme_minimal() +
theme(legend.position = "none")