author: “Mikey Cariaga” date: “03/5/26” —

Scenario: Inference for Penguin Data using Randomization and Bootstrapping

We are interested in the Palmer Penguins data set, and we will use two inferential tools to study group differences. In this project, we will work with a randomization test to ask whether an observed difference between two groups is real or plausibly due to chance, and a bootstrap confidence interval to estimate how large that difference plausibly is.

We will do this twice. First, we will work through a complete example together: studying the difference in body mass between male and female Gentoo penguins. Then, you will carry out the same analysis yourself for a second comparison: bill depth in Adelie versus Chinstrap penguins.


Part 1: Body Mass in Male vs. Female Gentoo Penguins

Setting up the data

We begin by filtering the penguins data to Gentoo penguins only, keeping only rows with complete data for body mass and sex.

gentoo <- penguins %>%
  filter(species == "Gentoo", !is.na(body_mass_g), !is.na(sex))

nrow(gentoo)
## [1] 119
table(gentoo$sex)
## 
## female   male 
##     58     61

We can visualize the raw data to get a sense of whether a difference exists before doing any inference.

ggplot(gentoo, aes(x = sex, y = body_mass_g, fill = sex)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Body mass of Gentoo penguins by sex",
       x = "Sex", y = "Body mass (g)") +
  theme_minimal() +
  theme(legend.position = "none")

There appears to be a visible difference. But could this have arisen by chance? We use a randomization test to find out.

The Observed Difference

Our test statistic is the difference in mean body mass: male mean minus female mean.

gentoo %>%
  group_by(sex) %>%
  summarise(mean_body_mass = mean(body_mass_g))
## # A tibble: 2 × 2
##   sex    mean_body_mass
##   <fct>           <dbl>
## 1 female          4680.
## 2 male            5485.
observed_diff <- mean((gentoo %>% filter(sex == "male"))$body_mass_g) -
                 mean((gentoo %>% filter(sex == "female"))$body_mass_g)

observed_diff
## [1] 805.0947

The observed difference in mean body mass is approximately 805 grams, with males heavier on average.

The Randomization Test

To ask whether this difference is real, we simulate a world in which sex has no relationship to body mass. We do this by shuffling the sex labels and recomputing the difference in means, repeating the process 1,000 times to build a null distribution.

one_shuffle_diff <- function() {
  gentoo %>%
    mutate(sex_shuffle = sample(sex)) %>%
    group_by(sex_shuffle) %>%
    summarise(mean_mass = mean(body_mass_g)) %>%
    summarise(diff = mean_mass[sex_shuffle == "male"] -
                mean_mass[sex_shuffle == "female"]) %>%
    pull(diff)
}

null_distribution <- replicate(1000, one_shuffle_diff())

We can visualize the null distribution and mark where our observed difference falls.

null_df <- data.frame(sim_diff = null_distribution)

ggplot(null_df, aes(x = sim_diff)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = observed_diff, color = "firebrick", linewidth = 1.2,
             linetype = "dashed") +
  annotate("text", x = observed_diff - 200, y = 60,
           label = paste0("Observed\nDifference\n", round(observed_diff, 0), "g"),
           color = "firebrick", hjust = 0, size = 3.5) +
  labs(title = "Null distribution: 1,000 shuffles of sex labels",
       x = "Simulated difference in mean body mass (g)",
       y = "Count") +
  theme_minimal()

The observed difference falls far out in the tail of the null distribution — none of the 1,000 shuffles came close to producing a difference as large as what we observed.

We can quantify this with an informal p-value: the proportion of simulated differences that were as large or larger than what we observed.

p_value <- sum(null_distribution >= observed_diff) / length(null_distribution)
p_value
## [1] 0

The p-value is 0. Out of 1,000 random shuffles, zero produced a difference as large as the observed difference of 805 grams. This is strong evidence that the difference in body mass between male and female Gentoo penguins is not due to chance.

The Bootstrap Confidence Interval

The randomization test tells us the difference is real. Now we ask: how large is it? We use bootstrapping to estimate a plausible range for the true difference. Unlike the randomization test, bootstrapping resamples from the real data with replacement.

gentoo_males   <- gentoo %>% filter(sex == "male")
gentoo_females <- gentoo %>% filter(sex == "female")

one_boot_diff <- function() {
  boot_males   <- sample(gentoo_males$body_mass_g,   size = nrow(gentoo_males),   replace = TRUE)
  boot_females <- sample(gentoo_females$body_mass_g, size = nrow(gentoo_females), replace = TRUE)
  mean(boot_males) - mean(boot_females)
}

boot_diffs <- replicate(1000, one_boot_diff())

Once we have our bootstrapped distribution, we can look at it, and calculate the 95% confidence interval by chopping off the most extreme values (2.5% on each side).

data.frame(boot_diff = boot_diffs) %>%
  ggplot(aes(x = boot_diff)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  geom_vline(xintercept = observed_diff,    color = "firebrick",   linewidth = 1.2) + 
  labs(title = "Bootstrap distribution: difference in mean body mass",
       subtitle = "Red = observed difference",
       x = "Bootstrap difference in mean body mass (male - female)",
       y = "Count") +
  theme_minimal()

ci_diff <- quantile(boot_diffs, probs = c(0.025, 0.975))
ci_diff
##     2.5%    97.5% 
## 699.3342 916.8427

The 95% confidence interval runs from approximately 710g to 910g (although exact values will change depending on the randomness in our specific simulation). We are 95% confident that male Gentoo penguins are, on average, somewhere between 710 and 910 grams heavier than female Gentoo penguins. Notice that this interval does not include zero, which is consistent with our randomization test conclusion that the difference is real.


Part 2: Do a Similar Analysis for another Penguin Species!

Now it is your turn. You will carry out the same two-part analysis for a different comparison. Pick a different species (Adelie or Chinstrap) and a quantitative variable (body mass is fine to do, but you may find more interesting results with bill depth, bill length, or flipper length) and address the following questions:

Question A: Consider the difference in mean between the male and female penguins of your chosen species. Is that difference real, or is it plausibly due to chance?

Question B: What is a plausible range of values for the size of that difference?

To do this, you will want to set up and filter your data similar to what we’ve done already; shuffle the sexes of the penguins to create a randomization test; and create a bootstrapped confidence interval. Most (all?) of your code can be drawn from Script 2 and Script 3, with key values changed so that we’re looking at a different penguin species and, possibly, a different quantitative variable.

Include the following:

(1) An explicit calculation of the observed statistic. -0.07 for difference of bill depth (This is a difference in mean between males and females of your species. For the Gentoo body mass example, this was roughly 805.)

(2) A histogram showing the difference in mean that is created using the simulated randomization test. (This comes from Script 2)

(3) A histogram showing the bootstrapped distribution via resampling. (This comes from Script 3)

(4) Appropriate analysis and conclusion. (How do these two graphs compare to each other? What can you conclude about the difference between male and female penguins of your species – is the observed difference likely due to an underlying difference, or is it possible that it’s due to random chance?) I think for the two graphs for bill depth I don’t believe it’s random because There’s a pretty big change from 97% and then the other one is 2.3% T The difference between male and females is that males often have a bigger bill size I I believe that this has to do with genetic An And it’s a pretty big difference. I don’t believe it can be random chance.

Unit 7, ALTERNATIVE Script 2: The Randomization Test

In Script 1 we shuffled the sex labels a handful of

times by hand. Now we use replicate() to do it 1,000

times and build a NULL DISTRIBUTION: the range of

differences we’d expect if species had no effect on

bill depth.

We’ll then compare our OBSERVED difference to that null

distribution and compute an informal p-value.

library(dplyr) library(ggplot2) library(palmerpenguins)

STEP 1: Reconstruct the observed difference (Script 1)

two_species <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap"),
         !is.na(bill_depth_mm))

# Calculate the mean bill depth for each species
two_species %>%
  group_by(species) %>%
  summarise(mean_bill_depth = mean(bill_depth_mm))
## # A tibble: 2 × 2
##   species   mean_bill_depth
##   <fct>               <dbl>
## 1 Adelie               18.3
## 2 Chinstrap            18.4
# The Adelie mean bill depth is around 18.3 mm
# The Chinstrap mean bill depth is around 18.4 mm

# Pull out the observed difference: Adelie mean minus Chinstrap mean
observed_diff <- mean((two_species %>% filter(species == "Adelie"))$bill_depth_mm) -
  mean((two_species %>% filter(species == "Chinstrap"))$bill_depth_mm)

observed_diff
## [1] -0.07423062
########################################################## 
# STEP 2: Write a single-shuffle function
########################################################## 

# Before using replicate(), let's write out the 
# single-shuffle logic cleanly as one expression. 
# This is exactly what replicate() will repeat.

one_shuffle_diff <- function() {
  two_species %>%
    mutate(species_shuffle = sample(species)) %>%        # shuffle species labels
    group_by(species_shuffle) %>%
    summarise(mean_depth = mean(bill_depth_mm)) %>%
    summarise(diff = mean_depth[species_shuffle == "Adelie"] -
                mean_depth[species_shuffle == "Chinstrap"]) %>%
    pull(diff)
}

# Test it — run this a few times and confirm you get different small values
one_shuffle_diff()
## [1] -0.02944098
one_shuffle_diff()
## [1] 0.2712894
one_shuffle_diff()
## [1] -0.2256623
one_shuffle_diff()
## [1] -0.3067102
########################################################## 
# STEP 3: Use replicate() to build the null distribution
########################################################## 

# replicate(n, expr) runs expr n times and collects the results into a vector.

set.seed(42)

null_distribution <- replicate(1000, one_shuffle_diff())

# Check: you should have a numeric vector of length 1000
length(null_distribution)
## [1] 1000
head(null_distribution)
## [1]  0.038809895  0.058005454  0.004684457 -0.289647448  0.026012855
## [6]  0.045208414
########################################################## 
# STEP 4: Visualise the null distribution
########################################################## 

# Convert to a data frame for ggplot
null_df <- data.frame(sim_diff = null_distribution)

# Look at a graph of the null distribution
ggplot(null_df, aes(x = sim_diff)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white", alpha = 0.8)

ggplot(null_df, aes(x = sim_diff)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = observed_diff, color = "firebrick", linewidth = 1.2,
             linetype = "dashed") +
  annotate("text", x = observed_diff + 0.05, y = 60,
           label = paste0("Observed\nDifference\n", round(observed_diff, 2), " mm"),
           color = "firebrick", hjust = 0, size = 3.5) +
  labs(title = "Null distribution",
       x = "Simulated difference in mean bill depth (mm)",
       y = "Count") +
  theme_minimal()

# QUESTION 1: Describe the shape of the null distribution. Where is it
# centered, and why does that make sense?

# QUESTION 2: Where does the observed difference (red line) fall relative
# to the null distribution? What does this suggest?


########################################################## 
# STEP 5: Compute an informal p-value
########################################################## 

# The p-value is the proportion of simulated differences that were AS LARGE
# OR LARGER (in absolute value) than what we actually observed.
# We use abs() here because a difference in either direction is equally
# "surprising" — Adelie could plausibly have deeper or shallower bills.

p_value <- sum(abs(null_distribution) >= abs(observed_diff)) / length(null_distribution)
p_value
## [1] 0.658

Unit 7, ALTERNATIVE Script 3: Bootstrapping a Confidence Interval

The randomization test told us the bill depth difference

between Adelie and Chinstrap penguins is real (or is it?).

Now we ask: how big is that difference, and what range of

values is plausible given our data?

We’ll use BOOTSTRAPPING to answer this.

library(dplyr)
library(ggplot2)
library(palmerpenguins)

two_species <- penguins %>%
  filter(species %in% c("Adelie", "Chinstrap"),
         !is.na(bill_depth_mm))

########################################################## 
# STEP 1: Warm up — bootstrap a single mean
########################################################## 

# Before the two-group case, let's bootstrap something simpler:
# the mean bill depth of Chinstrap penguins only.

chinstrap <- two_species %>% filter(species == "Chinstrap")

# A bootstrap resample draws n values FROM the data, WITH REPLACEMENT.
# Some penguins appear twice; some don't appear at all. That's intentional.

one_boot_sample <- sample(chinstrap$bill_depth_mm,
                          size = nrow(chinstrap),
                          replace = TRUE)

mean(one_boot_sample)   # run several times — notice it varies slightly
## [1] 18.52941
# QUESTION 1: Why do we sample WITH replacement? What would happen if we
# sampled without replacement (as in Script 2)?


# Now use replicate() to build a bootstrap distribution of the Chinstrap
# mean bill depth.

boot_chinstrap_depth <- replicate(1000, {
  sample(chinstrap$bill_depth_mm,
         size = nrow(chinstrap),
         replace = TRUE) %>% mean()
})

data.frame(boot_mean = boot_chinstrap_depth) %>%
  ggplot(aes(x = boot_mean)) +
  geom_histogram(bins = 40, fill = "mediumpurple", color = "white") +
  geom_vline(xintercept = mean(chinstrap$bill_depth_mm),
             color = "black", linewidth = 1, linetype = "dashed") +
  labs(title = "Bootstrap Distribution of Mean Bill Depth",
       subtitle = "Dashed line = observed sample mean",
       x = "Bootstrapped mean bill depth for Chinstrap penguins (mm)", y = "Count") +
  theme_minimal()

# Extract the 95% confidence interval using the middle 95% of the distribution
ci_depth <- quantile(boot_chinstrap_depth, probs = c(0.025, 0.975))
ci_depth
##     2.5%    97.5% 
## 18.17493 18.70151
# QUESTION 2: Interpret the confidence interval. What does it tell you
# about Chinstrap bill depth?


########################################################## 
# STEP 2: Bootstrap the "Adelie - Chinstrap" difference in means
########################################################## 

# Now we bootstrap the quantity we care about most: the difference in mean
# bill depth between Adelie and Chinstrap penguins.
#
#   1. Resamples Adelie penguins with replacement
#   2. Resamples Chinstrap penguins with replacement
#   3. Returns the difference in means (Adelie minus Chinstrap)

adelie    <- two_species %>% filter(species == "Adelie")
chinstrap <- two_species %>% filter(species == "Chinstrap")

one_boot_diff <- function() {
  boot_adelie    <- sample(adelie$bill_depth_mm,    size = nrow(adelie),    replace = TRUE)
  boot_chinstrap <- sample(chinstrap$bill_depth_mm, size = nrow(chinstrap), replace = TRUE)
  mean(boot_adelie) - mean(boot_chinstrap)
}

# Test it
one_boot_diff()
## [1] -0.03080444
one_boot_diff()
## [1] 0.1901441
# Build the bootstrap distribution
boot_diffs <- replicate(1000, one_boot_diff())

########################################################## 
# STEP 3: Visualize and extract the CI
########################################################## 

obs_diff <- mean(adelie$bill_depth_mm) - mean(chinstrap$bill_depth_mm)

# Visualize the bootstrapped sampling distribution

data.frame(boot_diff = boot_diffs) %>%
  ggplot(aes(x = boot_diff)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  geom_vline(xintercept = obs_diff, color = "firebrick", linewidth = 1.2) +
  labs(title = "Bootstrap distribution: difference in mean bill depth",
       subtitle = "Red = observed difference",
       x = "Bootstrap difference in mean bill depth (Adelie - Chinstrap)",
       y = "Count") +
  theme_minimal()

# Calculate a 95% confidence interval

ci_diff <- quantile(boot_diffs, probs = c(0.025, 0.975))

ci_diff   # print the interval
##       2.5%      97.5% 
## -0.3993283  0.2368801
data.frame(boot_diff = boot_diffs) %>%
  ggplot(aes(x = boot_diff)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  geom_vline(xintercept = obs_diff,   color = "firebrick",  linewidth = 1.2) +
  geom_vline(xintercept = ci_diff[1], color = "darkorange", linewidth = 1, linetype = "dashed") +
  geom_vline(xintercept = ci_diff[2], color = "darkorange", linewidth = 1, linetype = "dashed") +
  annotate("text", x = ci_diff[1] - 0.05, y = 65,
           label = paste0("2.5%\n", round(ci_diff[1], 2), " mm"),
           color = "darkorange", hjust = 1, size = 3.2) +
  annotate("text", x = ci_diff[2] + 0.05, y = 65,
           label = paste0("97.5%\n", round(ci_diff[2], 2), " mm"),
           color = "darkorange", hjust = 0, size = 3.2) +
  labs(title = "Bootstrap distribution: difference in mean bill depth",
       subtitle = "Red = observed difference | Orange dashed = 95% CI bounds",
       x = "Bootstrap difference in mean bill depth (mm, Adelie - Chinstrap)",
       y = "Count") +
  theme_minimal()