BootstrapPenguins

Estimating the Proportion of Gentoo that are males using bootstrapping

We will create an indicator variable for “male” (1 = male, 0 = not male), filter to Gentoo. We will treat our original set of data as a single sample, and this code will calculate the proportion of the Gentoo sample that is male.

Our first line of code creates a list of 1’s and 0’s, representing the male and female Gentoo penguins. This code specifically filters for the Gentoo species; removes any penguins whose sex is unknown; creates a new column of data that replaces “male” and “female” with 1 and 0; and then “pulls” this new column of data to focus exclusively on it.

# This code builds a 0/1 vector for "male" among Gentoo penguins
GentooSex <- Penguins %>%
  filter(species == "Gentoo") %>%
  filter(!is.na(sex)) %>%
  mutate(male = ifelse(sex == "male", 1, 0)) %>%
  pull(male)

length(GentooSex)          # sample size n

## [1] 119

ObsProp_GentooMale <- sum(GentooSex)/length(GentooSex)   # observed proportion
ObsProp_GentooMale

## [1] 0.512605

Next, we use a for loop to draw bootstrap samples with replacement from the observed data. Each bootstrap sample computes a sample proportion; the collection of those proportions forms the bootstrap distribution, from which we read off a 95% percentile CI.

# Make a blank vector for the bootstrap statistics
# We let "B" be the size of our Bootstrap sample
# It's currently set to 1000, but we could change it
B <- 1000
BootProps_Gentoo <- rep(NA, B)

# This is our "For Loop" bootstrap: resample WITH replacement from GentooSex
for(i in 1:B) {
  BootSample <- sample(GentooSex, size = length(GentooSex), replace = TRUE)
  BootProps_Gentoo[i] <- sum(BootSample)/length(GentooSex)
}

We now visualize our bootstrapped data set and compute our 95% confidence interval

# Visualize and compute a 95% percentile CI
ggplot(data.frame(BootProps_Gentoo), aes(x = BootProps_Gentoo)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Bootstrap Distribution: Proportion Male (Gentoo)",
    x = "Bootstrap p-hat", y = "Count"
  )

# 95% percentile CI
quantile(BootProps_Gentoo, c(0.025, 0.975))

##      2.5%     97.5% 
## 0.4285714 0.6050420

Estimating the Proportion of Biscoe islanders that are females using bootstrapping

Our goal now is to do the same type of analysis as above, but focusing on penguins (of any species) that are on the Biscoe island, and estimating the proportion that are female. Our first set of code should create a list of 1’s and 0’s, representing the female and male Gentoo penguins. This code should specifically filter for the Biscoe island; remove any penguins whose sex is unknown; create a new column of data that replaces “male” and “female” with 0 and 1; and then “pull” this new column of data to focus exclusively on it.

GentooFemale <- Penguins %>%
  filter(island == "Biscoe")%>%
  filter(species == "Gentoo") %>%
  filter(!is.na(sex)) %>%
  mutate(female = ifelse(sex == "female", 1, 0)) %>%
  pull(female)

Next, we want to use a for loop to draw bootstrap samples with replacement from the observed data. Each bootstrap sample computes a sample proportion; the collection of those proportions forms the bootstrap distribution, from which we read off a 95% percentile CI.

B <- 10000
boot_props <- numeric(B)
for (i in 1:B) {
  boot_sample <- sample(GentooFemale, replace = TRUE)
  boot_props[i] <- mean(boot_sample)
}
quantile(boot_props, probs = c(0.025, 0.975))

##      2.5%     97.5% 
## 0.4033613 0.5798319

ObsProp_GentooFemale <- mean(GentooFemale)

Finally, we want to visualize our bootstrapped data set and compute our 95% confidence interval

ggplot(data.frame(boot_props), aes(x = boot_props)) +
  geom_histogram(bins = 30, fill = "lightpink", color = "white") +
  labs(
    title = "Bootstrap Distribution: Proportion Female (Gentoo)",
    x = "Bootstrap p-hat (Proportion Female)",
    y = "Count"
)