We will create an indicator variable for “male” (1 = male, 0 = not male), filter to Gentoo. We will treat our original set of data as a single sample, and this code will calculate the proportion of the Gentoo sample that is male.
Our first line of code creates a list of 1’s and 0’s, representing the male and female Gentoo penguins. This code specifically filters for the Gentoo species; removes any penguins whose sex is unknown; creates a new column of data that replaces “male” and “female” with 1 and 0; and then “pulls” this new column of data to focus exclusively on it.
# This code builds a 0/1 vector for "male" among Gentoo penguins
GentooSex <- Penguins %>%
filter(species == "Gentoo") %>%
filter(!is.na(sex)) %>%
mutate(male = ifelse(sex == "male", 1, 0)) %>%
pull(male)
length(GentooSex) # sample size n
## [1] 119
ObsProp_GentooMale <- sum(GentooSex)/length(GentooSex) # observed proportion
ObsProp_GentooMale
## [1] 0.512605
Next, we use a for loop to draw bootstrap samples with replacement from the observed data. Each bootstrap sample computes a sample proportion; the collection of those proportions forms the bootstrap distribution, from which we read off a 95% percentile CI.
# Make a blank vector for the bootstrap statistics
# We let "B" be the size of our Bootstrap sample
# It's currently set to 1000, but we could change it
B <- 1000
BootProps_Gentoo <- rep(NA, B)
# This is our "For Loop" bootstrap: resample WITH replacement from GentooSex
for(i in 1:B) {
BootSample <- sample(GentooSex, size = length(GentooSex), replace = TRUE)
BootProps_Gentoo[i] <- sum(BootSample)/length(GentooSex)
}
We now visualize our bootstrapped data set and compute our 95% confidence interval
# Visualize and compute a 95% percentile CI
ggplot(data.frame(BootProps_Gentoo), aes(x = BootProps_Gentoo)) +
geom_histogram(bins = 30) +
labs(
title = "Bootstrap Distribution: Proportion Male (Gentoo)",
x = "Bootstrap p-hat", y = "Count"
)
# 95% percentile CI
quantile(BootProps_Gentoo, c(0.025, 0.975))
## 2.5% 97.5%
## 0.4285714 0.6050420
Our goal now is to do the same type of analysis as above, but focusing on penguins (of any species) that are on the Biscoe island, and estimating the proportion that are female. Our first set of code should create a list of 1’s and 0’s, representing the female and male Gentoo penguins. This code should specifically filter for the Biscoe island; remove any penguins whose sex is unknown; create a new column of data that replaces “male” and “female” with 0 and 1; and then “pull” this new column of data to focus exclusively on it.
GentooFemale <- Penguins %>%
filter(island == "Biscoe")%>%
filter(species == "Gentoo") %>%
filter(!is.na(sex)) %>%
mutate(female = ifelse(sex == "female", 1, 0)) %>%
pull(female)
Next, we want to use a for loop to draw bootstrap samples with replacement from the observed data. Each bootstrap sample computes a sample proportion; the collection of those proportions forms the bootstrap distribution, from which we read off a 95% percentile CI.
B <- 10000
boot_props <- numeric(B)
for (i in 1:B) {
boot_sample <- sample(GentooFemale, replace = TRUE)
boot_props[i] <- mean(boot_sample)
}
quantile(boot_props, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 0.4033613 0.5798319
ObsProp_GentooFemale <- mean(GentooFemale)
Finally, we want to visualize our bootstrapped data set and compute our 95% confidence interval
ggplot(data.frame(boot_props), aes(x = boot_props)) +
geom_histogram(bins = 30, fill = "lightpink", color = "white") +
labs(
title = "Bootstrap Distribution: Proportion Female (Gentoo)",
x = "Bootstrap p-hat (Proportion Female)",
y = "Count"
)