BootstrapPenguins

Estimating the Proportion of Gentoo that are males using bootstrapping

We will create an indicator variable for “male” (1 = male, 0 = not male), filter to Gentoo. We will treat our original set of data as a single sample, and this code will calculate the proportion of the Gentoo sample that is male.

Our first line of code creates a list of 1’s and 0’s, representing the male and female Gentoo penguins. This code specifically filters for the Gentoo species; removes any penguins whose sex is unknown; creates a new column of data that replaces “male” and “female” with 1 and 0; and then “pulls” this new column of data to focus exclusively on it.

# This code builds a 0/1 vector for "male" among Gentoo penguins
GentooSex <- Penguins %>%
  filter(species == "Gentoo") %>%
  filter(!is.na(sex)) %>%
  mutate(male = ifelse(sex == "male", 1, 0)) %>%
  pull(male)

length(GentooSex)          # sample size n

## [1] 119

ObsProp_GentooMale <- sum(GentooSex)/length(GentooSex)   # observed proportion
ObsProp_GentooMale

## [1] 0.512605

Next, we use a for loop to draw bootstrap samples with replacement from the observed data. Each bootstrap sample computes a sample proportion; the collection of those proportions forms the bootstrap distribution, from which we read off a 95% percentile CI.

# Make a blank vector for the bootstrap statistics
# We let "B" be the size of our Bootstrap sample
# It's currently set to 1000, but we could change it
B <- 1000
BootProps_Gentoo <- rep(NA, B)

# This is our "For Loop" bootstrap: resample WITH replacement from GentooSex
for(i in 1:B) {
  BootSample <- sample(GentooSex, size = length(GentooSex), replace = TRUE)
  BootProps_Gentoo[i] <- sum(BootSample)/length(GentooSex)
}

We now visualize our bootstrapped data set and compute our 95% confidence interval

# Visualize and compute a 95% percentile CI
ggplot(data.frame(BootProps_Gentoo), aes(x = BootProps_Gentoo)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Bootstrap Distribution: Proportion Male (Gentoo)",
    x = "Bootstrap p-hat", y = "Count"
  )

# 95% percentile CI
quantile(BootProps_Gentoo, c(0.025, 0.975))

##      2.5%     97.5% 
## 0.4201681 0.5966387

Estimating the Proportion of Biscoe islanders that are females using bootstrapping

Our goal now is to do the same type of analysis as above, but focusing on penguins (of any species) that are on the Biscoe island, and estimating the proportion that are female. Our first set of code should create a list of 1’s and 0’s, representing the female and male Gentoo penguins. This code should specifically filter for the Biscoe island; remove any penguins whose sex is unknown; create a new column of data that replaces “male” and “female” with 0 and 1; and then “pull” this new column of data to focus exclusively on it.

# This code builds a 0/1 vector for "female" among Biscoe penguins
BiscoeSex <- Penguins %>%
  filter(island == "Biscoe") %>%        # only Biscoe island
  filter(!is.na(sex)) %>%               # remove missing data
  mutate(female = ifelse(sex == "female", 1, 0)) %>%  # 1 = female, 0 = male
  pull(female)

length(BiscoeSex)                          # sample size n

## [1] 163

ObsProp_BiscoeFemale <- sum(BiscoeSex) / length(BiscoeSex)  # observed proportion
ObsProp_BiscoeFemale

## [1] 0.4907975

Next, we want to use a for loop to draw bootstrap samples with replacement from the observed data. Each bootstrap sample computes a sample proportion; the collection of those proportions forms the bootstrap distribution, from which we read off a 95% percentile CI.

# Bootstrap setup
B <- 1000
BootProps_Biscoe <- rep(NA, B)

# For loop to generate bootstrap samples
for(i in 1:B) {
  BootSample <- sample(BiscoeSex, size = length(BiscoeSex), replace = TRUE)
  BootProps_Biscoe[i] <- sum(BootSample) / length(BiscoeSex)
}

Finally, we want to visualize our bootstrapped data set and compute our 95% confidence interval

# Histogram of bootstrap proportions
ggplot(data.frame(BootProps_Biscoe), aes(x = BootProps_Biscoe)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Bootstrap Distribution: Proportion Female (Biscoe)",
    x = "Bootstrap p-hat", y = "Count"
  )

# 95% percentile confidence interval
quantile(BootProps_Biscoe, c(0.025, 0.975))

##      2.5%     97.5% 
## 0.4171779 0.5644172