Setup

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(infer)
library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

load("cdc.Rdata")

cdc$age - The true mean

Compute the true mean of cdc$age and store it in true_mean.

true_mean = mean(cdc$age)
true_mean

## [1] 45.06825

cdc500

Get a random sample of 500 cases from cdc. call it cdc500. Use a seed, 123.

Solution

set.seed(123)
cdc500 = cdc %>% 
  sample_n(500)
str(cdc)

## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

Bootstrap

Create a bootstrap distribution of the mean of cdc500$age. Use 1,000 reps in the infer package.

Solution

bootstrap_distribution <- cdc500 %>%
  specify(response = age) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean")

Upper and Lower Bounds

Create the upper and lower bounds of this distribution appropriate for a 95% CI.

Solution

lower = quantile(bootstrap_distribution$stat,.025)
upper = quantile(bootstrap_distribution$stat,.975)
lower

##    2.5% 
## 43.3419

upper

##    97.5% 
## 46.37025

Compare with get_ci()

Solution

bootstrap_distribution %>% 
  get_ci

## # A tibble: 1 × 2
##   lower_ci upper_ci
##      <dbl>    <dbl>
## 1     43.3     46.4

lower

##    2.5% 
## 43.3419

upper

##    97.5% 
## 46.37025

Compare with t.test()

Solution

t.test(cdc500$age)

## 
##  One Sample t-test
## 
## data:  cdc500$age
## t = 58.863, df = 499, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  43.30081 46.29119
## sample estimates:
## mean of x 
##    44.796

lower

##    2.5% 
## 43.3419

upper

##    97.5% 
## 46.37025

Examine Bootstrap

Use geom_density() and geom_vline(). Include the true mean value.

Solution

bootstrap_distribution %>% 
  ggplot(aes(x = stat)) +
  geom_density() +
  geom_vline(xintercept = lower) +
  geom_vline(xintercept = upper) +
  geom_vline(xintercept = true_mean,color = "red")

Bootstrap CI

Setup

cdc$age - The true mean

cdc500

Solution

Bootstrap

Solution

Upper and Lower Bounds

Solution

Compare with get_ci()

Solution

Compare with t.test()

Solution

Examine Bootstrap

Solution