Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Bootstrapping is a resampling technique. We will discuss how it works!
set.seed(356) #any number is finepenguins_intervals<-reg_intervals(bill_depth_mm ~ bill_length_mm*species*sex, data=penguins, type='percentile',keep_reps=FALSE)penguins_intervals
# A tibble: 10 × 4
species sex year bill_length_mm
<fct> <fct> <int> <dbl>
1 Gentoo female 2007 48.7
2 Adelie male 2008 39.6
3 Gentoo female 2009 50.5
4 Adelie female 2007 40.3
5 Gentoo male 2008 44.4
6 Adelie female 2009 40.2
7 Gentoo female 2007 46.2
8 Gentoo female 2007 45.1
9 Chinstrap female 2007 58
10 Gentoo male 2009 52.5
#let's turn resampling on (let's us include duplicates-- we can choose from entire dataset AGAIN when we collect a separate sample)lilpen2<- penguins %>%slice_sample(n=10, replace=TRUE) %>%select(species, sex, year, bill_length_mm)lilpen2 #if we run this enough times we will eventually see duplicates! This is the concept upon which bootstrapping is based
# A tibble: 10 × 4
species sex year bill_length_mm
<fct> <fct> <int> <dbl>
1 Adelie male 2008 40.1
2 Gentoo female 2008 45.5
3 Chinstrap male 2007 51.3
4 Gentoo female 2007 46.5
5 Adelie female 2008 33.1
6 Adelie female 2008 35.5
7 Adelie female 2007 39.5
8 Chinstrap male 2007 48.5
9 Adelie female 2009 39.6
10 Gentoo male 2009 46.8
Now we can scale up (working towards bootstrapping)
#with this sample in hand we can draw a rsample of the sample size and calc mean arrival dealyorig_sample %>%slice_sample(n=n, replace=TRUE) %>%summarize(meanbill=mean(bill_length_mm))
# A tibble: 1 × 1
meanbill
<dbl>
1 NA
#44.2#compare to orignal datasetpenguins %>%summarize(meanbill=mean(bill_length_mm))
# A tibble: 1 × 1
meanbill
<dbl>
1 NA
#44.0 -- different because n=150 in the df but we sampled extra (n=200)#by repeating this process many times we can see how much variation there is from sample to samplepen_200_bs<-1:1000%>%#1000 = number of trials / resamplesmap_dfr(~orig_sample %>%slice_sample(n=n, replace=TRUE) %>%summarize(meanbill=mean(bill_length_mm))) %>%mutate(n=n)pen_200_bs #you will see we now have means for 1000 trials!
# A tibble: 1,000 × 2
meanbill n
<dbl> <dbl>
1 NA 200
2 NA 200
3 NA 200
4 NA 200
5 43.6 200
6 NA 200
7 NA 200
8 NA 200
9 44.0 200
10 NA 200
# … with 990 more rows
#check against original dfpen_df_bs<-1:1000%>%#1000 = number of trials / resamplesmap_dfr(~penguins %>%slice_sample(n=n, replace=TRUE) %>%summarize(meanbill=mean(bill_length_mm))) %>%mutate(n=n)pen_df_bs
# A tibble: 1,000 × 2
meanbill n
<dbl> <dbl>
1 NA 200
2 NA 200
3 NA 200
4 NA 200
5 NA 200
6 44.1 200
7 NA 200
8 NA 200
9 NA 200
10 NA 200
# … with 990 more rows
-The distribution of values we get when we build a series of bootstrap trials is called the bootstrap distribution. It is not exactly the same as the sampling distribution but for sufficiently large n is is a good approximation!
-Remember that if we have a roughly normal distribution we can get 95% CIs by using the rule of thump CI=2SE (or standard error of the mean) #the “real” value here is 1.96SE
calculating boostrapped CIs thus, could look like this
pen_200_bs<-1:1000%>%#1000 = number of trials / resamplesmap_dfr(~orig_sample %>%slice_sample(n=n, replace=TRUE) %>%summarize(meanbill=mean(bill_length_mm))) %>%mutate(n=n)calc_CIs<-pen_200_bs %>%summarize(meanbillboot=mean(meanbill), CI=1.96*sd(meanbill))calc_CIs
# A tibble: 1 × 2
meanbillboot CI
<dbl> <dbl>
1 NA NA