Resampling based procedures are ways to perform population based statistical inferences, while living within our data. Data Scientists tend to really like resampling based inferences, since they are very data centric procedures, they scale well to large studies and they often make very few assumptions.
Here we look at creating a confidence interval and estimating standard errors with bootstraping
boot( ) calls the statistic function R times. Each time, it generates a set of random indices, with replacement, from the integers 1:nrow(data). These indices are used within the statistic function to select a sample. The statistics are calculated on the sample and the results are accumulated in the bootobject. The bootobject structure includes
You can access these as bootobject\(t0 and bootobject\)t.
Once you generate the bootstrap samples, print(bootobject) and plot(bootobject) can be used to examine the results. If the results look reasonable, you can use boot.ci( ) function to obtain confidence intervals for the statistic(s).
The format is boot.ci(bootobject, conf=, type= ) where ## parameter description - bootobject: The object returned by the boot function - conf: The desired confidence interval (default: conf=0.95) - type: The type of confidence interval returned. Possible values are “norm”, “basic”, “stud”, “perc”, “bca” and “all” (default: type=“all”)
The following example generates the bootstrapped 95% confidence interval for R-squared in the linear regression of miles per gallon (mpg) on car weight (wt) and displacement (disp). The data source is mtcars. The bootstrapped confidence interval is based on 1000 replications.
# Bootstrap 95% CI for R-Squared
library(boot)
# function to obtain R-Squared from the data
rsq <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(summary(fit)$r.square)
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=rsq,
R=1000, formula=mpg~wt+disp)
# view results
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~
## wt + disp)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.7809306 0.01083345 0.04859227
plot(results)
# get 95% confidence interval
boot.ci(results, type="bca")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 0.6378, 0.8522 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
In example above, the function rsq returned a number and boot.ci returned a single confidence interval. The statistics function you provide can also return a vector. In the next example we get the 95% CI for the three model regression coefficients (intercept, car weight, displacement). In this case we add an index parameter to plot( ) and boot.ci( ) to indicate which column in bootobject$t is to analyzed.
# Bootstrap 95% CI for regression coefficients
library(boot)
# function to obtain regression weights
bs <- function(formula, data, indices) {
d <- data[indices,] # allows boot to select sample
fit <- lm(formula, data=d)
return(coef(fit))
}
# bootstrapping with 1000 replications
results <- boot(data=mtcars, statistic=bs,
R=1000, formula=mpg~wt+disp)
# view results
results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~
## wt + disp)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 34.96055404 1.339763e-01 2.601566258
## t2* -3.35082533 -6.621430e-02 1.158595156
## t3* -0.01772474 -7.087857e-05 0.008421491
plot(results, index=1) # intercept
plot(results, index=2) # wt
plot(results, index=3) # disp
# get 95% confidence intervals
boot.ci(results, type="bca", index=1) # intercept
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "bca", index = 1)
##
## Intervals :
## Level BCa
## 95% (29.71, 39.85 )
## Calculations and Intervals on Original Scale
boot.ci(results, type="bca", index=2) # wt
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "bca", index = 2)
##
## Intervals :
## Level BCa
## 95% (-5.511, -0.813 )
## Calculations and Intervals on Original Scale
boot.ci(results, type="bca", index=3) # disp
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = results, type = "bca", index = 3)
##
## Intervals :
## Level BCa
## 95% (-0.0349, -0.0007 )
## Calculations and Intervals on Original Scale
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
COURSE NOTES: ## The jackknife
library(UsingR)
## Warning: package 'UsingR' was built under R version 4.0.3
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 4.0.3
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.0.3
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 4.0.3
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
##
## Attaching package: 'UsingR'
## The following object is masked from 'package:survival':
##
## cancer
data(father.son)
x <- father.son$sheight
n <- length(x)
theta <- median(x)
jk <- sapply(1 : n,
function(i) median(x[-i])
)
thetaBar <- mean(jk)
biasEst <- (n - 1) * (thetaBar - theta)
seEst <- sqrt((n - 1) * mean((jk - thetaBar)^2))
c(biasEst, seEst)
## [1] 0.0000000 0.1014066
library(bootstrap)
## Warning: package 'bootstrap' was built under R version 4.0.3
temp <- jackknife(x, median)
c(temp$jack.bias, temp$jack.se)
## [1] 0.0000000 0.1014066
The general procedure follows by first simulating complete data sets from the observed data with replacement
Use the simulated statistics to either define a confidence interval or take the standard deviation to calculate a standard error
B <- 1000
resamples <- matrix(sample(x,
n * B,
replace = TRUE),
B, n)
medians <- apply(resamples, 1, median)
sd(medians)
## [1] 0.08536117
quantile(medians, c(.025, .975))
## 2.5% 97.5%
## 68.44665 68.81588
subdata <- InsectSprays[InsectSprays$spray %in% c("B", "C"),]
y <- subdata$count
group <- as.character(subdata$spray)
testStat <- function(w, g) mean(w[g == "B"]) - mean(w[g == "C"])
observedStat <- testStat(y, group)
permutations <- sapply(1 : 10000, function(i) testStat(y, sample(group)))
observedStat
## [1] 13.25
mean(permutations > observedStat)
## [1] 0