Bootstrap: general concept:

it is used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.

Is widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

Example: it can be used to estimate the std. error of coefficients for a linear regression model.

Bootstrap sample: take random samples with replacement from original data set and add back them to the data set. This resulted set is bootstrap sample.

one of the advantage of the bootstrap approach is that it can be applied in almost all situations.

no complicated mathematical calculations are required.

TWO-STEP PROCESS IN “R”:

(1) create a function that computes the statistic of interest.

(2) use the boot() function (from library(boot)) to perform the bootstrap by repeteadly sampling observations from the data set with replacement and running it “R” times to get R estiamtes of statistic of interest and calculating std. error.

CASE-1: data : “Portfolio” from ISLR

Objective : suppose we want to invest a fixed sum of money in two financial assets that has yield of X and Y. we will invest a fraction of our money, alpha, in X. and another fraction, 1- alpha, in Y. since there is variability (risk) associated with the returns on these two assets, we wish to choose alpha in a way to minimize the total variance (risk) of our investments. let’s estimate the Alpha and the std. error of the estimate.

library(ISLR)

data(Portfolio)

alpha.fun() is a function which takes the data (X,Y) as input as well as a vector (index) indicating which observations should be used to estimate alpha. the function then outputs the estimate for alpha based on selected observations (index).

alpha.fun = function(data, index) {
        x = data$X[index]
        y = data$Y[index]
        return((var(y)-cov(x,y)) / (var(x)+var(y)-2*cov(x,y)))
}

example of how to call alpha.fun()

NO BOOTSTRAP:

round(alpha.fun(Portfolio, 1:100),3)
## [1] 0.576
# no random sampling as we gave index of 1:100. so no matter how many times you run, same results. this is not bootstraping. bootstraping is random sampling with repacement.

example of how to call alpha.fun() using sample()

SINGLE BOOTSTRAP INTERATION:

round(alpha.fun(Portfolio, sample(100,100,replace = T)),3)
## [1] 0.583
# Run multiple times and you will see different estimates every times. this is each bootstrap estimates. we do it for R=1000 times using boot() function.

we can now implement the bootstrap analysis by performing above command many times, recording all of the corresponsing estimate of alpha, and computing the resulting std. error (std.dev.).

the boot() function automates this process for us.

below we produce R=1000 bootstrap estimates of alpha.

library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
## 
##     aml
## The following object is masked from 'package:lattice':
## 
##     melanoma
boot(Portfolio, alpha.fun, R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fun, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original      bias    std. error
## t1* 0.5758321 0.003327537  0.09171213

The final output of estiamte of alpha using original data = 0.5758

the bootstrap estimate for SE of alpha = 0.0886

CASE-2: data : “Auto” from ISLR

Objective : The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. here we will use bootstrap to assess the variability of the estiamtes B0 and B1, the intercept and slope terms of the linear regresion model. then we will compare the estimates achieve using bootstrap to those obtained using the formula for SE(B0) and SE(B1).

with(Auto, plot(mpg ~ horsepower))

# looks like the relation is not perfectly linear.

first, let’s create simple function boot.fun() which takes in the Auto data set as well as a set of indices for the observations. and returns the intercept and slope estimates for the linear regression model.

boot.fun = function(data, index) {
        return(coef(lm(mpg ~ horsepower, data=data, subset=index)))
}

NO BOOTSTRAP:

round(boot.fun(Auto, 1:392),3)
## (Intercept)  horsepower 
##      39.936      -0.158
# no random sampling as we gave index of 1:392. so no matter how many times you run, same results. this is not bootstraping. bootstraping is random sampling with repacement.

SINGLE BOOTSTRAP INTERATION:

round(boot.fun(Auto, sample(392,392,replace = T)),3)
## (Intercept)  horsepower 
##      38.648      -0.147
# Run multiple times and you will see different estimates every times. this is each bootstrap estimates. we do it for R=1000 times using boot() function.

now, let’s do R=1000 bootstraping estiamtes using boot():

boot(Auto, boot.fun, R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fun, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.9358610  0.0400456643 0.848122750
## t2* -0.1578447 -0.0003312314 0.007349799

The final BOOTSTRAP output of estiamte of INTERCEPT = 39.936 (SE = 0.8783)

The final BOOTSTRAP output of estiamte of SLOPE = -0.158 (SE = 0.0076)

now we apply full set of 392 observations in order to compute the estiamtes of B0 and B1 on the entire data set using the usual linear regression coefficient estimate formulas.

summary(lm(mpg ~ horsepower, data=Auto))$coef
##               Estimate  Std. Error   t value      Pr(>|t|)
## (Intercept) 39.9358610 0.717498656  55.65984 1.220362e-187
## horsepower  -0.1578447 0.006445501 -24.48914  7.031989e-81

The final FORMULA output of estiamte of INTERCEPT = 39.936 (SE = 0.7175)

The final FORMULA output of estiamte of SLOPE = -0.158 (SE = 0.0064)

CONCLUSION:

the coefficient estimates for B0 and B1 obtained using the formulas and bootstrap are same. but the std. error estimates are not same.

this indicates better estimate by BOOTSTRAP than formula method. and here is why:

recall that the standard formula for SE(B0) and SE(B1) depend on unknown parameter sigma^2. the noise variance.

we then estimate sigma^2 using the RSS.

now although the formula for the std. errors do not depend on the linear model being correct, the estimate of sigma^2 does.

now if there is non-linear relationship in the data, like we have in this Auto data (check plot above) between mpg~horsepower, so the residuals from a linear fit will be inflated. and so wil estimate of sigma^2.

secondly, the std. formulas assume (somewhat unreaslisticly) that the xi are fixed and all the variability comes from the variation in the error Ei.

the bootstrap approach does not rely on any of these assumptions and so it is likely giving a more accurate estimate of the std. errors of the estiamte of B0 and B1 than using formula through summary() function.

CASE-3: data : “Auto” from ISLR

Objective : below we compute the bootstrap std. error estiamtes and the std. linear regression estimates that result from fitting the quadratic model to the data.

boot.fun = function(data, index){
        return(coef(lm(mpg ~ horsepower+I(horsepower^2), data=data, subset=index)))
}
set.seed(1)

boot(Auto, boot.fun, R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Auto, statistic = boot.fun, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 56.900099702  6.098115e-03 2.0944855842
## t2* -0.466189630 -1.777108e-04 0.0334123802
## t3*  0.001230536  1.324315e-06 0.0001208339
summary(lm(mpg~horsepower+I(horsepower^2), data=Auto))
## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

since this model provides a good fit to the data, there is now a better correspondence between the bootstrap estimates and the std. estimates of SE(B0), SE(B1) and SE(B2).