Bootstrap: general concept:
it is used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.
Is widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.
Example: it can be used to estimate the std. error of coefficients for a linear regression model.
Bootstrap sample: take random samples with replacement from original data set and add back them to the data set. This resulted set is bootstrap sample.
one of the advantage of the bootstrap approach is that it can be applied in almost all situations.
no complicated mathematical calculations are required.
TWO-STEP PROCESS IN “R”:
(1) create a function that computes the statistic of interest.
(2) use the boot() function (from library(boot)) to perform the bootstrap by repeteadly sampling observations from the data set with replacement and running it “R” times to get R estiamtes of statistic of interest and calculating std. error.
CASE-1: data : “Portfolio” from ISLR
Objective : suppose we want to invest a fixed sum of money in two financial assets that has yield of X and Y. we will invest a fraction of our money, alpha, in X. and another fraction, 1- alpha, in Y. since there is variability (risk) associated with the returns on these two assets, we wish to choose alpha in a way to minimize the total variance (risk) of our investments. let’s estimate the Alpha and the std. error of the estimate.
library(ISLR)
data(Portfolio)
example of how to call alpha.fun()
NO BOOTSTRAP:
round(alpha.fun(Portfolio, 1:100),3)
## [1] 0.576
example of how to call alpha.fun() using sample()
SINGLE BOOTSTRAP INTERATION:
round(alpha.fun(Portfolio, sample(100,100,replace = T)),3)
## [1] 0.583
we can now implement the bootstrap analysis by performing above command many times, recording all of the corresponsing estimate of alpha, and computing the resulting std. error (std.dev.).
the boot() function automates this process for us.
below we produce R=1000 bootstrap estimates of alpha.
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:lattice':
##
## melanoma
boot(Portfolio, alpha.fun, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Portfolio, statistic = alpha.fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.5758321 0.003327537 0.09171213
The final output of estiamte of alpha using original data = 0.5758
the bootstrap estimate for SE of alpha = 0.0886
CASE-2: data : “Auto” from ISLR
now, let’s do R=1000 bootstraping estiamtes using boot():
boot(Auto, boot.fun, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.9358610 0.0400456643 0.848122750
## t2* -0.1578447 -0.0003312314 0.007349799
The final BOOTSTRAP output of estiamte of INTERCEPT = 39.936 (SE = 0.8783)
The final BOOTSTRAP output of estiamte of SLOPE = -0.158 (SE = 0.0076)
The final FORMULA output of estiamte of INTERCEPT = 39.936 (SE = 0.7175)
CASE-3: data : “Auto” from ISLR
Objective : below we compute the bootstrap std. error estiamtes and the std. linear regression estimates that result from fitting the quadratic model to the data.
boot.fun = function(data, index){
return(coef(lm(mpg ~ horsepower+I(horsepower^2), data=data, subset=index)))
}
set.seed(1)
boot(Auto, boot.fun, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Auto, statistic = boot.fun, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 56.900099702 6.098115e-03 2.0944855842
## t2* -0.466189630 -1.777108e-04 0.0334123802
## t3* 0.001230536 1.324315e-06 0.0001208339
summary(lm(mpg~horsepower+I(horsepower^2), data=Auto))
##
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
since this model provides a good fit to the data, there is now a better correspondence between the bootstrap estimates and the std. estimates of SE(B0), SE(B1) and SE(B2).