Resampling methods involve repeatedly drawing samples from a training set and refitting a model on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ, which we will do in the second part of this lab.
The bootstrap is used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method. The power of the bootstrap is that it can easily be applied to a wide range of statistical learning methods, including some in which a measure of variability is otherwise difficult to obtain.
In real life, we usually can’t just get new data by resampling the population over and over. However, the bootstrap approach allows us to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additional samples.
Rather than repeatedly obtaining independent data sets from the population, bootstrapping has us obtain distinct data sets by repeatedly sampling observations from the original data set, of the same size as the original data set, treating the sample as a population to resample from.
Each of these “bootstrap data sets” is created by sampling with replacement, and as a result some observations may appear more than once in a given bootstrap data set while others might not appear at all.
From slides.
One of the great advantages of the bootstrap approach is that it can be applied in almost all situations. No complicated math is involved, there is only two steps in R, which we will go over below.
“Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y , respectively, where X and Y are random quantities. We will invest a fraction \({\alpha}\) of our money in X, and will invest the remaining 1-\({\alpha}\) in Y . Since there is variability associated with the returns on these two assets, we wish to choose \({\alpha}\) to minimize the total risk, or variance, of our investment to get the optimal return. In other words, we want to minimize Var(\({\alpha}\)X + (1-\({\alpha}\))Y).”
We can estimate the value of \({\alpha}\) that minimizes the variance of our investment using computed estimates of the population variances and covariances from our sample of past measurements of X and Y:
\[\hat{\alpha}=\frac{\hat{\sigma}^2_Y-\hat{\sigma}_{XY}}{\hat{\sigma}^2_X + \hat{\sigma}^2_Y-2\hat{\sigma}_{XY}}\]
So, this is a non-linear equation which might be hard to get at the sampling distrubiton of if we want to know the standard error/variability of our estimate of \({\alpha}\).
library (ISLR)
setwd("~/Predictive Analytics/ISLR/data")
fix (Portfolio)
alpha = function(x, y){
vx = var(x)
vy = var(y)
cxy = cov(x, y)
(vy - cxy) / (vx - vy - 2 * cxy)
}
alpha(Portfolio$X, Portfolio$Y)
## [1] -0.476069
We randomly select n observations from the data set in order to produce a bootstrap data set, \(Z^{*1}\). Note that if an observation is contained in \(Z^{*1}\), then both its X and Y values are included. We can use \(Z^{*1}\) to produce a new bootstrap estimate for \({\alpha}\), which we call \(\hat{\alpha}^{*1}\).
alpha.fn = function(data, index){
with(data[index,], alpha(X, Y))
}
##portfolio data frame, rows of the data frame (Values of 1 to N)
##with the data from the index, compute alpha for X and Y
alpha.fn(Portfolio, 1:100)
## [1] -0.476069
##original function using data 1 to N (100), if same as before, working
set.seed(1)
##since using random sampling, set seed for reproducible results
##The next command uses the sample() function to randomly select 100 observations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing alpha based on the new data set
alpha.fn(Portfolio, sample(1:100, 100, replace=TRUE))
## [1] -0.5332013
##single bootstrap sampe of size one, now let's do it 1000x
This procedure is repeated B times for some large value of B, we’ll do 1000, in order to produce B different bootstrap data sets, \(Z^{*1}\), \(Z^{*2}\),…, \(Z^{*B}\), and B corresponding \({\alpha}\) estimates, \(\hat{\alpha}^{*1}\), \(\hat{\alpha}^{*2}\),…, \(\hat{\alpha}^{*B}\). We can compute the standard error of these bootstrap estimates using the formula:
\[SE_{B}(\hat{\alpha})= \sqrt{\frac{1}{B-1}\sum_{r=1}^B \left(\hat{\alpha}^{*r} - \frac{1}{B}\sum_{r'=1}^B \hat{\alpha}^{*r'} \right)}\]
library(boot)
boot.out=boot(Portfolio, alpha.fn, R=1000)
boot.out
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.476069 -0.0226729 0.1275871
plot (boot.out)
##slight skew b/c no parameters but relatively normal
##Q-Q plot doesn't exactly line up in a straight line so distribution isn't perfectly normal but not too bad
The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for \({\beta_0}\) and \({\beta_1}\), two constants that represent the intercept and slope terms in a linear model.
setwd("~/Predictive Analytics/bootstrap lab")
studdata=read.csv ("intro to psych student data.csv", header =T)
dim(studdata)
## [1] 440 40
studdata=read.csv ("intro to psych student data.csv", header =T, na.strings = c("99","999"))
studdata=na.omit (studdata)
dim(studdata)
## [1] 436 40
head (studdata)
## subject age sex race yearsch leavehom befoleav job income facescoh
## 1 1 18 2 1 1 1 1 840 4 22
## 2 2 18 1 1 1 1 1 650 3 28
## 3 3 18 2 1 1 1 1 87 4 44
## 4 4 18 2 1 1 2 9 437 3 42
## 5 5 18 2 1 1 1 1 405 6 48
## 6 7 21 2 1 2 2 9 921 6 12
## facesada hlcssg hlcssa hlcsed hlcsd hlcsfi hlcssf hlcssr hlcsg eamtotal
## 1 34 27 17.0 7 20 8 2 6 6.00 56.00
## 2 32 25 9.0 30 21 17 10 18 10.00 56.00
## 3 25 36 18.0 6 14 24 6 3 11.00 46.22
## 4 22 38 12.0 25 24 25 11 12 14.00 53.00
## 5 31 32 14.4 6 13 4 2 3 11.76 35.22
## 6 22 42 9.0 24 35 28 5 21 8.00 64.03
## eaftotal rssep rseng rsenm rsnur rsselfc rsdd rshs prqmtotl prqftotl
## 1 68 26 19 34 22 25 24 41 180 148
## 2 57 25 21 40 23 31 30 43 179 174
## 3 43 32 18 27 26 30 28 42 212 212
## 4 56 30 32 39 23 30 26 41 189 159
## 5 34 17 11 36 24 39 19 52 205 206
## 6 58 17 30 34 8 35 15 40 66 91
## atctot iseltot rsetotal paqm paqf ulstotal bd1total felnowtl felgentl
## 1 69.00 74 24 26.79 34 43 19.00 48.00 52
## 2 76.00 76 27 28.00 32 33 13.00 42.00 52
## 3 82.00 73 31 26.00 37 37 1.00 35.83 34
## 4 78.00 79 28 26.00 32 38 18.00 40.00 51
## 5 80.39 78 39 28.00 39 20 5.41 32.00 33
## 6 81.78 78 37 33.00 39 24 1.00 30.00 30
## wahlertl
## 1 43.00
## 2 74.00
## 3 54.49
## 4 75.00
## 5 50.00
## 6 23.00
LEAVEHOM = whether participant left home to attend college
FAMSTRU = family structure
JOB = social class (higher value =higher social class; range = 71 - 960)
FACESCOH = family cohesion scale
FACESADA = family adaptability scale
HLCS = home-leaving cognitions scale
* SG = self-governance
* SA = school affiliation
* ED = emotional detachment
* D = disengagement
* FI = financial independence
RS = relationship survey
* SEP= separation anxiety
* ENG= engulfment anxiety
* ENM = enmeshment seeking
* DD = dependency denial
EAMTOTAL = emotional autonomy from mother
PRQMTOTL = attachment to mother
ATCTOT = adaptability to change
ISELTOT = social support
EAFTOTAL = emotional autonomy from father
PRQFTOTL = attachment to father
RSETOTAL = self-esteem
ULSTOTAL = loneliness
BD1TOTAL = Beck Depression Inventory
FELNOWTL = state anxiety
FELGENTL = trait anxiety
Simple Linear Regression
Predictor: attachment to mother
Response: relationship survey, engulfment anxiety
lm.fit =lm(rseng~prqmtotl, data=studdata)
summary(lm.fit)
##
## Call:
## lm(formula = rseng ~ prqmtotl, data = studdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.948 -3.581 0.305 3.402 17.305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.046240 1.505254 25.94 <2e-16 ***
## prqmtotl -0.097097 0.008385 -11.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.908 on 434 degrees of freedom
## Multiple R-squared: 0.2361, Adjusted R-squared: 0.2343
## F-statistic: 134.1 on 1 and 434 DF, p-value: < 2.2e-16
##[x] p-value significant
##[x] RSE (lack of fit) small
##[x] 24% variability explained
##[x] there's a relationship between predictor and response
with(studdata,plot(prqmtotl, rseng))
abline (lm.fit)
## as self reported attachment to mother increases, engulfment anxiety decreases
We first create a simple function, boot.fn(), which takes in the studdata data set as well as a set of indices for the observations, and returns the intercept and slope estimates for the linear regression model. We then apply this function to the full set of 436 observations in order to compute the estimates of \({\beta_0}\) and \({\beta_1}\) on the entire data set.
boot.fn=function (data ,index)
return(coef(lm(rseng~prqmtotl,data=data,subset=index)))
boot.fn(studdata ,1:436)
## (Intercept) prqmtotl
## 39.0462397 -0.0970967
##Same values for intercept and slope obtained in summary of regression above
The boot.fn() function can also be used in order to create bootstrap estimates for the intercept and slope terms by randomly sampling from among the observations with replacement, let’s do a couple individual examples:
set.seed (1)
boot.fn(studdata ,sample (436 ,436 , replace =T))
## (Intercept) prqmtotl
## 35.63248042 -0.07856354
boot.fn(studdata ,sample (436 ,436 , replace =T))
## (Intercept) prqmtotl
## 41.0146992 -0.1082734
Next,we can use the boot() function to compute the standard errors of 1,000 of these bootstrap estimates for the intercept and slope terms:
boot(studdata ,boot.fn ,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = studdata, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 39.0462397 0.0796258414 1.807120559
## t2* -0.0970967 -0.0004083515 0.009887761
Let’s compare these bootstrapped SEs to the SEs obtained using the standard SE equations for slope and intercept: \[SE (\hat{\beta_0}) = {\sigma}^2 \left (\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x}^2)}\right)\]
\[SE (\hat{\beta_1}) = \left (\frac{{\sigma}^2}{\sum_{i=1}^n (x_i - \bar{x}^2)}\right)\]
summary (lm(rseng~prqmtotl ,data=studdata))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.0462397 1.505253723 25.93997 2.855331e-90
## prqmtotl -0.0970967 0.008384686 -11.58024 3.296303e-27
These are somewhat different from the estimates obtained using the bootstrap. The standard formulas given above rely on certain assumptions. For example, they depend on the unknown parameter \(\sigma^2\), the noise variance. The standard formulas also assume (somewhat unrealistically) that the \(x_i\) are fixed, and all the variability comes from the variation in the errors \(\epsilon_i\). The bootstrap approach does not rely on any of these assumptions, and so it is likely giving a more accurate estimate of the standard errors of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) than is the summary() function.
Below we compute the bootstrap standard error estimates and the standard linear regression estimates that result from fitting the quadratic model to the data.
lm.fit2=lm(rseng~prqmtotl +I(prqmtotl^2), data=studdata)
summary (lm.fit2)
##
## Call:
## lm(formula = rseng ~ prqmtotl + I(prqmtotl^2), data = studdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.9410 -3.5715 0.4044 3.2862 17.1606
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.2201157 5.2793292 4.967 9.82e-07 ***
## prqmtotl 0.0681736 0.0657684 1.037 0.3005
## I(prqmtotl^2) -0.0005113 0.0002018 -2.533 0.0117 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.878 on 433 degrees of freedom
## Multiple R-squared: 0.2472, Adjusted R-squared: 0.2437
## F-statistic: 71.1 on 2 and 433 DF, p-value: < 2.2e-16
## worse fit, standard error went up, significance went away for predictor, F went down
boot.fn=function (data ,index )
coefficients(lm(rseng~prqmtotl +I(prqmtotl^2),data=data, subset=index))
set.seed (1)
boot(studdata,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = studdata, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 26.2201156975 4.728690e-01 5.967709971
## t2* 0.0681736160 -5.487872e-03 0.071656648
## t3* -0.0005113391 1.572339e-05 0.000213157
Since this model provides a worse fit to the data, there is now less correspondence between the bootstrap estimates and the standard estimates of SE(\(\hat{\beta_0}\)), SE(\(\hat{\beta_1}\)) and SE(\(\hat{\beta_2}\)).