Bootstrapping Lab

Resampling Refresher

Resampling methods involve repeatedly drawing samples from a training set and refitting a model on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ, which we will do in the second part of this lab.

About Bootstrapping

The bootstrap is used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method. The power of the bootstrap is that it can easily be applied to a wide range of statistical learning methods, including some in which a measure of variability is otherwise difficult to obtain.

In real life, we usually can’t just get new data by resampling the population over and over. However, the bootstrap approach allows us to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additional samples.

Rather than repeatedly obtaining independent data sets from the population, bootstrapping has us obtain distinct data sets by repeatedly sampling observations from the original data set, of the same size as the original data set, treating the sample as a population to resample from.

Each of these “bootstrap data sets” is created by sampling with replacement, and as a result some observations may appear more than once in a given bootstrap data set while others might not appear at all.

From slides.

One of the great advantages of the bootstrap approach is that it can be applied in almost all situations. No complicated math is involved, there is only two steps in R, which we will go over below.

Estimating the Accuracy of a Statistic of Interest

“Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y , respectively, where X and Y are random quantities. We will invest a fraction \({\alpha}\) of our money in X, and will invest the remaining 1-\({\alpha}\) in Y . Since there is variability associated with the returns on these two assets, we wish to choose \({\alpha}\) to minimize the total risk, or variance, of our investment to get the optimal return. In other words, we want to minimize Var(\({\alpha}\)X + (1-\({\alpha}\))Y).”

We can estimate the value of \({\alpha}\) that minimizes the variance of our investment using computed estimates of the population variances and covariances from our sample of past measurements of X and Y:

\[\hat{\alpha}=\frac{\hat{\sigma}^2_Y-\hat{\sigma}_{XY}}{\hat{\sigma}^2_X + \hat{\sigma}^2_Y-2\hat{\sigma}_{XY}}\]

So, this is a non-linear equation which might be hard to get at the sampling distrubiton of if we want to know the standard error/variability of our estimate of \({\alpha}\).

First, we must create a function that computes the statistic of interest:

library (ISLR)
setwd("~/Predictive Analytics/ISLR/data")
fix (Portfolio)

alpha = function(x, y){
  vx = var(x)
  vy = var(y)
  cxy = cov(x, y)
  (vy - cxy) / (vx - vy - 2 * cxy)
}

alpha(Portfolio$X, Portfolio$Y)

## [1] -0.476069

What’s the standard error of alpha?

We randomly select n observations from the data set in order to produce a bootstrap data set, \(Z^{*1}\). Note that if an observation is contained in \(Z^{*1}\), then both its X and Y values are included. We can use \(Z^{*1}\) to produce a new bootstrap estimate for \({\alpha}\), which we call \(\hat{\alpha}^{*1}\).

alpha.fn = function(data, index){
  with(data[index,], alpha(X, Y))
}
##portfolio data frame, rows of the data frame (Values of 1 to N) 
##with the data from the index, compute alpha for X and Y

alpha.fn(Portfolio, 1:100)

## [1] -0.476069

##original function using data 1 to N (100), if same as before, working 

set.seed(1)
##since using random sampling, set seed for reproducible results 

##The next command uses the sample() function to randomly select 100 observations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing alpha based on the new data set
alpha.fn(Portfolio, sample(1:100, 100, replace=TRUE))

## [1] -0.5332013

##single bootstrap sampe of size one, now let's do it 1000x

Second, we use the boot function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement:

This procedure is repeated B times for some large value of B, we’ll do 1000, in order to produce B different bootstrap data sets, \(Z^{*1}\), \(Z^{*2}\),…, \(Z^{*B}\), and B corresponding \({\alpha}\) estimates, \(\hat{\alpha}^{*1}\), \(\hat{\alpha}^{*2}\),…, \(\hat{\alpha}^{*B}\). We can compute the standard error of these bootstrap estimates using the formula:

\[SE_{B}(\hat{\alpha})= \sqrt{\frac{1}{B-1}\sum_{r=1}^B \left(\hat{\alpha}^{*r} - \frac{1}{B}\sum_{r'=1}^B \hat{\alpha}^{*r'} \right)}\]

library(boot)
boot.out=boot(Portfolio, alpha.fn, R=1000)
boot.out

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Portfolio, statistic = alpha.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original     bias    std. error
## t1* -0.476069 -0.0226729   0.1275871

plot (boot.out)

##slight skew b/c no parameters but relatively normal
##Q-Q plot doesn't exactly line up in a straight line so distribution isn't perfectly normal but not too bad

Estimating the Accuracy of a Linear Regression Model

The bootstrap approach can be used to assess the variability of the coefficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for \({\beta_0}\) and \({\beta_1}\), two constants that represent the intercept and slope terms in a linear model.

Starting Out

setwd("~/Predictive Analytics/bootstrap lab")
studdata=read.csv ("intro to psych student data.csv", header =T) 
dim(studdata)

## [1] 440  40

studdata=read.csv ("intro to psych student data.csv", header =T, na.strings = c("99","999")) 
studdata=na.omit (studdata)
dim(studdata)

## [1] 436  40

head (studdata)

##   subject age sex race yearsch leavehom befoleav job income facescoh
## 1       1  18   2    1       1        1        1 840      4       22
## 2       2  18   1    1       1        1        1 650      3       28
## 3       3  18   2    1       1        1        1  87      4       44
## 4       4  18   2    1       1        2        9 437      3       42
## 5       5  18   2    1       1        1        1 405      6       48
## 6       7  21   2    1       2        2        9 921      6       12
##   facesada hlcssg hlcssa hlcsed hlcsd hlcsfi hlcssf hlcssr hlcsg eamtotal
## 1       34     27   17.0      7    20      8      2      6  6.00    56.00
## 2       32     25    9.0     30    21     17     10     18 10.00    56.00
## 3       25     36   18.0      6    14     24      6      3 11.00    46.22
## 4       22     38   12.0     25    24     25     11     12 14.00    53.00
## 5       31     32   14.4      6    13      4      2      3 11.76    35.22
## 6       22     42    9.0     24    35     28      5     21  8.00    64.03
##   eaftotal rssep rseng rsenm rsnur rsselfc rsdd rshs prqmtotl prqftotl
## 1       68    26    19    34    22      25   24   41      180      148
## 2       57    25    21    40    23      31   30   43      179      174
## 3       43    32    18    27    26      30   28   42      212      212
## 4       56    30    32    39    23      30   26   41      189      159
## 5       34    17    11    36    24      39   19   52      205      206
## 6       58    17    30    34     8      35   15   40       66       91
##   atctot iseltot rsetotal  paqm paqf ulstotal bd1total felnowtl felgentl
## 1  69.00      74       24 26.79   34       43    19.00    48.00       52
## 2  76.00      76       27 28.00   32       33    13.00    42.00       52
## 3  82.00      73       31 26.00   37       37     1.00    35.83       34
## 4  78.00      79       28 26.00   32       38    18.00    40.00       51
## 5  80.39      78       39 28.00   39       20     5.41    32.00       33
## 6  81.78      78       37 33.00   39       24     1.00    30.00       30
##   wahlertl
## 1    43.00
## 2    74.00
## 3    54.49
## 4    75.00
## 5    50.00
## 6    23.00

Variables

LEAVEHOM = whether participant left home to attend college
FAMSTRU = family structure
JOB = social class (higher value =higher social class; range = 71 - 960)
FACESCOH = family cohesion scale
FACESADA = family adaptability scale
HLCS = home-leaving cognitions scale
* SG = self-governance
* SA = school affiliation
* ED = emotional detachment
* D = disengagement
* FI = financial independence
RS = relationship survey
* SEP= separation anxiety
* ENG= engulfment anxiety
* ENM = enmeshment seeking
* DD = dependency denial
EAMTOTAL = emotional autonomy from mother
PRQMTOTL = attachment to mother
ATCTOT = adaptability to change
ISELTOT = social support
EAFTOTAL = emotional autonomy from father
PRQFTOTL = attachment to father
RSETOTAL = self-esteem
ULSTOTAL = loneliness
BD1TOTAL = Beck Depression Inventory
FELNOWTL = state anxiety
FELGENTL = trait anxiety

Simple Linear Regression
Predictor: attachment to mother
Response: relationship survey, engulfment anxiety

lm.fit =lm(rseng~prqmtotl, data=studdata)

summary(lm.fit)

## 
## Call:
## lm(formula = rseng ~ prqmtotl, data = studdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.948  -3.581   0.305   3.402  17.305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.046240   1.505254   25.94   <2e-16 ***
## prqmtotl    -0.097097   0.008385  -11.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.908 on 434 degrees of freedom
## Multiple R-squared:  0.2361, Adjusted R-squared:  0.2343 
## F-statistic: 134.1 on 1 and 434 DF,  p-value: < 2.2e-16

##[x] p-value significant 
##[x] RSE (lack of fit) small 
##[x] 24% variability explained 
##[x] there's a relationship between predictor and response 

with(studdata,plot(prqmtotl, rseng))
abline (lm.fit)

## as self reported attachment to mother increases, engulfment anxiety decreases

Bootstrapping

We first create a simple function, boot.fn(), which takes in the studdata data set as well as a set of indices for the observations, and returns the intercept and slope estimates for the linear regression model. We then apply this function to the full set of 436 observations in order to compute the estimates of \({\beta_0}\) and \({\beta_1}\) on the entire data set.

boot.fn=function (data ,index)
return(coef(lm(rseng~prqmtotl,data=data,subset=index)))
boot.fn(studdata ,1:436)

## (Intercept)    prqmtotl 
##  39.0462397  -0.0970967

##Same values for intercept and slope obtained in summary of regression above

The boot.fn() function can also be used in order to create bootstrap estimates for the intercept and slope terms by randomly sampling from among the observations with replacement, let’s do a couple individual examples:

set.seed (1)
boot.fn(studdata ,sample (436 ,436 , replace =T))

## (Intercept)    prqmtotl 
## 35.63248042 -0.07856354

boot.fn(studdata ,sample (436 ,436 , replace =T))

## (Intercept)    prqmtotl 
##  41.0146992  -0.1082734

Next,we can use the boot() function to compute the standard errors of 1,000 of these bootstrap estimates for the intercept and slope terms:

boot(studdata ,boot.fn ,1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = studdata, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##       original        bias    std. error
## t1* 39.0462397  0.0796258414 1.807120559
## t2* -0.0970967 -0.0004083515 0.009887761

Let’s compare these bootstrapped SEs to the SEs obtained using the standard SE equations for slope and intercept: \[SE (\hat{\beta_0}) = {\sigma}^2 \left (\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x}^2)}\right)\]

\[SE (\hat{\beta_1}) = \left (\frac{{\sigma}^2}{\sum_{i=1}^n (x_i - \bar{x}^2)}\right)\]

summary (lm(rseng~prqmtotl ,data=studdata))$coef

##               Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) 39.0462397 1.505253723  25.93997 2.855331e-90
## prqmtotl    -0.0970967 0.008384686 -11.58024 3.296303e-27

These are somewhat different from the estimates obtained using the bootstrap. The standard formulas given above rely on certain assumptions. For example, they depend on the unknown parameter \(\sigma^2\), the noise variance. The standard formulas also assume (somewhat unrealistically) that the \(x_i\) are fixed, and all the variability comes from the variation in the errors \(\epsilon_i\). The bootstrap approach does not rely on any of these assumptions, and so it is likely giving a more accurate estimate of the standard errors of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) than is the summary() function.

Below we compute the bootstrap standard error estimates and the standard linear regression estimates that result from fitting the quadratic model to the data.

lm.fit2=lm(rseng~prqmtotl +I(prqmtotl^2),  data=studdata)
summary (lm.fit2)

## 
## Call:
## lm(formula = rseng ~ prqmtotl + I(prqmtotl^2), data = studdata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9410  -3.5715   0.4044   3.2862  17.1606 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.2201157  5.2793292   4.967 9.82e-07 ***
## prqmtotl       0.0681736  0.0657684   1.037   0.3005    
## I(prqmtotl^2) -0.0005113  0.0002018  -2.533   0.0117 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.878 on 433 degrees of freedom
## Multiple R-squared:  0.2472, Adjusted R-squared:  0.2437 
## F-statistic:  71.1 on 2 and 433 DF,  p-value: < 2.2e-16

## worse fit, standard error went up, significance went away for predictor, F went down

boot.fn=function (data ,index )
 coefficients(lm(rseng~prqmtotl +I(prqmtotl^2),data=data, subset=index))
set.seed (1)
boot(studdata,boot.fn,1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = studdata, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias    std. error
## t1* 26.2201156975  4.728690e-01 5.967709971
## t2*  0.0681736160 -5.487872e-03 0.071656648
## t3* -0.0005113391  1.572339e-05 0.000213157

Since this model provides a worse fit to the data, there is now less correspondence between the bootstrap estimates and the standard estimates of SE(\(\hat{\beta_0}\)), SE(\(\hat{\beta_1}\)) and SE(\(\hat{\beta_2}\)).