Abstract

In this report I will review resampling methods and provide examples using the boot package. In a previous report, A Multiple Regression Analysis of NYC’s 2012 Average SAT Math Scores, I demonstrated resampling using the caret package. That report can be found here:

https://rpubs.com/pjozefek/574750.

What is Resampling and Cross-Validation?

Resampling

Resampling is very much what it sounds like - repeatedly drawing samples from a dataset and refitting a model on each sample. This helps with model assessment (evaluating a model’s performance) and model selection (selecting a model’s ideal level of felxibility).

Cross-Validation

We typically build a model using sample data, but our objective is to find a model that performs well on new data. Since we usually do not have access to new data we have to simulate it by splitting our sample dataset into a training set and a test set. We will build the model using the training set and evaluate its performance on the test set. In this case, the test set will be a substitute for new data.

Validation Set Method

The Validation Method is the simplest approach. We randomnly divide the dataset into two parts, training and test, and use the MSE (mean squared error) to evaluate model performance. It is easy to implement, but it has two main problems

MSE can vary widely based on the random split.
We fit the model to a subset of data. Models tend to perform worse when trained on fewer observations.

Validation Set Example

> #Validation Set Approach
> library(ISLR) #load library for data
> library(knitr) #load library for table formmatting
> library(kableExtra) #load library for table formmatting
> 
> set.seed(1) #make random number generator replicable
> train=sample(392,196) #create random subset of 196 out of 392
> 
> # I will be using the Auto dataset
> kable(head(Auto)) %>%
+   kable_styling(bootstrap_options = c("striped","condensed"), 
+   full_width = F, font_size = 10, position = "left")%>%
+   row_spec(0,background="#E6E6FA") #format and display first 6

mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
18	8	307	130	3504	12.0	70	1	chevrolet chevelle malibu
15	8	350	165	3693	11.5	70	1	buick skylark 320
18	8	318	150	3436	11.0	70	1	plymouth satellite
16	8	304	150	3433	12.0	70	1	amc rebel sst
17	8	302	140	3449	10.5	70	1	ford torino
15	8	429	198	4341	10.0	70	1	ford galaxie 500

I first created a random sample of 196 out of 392 to be used with the Auto dataset.

> #fit model with sample of 196
> lm.fit=lm(mpg~horsepower,data=Auto,subset=train) 
> 
> #Use the predict function to estimate the response for 
> #all 392 observations
> #Calculate the squared error for observations not in the #training set (-train)
> #Use the mean function to calculate the Mean Squared Error (MSE)
> 
> MSEvs <- mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)
> MSEvs

[1] 23.26601

I then fit a linear regression with mpg as the dependent variable and horsepower as the independent variable, using the random subset or training set.

We can then use the test set to compare the actual mpg values to the predicted values. We arrive at an MSE value of 23.2660086.

> #Use the ploy function to compare with quadratic and 
> #cubic regresions
> 
> lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
> 
> #quadratic MSE
> MSEq <-mean((Auto$mpg-predict(lm.fit2,Auto))[-train]^2) 
> MSEq

[1] 18.71646

> lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
> 
> #cubic MSE
> MSEc <-mean((Auto$mpg-predict(lm.fit3,Auto))[-train]^2) 
> MSEc

[1] 18.79401

We can then compare it to quadratic and cubic regressions:

\[mpg = \beta_0+\beta_1 \times horsepower + \beta_2 \times horsepower^2 +\epsilon\] \[mpg = \beta_0+\beta_1 \times horsepower + \beta_2 \times horsepower^2 + \beta_3 \times horsepower^3 +\epsilon\]

The quadratic regression had an MSE of 18.7164595 and the cubic regression had an MSE of 18.7940068. Both are superior to the linear regression.

> #Choose a different random split
> set.seed(2) #make random number generator replicable
> train=sample(392,196) #create random subset of 196 out of 392
> 
> lm.fit=lm(mpg~horsepower,data=Auto,subset=train) 
> #MSE linear
> mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)

[1] 25.72651

> lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
> #MSE quadratic
> mean((Auto$mpg-predict(lm.fit2,Auto))[-train]^2)

[1] 20.43036

> lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
> #MSE cubic
> mean((Auto$mpg-predict(lm.fit3,Auto))[-train]^2)

[1] 20.38533

If we recalculate the MSE values using a different random split we will obtain different values.

Leave-One-Out Cross-Validation (LOOCV)

This method leaves out one observation for the test set and uses the remaining \((n-1)\) obeservations for the training set. Since the MSE is based on a single observation it will be extremely variable. However, we can repeat the process \(n\) times and average the \(n\) squared errors.

\[CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}MSE_i\] I has two advantages over the Validation Set method

It uses almost the entire dataset to build the model, so it will have less bias.
Since it uses every datapoint as a test observation rather than a random split it will always yield the same results.

LOOCV Example

> #Leave-One-Out Cross-Validation
> library(boot) #load library for cross validation
> #Use glm instead of lm because it can be used 
> #with the cv.glm function
> 
> glm.fit = glm(mpg~horsepower,data=Auto)
> 
> #cross validation
> cv.err = cv.glm(Auto,glm.fit) #default K = number of #observations
> 
> MSEl <-cv.err$delta[1] #Mean squared error
> MSEl

[1] 24.23151

We can use the cv.glm function from the boot package to perform cross validation on a model (in this case a linear regression). The default value for \(k\) is \(n\), the number of observations, which represents leave-one-out cross-validation. The resulting MSE value is 24.2315135.

> #Repeat for different polynomial fits using a loop
> cv.error=rep(0,5) #repeat 0 five times
> for (i in 1:5){ #loop 5 times
+   glm.fit=glm(mpg~poly(horsepower,i),data=Auto) #fit poly 1 to 5
+   cv.error[i]=cv.glm(Auto,glm.fit)$delta[1] #calculate and record MSE
+ }
> cv.error #output MSE for polynomials of order 1 to 5

[1] 24.23151 19.24821 19.33498 19.42443 19.03321

We can compare the MSE values for the first 5 polynomial regressions by using a loop. The linear regression performs the worst.

k-Fold Cross-Validation

To apply this method the dataset must be randomnly split into \(k\) groups (\(folds\)) of approximatley equal size. One fold is used for the test set and the remaining \(k-1\) are used for training.

The MSE is calculated on the test set and the process is repeated \(k\) times (each time with a different test fold).

Each of the \(k\) MSE values are then averaged.

\[CV_{(k)}=\frac{1}{k}\sum_{i=1}^{k}MSE_i\] LOOCV is a from of k-fold cross-validation where \(k=n\).

The Bias-Variance Trade-Off for k-Fold Cross-Validation

Variance- The amount by which the model would change if we estimated it with a different training dataset. More flexible statistical methods tend to have higher variance.

Bias- The error that is introduced by approximating the true model (unknown) with a simpler model. More flexible statistical methods tend to have lower bias.

As mentioned previously (when discussing the Validaton Approach), models tend to perform worse when trained on fewer observations. Since k-fold validation is trained on fewer observations than LOOCV, it tends to have more bias.

However, k-fold CV tends to have lower variance. Since LOOCV is trained on nearly every observation, each MSE value (which will be averaged) is highly correlated. k-fold CV averages \(k\) MSE values, each of which is somewhat less correlated.

In general, k-fold CV gives more accurate estimates of the error rate than LOOCV.

k-Fold CV Example

> #K-fold cross validation
> set.seed(20) #make random number generator replicable
> cv.error.10=rep(0,10) #repeat 0 ten times
> for (i in 1:10){
+   glm.fit=glm(mpg~poly(horsepower,i),data=Auto) #fit poly 1 to 10
+   #Set K=10
+   cv.error.10[i]=cv.glm(Auto,glm.fit, K=10)$delta[1] #calculate and record MSE
+ }
> cv.error.10

 [1] 24.23568 19.17760 19.21438 19.63799 19.08680 19.70131 19.12513 19.31875
 [9] 21.07359 19.24301

In cv.glm the default value for \(k\) is \(n\), but we can set the value to perform k-fold cross-validation. In this case I used \(k=10\) and again used a loop to evaluate different polynomial regressions.

Cross-Validation on Classification Problems

When using qualitative data rather than quantitative, MSE can be substituted with the error rate. For example, LOOCV can be calculted by averaging the error rate for each test set.

\[CV_{n}=\frac{1}{n}\sum_{i=1}^{n}Err_i\] where \(Err_i=1\) if the predicted value is incorrect and \(0\) otherwise.

Similar adjustements can be made for k-fold cross-validation.

Bootstrapping

Bootstrapping relies on random sampling with replacement, and of equal size to the observed dataset. It is used to estimate properties of a statistic, such as standard errors.

Bootstraping Example Using Portfolio

> # I will be using the Portfolio dataset
> kable(head(Portfolio)) %>%
+   kable_styling(bootstrap_options = c("striped","condensed"), 
+   full_width = F, font_size = 12, position="left")%>%
+   row_spec(0,background="#E6E6FA") #format and display first 6

X	Y
-0.8952509	-0.2349235
-1.5624543	-0.8851760
-0.4170899	0.2718880
1.0443557	-0.7341975
-0.3155684	0.8419834
-1.7371238	-2.0371910

In the Portfolio dataset \(X\) and \(Y\) represent the returns of two financial assets.

Suppose we wanted to invest a fraction \(\alpha\) of money in \(X\) and the remaining \(1-\alpha\) in \(Y\). If we wanted to choose the value of \(\alpha\) to minimize the portfolio risk we can solve for it by using this formula:

\[\alpha=\frac{\sigma_Y^2-\sigma_{XY}}{\sigma_X^2+\sigma_Y^2-2\sigma_{XY}}\] where \(\sigma_X^2 = VAR(X)\), \(\sigma_Y^2 = VAR(Y)\), and \(\sigma_{XY}=COV(X,Y)\)

> #Function takes inputs X and Y and a
> #vector which indicates what oberservations should be used
> #Outputs the estimate for alpha based on selected observations
> 
> alpha.fn=function(data,index){
+   X=data$X[index]
+   Y=data$Y[index]
+   return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
+ }

We can start by creating a function which will calculate \(\alpha\).

> Alp <- alpha.fn(Portfolio,1:100) #using all 100 obervations
> Alp

[1] 0.5758321

Using all 100 observations we can see that \(\alpha\) equals 0.5758321.

> set.seed(1)
> #randomnly select 100 observations with replacement
> Alphat <- alpha.fn(Portfolio,sample(100,100,replace=TRUE))
> Alphat

[1] 0.7368375

If we randomnly select 100 observations with replacement, we can calculate \(\hat\alpha\) of 0.7368375.

> #boot( ) calls the statistic function R times. 
> #Each time, it generates a set of random indices, with replacement, 
> #from the integers 1:nrow(data). These indices are used within 
> #the statistic function to select a sample. The statistics are 
> #calculated on the sample and the results are accumulated in 
> #the bootobject.
> 
> bootcorr<- boot(Portfolio,alpha.fn,R=1000)
> bootcorr #summary


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.5758321 -0.001695873  0.09366347

> bootcorr$t0 # alpha estimate

[1] 0.5758321

> mean(bootcorr$t) - bootcorr$t0 #bias

[1] -0.001695873

> sd(bootcorr$t) # standard error

[1] 0.09366347

We can implement a bootstrap analysis by performing the alpha.fn function many times (using a random sample with replacement), and computing the standard deviation of all \(\hat\alpha\) values. Fortunately the boot function automates this.

The function results in an \(\hat\alpha\) value of 0.5758321 and \(SE\hat\alpha\) value of 0.0936635.

Bootstraping Example Using Auto

We can also use bootstrapping to estimate the variability of \(\beta_0\) and \(\beta_1\) for a linear regression.

> #Create function which takes a dataset and set of indices
> #Returns the intercept and slope estimates for linear regression
> 
> boot.fn=function(data,index)
+   return(coef(lm(mpg~horsepower,data=data,subset=index)))

We start by creating a function that returns the intercept and slope values for a simple linear regression.

> #Coefficients for all 392 observations
> boot.fn(Auto,1:392)

(Intercept)  horsepower 
 39.9358610  -0.1578447

We can then evaluate the output using all observations.

> set.seed(1)
> boot.fn(Auto,sample(392,392,replace=TRUE))

(Intercept)  horsepower 
 40.3404517  -0.1634868

> set.seed(2)
> boot.fn(Auto,sample(392,392,replace=TRUE))

(Intercept)  horsepower 
 39.7271157  -0.1538265

We can create bootstrap estimates by randomnly sampling with replacement.

> # Use boot function to automate many samples
> bootcorr2 <- boot(Auto,boot.fn,1000)
> bootcorr2


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Auto, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
      original        bias    std. error
t1* 39.9358610  0.0743759879 0.848880499
t2* -0.1578447 -0.0006799674 0.007360085

If we use the boot function to automate many samples, we arrive at an estimate for \(SE(\hat\beta_0)\) of 0.8488805 and for \(SE(\hat\beta_1)\) of 0.0073601.

> #Create function using quadratic regression
> boot.fn=function(data,index)
+   coefficients(lm(mpg~horsepower+I(horsepower^2),data=data,subset=index))
> 
> set.seed(1) #Make replicable
> boot(Auto,boot.fn,1000) #automate with boot


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Auto, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
        original        bias     std. error
t1* 56.900099702  3.511640e-02 2.0300222526
t2* -0.466189630 -7.080834e-04 0.0324241984
t3*  0.001230536  2.840324e-06 0.0001172164

We can also calculate standard errors for quadratic or higher level polynomials.

Resampling Methods

R Markdown

Paul Jozefek

5/18/2020