In this report I will review resampling methods and provide examples using the boot package. In a previous report, A Multiple Regression Analysis of NYC’s 2012 Average SAT Math Scores, I demonstrated resampling using the caret package. That report can be found here:
Resampling
Resampling is very much what it sounds like - repeatedly drawing samples from a dataset and refitting a model on each sample. This helps with model assessment (evaluating a model’s performance) and model selection (selecting a model’s ideal level of felxibility).
Cross-Validation
We typically build a model using sample data, but our objective is to find a model that performs well on new data. Since we usually do not have access to new data we have to simulate it by splitting our sample dataset into a training set and a test set. We will build the model using the training set and evaluate its performance on the test set. In this case, the test set will be a substitute for new data.
The Validation Method is the simplest approach. We randomnly divide the dataset into two parts, training and test, and use the MSE (mean squared error) to evaluate model performance. It is easy to implement, but it has two main problems
MSE can vary widely based on the random split.
We fit the model to a subset of data. Models tend to perform worse when trained on fewer observations.
> #Validation Set Approach
> library(ISLR) #load library for data
> library(knitr) #load library for table formmatting
> library(kableExtra) #load library for table formmatting
>
> set.seed(1) #make random number generator replicable
> train=sample(392,196) #create random subset of 196 out of 392
>
> # I will be using the Auto dataset
> kable(head(Auto)) %>%
+ kable_styling(bootstrap_options = c("striped","condensed"),
+ full_width = F, font_size = 10, position = "left")%>%
+ row_spec(0,background="#E6E6FA") #format and display first 6| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | name |
|---|---|---|---|---|---|---|---|---|
| 18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
| 15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
| 18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
| 16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
| 17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
| 15 | 8 | 429 | 198 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
I first created a random sample of 196 out of 392 to be used with the Auto dataset.
> #fit model with sample of 196
> lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
>
> #Use the predict function to estimate the response for
> #all 392 observations
> #Calculate the squared error for observations not in the #training set (-train)
> #Use the mean function to calculate the Mean Squared Error (MSE)
>
> MSEvs <- mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)
> MSEvs[1] 23.26601
I then fit a linear regression with mpg as the dependent variable and horsepower as the independent variable, using the random subset or training set.
We can then use the test set to compare the actual mpg values to the predicted values. We arrive at an MSE value of 23.2660086.
> #Use the ploy function to compare with quadratic and
> #cubic regresions
>
> lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
>
> #quadratic MSE
> MSEq <-mean((Auto$mpg-predict(lm.fit2,Auto))[-train]^2)
> MSEq[1] 18.71646
> lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
>
> #cubic MSE
> MSEc <-mean((Auto$mpg-predict(lm.fit3,Auto))[-train]^2)
> MSEc[1] 18.79401
We can then compare it to quadratic and cubic regressions:
\[mpg = \beta_0+\beta_1 \times horsepower + \beta_2 \times horsepower^2 +\epsilon\] \[mpg = \beta_0+\beta_1 \times horsepower + \beta_2 \times horsepower^2 + \beta_3 \times horsepower^3 +\epsilon\]
The quadratic regression had an MSE of 18.7164595 and the cubic regression had an MSE of 18.7940068. Both are superior to the linear regression.
> #Choose a different random split
> set.seed(2) #make random number generator replicable
> train=sample(392,196) #create random subset of 196 out of 392
>
> lm.fit=lm(mpg~horsepower,data=Auto,subset=train)
> #MSE linear
> mean((Auto$mpg-predict(lm.fit,Auto))[-train]^2)[1] 25.72651
> lm.fit2=lm(mpg~poly(horsepower,2),data=Auto,subset=train)
> #MSE quadratic
> mean((Auto$mpg-predict(lm.fit2,Auto))[-train]^2)[1] 20.43036
> lm.fit3=lm(mpg~poly(horsepower,3),data=Auto,subset=train)
> #MSE cubic
> mean((Auto$mpg-predict(lm.fit3,Auto))[-train]^2)[1] 20.38533
If we recalculate the MSE values using a different random split we will obtain different values.
This method leaves out one observation for the test set and uses the remaining \((n-1)\) obeservations for the training set. Since the MSE is based on a single observation it will be extremely variable. However, we can repeat the process \(n\) times and average the \(n\) squared errors.
\[CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n}MSE_i\] I has two advantages over the Validation Set method
It uses almost the entire dataset to build the model, so it will have less bias.
Since it uses every datapoint as a test observation rather than a random split it will always yield the same results.
> #Leave-One-Out Cross-Validation
> library(boot) #load library for cross validation
> #Use glm instead of lm because it can be used
> #with the cv.glm function
>
> glm.fit = glm(mpg~horsepower,data=Auto)
>
> #cross validation
> cv.err = cv.glm(Auto,glm.fit) #default K = number of #observations
>
> MSEl <-cv.err$delta[1] #Mean squared error
> MSEl[1] 24.23151
We can use the cv.glm function from the boot package to perform cross validation on a model (in this case a linear regression). The default value for \(k\) is \(n\), the number of observations, which represents leave-one-out cross-validation. The resulting MSE value is 24.2315135.
> #Repeat for different polynomial fits using a loop
> cv.error=rep(0,5) #repeat 0 five times
> for (i in 1:5){ #loop 5 times
+ glm.fit=glm(mpg~poly(horsepower,i),data=Auto) #fit poly 1 to 5
+ cv.error[i]=cv.glm(Auto,glm.fit)$delta[1] #calculate and record MSE
+ }
> cv.error #output MSE for polynomials of order 1 to 5[1] 24.23151 19.24821 19.33498 19.42443 19.03321
We can compare the MSE values for the first 5 polynomial regressions by using a loop. The linear regression performs the worst.
To apply this method the dataset must be randomnly split into \(k\) groups (\(folds\)) of approximatley equal size. One fold is used for the test set and the remaining \(k-1\) are used for training.
The MSE is calculated on the test set and the process is repeated \(k\) times (each time with a different test fold).
Each of the \(k\) MSE values are then averaged.
\[CV_{(k)}=\frac{1}{k}\sum_{i=1}^{k}MSE_i\] LOOCV is a from of k-fold cross-validation where \(k=n\).
Variance- The amount by which the model would change if we estimated it with a different training dataset. More flexible statistical methods tend to have higher variance.
Bias- The error that is introduced by approximating the true model (unknown) with a simpler model. More flexible statistical methods tend to have lower bias.
As mentioned previously (when discussing the Validaton Approach), models tend to perform worse when trained on fewer observations. Since k-fold validation is trained on fewer observations than LOOCV, it tends to have more bias.
However, k-fold CV tends to have lower variance. Since LOOCV is trained on nearly every observation, each MSE value (which will be averaged) is highly correlated. k-fold CV averages \(k\) MSE values, each of which is somewhat less correlated.
In general, k-fold CV gives more accurate estimates of the error rate than LOOCV.
> #K-fold cross validation
> set.seed(20) #make random number generator replicable
> cv.error.10=rep(0,10) #repeat 0 ten times
> for (i in 1:10){
+ glm.fit=glm(mpg~poly(horsepower,i),data=Auto) #fit poly 1 to 10
+ #Set K=10
+ cv.error.10[i]=cv.glm(Auto,glm.fit, K=10)$delta[1] #calculate and record MSE
+ }
> cv.error.10 [1] 24.23568 19.17760 19.21438 19.63799 19.08680 19.70131 19.12513 19.31875
[9] 21.07359 19.24301
In cv.glm the default value for \(k\) is \(n\), but we can set the value to perform k-fold cross-validation. In this case I used \(k=10\) and again used a loop to evaluate different polynomial regressions.
When using qualitative data rather than quantitative, MSE can be substituted with the error rate. For example, LOOCV can be calculted by averaging the error rate for each test set.
\[CV_{n}=\frac{1}{n}\sum_{i=1}^{n}Err_i\] where \(Err_i=1\) if the predicted value is incorrect and \(0\) otherwise.
Similar adjustements can be made for k-fold cross-validation.
Bootstrapping relies on random sampling with replacement, and of equal size to the observed dataset. It is used to estimate properties of a statistic, such as standard errors.
> # I will be using the Portfolio dataset
> kable(head(Portfolio)) %>%
+ kable_styling(bootstrap_options = c("striped","condensed"),
+ full_width = F, font_size = 12, position="left")%>%
+ row_spec(0,background="#E6E6FA") #format and display first 6| X | Y |
|---|---|
| -0.8952509 | -0.2349235 |
| -1.5624543 | -0.8851760 |
| -0.4170899 | 0.2718880 |
| 1.0443557 | -0.7341975 |
| -0.3155684 | 0.8419834 |
| -1.7371238 | -2.0371910 |
In the Portfolio dataset \(X\) and \(Y\) represent the returns of two financial assets.
Suppose we wanted to invest a fraction \(\alpha\) of money in \(X\) and the remaining \(1-\alpha\) in \(Y\). If we wanted to choose the value of \(\alpha\) to minimize the portfolio risk we can solve for it by using this formula:
\[\alpha=\frac{\sigma_Y^2-\sigma_{XY}}{\sigma_X^2+\sigma_Y^2-2\sigma_{XY}}\] where \(\sigma_X^2 = VAR(X)\), \(\sigma_Y^2 = VAR(Y)\), and \(\sigma_{XY}=COV(X,Y)\)
> #Function takes inputs X and Y and a
> #vector which indicates what oberservations should be used
> #Outputs the estimate for alpha based on selected observations
>
> alpha.fn=function(data,index){
+ X=data$X[index]
+ Y=data$Y[index]
+ return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
+ }We can start by creating a function which will calculate \(\alpha\).
[1] 0.5758321
Using all 100 observations we can see that \(\alpha\) equals 0.5758321.
> set.seed(1)
> #randomnly select 100 observations with replacement
> Alphat <- alpha.fn(Portfolio,sample(100,100,replace=TRUE))
> Alphat[1] 0.7368375
If we randomnly select 100 observations with replacement, we can calculate \(\hat\alpha\) of 0.7368375.
> #boot( ) calls the statistic function R times.
> #Each time, it generates a set of random indices, with replacement,
> #from the integers 1:nrow(data). These indices are used within
> #the statistic function to select a sample. The statistics are
> #calculated on the sample and the results are accumulated in
> #the bootobject.
>
> bootcorr<- boot(Portfolio,alpha.fn,R=1000)
> bootcorr #summary
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 0.5758321 -0.001695873 0.09366347
[1] 0.5758321
[1] -0.001695873
[1] 0.09366347
We can implement a bootstrap analysis by performing the alpha.fn function many times (using a random sample with replacement), and computing the standard deviation of all \(\hat\alpha\) values. Fortunately the boot function automates this.
The function results in an \(\hat\alpha\) value of 0.5758321 and \(SE\hat\alpha\) value of 0.0936635.
We can also use bootstrapping to estimate the variability of \(\beta_0\) and \(\beta_1\) for a linear regression.
> #Create function which takes a dataset and set of indices
> #Returns the intercept and slope estimates for linear regression
>
> boot.fn=function(data,index)
+ return(coef(lm(mpg~horsepower,data=data,subset=index)))We start by creating a function that returns the intercept and slope values for a simple linear regression.
(Intercept) horsepower
39.9358610 -0.1578447
We can then evaluate the output using all observations.
(Intercept) horsepower
40.3404517 -0.1634868
(Intercept) horsepower
39.7271157 -0.1538265
We can create bootstrap estimates by randomnly sampling with replacement.
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Auto, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 39.9358610 0.0743759879 0.848880499
t2* -0.1578447 -0.0006799674 0.007360085
If we use the boot function to automate many samples, we arrive at an estimate for \(SE(\hat\beta_0)\) of 0.8488805 and for \(SE(\hat\beta_1)\) of 0.0073601.
> #Create function using quadratic regression
> boot.fn=function(data,index)
+ coefficients(lm(mpg~horsepower+I(horsepower^2),data=data,subset=index))
>
> set.seed(1) #Make replicable
> boot(Auto,boot.fn,1000) #automate with boot
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Auto, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 56.900099702 3.511640e-02 2.0300222526
t2* -0.466189630 -7.080834e-04 0.0324241984
t3* 0.001230536 2.840324e-06 0.0001172164
We can also calculate standard errors for quadratic or higher level polynomials.