Chapter 05: Resampling Methods

3

3a

K-fold validation is the approach taken to validate a model by separating the available data into k sets, where one of the folds is selected as the test set and the remaining are used for training. In doing so, we obtain the error of using this particular fold for testing. The process is then repeated once for each fold, and all the errors obtained are averaged.

3b

i

The main disadvantage of using an approach such as k-fold is the computational cost involved. This may provide too time consuming or costly for some models. The validation set approach has a clear benefit in this case, since the model only needs be learnt from the data once, and tested once. However, a validation set may not be available (due to only a small number of observations being available) or may be too costly to obtain in practice. In these cases, the k-fold cross validation is a clear winner, for it allows us to fine tune our model. Furthermore, a validation set may tend to over estimate the error since less data is used for training.

ii

The Leave-One-Out Cross Validation (LOOCV) approach has a worse (or the same, in the case k=n) computational cost to K-Fold Cross Validation since the model needs to be trained and tested n times, instead of k. Furthermore, LOOCV suffers from a higher variance in result, since they are typically highly correlated (most models trained are very similar). There are no significant advance to using LOOCV since 5-fold or 10-fold will significantly reduce the computational cost and will neither suffer from high bias or variance.

Applied Exercises

5

5a

glm.fit=glm(default~income+balance,family='binomial',data=Default)

5b

set.seed(3)
subset=sample(1:1000,500)                                  # i. done by the subset parameter.
glm.fit=glm(default~income+balance,family='binomial',      # ii. fitting the model.
            data=Default,subset=subset) 

glm.resp=predict(glm.fit,Default[-subset,],type='response') # iii. computing the posterior.
glm.pred=ifelse(glm.resp>0.5,'Yes','No')

mean(glm.pred!=Default[-subset,'default'])  
## [1] 0.02631579

5c

  DefaultValidationFn = function(formula=default~income+balance,n=1000,s=500,seed){
  set.seed(seed)
  subset=sample(n,s)                                
  glm.fit=glm(formula,family='binomial',     
            data=Default,subset=subset) 

  glm.resp=predict(glm.fit,Default[-subset,],type='response') 

  mean(glm.pred!=Default[-subset,'default'])                 
  }

for(i in 1:3) print(DefaultValidationFn(seed=i))
## [1] 0.02715789
## [1] 0.02621053
## [1] 0.02631579

5d

for(i in 1:3) print(DefaultValidationFn(formula=default~income+balance+student,seed=i))
## [1] 0.02715789
## [1] 0.02621053
## [1] 0.02631579

A reduction is not observed when including the variable student.

6

6a

summary(glm(default~income+balance,data=Default,family='binomial'))
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

6b

boot.fn = function(data,index){
  coef(glm(default~income+balance,data=data,subset=index,family='binomial'))
}

set.seed(1)
boot.fn(Default,sample(1000,500,replace = T))
##   (Intercept)        income       balance 
## -1.029272e+01  1.558656e-05  5.236697e-03

6c

boot(Default,boot.fn,R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.933672e-02 4.343265e-01
## t2*  2.080898e-05  1.671659e-07 4.867306e-06
## t3*  5.647103e-03  1.850346e-05 2.298227e-04

9

9a

mu=mean(medv) # estimate of population mean
print(mu)
## [1] 22.53281

9b

sterr = function(x,index){ sd(x[index])/sqrt(length(x[index])) }  # estimate of standard error.
print(sterr(medv)) 
## [1] 0.4088611

9c

set.seed(456)
boot(medv,function(x,index){mean(x[index])},R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = function(x, index) {
##     mean(x[index])
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.01222925   0.3939195

The results are different but approximate; 0.404878 for the boot estimate and 0.4088611 for the sample estimate.

9d

mu-2*0.404878  
## [1] 21.72305
mu+2*0.404878  
## [1] 23.34256

9e

print(median(medv))
## [1] 21.2

9f

boot(medv,function(data,index){median(data[index])},1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = function(data, index) {
##     median(data[index])
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     21.2 -0.02755    0.375179

There is a standard error or 0.3756641 for the estimating the median value.

9g

quantile(medv,.1)
##   10% 
## 12.75

9h

boot(medv,function(data,index){quantile(medv[index],.1)},R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = function(data, index) {
##     quantile(medv[index], 0.1)
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.01485     0.50522

There is a standard error of 0.5166154. One can interpret this value to mean the amount that the quarterly deviates from the estimate in each sample.