This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
In this tutorial, we will explore Cross-Validation and the Bootstrap.
One major task of machine learning is to evaluate the expected performance of a learning algorithm. In general, we choose a loss function \(L\) (e.g. \(L_2\) loss, 0-1 loss, binomial deviance), when the outcome \(Y\) and predictors \(X\) come from an unknown distribution \(p(Y, X)\), we want to estimate the expected loss for any prediction \(\hat y =\hat{f}(\mathbf{x})\), \[\mbox{test error of }\hat{f}=\mathbb{E} [L(Y, \hat{f}(X)].\]
It is crucial especially when there are tuning parameters in the model– we want to select the optimal model that leads to the lowest test error.
One method we can use is to split the data into training set and an independent test set, and use the average loss on the test set as an unbiased estimator of test error. But what if we have an extra model-selection procedure? Clearly, we cannot use the same data set for both model-selection and model-assessment!
In order to evaluate the performance of different models and choose the optimal one, we randomly split data into three sets: a training set, a validation set and a test set.
For example, let’s consider a regression problem where we have \(n=50\) observations and \(d=30\) predictors, while most of them are not predictive. To adapt to the sparsity nature of the model, we use the lasso regression with a tuning parameter \(\lambda\). We want to choose the optimal \(\lambda\), and estimate the test error.
To this end, we carry out K-fold cross-validation. We first set aside the test set that contains 30% of the data. Then split the remaining data into K blocks. Use each block in turn as the validation set and perform cross-validation and average the results over all K combinations.
We will use the R package glmnet.
library(glmnet)
The data is generated from the following model \[ Y = \sum_{j=1}^d \beta_j X_j Z_j + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1),\] where \(Z_j\)’s are Bernoulli random variables with parameter 0.1 and \(\beta_j\)’s are generated from the uniform distribution on \([-3, 3]\). We use the following code generate 50 observation.
set.seed(123)
d=30
n=50
x=matrix(rnorm(d*n, 1,1), n,d)
beta=rbinom(d,size=1,prob=0.1)*runif(d,-3, 3)
eps=rnorm(n,0,1)
y=x%*%beta+eps
We fist set aside a test set, which contains 30% data points.
train = sample(1:n, n*0.7)
test= c(1:n)[-train]
ytest = y[test]
xtrain=x[train,]
ytrain=y[train]
This function is to calculate the average loss of a model on a data set
cv_error=function(fitted_model,newy,newx){
pred <- predict(fitted_model, newx = newx)
error=colMeans((pred-newy)^2)
return(error)
}
The following function carries out \(K\)-fold cross-validation for the lasso
lasso_cv=function(k, lambda){
nlambda = length(lambda)
validation_error_mat=matrix(NA,k,nlambda)
for( j in 1:k){
validation_index=c((j-1)*(length(train)%/%k)+1:(length(train)%/%k)) ## the training data has already been shuffled.
train_index=c(1:length(train))[-validation_index]
lasso.mod = glmnet(xtrain[train_index,], ytrain[train_index], alpha = 1, lambda=lambda)
validation_error_mat[j,]=cv_error(fitted_model=lasso.mod,newy=ytrain[validation_index],newx= if
(length(validation_index)==1){ t(xtrain[validation_index,]) } else
{xtrain[validation_index,]} )
}
return(validation_error_mat)
}
The following function is for the visualision of cross-validation result.
lasso_cv_graph=function(legend_1,k,lambda, tit){
k_fold_cv=lasso_cv(k, lambda)
validation_error=apply(k_fold_cv, 2,mean)
validation_error_sd=apply(k_fold_cv, 2,sd)/sqrt(k)
lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = lambda)
test_error=cv_error(lasso.mod,newy=ytest,newx= x[test,])
best_lambda=which.min(validation_error)
print("The optimal lambda that minimizes CV error is" )
print(lambda[best_lambda])
print("The 'rule of thumb' lambda is" )
print(max(lambda[ validation_error<=(validation_error[best_lambda]+ validation_error_sd[best_lambda])]))
plot(log(lambda),validation_error ,ylim=range(validation_error-validation_error_sd,
validation_error+validation_error_sd,
test_error),pch=20,cex=0.5,col='blue',ylab="MSE",main =tit)
lines(log(lambda),validation_error,col='blue' )
lines(log(lambda),test_error)
for( i in 1:nlambda){
lines(x=rep(log(lambda)[i],2),
y=c(validation_error[i]-validation_error_sd[i],
validation_error[i]+validation_error_sd[i]),lwd=0.3,col='blue')
}
legend("topleft", col=c("blue", 1), lwd=2, legend=c(legend_1, "test error"))
points( log(lambda)[best_lambda],validation_error[best_lambda],col=2 )
}
For example, we perform 10-fold CV.
nlambda=30
lambda=exp(seq(0.5, -6, length = nlambda))
lasso_cv_graph("ten-fold CV", 10, lambda, tit=paste("ten-fold CV, d=",d))
## [1] "The optimal lambda that minimizes CV error is"
## [1] 0.1400834
## [1] "The 'rule of thumb' lambda is"
## [1] 0.3433635
### choose the optimal tuning parameter. Here we also pick up the tuning parameter that minimizes the CV error. Accounting for the uncertainty in CV-estimation, we apply the : Choose the simplest model whose CV error is no more than one standard error above the model with the lowest CV error.
When \(K=n\), each validation set is of size 1, this is also termed as leave-one-out (LOO) CV.
lasso_cv_graph("LOO-CV", length(train),lambda,tit=paste("LOO-CV, d=",d))
## [1] "The optimal lambda that minimizes CV error is"
## [1] 0.1119553
## [1] "The 'rule of thumb' lambda is"
## [1] 0.3433635
##How to choose K? There is no general rule to choose \(K\), as it depends on the problem. In \(K\)-fold CV with a relatively small \(K\), we train the model on less data than what is available. This introduces bias into the estimates of test error. In LOO-CV, the training samples highly resemble each other. This increases the variance of the test error estimate.
In the previous example, LOO and 10-fold CV have similar result. But if we increases the dimension to make the learning curve steeper锛宼hen the bias of 10-fold CV estimation error becomes significant.
set.seed(123)
d=200
n=50
x=matrix(rnorm(d*n, 1,1), n,d)
beta=rbinom(d,size=1,prob=0.1)*runif(d,-3, 3)
eps=rnorm(n,0,1)
y=x%*%beta+eps
train = sample(1:n, n*0.7)
test= c(1:n)[-train]
ytest = y[test]
nlambda=30
lambda=exp(seq(1, -8, length = nlambda))
xtrain=x[train,]
ytrain=y[train]
lasso_cv_graph("10-fold CV", 10, lambda, tit=paste("10-fold CV, d=",d))
## [1] "The optimal lambda that minimizes CV error is"
## [1] 0.0003354626
## [1] "The 'rule of thumb' lambda is"
## [1] 0.1664428
lasso_cv_graph("LOO-CV", length(train), lambda, tit=paste("LOO-CV, d=",d))
## [1] "The optimal lambda that minimizes CV error is"
## [1] 0.0003354626
## [1] "The 'rule of thumb' lambda is"
## [1] 0.1664428
Rough idea: Each time we re-sample a bootstrap sample uniformly from the observation (with replacement), estimate the parameter of interest according to this bootstrap sample. Repeat this procedure \(B\) times and obtain the distribution of that estimate. The function in R can do this.
Suppose we are interested in the distribution, or at least the variance of the p-value in a linear regression. We use the 鈥渘uclear鈥? data set (included in R-package “boot”). The data relates to the construction of 32 light water reactor (LWR) plants constructed in the U.S.A in the late 1960’s and early 1970’s. It contains 11 variables: cost ( in millions of dollars adjusted to 1976 base), date (the date on which the construction permit was issued in years), t1 (the time between application for and issue of the construction permit), t2 (the time between issue of operating license and construction permit), cap (the net capacity of the power plant (MWe)), bw (the cumulative number of power plants constructed by each architect-engineer) and five binary variables pr (the prior existence of a LWR plant at the same site), ne (the plant was constructed in the north-east region ), ct (the use of a cooling tower in the plant), bw ( the nuclear steam supply system was manufactured by Babcock-Wilcox.), pt ( plants with partial turnkey guarantees). Are those variables predictive of the outcome (cost)? We can run a simple linear regression:
library(boot)
summary(lm(cost~.,data = nuclear) )
##
## Call:
## lm(formula = cost ~ ., data = nuclear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.608 -46.736 -2.668 39.782 180.365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.135e+03 2.788e+03 -2.918 0.008222 **
## date 1.155e+02 4.226e+01 2.733 0.012470 *
## t1 5.928e+00 1.089e+01 0.545 0.591803
## t2 4.571e+00 2.243e+00 2.038 0.054390 .
## cap 4.217e-01 8.844e-02 4.768 0.000104 ***
## pr -8.112e+01 4.077e+01 -1.990 0.059794 .
## ne 1.375e+02 3.869e+01 3.553 0.001883 **
## ct 4.327e+01 3.431e+01 1.261 0.221008
## bw -8.238e+00 5.188e+01 -0.159 0.875354
## cum.n -6.989e+00 3.822e+00 -1.829 0.081698 .
## pt -1.925e+01 6.367e+01 -0.302 0.765401
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82.83 on 21 degrees of freedom
## Multiple R-squared: 0.8394, Adjusted R-squared: 0.763
## F-statistic: 10.98 on 10 and 21 DF, p-value: 2.844e-06
But the p-value itself is noisy, particularly we have only a few observations. What is the distribution of those p-values?
We then run the bootstrap on the p-value estimates:
set.seed(123)
variable_name=names(nuclear)[-1]
p_v_estimate = function(data, indices){
d = data[indices, ]
l = lm(cost~., data = d)
p_v =c( summary(l)$coefficients[-1,4] )
p_v_arg=rep(NA, length(variable_name))
for(i in 1:length(variable_name))
if( sum(names( p_v)==variable_name[i]) >=1)
p_v_arg[i]=p_v[which(names( p_v)==variable_name[i]) ]
return(p_v_arg)
}
results = boot(data = data.frame( as.matrix(nuclear)), statistic = p_v_estimate, R = 1000)
par(mfrow=c(2,5))
for(i in 1:10)
{
hist(results$t[,i], main=paste("p_value of", names(nuclear)[i+1] ),xlim=c(0,1) ,prob=T, breaks=20, xlab = names(nuclear)[i+1])
#abline(v=0.05,col=2)
}
We can also obtain the bootsrap estimate of the variance of regression coefficients:
set.seed(123)
coef_estimate = function(data, indices){
d = data[indices, ]
l = lm(cost~., data = d)
p_v =c( summary(l)$coefficients[-1,1] )
p_v_arg=rep(NA, length(variable_name))
for(i in 1:length(variable_name))
if( sum(names( p_v)==variable_name[i]) >=1)
p_v_arg[i]=p_v[which(names( p_v)==variable_name[i]) ]
return(p_v_arg)
}
print( boot(data = data.frame( as.matrix(nuclear)), statistic = coef_estimate, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data.frame(as.matrix(nuclear)), statistic = coef_estimate,
## R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 115.483182 -0.08975471 63.3446775
## t2* 5.928437 0.41652399 12.8858036
## t3* 4.570921 0.37143847 3.1022225
## t4* 0.421648 -0.01725226 0.1199543
## t5* -81.121120 4.31215505 57.2997340
## t6* 137.450223 -4.91974013 52.4472508
## t7* 43.273270 -0.16818232 47.4804135
## t8* -8.238443 -3.47844425 60.6424711
## t9* -6.988608 0.90592307 6.2320482
## t10* -19.247631 -4.47454364 96.5764560
The corss-validation can also contain tunning parameter, for example, the number of folds K. How can we use corss-validation to choose K?
The bootstap estimate also contains uncertainity. How can we quantify that uncertainity? Can we bootstrap the bootstraped-standard-error?