K-fold cross-validation is one of the cross-validation approaches used for Resampling.
This approach involves:
* Divide dataset randomly into K different folds or groups of approximately equal size
* Remove (held out) 1st fold/part, fit model for the remaining K-1 parts. This first part is considered as validation set
* Mean Squared Error(MSE) is computed on the observations in the held out fold (1st fold)
* Repeat this procedure “K” times taking out different parts each time for validation, resulting in K different MSEs.
* The cross validation test error is calculated by averaging the resulting K different MSEs.
In case of classification problems, use Mis-classification Error rate instead of MSEs.
Validation set approach: The validation set approach has two main drawbacks compared to k-fold cross-validation. Though computation wise, this is less intensive than K-fold, the test error rate is highly variable with this approach. Second, only a subset of the observations are used to fit the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that the validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
LOOCV is a special case of K-fold in which k=n. LOOCV has two drawbacks compared to K-fold. LOOCV is computationally intensive than K-Fold as it has to be fitted n times where K-fold has to be fitted only K times. LOOCV has high variance compared to K-fold since we are averaging the outputs of n fitted models trained on an almost identical set of observations, these outputs are highly correlated, and the mean of highly correlated quantities has higher variance than less correlated ones.
Here i tried to compare all three approaches.
| Category | Validation set approach | LOOCV | K-Fold (K<n) |
|---|---|---|---|
| Computation | Relatively less | Very High | High |
| Bias | Very High | Relatively less | High |
| Variance | Very High | High | Relatively less |
| Randomness | Very High | Almost none | Minimal |
# Load the ISLR library
library(ISLR)
# set the seed
set.seed(1)
# See the structure of dataset
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
# Create model parameters string so that we could reuse it
model_vars <- default~income+balance
# Fit Logistic regression. Using glm instead of lm. Both are same
fit.glm = glm(model_vars,data = Default, family = binomial)
summary(fit.glm)
##
## Call:
## glm(formula = model_vars, family = binomial, data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Both income and balance are significant.
Validation set approach:
# Create a function as we have to repeat this several times and can reuse this code. Set seed as input to use for rerunning
# Initialize the train data set with a sample
train = sample(dim(Default)[1], dim(Default)[1] / 2)
# Create a multiple logistic regression model function to reuse later
mlogit = function(mVars,seed) {
# set the seed
set.seed(seed)
# 5(b)i. Splitting the sample into training (train) and validation set. Using dim to get the dimension and dividing it by 2.
train = sample(dim(Default)[1], dim(Default)[1] / 2)
# 5(b)ii. Fit the multiple logistic regression model using only the training observations (train data set)
fit.glm <- glm(mVars, data = Default, subset = train, family = "binomial")
# 5(b)iii. Get the probability for 0.5 as the threshold
probs.glm <- predict(fit.glm, Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs.glm))
pred.glm[probs.glm > 0.5] <- "Yes"
# Return the value
return(mean(pred.glm!=Default[-train, ]$default))
}
mlogit(model_vars,1)
## [1] 0.0254
2.54% of test error rate with this step.
# Instantiate a vector to capture the test error rate
cv.error = rep(0,3)
# Iterate over the multiple logistic regression function in previous step 3 times and capture the error rate
for (i in 1:3) {
# populate seed with i+1 as i already used seed(1) before. This is just to ensure results are same when i re-run it again.
cv.error[i] = mlogit(model_vars,i+1)
}
# Print the vector results
cv.error
## [1] 0.0246 0.0266 0.0252
We got different results each time but the error rates are not too far off.
# Create a new model vars variable
model_vars2 <- default~income+balance+student
# Run the model again
# Instantiate a vector to capture the test error rate
cv2.error = rep(0,3)
# Iterate over the multiple logistic regression function in previous step 3 times and capture the error rate
for (i in 1:3) {
# populate seed with i+1 as i already used seed(1) before. This is just to ensure results are same when i re-run it again.
cv2.error[i] = mlogit(model_vars2,i)
}
# Print the vector results
cv2.error
## [1] 0.0260 0.0246 0.0272
Adding Student predictor didn’t reduce the error rate. However, the error rates were not too significantly different.
# Fit Logistic regression.
fit2.glm = glm(model_vars,data = Default, family = binomial)
summary(fit2.glm)
##
## Call:
## glm(formula = model_vars, family = binomial, data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The standard errors estimated by glm are 4.348e-01, 4.985e-06 and 2.274e-04 for each of the coefficients.
library(boot)
## Warning: package 'boot' was built under R version 3.6.2
set.seed(1)
boot.fn <- function(data,index) {
fit3.glm <- glm(default~income+balance, data=data, subset = index , family = binomial)
return(coef(fit3.glm))
}
boot(Default, boot.fn, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The standard errors estimated by bootstrap function are 4.344722e-01, 4.8662845e-06 and 2.298949e-04 for each of the coefficients.
The estimated standard errors for both glm() and the bootstrap function are very close. Bootstrap function helped with smoothing.
# Load the MASS library
library(MASS)
## Warning: package 'MASS' was built under R version 3.6.2
# Get the mean of medv
mu.hat <- mean(Boston$medv)
mu.hat
## [1] 22.53281
SE = sd(Boston$medv)/sqrt(nrow(Boston))
SE
## [1] 0.4088611
The Standard Error is 0.4088611. This is small and hence we can interpret that the sample mean mu.hat is closer to the true population mean.
# set the seed for repeatability
set.seed(1)
# Define boot function to return the mean
boot.fn = function(data, index) {
mu <- mean(data[index])
return (mu)
}
# call the boot function with R = 1000
SE_boot = boot(Boston$medv, boot.fn, R=1000)
# Print the statistics
SE_boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
Bootstrap estimated the standard error of \(\mu\hat{}\) as 0.4106622, where as standard error of sample mean is 0.4088611. The values are very close.
Lets calculate the confidence intervals for the boostrap estimate
# Get the μˆ and SE(μˆ) from above bootstrap
mu.hat_b = SE_boot$t0
SE_b = sd(SE_boot$t)
# calculate the 95% confidence interval using the formula [μˆ − 2SE(μˆ), μˆ + 2SE(μˆ)]
CI <- c(mu.hat_b - 2 * SE_b,mu.hat_b + 2 * SE_b)
CI
## [1] 21.71148 23.35413
Lets do the t-test for Boston$medv.
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The 95% confidence intervals from t-test of Boston$medv are 21.72953 and 23.33608 where as the 95% confidence intervals calculated from the bootstrap estimate are 21.71148 and 23.35413. They are very close.
# use the median function and get the μˆmed
med.hat <- median(Boston$medv)
med.hat
## [1] 21.2
# Estimate the standard error of μˆmed using the bootstrap
# set the seed for repeatability
set.seed(1)
# Define boot function to return the median
boot.fn = function(data, index) {
med <- median(data[index])
return (med)
}
# call the boot function with R = 1000
SE_boot_med = boot(Boston$medv, boot.fn, R=1000)
# Print the statistics
SE_boot_med
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
The standard error of 0.3778075 for the median of bootstrap estimate is relatively small.
# Get the tenth percentile of medv in Boston suburbs using Quantile function with .1
mu.hat.0.1 = quantile(Boston$medv,.1)
mu.hat.0.1
## 10%
## 12.75
# Use bootstrap and estimate the standard error of μˆ0.1
# set the seed for repeatability
set.seed(1)
# Define boot function to return the quantile at 0.1 i.e. tenth percentile
boot.fn = function(data, index) {
mu.hat.0.1.b <- quantile(data[index],c(0.1))
return (mu.hat.0.1.b)
}
# call the boot function with R = 1000
SE_boot_quant = boot(Boston$medv, boot.fn, R=1000)
# Print the statistics
SE_boot_quant
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
The standard error of the 10th percentile is 0.4767526 which is relatively small.