This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error,. The k-fold CV estimate is computed by averaging these values
The validation set approach is by seperating the already exisiting training data in two sets. However, there are two drawbacks: The estimate of the test error rate can be highly variable depending on which observations are included in the training and validation sets. The validation set error rate may tend to overestimate the test error rate for the model fit on the entire data set.
Advantages: LOOCV have less bias. The validation approach produces different MSE when applied repeatedly due to randomness in the splitting process, while performing LOOCV multiple times will always yield the same results, because we split based on 1 obs. each time.
Disadvantage: LOOCV is computationally intensive LOOCV is a special case of k-fold cross-validation with k = n. Thus, LOOCV is the most computationally intense method since the model must be fit n times. Also, LOOCV has higher variance, but lower bias, than k-fold CV.
library(ISLR)
set.seed(1)
glm_model = glm(default ~ income + balance, data = Default, family = "binomial")
summary(glm_model)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train = sample(dim(Default)[1], dim(Default)[1] / 2)
glm_model = glm(default ~ income + balance, data = Default[train,], family = "binomial")
glm_model = glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
#Both above formulas have the same outcome.
summary(glm_model)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
glm.probs = predict(glm_model, newdata = Default[-train, ], type="response")
glm.pred=rep("No",5000)
glm.pred[glm.probs>0.5] = "Yes"
mean(glm.pred != Default[-train, ]$default)
## [1] 0.0254
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
glm_model <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
probs <- predict(glm_model, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)
## [1] 0.0274
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
glm_model <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
probs <- predict(glm_model, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)
## [1] 0.0244
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
glm_model <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
probs <- predict(glm_model, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)
## [1] 0.0244
The validation estimate of the test error rate can be variable, depending on which observations are included in the training and validation set.
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
glm_model<- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
pred.glm <- rep("No", length(probs))
probs <- predict(glm_model, newdata = Default[-train, ], type = "response")
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)
## [1] 0.0278
Adding the “student” dummy variable leads to a reduction in the validation set estimate of the test error rate.
set.seed(1)
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
glm_model <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(glm_model)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
boot.fn <- function(data, index) {fit <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
return (coef(fit))}
library(boot)
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.912114e-02 4.347403e-01
## t2* 2.080898e-05 1.585717e-07 4.858722e-06
## t3* 5.647103e-03 1.856917e-05 2.300758e-04
The bootstrap estimates of the standard errors for the coefficients β0, β1 and β2 are respectively 4.347x 10(-1), 4.858 x 10(-6) and 2.300 x 10(-4).
The estimated standard errors obtained by the two methods are pretty close.
library(MASS)
mu.hat = mean(Boston$medv)
mu.hat
## [1] 22.53281
se.hat = sd(Boston$medv) /sqrt(dim(Boston)[1])
se.hat
## [1] 0.4088611
set.seed(1)
boot.fn <- function(data, index) {
mu <- mean(data[index])
return (mu)
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
CI.mu.hat <- c(22.53 - 2 * 0.4119, 22.53 + 2 * 0.4119)
CI.mu.hat
## [1] 21.7062 23.3538
med.hat <- median(Boston$medv)
med.hat
## [1] 21.2
boot.fn <- function(data, index) {
mu <- median(data[index])
return (mu)
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0386 0.3770241
percent.hat <- quantile(Boston$medv, c(0.1))
percent.hat
## 10%
## 12.75
boot.fn <- function(data, index) {
mu <- quantile(data[index], c(0.1))
return (mu)
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0186 0.4925766
We get an estimated tenth percentile value of 12.75 which is again equal to the value obtained in (g), with a standard error of 0.4925 which is relatively small compared to percentile value.