It’s done by splitting the data into k in roughly equal groups. Then it’s trained on all but one group then tested on the group that was left out. Then this is repeated “k” times, then it round the test errors to give one big overall estimate of the model’s test error.
The validation set approach has the ability to use data more efficiently. It also separates some of the data and doesn’t use it to train the model so the results depend a lot of the data in the training sets. Cross-validation wouldn’t have this problem because it uses all data in the training and testing at some point.However, cross-validation would take longer to run because it has to fit “k” amount of times.
LOOCV compared to k-fold cross-validation is usually less computationally expensive. LOOCV leaves out one single observation at a time, so it has to fit n times, in which n is the number of observations. This can take a long time for a large data set, while k-fold cross-validation only has to do it k number of times. However LOOCV would be less bias because each model is trained on larger portions of the data. Additionally, LOOCV may be more respective of outliers in the data set making more variance than k-fold cross-validation would.
library(ISLR2)
data(Default)
Default <- na.omit(Default)
Default$default <- as.factor(Default$default)
Default$student <- as.factor(Default$student)
set.seed(146)
glm.fit <- glm(default ~ income + balance,
data = Default,
family = binomial)
summary(glm.fit)
Call:
glm(formula = default ~ income + balance, family = binomial,
data = Default)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1579.0 on 9997 degrees of freedom
AIC: 1585
Number of Fisher Scoring iterations: 8
library(rsample)
set.seed(146)
data_split <- initial_split(Default, prop = 0.80, strata = default)
train_data1 <- training(data_split)
test_data1 <- testing(data_split)
glm.fit.train <- glm(default ~ income + balance,
data = train_data1,
family = binomial)
glm.probs <- predict(glm.fit.train,
newdata = test_data1,
type = "response")
glm.pred <- rep("No", nrow(test_data1))
glm.pred[glm.probs > 0.5] <- "Yes"
validation.error <- mean(glm.pred != test_data1$default)
validation.error
[1] 0.0215
# Test 2
set.seed(501)
data_split1 <- initial_split(Default, prop = 0.70, strata = default)
train_data2 <- training(data_split1)
test_data2 <- testing(data_split1)
glm.fit.train <- glm(default ~ income + balance,
data = train_data2,
family = binomial)
glm.probs <- predict(glm.fit.train,
newdata = test_data2,
type = "response")
glm.pred <- rep("No", nrow(test_data2))
glm.pred[glm.probs > 0.5] <- "Yes"
validation.error <- mean(glm.pred != test_data2$default)
validation.error
[1] 0.025
#Test 3
set.seed(4)
data_split3 <- initial_split(Default, prop = 0.60, strata = default)
train_data3 <- training(data_split3)
test_data3 <- testing(data_split3)
glm.fit.train <- glm(default ~ income + balance,
data = train_data3,
family = binomial)
glm.probs <- predict(glm.fit.train,
newdata = test_data3,
type = "response")
glm.pred <- rep("No", nrow(test_data3))
glm.pred[glm.probs > 0.5] <- "Yes"
validation.error <- mean(glm.pred != test_data3$default)
validation.error
[1] 0.0255
#Test 4
set.seed(4)
data_split4 <- initial_split(Default, prop = 0.85, strata = default)
train_data4 <- training(data_split4)
test_data4 <- testing(data_split4)
glm.fit.train <- glm(default ~ income + balance,
data = train_data4,
family = binomial)
glm.probs <- predict(glm.fit.train,
newdata = test_data4,
type = "response")
glm.pred <- rep("No", nrow(test_data4))
glm.pred[glm.probs > 0.5] <- "Yes"
validation.error <- mean(glm.pred != test_data4$default)
validation.error
[1] 0.026
I decided to change up both the split sizes and the seeds, it seems like first split and training size was the best but all the rest stay around 2.5% so that’s not that bad I would imagine.
set.seed(146)
data_split5 <- initial_split(Default, prop = 0.80, strata = default)
train_data5 <- training(data_split5)
test_data5 <- testing(data_split5)
glm.fit.train <- glm(default ~ income + balance + student,
data = train_data5,
family = binomial)
glm.probs <- predict(glm.fit.train,
newdata = test_data5,
type = "response")
glm.pred <- rep("No", nrow(test_data5))
glm.pred[glm.probs > 0.5] <- "Yes"
validation.error <- mean(glm.pred != test_data5$default)
validation.error
[1] 0.022
For me it didn’t actually make the model any better, it made it .05% worse which isn’t much but I would test it in each other model to see if it’s consistently that and then I would remove it if so.
set.seed(146)
glm.fit <- glm(default ~ income + balance,
data = Default,
family = binomial)
summary(glm.fit)
Call:
glm(formula = default ~ income + balance, family = binomial,
data = Default)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1579.0 on 9997 degrees of freedom
AIC: 1585
Number of Fisher Scoring iterations: 8
library(boot)
boot.fn <- function(data, index) {
glm.fit <- glm(default ~ income + balance,
data = data,
family = binomial,
subset = index)
return(coef(glm.fit)[c("income", "balance")])
}
boot.results <- boot(Default,
boot.fn,
R = 1000)
boot.results
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Default, statistic = boot.fn, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 2.080898e-05 1.893728e-07 5.043453e-06
t2* 5.647103e-03 1.224738e-05 2.222933e-04
The results of the standard error were extremely similar between the glm() function and the bootstrap function. For income, in the glm() function the SE was 4.99e-06 and for bootstrap it was 4.72e-06. As for balance it was 2.24e-04 for glm(), and for bootstrap it was 2.27e-06. So it seems like the SE is consistent with what the bootstrap estimates for the model.
data(Boston)
Boston <- na.omit(Boston)
mu_hat <- mean(Boston$medv)
print(mu_hat)
[1] 22.53281
se_mu_hat <- sd(Boston$medv) / sqrt(length(Boston$medv))
se_mu_hat
[1] 0.4088611
The standard error for μ̂ is .409. Which means if we decided to repeatedly take samples and calculated the sample mean each time, it would be vary around the value of .409 from the true population mean.
set.seed(146)
boot.fn.mean <- function(data, index) {
return(mean(data$medv[index]))
}
boot.mean <- boot(Boston,
boot.fn.mean,
R = 1000)
boot.mean
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Boston, statistic = boot.fn.mean, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 22.53281 0.009530632 0.4083056
My standard error was .408 so it was only .001 different so extremely similar to the formula based estimate.
boot_se_mean <- sd(boot.mean$t)
ci_lower <- mu_hat - 2 * boot_se_mean
ci_upper <- mu_hat + 2 * boot_se_mean
c(ci_lower, ci_upper)
[1] 21.71620 23.34942
t.test(Boston$medv)
One Sample t-test
data: Boston$medv
t = 55.111, df = 505, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
21.72953 23.33608
sample estimates:
mean of x
22.53281
The two versions giving 95% CI for the mean of medv are extremely similar with bootstrap being 21.72, 23.35 and the t.test being 21.73, 23.35 so almost identicial
mu_hat_med <- median(Boston$medv)
mu_hat_med
[1] 21.2
boot.fn.median <- function(data, index) {
return(median(data$medv[index]))
}
boot.median <- boot(Boston,
boot.fn.median,
R = 1000)
boot.median
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Boston, statistic = boot.fn.median, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 21.2 0.0052 0.384761
This bootstrap standard error would allow me to know that the sample median of 21.2 would vary by ~.385 across repeated sampling
mu_hat_0.1 <- quantile(Boston$medv, 0.1)
mu_hat_0.1
10%
12.75
boot.fn.quantile <- function(data, index) {
return(quantile(data$medv[index], 0.1))
}
boot.quantile <- boot(Boston,
boot.fn.quantile,
R = 1000)
boot.quantile
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = Boston, statistic = boot.fn.quantile, R = 1000)
Bootstrap Statistics :
original bias std. error
t1* 12.75 0.00405 0.5040537
This shows that the bottom 10% quantile of medv is 12.75 and that the SE is .504 which does mean it has more variablility than the median.