ISLR Chapter 05 (page 219): 3, 5, 6, 9
We now review k-fold cross-validation.
(a) Explain how k-fold cross-validation is implemented.
K-fold cross-valdiation maps out entire dataset and splits our data into smaller groups then test them all. These smaller group counts also called folds. the process runs in a loop, training on all groups except one and uses that final group as a test to see how well it learned.
(b) What are the advantages and disadvantages of k-fold cross validation relative to:
i. The validation set approach?
In validation set approach, we split data into 2 portions test and train. Only one part of the data gets to be train or set but on k-fold cross validation every part of the data eventually plays test and train role. This provide less bias approach estimating prediction parameters on the other hand it can be computationally heavy since it is run on loops over the partioning dataset and testing over and over again.
ii. LOOCV?
In LOOCV , we take k-fold to the extreme where k equals the total number of data points. This means it tests on only one single data point at a time and trains on all the rest. Compared to standard k-fold, LOOCV is much more computationally heavy because it has to run a loop for thousands or millions of individual points instead of just 5 or 10 groups. However, k-fold actually provides a lower-variance, because LOOCV’s training loops use almost identical data, which makes its test errors highly correlated and less stable overall.
In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
(a) Fit a logistic regression model that uses income and balance to predict default.
First, we need to get the
Defaultdataset in ISLR library.
now lets fit a logistic regression with given parameters.
set.seed(35)
log_model_full <- glm(default ~ income + balance, data = Default, family = binomial)
summary(log_model_full)(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
i. Split the sample set into a training set and a validation set.
train_sample <- sample(nrow(Default), nrow(Default) * 0.5)
train_set <- Default[train_sample, ]
test_set <- Default[-train_sample, ] ii. Fit a multiple logistic regression model using only the training observations.
iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
probs_log <- predict(log_model_train, newdata = test_set, type = "response")
preds_log <- ifelse(probs_log > 0.5, "Yes","No")iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
for (i in 1:3)
{
loop_samples <- sample(nrow(Default), nrow(Default) / 2)
loop_train <- Default[loop_samples, ]
loop_test <- Default[-loop_samples, ]
loop_model <- glm(default ~ income + balance, data = loop_train, family = binomial)
loop_probs <- predict(loop_model, newdata = loop_test, type = "response")
loop_preds <- ifelse(loop_probs > 0.5, "Yes", "No")
loop_error <- mean(loop_preds != loop_test$default)
cat( i, "Error Rate:", loop_error, "\n")
}We tried same train test split method 3 times and obtained different results. This show error rate might fluctuate depending on which part of the data is being utilized as train or test.
(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.
log_model_dummy <- glm(default ~ income + balance + student, data = train_set, family = binomial)
probs_dummy <- predict(log_model_dummy, newdata=test_set, type="response")
preds_dummy <- ifelse(probs_dummy > 0.5, "Yes", "No")
log_model_dummy_error <- mean(preds_dummy != test_set$default)
print(log_model_dummy_error)Including
studentvariable increased error rate slightly.
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.
model_glm <- glm(default ~ income + balance, data = Default, family = binomial)
summary(model_glm)$coefficients(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.
boot.fn <- function(data, index)
{ boot_fit <- glm(default ~ income + balance, data = data, family = binomial, subset = index)
return(coef(boot_fit))}(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.
library(boot)
boot_results <- boot(data = Default, statistic = boot.fn, R = 1000)
print(boot_results)(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.
The standard errors obtained from glm() and the bootstrap are nearly identical. For income, glm() gives 4.99e-06 compared to the bootstrap’s 4.74e-06, and for balance, glm() gives 2.27e-04 compared to 2.19e-04. Because these numbers are so close, it proves the standard glm() formula works perfectly for our data and we can completely trust our error estimates. ***
We will now consider the Boston housing data set, from the ISLR2 library.
(a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate \(\hat{\mu}\).
(b) Provide an estimate of the standard error of \(\hat{\mu}\). Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
stand_error <- sd(Boston$medv)
n <- nrow(Boston)
se_formula <- stand_error / sqrt(n)
cat("Formula Standard Error:", se_formula)(c) Now estimate the standard error of \(\hat{\mu}\) using the bootstrap. How does this compare to your answer from (b)?
mean.fn <-function(data, index)
{
return(mean(Boston$medv[index]))
}
boot_mean <- boot(data = Boston, statistic = mean.fn, R=1000)
print(boot_mean)The standard error from the standard formula (0.409) and the bootstrap method (0.403) are almost exactly the same. Because these two numbers match so closely, it proves our estimate for the average home value is highly stable and reliable.
(d) Based on your bootstrap estimate from (c),
provide a 95% confidence interval for the mean of medv. Compare it to
the results obtained using t.test(Boston$medv).
Hint: You can approximate a 95% confidence interval using the formula \([\hat{\mu} - 2\text{SE}(\hat{\mu}), \hat{\mu} + 2\text{SE}(\hat{\mu})]\).
sd_boot <- sd(boot_mean$t)
conf_lower <- hat - 2 * sd_boot
conf_upper <- hat + 2 * sd_boot
cat("Bootstrap 95%: [", conf_lower, ",", conf_upper, "]")The 95% confidence intervals from both methods are almost identical, with the bootstrap interval at [21.73, 23.34] and the t-test interval at [21.73, 23.34].
(e) Based on this data set, provide an estimate, \(\hat{\mu}_{\text{med}}\), for the median value of medv in the population.
(f) We now would like to estimate the standard error of \(\hat{\mu}_{\text{med}}\). Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
median.fn <- function(data, index) {
return(median(data$medv[index]))
}
boot_median <- boot(data = Boston, statistic = median.fn, R = 1000)
print(boot_median)The estimated population median home value is 21.2 and the bootstrap gives this median a standard error of 0.372. Since there is no simple textbook formula to find the standard error of a median, the bootstrap is essential here. This low error value shows that our median estimate is highly stable and reliable.
(g) Based on this data set, provide an estimate for
the tenth percentile of medv in Boston census tracts. Call this quantity
\(\hat{\mu}_{0.1}\). (You can use
the quantile() function.)
(h) Use the bootstrap to estimate the standard error of \(\hat{\mu}_{0.1}\). Comment on your findings.
quantile_10.fn <- function(data, index) {
return(quantile(data$medv[index], 0.1))
}
boot_quantile <- boot(data = Boston, statistic = quantile_10.fn, R = 1000)
print(boot_quantile)The bootstrap gives our 10th percentile estimate (12.75) and standard error of 0.509. Because percentiles don’t have a simple, standard formula for calculating error, the bootstrap is incredibly valuable here. This low standard error shows that our estimate for the lowest 10% of home values is highly stable and won’t fluctuate much with a different sample.