K-fold cross-validation is implemented by dividing the dataset into k equal sized folds, and then the model is trained k times with each time using k-1 folds for training and the rest fold for validation.
Advantages:
-It has better data utilization: K-fold cross validation ensures that every data point is used for both training and validation, unlike the validation set approach.
Disadvantages:
-It has a higher computational cost: When you train the model k times, it increases the computational time compared to a single training validation split.
-It has more complexity: Implementing k-fold validation is subastantially more complex than implementing a simple validation set split.
Advantages:
Lower variance: LOOCV uses pretty much the entire dataset for training in each iteration, so it can lead to high variance in error estimates. The k-fold cross validation method would provide a more stable estimate.
It has lower computaional cost: Implementing LOOCV requires n model fits, while k-fold cross validation only requires k fifts, making it computationally more efficient.
Disadvantages:
It could have potential bias: K-fold validation usually has a higher bias compares to LOOCV, because each training set is smaller in size.
The choice of k: The results can widely vary based on the choice of k, which can make the parameter selection more critical.
install.packages("ISLR")
## Installing package into 'C:/Users/austr/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'ISLR' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\austr\AppData\Local\Temp\RtmpUHnmTh\downloaded_packages
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Loading required package: lattice
set.seed(123)
glm_fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm_fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train_index <- createDataPartition(Default$default, p = 0.5, list = FALSE)
train_data <- Default[train_index, ]
valid_data <- Default[-train_index, ]
glm_train <- glm(default ~ income + balance, data = train_data, family = binomial)
probabilities <- predict(glm_train, valid_data, type = "response")
predictions <- ifelse(probabilities > 0.5, "Yes", "No")
error_rate <- mean(predictions != valid_data$default)
print(paste("Validation Set Error:", round(error_rate, 4)))
## [1] "Validation Set Error: 0.0274"
set.seed(456)
train_index2 <- createDataPartition(Default$default, p = 0.5, list = FALSE)
train_data2 <- Default[train_index2, ]
valid_data2 <- Default[-train_index2, ]
glm_train2 <- glm(default ~ income + balance, data = train_data2, family = binomial)
probabilities2 <- predict(glm_train2, valid_data2, type = "response")
predictions2 <- ifelse(probabilities2 > 0.5, "Yes", "No")
error_rate2 <- mean(predictions2 != valid_data2$default)
print(paste("Validation Set Error (2nd Split):", round(error_rate2, 4)))
## [1] "Validation Set Error (2nd Split): 0.0242"
set.seed(789)
train_index3 <- createDataPartition(Default$default, p = 0.5, list = FALSE)
train_data3 <- Default[train_index3, ]
valid_data3 <- Default[-train_index3, ]
glm_train3 <- glm(default ~ income + balance, data = train_data3, family = binomial)
probabilities3 <- predict(glm_train3, valid_data3, type = "response")
predictions3 <- ifelse(probabilities3 > 0.5, "Yes", "No")
error_rate3 <- mean(predictions3 != valid_data3$default)
print(paste("Validation Set Error (3rd Split):", round(error_rate3, 4)))
## [1] "Validation Set Error (3rd Split): 0.0266"
From looking at the results of the splits, we can see that all three arerelatively close to eachother, which can suggest that the model’s performance is pretty stable across the difference training-validation partitions. These results can also suggest that when using logistic regression using income and balance as predictors provides a consistent and reasonably low missclassification rate for predicting default.
set.seed(123)
train_index4 <- createDataPartition(Default$default, p = 0.5, list = FALSE)
train_data4 <- Default[train_index4, ]
valid_data4 <- Default[-train_index4, ]
glm_train4 <- glm(default ~ income + balance + student, data = train_data4, family = binomial)
probabilities4 <- predict(glm_train4, valid_data4, type = "response")
predictions4 <- ifelse(probabilities4 > 0.5, "Yes", "No")
error_rate4 <- mean(predictions4 != valid_data4$default)
print(paste("Validation Test Error with Student Variable:", round(error_rate4, 4)))
## [1] "Validation Test Error with Student Variable: 0.0278"
We can see that the value with the dummyvariable(student) is 2.78%, which is slightly higher than the lowest validation set error without the dummyvariable. This can suggest that adding the dummyvariable does not lead to a meaningful reduction in test error and it could even lead to unnecessary complexity while not improving the predictive performance. This evidence can suggest that the dummyvariable is not a strong predictor.
set.seed(123)
glm_fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm_fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
boot.fn <- function(data, indices) {
data_boot <- data[indices, ]
glm_boot <- glm(default ~ income + balance, data = data_boot, family = binomial)
return(coef(glm_boot)[c("income", "balance")])
}
set.seed(456)
boot_results <- boot(Default, boot.fn, R = 1000)
boot_results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 9.751512e-08 4.869472e-06
## t2* 5.647103e-03 1.565097e-05 2.238660e-04
We can see with the glm() function, the standard error of Income is 2.985e-06 and the standard error of Balance is 2.274e-04. With the Bootstrap function, we can see that the standard error for Income is 4.7747e-06 and the standard for Balance is 2.2275e-04. It is clear that the standard errors attained with the Bootsrap method are slightly smaller compared to the standard error values attained with the glm() function. It is a very small difference, which can suggest that the logistic regression model used is well behaved.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
mean_medv <- mean(Boston$medv)
mean_medv
## [1] 22.53281
sd_medv <- sd(Boston$medv)
n <- length(Boston$medv)
se_mean_medv <- sd_medv / sqrt(n)
se_mean_medv
## [1] 0.4088611
The estimate of the standard error is 0.4088611. This will mean that if we sample from the Boston housing dataset repeatedly, the sample means would vary by about 0.41 units around the true population mean of medv.
##c.
library(boot)
boot_fn <- function(data, indices) {
sample_data <- data[indices]
return(mean(sample_data))
}
boot_results <- boot(Boston$medv, boot_fn, R = 1000)
se_bootstrap <- sd(boot_results$t)
se_bootstrap
## [1] 0.4007338
We can see that the estimate standard error using Bootstrap is 0.3960192, with is slightly lower than the estimate given in the last question. However, since both values are pretty similar, it suggests that the sample mean is a good estimate of the population mean.
ci_lower <- mean_medv - 2 * se_bootstrap
ci_upper <- mean_medv + 2 * se_bootstrap
ci_lower
## [1] 21.73134
ci_upper
## [1] 23.33427
t_test_result <- t.test(Boston$medv)
t_test_result$conf.int
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95
We can see that the two intervals(21.74077 & 23.32484, and 21.72953 & 23.33608) are very close in value, which can suggest that the normality assumption underlying the t-test is reasonable for this dataset.
median_medv <- median(Boston$medv)
median_medv
## [1] 21.2
boot_fn_median <- function(data, indices) {
sample_data <- data[indices]
return(median(sample_data))
}
boot_results_median <- boot(Boston$medv, boot_fn_median, R = 1000)
se_median_bootstrap <- sd(boot_results_median$t)
se_median_bootstrap
## [1] 0.3903302
The estimated standard error of the edian using bootstrap is 37.96%. This value is relatively small, which means that the sample provides a stable estimate of the median.
percentile_10 <- quantile(Boston$medv, 0.1)
percentile_10
## 10%
## 12.75
boot_fn_percentile_10 <- function(data, indices) {
sample_data <- data[indices]
return(quantile(sample_data, 0.1))
}
boot_results_percentile_10 <- boot(Boston$medv, boot_fn_percentile_10, R = 1000)
se_percentile_10_bootstrap <- sd(boot_results_percentile_10$t)
se_percentile_10_bootstrap
## [1] 0.5100265
This estimated standard error of the 10th percentile (0.4956709) is slightly larger than the bootstrap standard error of the median(0.3796). Percentiles at the edges of the distribution do usually tend to have higher variability due to that they are more influenced by small sample changes compared when considering the median.