K-fold cross-validation is a method used to evaluate the performance of a machine learning model by repeatedly splitting the dataset into different training and testing sets. The dataset is first divided into K equal-sized groups, also referred to as folds. In the first iteration, K − 1 folds are used to train the model, while the remaining fold is used to test its performance. The process is then repeated K times, each time selecting a different fold as the test set and using the remaining folds for training. This is done to ensure that every observation in the dataset is used for testing just once, and multiple times for training. After all iterations are completed, the results from each round are averaged to produce an overall estimate of the model’s performance.
(i). The validation set approach?
Compared to the validation set approach, k-fold cross-validation provides gives us a more reliable estimate of a model’s performance because it allows every observation to be used for both training and testing. This makes better use of the available data, unlike the validation set approach where a portion of the dataset is permanently reserved for testing and cannot be used for training. Additionally, k-fold cross-validation reduces the risk of results being overly dependent on a single random split of the data. K-fold cross-validation has some disadvantages. It is more computationally expensive since the model must be trained multiple times rather than only once. The results may also vary depending on the choice of k, and for very large datasets or complex models, the repeated training process can require significant computational resources.
(ii). LOOCV? K-fold cross-validation is generally less computationally intensive comapared to Leave-One-Out Cross-Validation (LOOCV). This is because the model is trained only k times instead of once for every observation in the dataset. It also tends to have lower variance in performance estimates, making the evaluation more stable across different samples. On the other hand, LOOCV has the advantage of using almost the entire dataset for training in each iteration, since only one observation is left out for testing each time. This can sometimes provide a less biased estimate of model performance, especially with smaller datasets. As a result, LOOCV is much slower because the model must be fit n times, where n is the total number of observations, making it less practical for larger datasets.
Read Explore data
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
(a)Use logistic regression model that uses income and balance to predict default.
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# i)
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]
#ii
LOGmodel2 = glm(default ~ balance + income, data = Default, family = binomial, subset = trainDefault)
#iii
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)##
## log.pred_def No Yes
## No 4815 105
## Yes 29 51
## [1] 0.0268
(C)Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
Trial #1 , using 0.6/0.4
set.seed(1)
errors <- c()
for (i in 1:3) {
train <- sample(nrow(Default), nrow(Default) * 0.5)
fit <- glm(default ~ income + balance,
data = Default,
family = binomial,
subset = train)
probs <- predict(fit, Default[-train, ], type = "response")
pred <- rep("No", nrow(Default[-train, ]))
pred[probs > 0.5] <- "Yes"
errors[i] <- mean(pred != Default[-train, ]$default)
}
errors## [1] 0.0254 0.0274 0.0244
## [1] 0.02573333
The validation set approach gave us a 2.5%, 2.7%, 2.4% test error rates for the tree trials. They seem to be close with small variation.
#i
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]
#ii
LOGmodel2 = glm(default ~ balance + income + student, data = Default, family = binomial, subset = trainDefault)
#iii
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)##
## log.pred_def No Yes
## No 4825 106
## Yes 15 54
## [1] 0.0242
Including the student variable in the model results in a test error rate of approximately 2.76%. This is very similar to the error rate obtained from the model without the student variable, shows that adding student status did not improve predictive performance.
set.seed(4)
glm.Default <- glm(default~income+balance, data=Default,family=binomial )
summary(glm.Default)##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn = function(data, index) return(coef(glm(default ~ balance + income, data = data, family = binomial, subset = index)))##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 1.381094e-02 4.250016e-01
## t2* 5.647103e-03 -1.285551e-06 2.240173e-04
## t3* 2.080898e-05 -3.311297e-07 4.366588e-06
The estimated standard errors for the coefficients for income and balance are 4.240036e-01, 2.294044e-04, and 4.652618e-06
Explore Dataset
## [1] 22.53281
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
estimate for the population mean of medv ˆµ= 22.53281
## [1] 0.4088611
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn2, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.0008359684 0.398165
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
## [1] 21.71 23.35
A 95 % confidence interval using bootstrap is between 21.71 and 23.35, using t test we have a CI of 22.53. The two result are approximately close.
## [1] 21.2
The estimated median value of 21.2 from bootstrap which is same as the value in (e), and the the standard error of 0.3874 which is relatively small compared to median value
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn2, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.0104 0.3568986
## 10%
## 12.75
boot.fn3=function(data, index){return(quantile(data[index], c(0.1)))}
library(boot)
boot(medv,boot.fn3,500)##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn3, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 -0.0031 0.5138614
Estimated quantile value of 12.75 from bootstap which is same as the value in (g), and the standard error of 0.5300358