R Markdown
3.We now review k-fold cross-validation.
The K-fold cross-validation is performed by randomly dividing the obervations into fold or k groups of equal sizes. Our first fold is the “validation set”. The cross-validation is used in order to estimate our model and how it is expected to perform. It consists of a summary of the mean of our model skill scores, the measure of the variance: standard deviation/standard error.
One of the disadvantages of validation set approach is that depending how many observations are within our model the test error rate estimare might be extremely varied. Another disadvantage is that the test error rate for our model fit on the full data set may tend to overstate the validation set error rate. For advantage, the validation set technique is both theoretically and practically straighforward.
One of the advantage of LOOCV is that it is completely random, because the model is trained on the full dataset, the bias will be minimized and the test error rate will not be overestimated. For disadvantage, in order to perfomr LOOCV requires much computing time.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
attach(Default)
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
set.seed(1)
fit.glm.1 <- glm(default ~ income + balance, data = Default, family = binomial)
summary(fit.glm.1)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train <- sample(dim(Default)[1], dim(Default)[1]/2)
test <- Default[-train, ]
fit.glm.2 <- glm(default ~ balance + income, data = Default, family = binomial, subset = train)
predict.five <-predict(fit.glm.2, test, type= "response")
pred.glm.five <- rep("No", dim(Default)[1]*0.50)
pred.glm.five[predict.five > 0.5] <- "Yes"
table(pred.glm.five, test$default)
##
## pred.glm.five No Yes
## No 4824 108
## Yes 19 49
mean(pred.glm.five !=test$default)
## [1] 0.0254
1:
train1 <- sample(dim(Default)[1], dim(Default)[1]/2)
test1 <- Default[-train1, ]
fit.glm.c1 <- glm(default ~ balance + income, data = Default, family = binomial, subset = train1)
predict.c1 <-predict(fit.glm.c1, test1, type= "response")
pred.glm.c1 <- rep("No", dim(Default)[1]*0.50)
pred.glm.c1[predict.c1 > 0.5] <- "Yes"
table(pred.glm.c1, test1$default)
##
## pred.glm.c1 No Yes
## No 4813 112
## Yes 25 50
mean(pred.glm.c1 !=test1$default)
## [1] 0.0274
train2 <- sample(dim(Default)[1], dim(Default)[1]/2)
test2 <- Default[-train2, ]
fit.glm.c2 <- glm(default ~ balance + income, data = Default, family = binomial, subset = train2)
predict.c2 <-predict(fit.glm.c2, test2, type= "response")
pred.glm.c2 <- rep("No", dim(Default)[1]*0.50)
pred.glm.c2[predict.c2 > 0.5] <- "Yes"
table(pred.glm.c2, test2$default)
##
## pred.glm.c2 No Yes
## No 4827 103
## Yes 19 51
mean(pred.glm.c2 !=test2$default)
## [1] 0.0244
3:
train3 <- sample(dim(Default)[1], dim(Default)[1]/2)
test3 <- Default[-train3, ]
fit.glm.c3 <- glm(default ~ balance + income, data = Default, family = binomial, subset = train3)
predict.c3 <-predict(fit.glm.c3, test3, type= "response")
pred.glm.c3 <- rep("No", dim(Default)[1]*0.50)
pred.glm.c3[predict.c3 > 0.5] <- "Yes"
table(pred.glm.c3, test3$default)
##
## pred.glm.c3 No Yes
## No 4824 106
## Yes 16 54
mean(pred.glm.c3 !=test3$default)
## [1] 0.0244
First: 0.0258 Second:0.0270 Third: 0.0252
After performing model three times, we can see how our error rate change every time, for .c1 it was 0.0258, .c2 it was 0.0270 and .c3 it was 0.0252. The test error rate is variable, so it depends on what observations are used in training and validation set.
traind <- sample(dim(Default)[1], dim(Default)[1]/2)
testd <- Default[-traind, ]
fit.glm.d <- glm(default ~ balance + income + student, data = Default, family = binomial, subset = traind)
predict.d <-predict(fit.glm.d, testd, type= "response")
pred.glm.d <- rep("No", dim(Default)[1]*0.50)
pred.glm.d[predict.d > 0.5] <- "Yes"
table(pred.glm.d, testd$default)
##
## pred.glm.d No Yes
## No 4808 121
## Yes 18 53
mean(pred.glm.d !=testd$default)
## [1] 0.0278
After including student, it gave us a error rate of 0.026 which is close to the other error rates we ran.The addition of the student variable has little effect on the error rate.
set.seed(1)
attach(Default)
## The following objects are masked from Default (pos = 3):
##
## balance, default, income, student
fit.glm6 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm6)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
The intercept: 4.348e-01 with p-value: < 2e-16 Income: 4.985e-06 with p-value: 2.99e-05 Balance: 2.274e-04 with p-value: 2e-16
boot.fn = function(data, index) return(coef(glm(default ~ income + balance,
data = data, family = binomial,
subset = index)))
library(boot)
boot6 <- boot(Default, boot.fn, 1000)
boot6
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The standard errors obtained using the glm function compared to bootstrap are very very close to each other,
library(MASS)
attach(Boston)
mu.hat <- mean(medv)
mu.hat
## [1] 22.53281
staerr.hat <- sd(medv) / sqrt(dim(Boston)[1])
staerr.hat
## [1] 0.4088611
set.seed(1)
boot.fn9 <- function(data, index) return(mean(data[index]))
boot9 <- boot(medv, boot.fn9, 1000)
boot9
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn9, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
conf.int.mu.hat <- c(22.53 - 2 * 0.414028, 22.53 + 2 * 0.414028)
conf.int.mu.hat
## [1] 21.70194 23.35806
The t.test method provides a confidence interval that is extremely close to the bootstrap confidence interval.
umed.hat <- median(medv)
umed.hat
## [1] 21.2
boot.fn9f <- function(data, index) return(median(data[index]))
boot9f <- boot(medv, boot.fn9f, 1000)
boot9f
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn9f, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0386 0.3770241
Our estimated median value is 21.2 which is the same result as our umed.hat on e).
tenth.perc <- quantile(medv, c(0.1))
tenth.perc
## 10%
## 12.75
boot.fn9h <- function(data, index) return(quantile(data[index], c(0.1)))
boot9h <- boot(medv, boot.fn9h, 1000)
boot9h
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn9h, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0186 0.4925766
Our estimated median value is 12.75 which is the same result as our tenth.perc on g).And our standard error is 0.5141291 which is smaller.