Explain how k-fold cross-validation is implemented. The K fold cross validation is implemented by randomly dividing the set of observations into k groups/folds. The first fold is treated as a validation set and the method is fit on the k - 1 folds. Then the mean squared error is computed on the observations in the fold that was held out. This procedure is completed k times. This process results in k estimates of the test error. The k-fold CV estimate is computed by averaging these values,
What are the advantages and disadvantages of k-fold cross validation relative to:
Advantages of K-fold relative to validation set approach:: - The validation estimate of the test error rate can be highly variable - The validation set uses only a subset of observations, thus the validation set error may overestimate the test error rate for the model on the entire data set.
Disadvantages of k-fold relative to validation set approach: - The validation set approach is conceptually simple and is easy to implement
Disadvantages of k-fold relative to LOOCV: - LOOCV will give approximately unbiased estimates of the test error. - LOOCV is computationally intensive.
library(ISLR)
set.seed(1)
attach(Default)
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
names(Default)
## [1] "default" "student" "balance" "income"
glm.default <- glm(default~balance+income, data=Default,family=binomial )
summary(glm.default)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train <- sample(10000,2000)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5698 -0.1442 -0.0574 -0.0245 3.3545
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.202e+01 9.452e-01 -12.72 < 2e-16 ***
## income 4.689e-05 1.111e-05 4.22 2.44e-05 ***
## balance 5.407e-03 4.725e-04 11.44 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 606.86 on 1999 degrees of freedom
## Residual deviance: 317.13 on 1997 degrees of freedom
## AIC: 323.13
##
## Number of Fisher Scoring iterations: 8
probs <- predict(fit.glm, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)
## [1] 0.02675
The validation set error is 2.7%.
train_2 <- sample(10000,3000)
fit.glm2 <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train_2)
summary(fit.glm2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train_2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8820 -0.1406 -0.0533 -0.0188 3.3745
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.210e+01 8.412e-01 -14.384 < 2e-16 ***
## income 2.787e-05 9.065e-06 3.074 0.00211 **
## balance 5.863e-03 4.393e-04 13.349 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 870.12 on 2999 degrees of freedom
## Residual deviance: 470.72 on 2997 degrees of freedom
## AIC: 476.72
##
## Number of Fisher Scoring iterations: 8
probs2 <- predict(fit.glm2, newdata = Default[-train_2, ], type = "response")
pred.glm2 <- rep("No", length(probs2))
pred.glm2[probs2 > 0.5] <- "Yes"
mean(pred.glm2 != Default[-train_2, ]$default)
## [1] 0.02628571
The validation set error is 3%.
train_3 <- sample(10000,1000)
fit.glm3 <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train_3)
summary(fit.glm3)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train_3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0782 -0.1333 -0.0544 -0.0216 3.3497
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.166e+01 1.476e+00 -7.902 2.76e-15 ***
## income 2.401e-05 1.707e-05 1.407 0.159
## balance 5.604e-03 7.841e-04 7.148 8.84e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 241.10 on 999 degrees of freedom
## Residual deviance: 139.07 on 997 degrees of freedom
## AIC: 145.07
##
## Number of Fisher Scoring iterations: 8
probs3 <- predict(fit.glm3, newdata = Default[-train_3, ], type = "response")
pred.glm3 <- rep("No", length(probs3))
pred.glm3[probs3 > 0.5] <- "Yes"
mean(pred.glm3 != Default[-train_3, ]$default)
## [1] 0.02755556
The validation set error is 2.8%.
train_4 <- sample(10000,4500)
fit.glm4 <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train_4)
summary(fit.glm4)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train_4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4884 -0.1449 -0.0582 -0.0206 3.7263
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.149e+01 6.366e-01 -18.048 <2e-16 ***
## income 1.762e-05 7.172e-06 2.457 0.014 *
## balance 5.710e-03 3.404e-04 16.775 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1342.13 on 4499 degrees of freedom
## Residual deviance: 722.79 on 4497 degrees of freedom
## AIC: 728.79
##
## Number of Fisher Scoring iterations: 8
probs4 <- predict(fit.glm4, newdata = Default[-train_4, ], type = "response")
pred.glm4 <- rep("No", length(probs4))
pred.glm4[probs4 > 0.5] <- "Yes"
mean(pred.glm4 != Default[-train_4, ]$default)
## [1] 0.02672727
The validation set error is 2.5%.
Comment on the results obtained. Three more estimates of the validation set error would give:0.0296, 0.0281, 0.0249. Changing the number of observations changes the validation set error slightly.
train <- sample(10000,2000)
glm <- glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train)
summary(glm)
##
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.67265 -0.17739 -0.07445 -0.02897 3.12567
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.033e+01 1.018e+00 -10.152 <2e-16 ***
## income 8.456e-06 1.746e-05 0.484 0.628
## balance 5.302e-03 4.735e-04 11.197 <2e-16 ***
## studentYes -5.354e-01 5.070e-01 -1.056 0.291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 586.82 on 1999 degrees of freedom
## Residual deviance: 353.13 on 1996 degrees of freedom
## AIC: 361.13
##
## Number of Fisher Scoring iterations: 8
probs <- predict(glm, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)
## [1] 0.02575
With adding the dummy variable, student, the validation set error is 2.6%. This doesn’t change the validation set error much.
Do not forget to set a random seed before beginning your analysis.
set.seed(1)
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
The standard error for Bo is 0.6178 B1 is 7.024 e-6 B2 is 3.158 e-6
default <- data.frame(Default)
library(boot)
boot.fn<-function(data,index){fit<-glm(default~income+balance,data=data,family="binomial",subset=index)
return(coef(fit))}
boot(default,boot.fn,100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 7.265692e-03 4.116057e-01
## t2* 2.080898e-05 -4.396504e-07 4.204169e-06
## t3* 5.647103e-03 -2.197193e-06 2.219810e-04
The estimated standard errors with glm() function were: 0.6178, 7.024 e-6, 3.158 e-6
Using the bootstrap funciton, the standard errors are: 3.842813e-01, 5.122866e-06, 2.044025e-04 Which are very close to one another.
library(MASS)
attach(Boston)
µ_hat <- mean(Boston$medv)
µ_hat
## [1] 22.53281
Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
µ_hat_sd <- sd(Boston$medv)
µ_hat_obs <- sqrt(dim(Boston)[1])
µ_hat_sd/µ_hat_obs
## [1] 0.4088611
Estimate of the standard error of µ_hat is 0.4088611,
boot.fn2<-function(data,index){
mu<-mean(data[index])
return(mu)
}
set.seed(1)
boot(Boston$medv,boot.fn2,100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn2, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.009027668 0.3482331
The standard error of µ is 0.3482331.
Hint: You can approximate a 95 % confidence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
#[ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].
confid<- c(µ_hat - 2*0.3482331, µ_hat + 2*0.3482331)
confid
## [1] 21.83634 23.22927
µ_med <-median(Boston$medv)
µ_med
## [1] 21.2
The estimate for the median value of medv is 21.2.
set.seed(1)
boot.fn3<-function(data,index){
mu_median<-median(data[index])
return(mu_median)
}
set.seed(1)
boot(Boston$medv,boot.fn3,100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn3, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.029 0.3461316
µ0.1 <- quantile(Boston$medv, c(0.1))
µ0.1
## 10%
## 12.75
The estimate for the tenth percentile of medv is 12.75
boot.fn4 <- function(data, index) {
mu.quant <- quantile(data[index], c(0.1))
return (mu.quant)
}
boot(medv, boot.fn4, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn4, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0285 0.4861472
We get an estimated tenth percentile value of 12.75 which is equal to the value obtained in (g). We also get a standard error of 0.4861472 which is relatively small compared to percentile value.