This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error,MSE1, MSE2,…, MSEk. The k-fold CV estimate is computed by averaging these values.
Advantages:The validation set approach is conceptually simple and is easy to implement
Disadvantages: The validation MSE can be highly variable and Only a subset of observations are used to fit the model (training data).
Advantages: LOOCV have less bias. The validation approach produces different MSE when applied repeatedly due to randomness in the splitting process, while performing LOOCV multiple times will always yield the same results, because we split based on 1 obs. each time.
Disadvantage: LOOCV is computationally intensive.
library(ISLR2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
library(boot)
default using income and
balance on the Default data set. We will now estimate the
test error of this logistic regression model using the validation set
approach. Do not forget to set a random seed before beginning your
analysis.income
and balance to predict defaultset.seed(1)
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
attach(Default)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Split the sample set into a training set and a validation set.
Fit a multiple logistic regression model using only the training observations.
Obtain a prediction of default status for each individual in the
validation set by computing the posterior probability of default for
that individual, and classifying the individual to the
default category if the posterior probability is greater
than 0.5.
Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
#i
train=sample(dim(Default)[1], dim(Default)[1]*0.50)
test=Default[-train, ]
#ii
glm.fit=glm(default ~ income + balance, data = Default, family = binomial, subset = train)
#iii
glm.probs = predict(glm.fit, test, type = "response")
glm.pred = rep("No", dim(Default)[1]*0.50)
glm.pred[glm.probs > 0.5] = "Yes"
table(glm.pred, test$default)
##
## glm.pred No Yes
## No 4824 108
## Yes 19 49
#iv
mean(glm.pred != test$default)
## [1] 0.0254
#i
train=sample(dim(Default)[1], dim(Default)[1]*0.50)
test=Default[-train, ]
#ii
glm.fit1=glm(default ~ income + balance, data = Default, family = binomial, subset = train)
#iii
glm.probs1 = predict(glm.fit1, test, type = "response")
glm.pred1 = rep("No", dim(Default)[1]*0.50)
glm.pred1[glm.probs1 > 0.5] = "Yes"
table(glm.pred1, test$default)
##
## glm.pred1 No Yes
## No 4813 112
## Yes 25 50
#iv
mean(glm.pred1 != test$default)
## [1] 0.0274
#i
train=sample(dim(Default)[1], dim(Default)[1]*0.50)
test=Default[-train, ]
#ii
glm.fit2=glm(default ~ income + balance, data = Default, family = binomial, subset = train)
#iii
glm.probs2 = predict(glm.fit2, test, type = "response")
glm.pred2 = rep("No", dim(Default)[1]*0.50)
glm.pred2[glm.probs2 > 0.5] = "Yes"
table(glm.pred2, test$default)
##
## glm.pred2 No Yes
## No 4827 103
## Yes 19 51
#iv
mean(glm.pred2 != test$default)
## [1] 0.0244
#i
train=sample(dim(Default)[1], dim(Default)[1]*0.50)
test=Default[-train, ]
#ii
glm.fit3=glm(default ~ income + balance, data = Default, family = binomial, subset = train)
#iii
glm.probs3 = predict(glm.fit3, test, type = "response")
glm.pred3 = rep("No", dim(Default)[1]*0.50)
glm.pred3[glm.probs3 > 0.5] = "Yes"
table(glm.pred3, test$default)
##
## glm.pred3 No Yes
## No 4824 106
## Yes 16 54
#iv
mean(glm.pred3 != test$default)
## [1] 0.0244
All three different splits have the different testing error that means the observations are different for each training set and validation set.
train=sample(dim(Default)[1], dim(Default)[1]*0.50)
test=Default[-train, ]
glm.fit4=glm(default ~ income + balance + student, data = Default, family = binomial, subset = train)
glm.probs4 = predict(glm.fit4, test, type = "response")
glm.pred4 = rep("No", dim(Default)[1]*0.50)
glm.pred4[glm.probs4 > 0.5] = "Yes"
table(glm.pred4, test$default)
##
## glm.pred4 No Yes
## No 4808 121
## Yes 18 53
mean(glm.pred4 != test$default)
## [1] 0.0278
Including a student dummy variable does not lead to a
reduction in the test error rate.
default using
income and balance on the Default
data set. In particular, we will now compute estimates for the standard
errors of the income and balance logistic
regression coefficients in two different ways: (1) using the bootstrap,
and (2) using the standard formula for computing the standard errors in
the glm() function. Do not forget to set a random seed
before beginning your analysis.summary() and glm()
functions, determine the estimated standard errors for the coefficients
associated with income and balance in a
multiple logistic regression model that uses both predictors.set.seed(1)
glm.fit5=glm(default ~ income + balance, data = Default, family = "binomial")
summary(glm.fit5)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn(), that takes as input
the Default data set as well as an index of the
observations, and that outputs the coefficient estimates for
income and balance in the multiple logistic regression
model.boot.fn <- function(data, index){
return(coef(glm(default ~ income + balance, data = data, family = binomial, subset = index)))
}
boot() function together with your
boot.fn() function to estimate the standard errors of the
logistic regression coefficients for income and
balance.boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The estimated standard errors for the coefficients obtained using the glm() for income and balance are 4.985e-06 and 2.274e-04, respectively.
The estimated standard errors for the coefficients obtained using the bootstrap function for income and balance are 4.988746e-06 and 2.347490e-04, respectively.
Boston housing data set, from
the ISLR2 library.summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
attach(Boston)
mean.medv = mean(medv)
mean.medv
## [1] 22.53281
stderr.medv = sd(medv) / sqrt(dim(Boston)[1])
stderr.medv
## [1] 0.4088611
boot.fn1 <- function(data, index){
return (mean(data[index]))
}
boot(medv, boot.fn1, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn1, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.0109415 0.414028
It is very closed to the estimation from (b)
medv. Compare it to the
results obtained using t.test(Boston$medv).t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
CI = c(22.53281 - 2 * 0.4064926, 22.53281 + 2 * 0.4064926)
CI
## [1] 21.71982 23.34580
t.test confidence interval is simmilar to the bootstap confidence interval.
medv in the population.median.medv = median(medv)
median.medv
## [1] 21.2
boot.fn2 <- function(data, index) {
return (median(data[index]))
}
boot(medv, boot.fn2, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn2, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02095 0.3707169
The median value is 21.2 which is similar from part(e), and the standard error is 0.3705591.
medv in Boston census tracts. Call this
quantity ˆµ0.1. (You can use the quantile() function.)tenth.medv = quantile(medv, c(0.1))
tenth.medv
## 10%
## 12.75
boot.fn3 <- function(data, index) {
return (quantile(data[index], c(0.1)))
}
boot(medv, boot.fn3, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn3, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.02795 0.5077737
The estimate for the tenth percentile of medv is 12.75,
which is the same with part(g), and the standard error is 0.4842804.