K-fold CV will have more stable (less variability) mean squared error than the validation set approach, and will have less bias. K-fold is a little more computationally expensive because the model needs to be fit k-times, but compute power is cheap these days.
K-fold CV is less computationally expensive than LOOCV, which is just a specialized version of K-fold. K-fold may have higher bias than LOOCV when k<n, but will also have less variance than LOOCV under those conditions.
default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.income and balance to predict default.set.seed(1)
Default.data <- Default
glm.5A <- glm(default~income+balance, data=Default.data, family = "binomial")
set.seed(1)
index.5B <- sample(1:nrow(Default.data), 0.5*nrow(Default.data))
Default.train <- Default.data[index.5B, ]
Default.test <- Default.data[-index.5B, ]
Default.true <- Default.test$default
set.seed(1)
glm.5B <- glm(default ~ income + balance, data = Default.train, family = "binomial")
default category if the posterior probability is greater than 0.5.glm.5B.probs <- predict(glm.5B, Default.test, type = "response")
glm.5B.preds <- rep("No", length(glm.5B.probs))
glm.5B.preds[glm.5B.probs > 0.5] = "Yes"
mean(glm.5B.preds != Default.true)
## [1] 0.0254
Looks like about 2.54%.
set.seed(1)
fn.5C = function(){ #my first function
index.5C <- sample(1:nrow(Default.data), 0.5*nrow(Default.data))
Default.train.5C <- Default.data[index.5C, ]
Default.test.5C <- Default.data[-index.5C, ]
Default.true.5C <- Default.test.5C$default
glm.5C <- glm(default ~ income + balance, data = Default.train.5C, family = "binomial")
glm.5C.probs <- predict(glm.5C, Default.test.5C, type = "response")
glm.5C.preds <- rep("No", length(glm.5C.probs))
glm.5C.preds[glm.5C.probs > 0.5] = "Yes"
return(mean(glm.5C.preds != Default.true.5C))
}
fn.5C()
## [1] 0.0254
fn.5C()
## [1] 0.0274
fn.5C()
## [1] 0.0244
mean(0.0254, 0.0274, 0.244)
## [1] 0.0254
Averages out to about 0.0254 error rate. 2.54% error rate.
default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.set.seed(1)
index.5D <- sample(1:nrow(Default.data), 0.5*nrow(Default.data))
Default.train.5D <- Default.data[index.5D, ]
Default.test.5D <- Default.data[-index.5D, ]
Default.true.5D <- Default.test.5D$default
glm.5D <- glm(default ~ income + balance + student, data = Default.train.5D, family = "binomial")
glm.5D.probs <- predict(glm.5D, Default.test.5D, type = "response")
glm.5D.preds <- rep("No", length(glm.5D.probs))
glm.5D.preds[glm.5D.probs > 0.5] = "Yes"
mean(glm.5D.preds != Default.true.5D)
## [1] 0.026
Looks like error rate is 2.6% with the student variable included. Doesn’t look like including the student variable has done very much to improve the test error rate.
default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.set.seed(1)
glm.6A <- glm(default ~ income + balance, Default.data, family = "binomial")
summary(glm.6A)$coefficients[ ,2]
## (Intercept) income balance
## 4.347564e-01 4.985167e-06 2.273731e-04
boot.fn(), that takes as input the Default data set as well as an index of the observations, and that output the coefficient estimates for income and balance in the multiple logistic regression model.boot.fn <- function(data, index){
return(coef(glm(default ~ income + balance, data=data, family="binomial", subset=index)))
}
library(boot)
boot(Default.data, boot.fn, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default.data, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 8.556378e-03 4.122015e-01
## t2* 2.080898e-05 -3.993598e-07 4.186088e-06
## t3* 5.647103e-03 -4.116657e-06 2.226242e-04
Boston housing data set, from the MASS library.library(MASS)
Boston.data <- Boston
medv. Call this estimate ˆμ.set.seed(1)
mu <- mean(Boston.data$medv)
mu
## [1] 22.53281
mu.std <- sd(Boston.data$medv)/sqrt(length(Boston.data$medv))
mu.std
## [1] 0.4088611
boot.fn <- function(data, index) {
return(mean(data[index]))
}
bootstrp.9C <- boot(Boston.data$medv, boot.fn, 5000)
bootstrp.9C
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston.data$medv, statistic = boot.fn, R = 5000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.002824941 0.4042429
The std. error here is fairly close to the value calculated in 9-B: 0.40886 to 0.40424.
medv. Compare it to the resultsobtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆμ − 2SE(ˆμ), μˆ + 2SE(ˆμ)].
t.test(Boston.data$medv)
##
## One Sample t-test
##
## data: Boston.data$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Goal is to hit 21.729, 23.336 or close to them.
bootstrp.9C$t0 - 2 * mu.std
## [1] 21.71508
bootstrp.9C$t0 + 2 * mu.std
## [1] 23.35053
c(bootstrp.9C$t0 - 2 * mu.std, bootstrp.9C$t0 + 2 * mu.std)
## [1] 21.71508 23.35053
Bootstrap is pretty close to t.test.
medv in the population.med.medv <- median(Boston.data$medv)
med.medv
## [1] 21.2
Bootstrap function for fun.
boot.fn <- function(data, index) {
return(median(data[index]))
}
bootstrp.9E <- boot(Boston.data$medv, boot.fn, 1000)
bootstrp.9E
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston.data$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0342 0.3644972
set.seed(1)
boot.fn <- function(data, index) {
return(median(data[index]))
}
bootstrp.9F <- boot(Boston.data$medv, boot.fn, 5000)
bootstrp.9F
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston.data$medv, statistic = boot.fn, R = 5000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.01064 0.3716011
21.2 median, with a standard error of around 0.371% per bootstrap function.
medv in Boston suburbs. Call this quantity ˆμ0.1. (You can use the quantile() function.)?quantile
tenth.medv <- quantile(Boston.data$medv, probs = c(0.1))
tenth.medv
## 10%
## 12.75
Tenth percentile of medv is 12.75.
set.seed(1) #Because why not
boot.fn <- function(data, index) {
return(quantile(data[index], c(0.1)))
}
bootstrp.9H <- boot(Boston.data$medv, boot.fn, 5000)
bootstrp.9H
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston.data$medv, statistic = boot.fn, R = 5000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.01523 0.4959675
The estimated standard error of the tenth.medv (tenth percentile of medv in Boston suburbs) shows up as about 0.4959 using the bootstrap method.
Comment on the estimated standard errors obtained using the
glm()function and using your bootstrap function.I observe that the standard errors are approximately equal, but have some minor differences after the decimal point.