“It involves randomly dividing the available set of observations into two parts, a training set and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate—typically assessed using MSE in the case of a quantitative response—provides an estimate of the test error rate.
Easy to use and implement
1.validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set. 2. In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validationset—are used to ft the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that thevalidation set error rate may tend to overestimate the test error rate for the model ft on the entire data set.
“First, it has far less bias. In LOOCV, we repeatedly ft the statistical learning method using training sets that contain n − 1 observations, almost as many as are in the entire data set. This is in contrast to the validation set approach, in which the training set is typically around half the size of the original data set. Consequently, the LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does. Second, in contrast to the validation approach which will yield diferent results when applied repeatedly due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results: there is no randomness in the training/validation set splits”
“LOOCV has the potential to be expensive to implement, since the model has to be ft n times. This can be very time consuming if n is large, and if each individual model is slow to fit. With least squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit!”
require(ISLR)
## Loading required package: ISLR
## Warning: package 'ISLR' was built under R version 4.3.2
getwd()
## [1] "C:/Masters/Predictive Modeling/hw3"
setwd("C:/Users/monee/OneDrive/Pictures/Documents")
auto_data<-read.csv("Default.csv",header=TRUE)
glm.fit <- glm(default ~ income + balance, data=Default, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(33)
train_id <- sample(10000, 5000)
train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.train <- glm(default ~ income + balance, data=train, family = binomial)
summary(glm.fit.train)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.156e+01 6.180e-01 -18.702 < 2e-16 ***
## income 2.319e-05 6.988e-06 3.319 0.000903 ***
## balance 5.660e-03 3.213e-04 17.617 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1483.83 on 4999 degrees of freedom
## Residual deviance: 819.08 on 4997 degrees of freedom
## AIC: 825.08
##
## Number of Fisher Scoring iterations: 8
glm.fit.prob <- predict(glm.fit.train, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)
## test.Default
## glm.fit.pred No Yes
## No 4814 105
## Yes 23 58
mean(glm.fit.pred != test.Default)
## [1] 0.0256
set.seed(3)
train_id <- sample(10000, 5000)
train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.1 <- glm(default ~ income + balance, data=train, family = binomial)
glm.fit.prob <- predict(glm.fit.1, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)
## test.Default
## glm.fit.pred No Yes
## No 4822 109
## Yes 23 46
mean(glm.fit.pred != test.Default)
## [1] 0.0264
Sample 2:
set.seed(2)
train_id <- sample(10000, 5000)
train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.2 <- glm(default ~ income + balance, data=train, family = binomial)
glm.fit.prob <- predict(glm.fit.2, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)
## test.Default
## glm.fit.pred No Yes
## No 4819 101
## Yes 18 62
mean(glm.fit.pred != test.Default)
## [1] 0.0238
Sample 3
set.seed(1)
train_id <- sample(10000, 5000)
train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.3 <- glm(default ~ income + balance, data=train, family = binomial)
glm.fit.prob <- predict(glm.fit.3, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)
## test.Default
## glm.fit.pred No Yes
## No 4824 108
## Yes 19 49
mean(glm.fit.pred != test.Default)
## [1] 0.0254
set.seed(6)
train_id <- sample(10000, 5000)
train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.4 <- glm(default ~ income + balance + student, data = train, family = binomial)
glm.fit.prob <- predict(glm.fit.4, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)
## test.Default
## glm.fit.pred No Yes
## No 4825 109
## Yes 21 45
mean(glm.fit.pred != test.Default)
## [1] 0.026
glm.fit.5 <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit.5)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn <- function(data, index) {
return(coef(glm(default ~ income + balance, data = data, family = binomial,
subset = index)))
}
library(boot)
set.seed(1)
boot(Default, boot.fn, R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
require(MASS)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 4.3.1
getwd()
## [1] "C:/Masters/Predictive Modeling/hw3"
setwd("C:/Users/monee/OneDrive/Pictures/Documents")
Boston<-read.csv("Boston.csv",header=TRUE)
u.hat <- mean(Boston$medv)
u.hat
## [1] 22.53281
sd(Boston$medv)/sqrt(nrow(Boston))
## [1] 0.4088611
boot.fn=function(data,index) {
return(mean(data[index]))
}
set.seed(5)
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.01637984 0.4096144
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Hint: You can approximate a 95 % confdence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].
u.hat - 2*0.412335 ; u.hat + 2*0.412335
## [1] 21.70814
## [1] 23.35748
med.hat <- median(Boston$medv)
med.hat
## [1] 21.2
boot.fn.med=function(data,index) {
return(median(data[index]))
}
set.seed(5)
boot(Boston$medv, boot.fn.med, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.med, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0026 0.3847418
u.0.1 <- quantile(Boston$medv, 0.1)
u.0.1
## 10%
## 12.75
boot.fn.q=function(data,index) {
u.0.1 <- quantile(data[index], c(0.1))
return(u.0.1)
}
set.seed(5)
boot(Boston$medv, boot.fn.q, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.q, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.01505 0.4911489
at 10% we get 49.11% standard error