The k-fold cross validation involves dividing the amount of observations into k groups (or folds). The first fold is considered a validation set and the remainders are used at training sets to fit the model. This process are repeated k times, with k=number of times we want. Each time, a different group will be treated as a validation set. Test error is then computed by averaging the all the MSE estimates from all the folds.
i. The validation set approach?
Advantages:The validation set approach is conceptually simple and is easy to implement
Disadvantages: The validation MSE can be highly variable (depends on which observations are included in the training/validation set). Also in validation set approach, only one subset of observations are used to fit the model, as a training set, and may overestimate the test error rate.
ii. LOOCV?
Advantages: LOOCV have less bias. We repeatedly fit the statistical learning method using training data that contains n-1 obs so almost all the data set is used. LOOCV produces a less variable MSE.
Disadvantage: LOOCV is computationally intensive
library(ISLR)
str(Default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
set.seed(1)
fit.glm <- glm(default ~ income + balance, data=Default, family="binomial")
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
i. Split the sample set into a training set and a validation set.
set.seed(1)
tr_ind <- sample(nrow(Default), 0.8*nrow(Default), replace = F)
train <- Default[tr_ind,]
test <- Default[-tr_ind,]
ii. Fit a multiple logistic regression model using only the training observations.
fit.glm <- glm(default ~ income + balance, data=train, family="binomial")
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4758 -0.1413 -0.0563 -0.0210 3.4620
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.168e+01 4.893e-01 -23.879 < 2e-16 ***
## income 2.547e-05 5.631e-06 4.523 6.1e-06 ***
## balance 5.613e-03 2.531e-04 22.176 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2313.6 on 7999 degrees of freedom
## Residual deviance: 1239.2 on 7997 degrees of freedom
## AIC: 1245.2
##
## Number of Fisher Scoring iterations: 8
iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
glm.prob = predict.glm(fit.glm, newdata = test, type = "response") #create a prediction
glm.pred = rep("No",length(glm.prob)) #use these for confusion Matrix or table
glm.pred[glm.prob > 0.5] = "Yes"
caret::confusionMatrix(as.factor(glm.pred), as.factor(test$default))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1928 50
## Yes 2 20
##
## Accuracy : 0.974
## 95% CI : (0.966, 0.9805)
## No Information Rate : 0.965
## P-Value [Acc > NIR] : 0.01371
##
## Kappa : 0.4252
##
## Mcnemar's Test P-Value : 7.138e-11
##
## Sensitivity : 0.9990
## Specificity : 0.2857
## Pos Pred Value : 0.9747
## Neg Pred Value : 0.9091
## Prevalence : 0.9650
## Detection Rate : 0.9640
## Detection Prevalence : 0.9890
## Balanced Accuracy : 0.6423
##
## 'Positive' Class : No
##
iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified
mean(glm.pred != test$default)
## [1] 0.026
The error rate is different when we change the split of data (70/30, 60/40, and 50/50). At first they decrease but after the split of 60/40, the error rate starts to increase.
set.seed(1)
tr_ind <- sample(nrow(Default), 0.7*nrow(Default), replace = F)
train <- Default[tr_ind,]
test <- Default[-tr_ind,]
fit.glm <- glm(default ~ income + balance, data=train, family="binomial")
glm.prob = predict.glm(fit.glm, newdata = test, type = "response")
glm.pred = rep("No",length(glm.prob))
glm.pred[glm.prob > 0.5] = "Yes"
mean(glm.pred != test$default)
## [1] 0.02666667
set.seed(1)
tr_ind <- sample(nrow(Default), 0.6*nrow(Default), replace = F)
train <- Default[tr_ind,]
test <- Default[-tr_ind,]
fit.glm <- glm(default ~ income + balance, data=train, family="binomial")
glm.prob = predict.glm(fit.glm, newdata = test, type = "response")
glm.pred = rep("No",length(glm.prob))
glm.pred[glm.prob > 0.5] = "Yes"
mean(glm.pred != test$default)
## [1] 0.025
set.seed(1)
tr_ind <- sample(nrow(Default), 0.5*nrow(Default), replace = F)
train <- Default[tr_ind,]
test <- Default[-tr_ind,]
fit.glm <- glm(default ~ income + balance, data=train, family="binomial")
glm.prob = predict.glm(fit.glm, newdata = test, type = "response")
glm.pred = rep("No",length(glm.prob))
glm.pred[glm.prob > 0.5] = "Yes"
mean(glm.pred != test$default)
## [1] 0.0254
Adding the dummy variable for student increase the test error rate, but just slightly. New test error rate is 2.75% while the old one is 0.026
set.seed(1)
tr_ind <- sample(nrow(Default), 0.8*nrow(Default), replace = F)
train <- Default[tr_ind,]
test <- Default[-tr_ind,]
fit.glm <- glm(default ~ income + balance + student, data=train, family="binomial")
glm.prob = predict.glm(fit.glm, newdata = test, type = "response")
glm.pred = rep("No",length(glm.prob))
glm.pred[glm.prob > 0.5] = "Yes"
mean(glm.pred != test$default)
## [1] 0.0275
The estimate Std. Error for the coefficients are: 0.04348, 4.985 x 10-6, and 2.274 x 10-4
fit.glm <- glm(default ~ income + balance, data=Default, family="binomial")
summary(fit.glm)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
library(boot)
boot.fn=function(data =data,index){
fit <- glm(default ~ income + balance, data=data, family ="binomial", subset= index)
return(coef(fit))}
boot(Default,boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.890536e-02 4.347841e-01
## t2* 2.080898e-05 1.566550e-07 4.864757e-06
## t3* 5.647103e-03 1.843880e-05 2.301132e-04
Compare to the coefficients of the ‘glm()’ function, both the boot function and glm function give the same coefficients
summary(fit.glm)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
They have almost similar estimated standard errors. For bootstrap, Std error for ‘intercept’ and ‘balance’ are slightly higher while intercept for ‘income’ slightly lower
library(ISLR)
library(MASS)
u = mean(Boston$medv)
u
## [1] 22.53281
dim(Boston)
## [1] 506 14
se.u <- sd(Boston$medv)/sqrt(dim(Boston)[1])
se.u
## [1] 0.4088611
The standard error is 0.4088
set.seed(1)
boot.fn <- function(data, index) {
u <- mean(data[index])
return (u)
}
boot(Boston$medv, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
The standard error using bootstrap is similar to using the formula in b)
#Confidence interval from t-test
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
#Confidence interval from bootstrap
CI.boot.mean <- c(u-2*se.u, u+2*se.u)
CI.boot.mean
## [1] 21.71508 23.35053
The confidence interval, using t-test and bootstrap, are pretty close to each other.
med <- median(Boston$medv)
med
## [1] 21.2
set.seed(1)
boot.fn = function(data, index) {
med <- median(data[index])
return (med)
}
SE.boot.med = boot(Boston$medv, boot.fn, R=1000)
SE.boot.med
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
u0.1 <- quantile(Boston$medv,.1)
u0.1
## 10%
## 12.75
set.seed(1)
boot.fn = function(data, index) {
u0.1 <- quantile(data[index],c(0.1))
return (u0.1)
}
SE.boot.u0.1 = boot(Boston$medv, boot.fn, R=1000)
SE.boot.u0.1
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
The tenth percentile from bootstrap is also 12.75, similar to g)