3 We now review k-fold cross-validation.
K- fold cross validation is made up of k groups similar in size. The first k-fold group is used as a validation set that is then used to fit the remaining k-1 groups. This process will occur k number of times and will have varying amounts for the values for the mean square error(MSE).These MSE values are then averaged to result in the k-fold cross-validation estimate (CVk).
The validation set approach? This approach creates gives a test estimate that is very fluctuating when you compare it to the k-fold cross validation test estimate. The randomization of the separation of observations put into the training and testing data sets are what cause this fluctuation.
LOOCV? LOOCV can cause problems when the value of n is large because it requires the fitting of the statistical learning method n times. Since the test error estimate is the average of the n fitted models with similar observations this causes the estimate to have high fluctuation as well. The K-fold validation however could possibly have a value of k resulting in an unbiased error test rate with a lower variance.
5 In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
attach(Default)
set.seed(8)
default.fit<-glm(default~income+balance, data=Default, family="binomial")
summary(default.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
train.default<-sample(dim(Default)[1],dim(Default)[1]/2)
spl.default<-glm(default~income+balance, data=Default, family="binomial", subset = train.default)
default.probs=predict(spl.default,Default[-train.default,], type="response")
spldefault.pred=rep("No",length(default.probs))
spldefault.pred[default.probs>0.5]="Yes"
mean(spldefault.pred != Default[-train.default, ]$default)
## [1] 0.0272
Validation set approach test error rate = 2.72%.
train.default<-sample(dim(Default)[1],dim(Default)[1]/2)
spl.default<-glm(default~income+balance, data=Default, family="binomial", subset = train.default)
default.probs=predict(spl.default,Default[-train.default,], type="response")
spldefault.pred=rep("No",length(default.probs))
spldefault.pred[default.probs>0.5]="Yes"
mean(spldefault.pred != Default[-train.default, ]$default)
## [1] 0.0258
train.default<-sample(dim(Default)[1],dim(Default)[1]/2)
spl.default<-glm(default~income+balance, data=Default, family="binomial", subset = train.default)
default.probs=predict(spl.default,Default[-train.default,], type="response")
spldefault.pred=rep("No",length(default.probs))
spldefault.pred[default.probs>0.5]="Yes"
mean(spldefault.pred != Default[-train.default, ]$default)
## [1] 0.0288
train.default<-sample(dim(Default)[1],dim(Default)[1]/2)
spl.default<-glm(default~income+balance, data=Default, family="binomial", subset = train.default)
default.probs=predict(spl.default,Default[-train.default,], type="response")
spldefault.pred=rep("No",length(default.probs))
spldefault.pred[default.probs>0.5]="Yes"
mean(spldefault.pred != Default[-train.default, ]$default)
## [1] 0.0268
Validation set approach test error rate = (2.58% 2.88%, 2.68%) all being slightly different from the randomization.
train.default<-sample(dim(Default)[1],dim(Default)[1]/2)
spl.default<-glm(default~income+balance+student, data=Default, family="binomial", subset = train.default)
default.probs=predict(spl.default,Default[-train.default,], type="response")
spldefault.pred=rep("No",length(default.probs))
spldefault.pred[default.probs>0.5]="Yes"
mean(spldefault.pred != Default[-train.default, ]$default)
## [1] 0.027
There is no need to create another variable because the student was already a binary categorical variable. Adding a student to the model wouldn’t cause a decrease in the test error rate because 2.7% falls in the range of the previous test error rates.
6 We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways:(1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
library(boot)
set.seed(12)
default.fit<-glm(default~income+balance, data=Default, family="binomial")
summary(default.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Estimated standard error for income (β1) = 4.985e−06; balance(β2) = 2.274e−4; intercept (β0) = 0.4348
boot.fn <- function(data, index){
def.fit<-glm(default~income+balance, data=data, family="binomial", subset=index)
return(coef(def.fit))}
boot(Default,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -8.387469e-03 4.484914e-01
## t2* 2.080898e-05 -1.550111e-07 5.010512e-06
## t3* 5.647103e-03 6.338097e-06 2.313711e-04
detach(Default)
Estimated coefficients standard error for intercept = 0.4485; income = 5.0105e−6; balance = 2.3137e−4
The estimated standard error for bootstrap was a little less compared to the estimated standard error for glm
9 We will now consider the Boston housing data set, from the MASS library.
library(MASS)
attach(Boston)
mu.hat <- mean(medv)
mu.hat
## [1] 22.53281
Population mean for medv = 22.5328
mu.hat.err= sd(medv)/sqrt(length(medv))
mu.hat.err
## [1] 0.4088611
Estimated standard error of μ^ = 0.4089, which is pretty small compared to the mean of medv
bootmu.fn<- function(data, index){
mu<-mean(data[index])
return(mu)}
boot(medv, bootmu.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = bootmu.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.005427866 0.3967152
The standard error estimate resulting at 0.3967
t.test(medv)
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
CI.muhat<-c(22.53281 - 2 * 0.4119, 22.53281 + 2 * 0.4119)
CI.muhat
## [1] 21.70901 23.35661
Using the standard error calculated from bootstrapping resulted in A 95% confidence interval was resulted for the standard error, also being relatively close to the confidence interval resulting in the t-test. The bootstrap value is about ~.02 less than the t-test value.
mu.med<-median(medv)
mu.med
## [1] 21.2
bootmed.fn <-function(data,index){
med.mu<-(median(data[index]))
return(med.mu)}
boot(medv,bootmed.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = bootmed.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.02815 0.3773028
The median and the bootstrap estimate show to hold the same value of 21.2. Bootstraping’s median standard error = .3773 which is small compared to the mean.
mu.01 <- quantile(medv, c(0.1))
mu.01
## 10%
## 12.75
Results show to be in the tenth percentile of medv.
boot01.fn<-function(data,index){
mu01<-quantile(data[index],c(0.1))
return(mu01)
}
boot(medv,boot01.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot01.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0326 0.4849664
Bootstrap estimate = 12.75 (same results as part g), resulting in a standard error = 0.4850 making it small compared to the tenth percentile’s value.