K-fold cross validation is implemented by splitting the data set into k segments of equal length. K models are developed leaving out each one of the folds. Then, the test error is calculated for each model, and averaged to create a test error for the data set.
The advantages of k-fold cross validation compared to the validation approach are that it is less susceptible to randomness caused by the seed and is less susceptible to overestimating the test error. The disadvantage is the extra computation power required to run k number of models.
The advantages of k-fold cross validation compared to LOOCV is the reduced computational power required (k models instead of n models) and lower variance. The disadvantage is the increased bias.
Default = Default
Default$default = ifelse(Default$default == 'Yes', 1, 0)
logit = glm(default ~ income + balance, data = Default, family = 'binomial')
set.seed(1)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)
## cv.preds
## 0 1
## 0 2897 1
## 1 89 13
val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)
## [1] 0.03
Repeating the process for seed 2-4 results in test errors ranging from .0227 to .03. This is a ~30% variance which does not give me confidence in what the actual test error is.
set.seed(2)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)
## cv.preds
## 0 1
## 0 2900 5
## 1 67 28
val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)
## [1] 0.024
set.seed(3)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)
## cv.preds
## 0 1
## 0 2910 7
## 1 68 15
val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)
## [1] 0.025
set.seed(4)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)
## cv.preds
## 0 1
## 0 2906 8
## 1 60 26
val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)
## [1] 0.02266667
Adding the student predictor decreased model performance if you compare to identical seeds! I recommend leaving it out of the model.
set.seed(1)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance + student, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)
## cv.preds
## 0 1
## 0 2897 1
## 1 91 11
val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)
## [1] 0.03066667
The standard errors for income and balance are 4.985 x 10^-6 and 2.274 x 10^-4, respectively.
logit = glm(default ~ income + balance, data = Default, family = 'binomial')
summary(logit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
income.std.err = summary(logit)$coefficients[2,2]
balance.std.err = summary(logit)$coefficients[3,2]
set.seed(1)
boot.fn = function(data, index) + return(coef(glm(default ~ income + balance, data = data, subset = index, family = 'binomial')))
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The standard errors calculated by the glm and the bootstrap function are very similar. With a standard error greater than the coefficient estimate, we can assume that Income does not effect whether or not an individual will Default. But, Balance’s standard error is an order of magnitude less than the coefficent, so we should keep that predictor as for whether or not an individual might Default.
The mean Median Value of owner-occupied homes is $225,330.
boston = Boston
mu = mean(Boston$medv)
mu
## [1] 22.53281
The standard error is .409. This means I am 95% confident the true population mean is between $217,314 and $233,346.
se.mu = sd(Boston[['medv']]) / (nrow(Boston)^.5)
se.mu
## [1] 0.4088611
The standard error calculated using bootstrap is the same as the one calculated using the entire sample set.
set.seed(1)
se.fn = function(data,index){
X = data$medv[index]
return(sd(X) / (length(X) ^ .5))
}
boot(boston, se.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = boston, statistic = se.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.4088611 -0.001898635 0.01735132
Using the bootstrap standard error of .409 we calculate the same confidence interval as in Part B. I am 95% confident the true population mean is between $217,314 and $233,346. This is nearly identical to the confidence interval calculated using a t-test on the full sample.
t.test(boston$medv)
##
## One Sample t-test
##
## data: boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
The median Median Home Value for Owner-Occupied homes is $212,000.
mu.med = median(boston$medv)
mu.med
## [1] 21.2
The bootstrap estimate for standard error is .377. This means we are 95% confident the population median is between $204,611 and $219,389.
set.seed(1)
med.fn = function(data,index){
X = data$medv[index]
return(median(X))
}
boot(boston, med.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = boston, statistic = med.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
The top ten percentile has a median home value of $348,000.
mu.10 = quantile(boston$medv, probs = .9)
mu.10
## 90%
## 34.8
The bootstrap estimate for the top 10 percentile is 1.148. This means we are 95% confident the population top 10 percentile is between $325,499 and $370,501.
set.seed(1)
se.10.fn = function(data,index){
X = data$medv[index]
return(quantile(X, probs = .9))
}
boot(boston, se.10.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = boston, statistic = se.10.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 34.8 -0.40475 1.14822