This approach involves randomly dividing the set of observations into k groups, or folds. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold.Repeat all the way through all K groups. Once all have been calculated take the mean of all calculated MSEs to get test error.
k-fold is more complex, but it has a lower variability compared to validation set.
k-fold CV with k<n has a computational advantage to LOOCV and gives more accurate estimates of the test error rate than does LOOCV.
#attach(Default)
#detach(Default)
set.seed(752)
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
glm.fit = glm(default ~ balance + income, data = Default, family = binomial)
glm.fit
##
## Call: glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Coefficients:
## (Intercept) balance income
## -1.154e+01 5.647e-03 2.081e-05
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9997 Residual
## Null Deviance: 2921
## Residual Deviance: 1579 AIC: 1585
train = sample(10000,5000)
glm.fit = glm(default ~ balance + income, data = Default, family = binomial, subset = train)
glm.probs = predict(glm.fit , Default[-train,], type='response')
glm.pred = rep("No", length(train))
glm.pred[glm.probs > 0.5] = "Yes"
mean(glm.pred != Default[-train,]$default)
## [1] 0.025
# i.
train = sample(10000,5000)
# ii.
glm.fit = glm(default ~ balance + income, data = Default, family = binomial, subset = train)
# iii.
glm.probs = predict(glm.fit , Default[-train,], type='response')
glm.pred = rep("No", length(train))
glm.pred[glm.probs > 0.5] = "Yes"
# iv.
mean(glm.pred != Default[-train,]$default)
## [1] 0.0272
# i.
train = sample(10000,5000)
# ii.
glm.fit = glm(default ~ balance + income, data = Default, family = binomial, subset = train)
# iii.
glm.probs = predict(glm.fit , Default[-train,], type='response')
glm.pred = rep("No", length(train))
glm.pred[glm.probs > 0.5] = "Yes"
# iv.
mean(glm.pred != Default[-train,]$default)
## [1] 0.0246
# i.
train = sample(10000,5000)
# ii.
glm.fit = glm(default ~ balance + income, data = Default, family = binomial, subset = train)
# iii.
glm.probs = predict(glm.fit , Default[-train,], type='response')
glm.pred = rep("No", length(train))
glm.pred[glm.probs > 0.5] = "Yes"
# iv.
mean(glm.pred != Default[-train,]$default)
## [1] 0.0252
All are around 2.5% test error
# i.
train = sample(10000,5000)
# ii.
glm.fit = glm(default ~ balance + income + student, data = Default, family = binomial, subset = train)
# iii.
glm.probs = predict(glm.fit , Default[-train,], type='response')
glm.pred = rep("No", length(train))
glm.pred[glm.probs > 0.5] = "Yes"
# iv.
mean(glm.pred != Default[-train,]$default)
## [1] 0.0274
With student as a parameter model is around 2.7% test error, If we run it a few times thought we see it averages closer to 2.5%
glm.fit = glm(default ~ balance + income, data = Default, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn = function(dataset, i)
return(coef(glm(default ~ balance + income, data = dataset, family = binomial, subset = i)))
boot(Default, boot.fn, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.861562e-02 4.281434e-01
## t2* 5.647103e-03 1.241182e-05 2.336970e-04
## t3* 2.080898e-05 4.518311e-07 4.416986e-06
They look roughly the same from parts A & B.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
medv.mean = mean(Boston$medv)
medv.mean
## [1] 22.53281
medv.stderr = sd(Boston$medv)/sqrt(length(Boston$medv))
medv.stderr
## [1] 0.4088611
boot.fn = function(dataset, i)
return(mean(dataset[i]))
medv.boot = boot(Boston$medv, boot.fn, 100)
medv.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.03045059 0.403499
Both parts B and C have a very close std.error.
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
medv.boot$t0 - 2 * .403
## [1] 21.72681
medv.boot$t0 + 2 * .403
## [1] 23.33881
medv.med = median(Boston$medv)
medv.med
## [1] 21.2
boot.fn = function(dataset, i)
return(median(dataset[i]))
medv.boot = boot(Boston$medv, boot.fn, 100)
medv.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.046 0.3900065
Bootstrap says median is 21.2 with .378 std.error
medv.med = quantile(Boston$medv, .1)
medv.med
## 10%
## 12.75
boot.fn = function(dataset, i)
return(quantile(dataset[i], .1))
medv.boot = boot(Boston$medv, boot.fn, 100)
medv.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.08 0.5351767
12.75 is the the tenth percentile of medv in Boston suburbs with a std. error of .553