#3. a) Explain how k-fold cross-validation is implemented
K-fold CV is implemented by taking a data set and randomly partitioning it into ‘k’ number of equally sized folds. These folds allows us to create train/test splits on data sets that do not really offer that much data (<1000, or more if we are willing to use the computing power). The training data is the ‘k-1’ folds and the first fold is the test set in that specific cross-section. We then iterate over the same data set and “cross-validate” our results by taking the average accuracy or test MSE of the folds to get our results for overall test, rather than just taking the highest fold.
Leave one out CV (LOOCV): Advantages of k-fold CV compared to LOOCV is that it has much lower computational costs since we are not doing k=n, we are usually doing k=5 or k=10. When using an optimal k, k-fold CV will come rather close to LOOCV at a much lower cost because we are not treating each observation as its own fold. A disadvantage of k-fold CV compared to LOOCV would be that it can be slightly more biased if we use a sub-optimal k for the specific data set like k=2.
#5. a) Fit a logistic regression model that uses income and balance to predict default.
Default = ISLR2::Default
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
set.seed(1)
default_glm1 = glm(default ~ income + balance, data = Default, family = 'binomial')
summary(default_glm1)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Both income and balance are significant at an alpha level of 0.05 on predicting if someone is likely to default.
index = sample(1:nrow(Default), 0.7*nrow(Default))
# 70/30 Train/Test Split
def_train = Default[index,]
def_test = Default[-index,]
default_glm2 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = def_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.167e+01 5.214e-01 -22.379 < 2e-16 ***
## income 2.560e-05 6.012e-06 4.258 2.06e-05 ***
## balance 5.574e-03 2.678e-04 20.816 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2030.3 on 6999 degrees of freedom
## Residual deviance: 1079.6 on 6997 degrees of freedom
## AIC: 1085.6
##
## Number of Fisher Scoring iterations: 8
probabilities = predict(default_glm2,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')
mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)
## [1] 0.02666667
print(acc)
## [1] 0.9733333
We get an accuracy of 97.33% or an error rate or 2.67%. This is with a 70/30 split.
80/20 split
index = sample(1:nrow(Default), 0.8*nrow(Default))
def_train = Default[index,]
def_test = Default[-index,]
default_glm3 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm3)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = def_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.164e+01 4.908e-01 -23.725 < 2e-16 ***
## income 2.006e-05 5.546e-06 3.616 0.000299 ***
## balance 5.731e-03 2.592e-04 22.111 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2340.6 on 7999 degrees of freedom
## Residual deviance: 1252.2 on 7997 degrees of freedom
## AIC: 1258.2
##
## Number of Fisher Scoring iterations: 8
probabilities = predict(default_glm3,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')
mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)
## [1] 0.0265
print(acc)
## [1] 0.9735
The 80/20 split is 0.0002 better than the 70/30 split with an accuracy of 97.35%.
90/10
index = sample(1:nrow(Default), 0.9*nrow(Default))
def_train = Default[index,]
def_test = Default[-index,]
default_glm4 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm4)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = def_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.168e+01 4.642e-01 -25.164 < 2e-16 ***
## income 1.960e-05 5.265e-06 3.723 0.000197 ***
## balance 5.764e-03 2.437e-04 23.656 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2650.8 on 8999 degrees of freedom
## Residual deviance: 1410.3 on 8997 degrees of freedom
## AIC: 1416.3
##
## Number of Fisher Scoring iterations: 8
probabilities = predict(default_glm4,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')
mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)
## [1] 0.03
print(acc)
## [1] 0.97
The 90/10 split is worse than the previous two at an accuracy of 97%
50/50
index = sample(1:nrow(Default), 0.5*nrow(Default))
def_train = Default[index,]
def_test = Default[-index,]
default_glm5 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm5)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = def_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.165e+01 6.311e-01 -18.460 < 2e-16 ***
## income 2.561e-05 7.211e-06 3.551 0.000384 ***
## balance 5.568e-03 3.212e-04 17.335 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1429.88 on 4999 degrees of freedom
## Residual deviance: 781.26 on 4997 degrees of freedom
## AIC: 787.26
##
## Number of Fisher Scoring iterations: 8
probabilities = predict(default_glm5,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')
mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)
## [1] 0.0262
print(acc)
## [1] 0.9738
The 50/50 split is the best model at 97.38% accuracy which is 0.0003 better than the 80/20 split.
default_glm6 = glm(default~., data=def_train, family = 'binomial')
summary(default_glm6)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = def_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.070e+01 6.975e-01 -15.338 < 2e-16 ***
## studentYes -9.689e-01 3.327e-01 -2.912 0.00359 **
## balance 5.714e-03 3.304e-04 17.295 < 2e-16 ***
## income -3.199e-07 1.146e-05 -0.028 0.97774
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1429.88 on 4999 degrees of freedom
## Residual deviance: 772.86 on 4996 degrees of freedom
## AIC: 780.86
##
## Number of Fisher Scoring iterations: 8
probabilities = predict(default_glm6,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')
mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)
## [1] 0.0266
print(acc)
## [1] 0.9734
Using the 50/50 split, adding student caused income to no longer be significant but studentYes was significant at an alpha level of 0.05. Accuracy went down by 0.0004 when adding student but this could potentially be the case because of the split and the difference is rather small. Student did not reduce error rate.
#6. a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.
summary(default_glm1)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Standard error of coefficients income: 4.985e-06 balance: 2.274e-04
b)Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.
boot.fn = function(data,index){
coef(glm(default~income+balance,data=data,subset=index,family = 'binomial'))
}
boot.fn(Default, 1:nrow(Default))
## (Intercept) income balance
## -1.154047e+01 2.080898e-05 5.647103e-03
c)Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefcients for income and balance.
library(boot)
## Warning: package 'boot' was built under R version 4.3.3
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
set.seed(1)
boot(Default, boot.fn, 1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
Intercept 4.344722e-01 income 4.866284e-06 balance 2.298949e-04
d)Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.
From using boot() and boot.fn we got income:4.866284e-06, balance:2.298949e-04 for our estimated standard errors. When using the glm() function we got income: 4.985e-06, balance: 2.274e-04. These estimated standard errors are extremely close to each other so we can state that this model provides a good fit to the data and our assumptions are satisfied.
#9. a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate µˆ.
boston = ISLR2::Boston
mu_hat = mean(boston$medv)
mu_hat
## [1] 22.53281
We get a sample mean of 22.53 for medv (mu_hat / µˆ).
(sd(boston$medv)/sqrt(length(boston$medv)))
## [1] 0.4088611
We get 0.4089 as our estimated standard error of mu_hat / sample mean. This means that, on average, the sample mean of medv will vary about 0.4089 from the true population mean.
set.seed(1)
boot.fn2 = function(data, index) {
mean(data[index])
}
boot(boston$medv, boot.fn2, 10000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = boston$medv, statistic = boot.fn2, R = 10000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.002350771 0.4069546
We estimated the standard error of the sample mean to be about 0.4089 in part b) and when bootstrapping got a standard error of 0.4070.The answers are within 0.0019 of each other so they similar in estimating the error of the sample mean in relation to the true mean.
d)Based on your bootstrap estimate from (c), provide a 95 % confdence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confdence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].
lower_bound = mu_hat - 2*(0.4069546)
upper_bound = mu_hat + 2*(0.4069546)
print(c(lower_bound, upper_bound))
## [1] 21.71890 23.34672
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Confidence Interval: 21.71890 23.34672 t-test: 21.72953 23.33608 The confidence interval and t-test are almost exactly the same, they only differ by about 0.01.
e)Based on this data set, provide an estimate, µˆmed, for the median value of medv in the population.
median(boston$medv)
## [1] 21.2
The sample median of medv is 21.2.
f)We now would like to estimate the standard error of µˆmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.
set.seed(1)
boot.fn2 = function(data, index) {
median(data[index])
}
boot(boston$medv, boot.fn2, 10000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = boston$medv, statistic = boot.fn2, R = 10000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.014705 0.3780355
When bootstrapping the median, we get a standard error of 0.3780 to estimate the distance of the sample median from the true median. We can create a confidence interval to see the estimated range of the true median:
lower_bound = 21.2 - 2*(0.3780355)
upper_bound = 21.2 + 2*(0.3780355)
print(c(lower_bound, upper_bound))
## [1] 20.44393 21.95607
Which would be that the true median estimate should fall between 20.44393, 21.95607 at 95% confidence.
quantile(boston$medv, 0.1)
## 10%
## 12.75
The 10th percentile occurs at 12.75 medv.
h)Use the bootstrap to estimate the standard error of µˆ0.1. Comment on your findings
set.seed(1)
boot.fn3 = function(data, index) {
quantile(data[index], 0.1)
}
boot(boston$medv, boot.fn3, 10000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = boston$medv, statistic = boot.fn3, R = 10000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.007625 0.5014507
Using the bootstrapping method on the 10th percentile, we get a standard error of 0.5015. This gave us the largest standard error value compared to our sample mean and sample medians, but the estimate is still rather small. Getting similar error spreads through different tests shows the power of bootstrapping has on estimating variability.