1. Divide the dataset into k roughly equal-sized folds at random. 2. Fit the model using the other k-1 folds as a training set and evaluate the mode in the validation set, or i-th fold for each fold i 3. Calculate the mean of the k test errors to estimate the overall test error rate
Advantages: More stable test error estimate since it uses all data for training and validation ; Less variance in results since its rooted in “Actuals” and not best assumptions.
Disadvantages: Its requires much more computation
Advantages: Faster approach when dealing with larger datasets; Lower variance in test error estimates
Disadvantages: May be less biased and carry a higher the variance, which could lead to false positives or false negatives and mislead decision making.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.3
data("Default")
set.seed(1)
default_lrm.fit<-glm(default~income+balance,data=Default,family=binomial)
summary(default_lrm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
default.train<-sample(1:nrow(Default),nrow(Default)/2)
default.training<-Default[default.train,]
default.valid<-Default[-default.train,]
lrmfit.train<- glm(default~income+balance,data=default.training,family=binomial)
default.probs<-predict(lrmfit.train,newdata=default.valid,type="response")
default.pred<-ifelse(default.probs>0.5,"Yes","No")
default.actual<-default.valid$default
mean(default.pred != default.actual)
## [1] 0.0254
errorcount<-numeric(3)
seed<-c(2,3,4)
for (i in 1:3) {set.seed(seed[i])
repeat.train<-sample(1:nrow(Default),nrow(Default)/2)
repeat.fit<-glm(default~income+balance,data=Default[repeat.train,], family=binomial)
repeat.prob<-predict(repeat.fit,newdata=Default[-repeat.train,],type="response")
repeat.pred<-ifelse(repeat.prob>0.5,"Yes","No")
errorcount[i]<-mean(repeat.pred!=Default[-repeat.train,]$default)}
errorcount
## [1] 0.0238 0.0264 0.0256
mean(errorcount)
## [1] 0.02526667
Comment on the results obtained.
Average validation set errors were 2.52%, and initial model validation error was 2.54%, indicating the validation set approach is stable. It may be worth noting the slight variability of the repeated test set (2.38%, 2.64% and 2.56%) could indicate some sensitity due to how the data is split.
set.seed(1)
dummy.train<-sample(1:nrow(Default),nrow(Default)/2)
dummy.fit<-glm(default~income+balance+student,data=Default[dummy.train,],family=binomial)
dummy.prob<-predict(dummy.fit,newdata=Default[-dummy.train,], type="response")
dummy.pred<-ifelse(dummy.prob>0.5,"Yes","No")
mean(dummy.pred!=Default[-dummy.train,]$default)
## [1] 0.026
Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.
No, adding the Student variable does not lead to a reduction in test error rate. It remains just about the same, with a slight variation upward, thus worsening the test error rate by 0.06%
set.seed(1)
default6.fit<-glm(default~income+balance,data=Default,family=binomial)
summary(default6.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
default.boot<-function(data,index){
coef(glm(default~income+balance,data=data[index,],family=binomial))}
library(boot)
set.seed(1)
bootstrap<-boot(data=Default,statistic=default.boot,R=1000)
bootstrap
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = default.boot, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
The estimated standard errors between the glm() and boostrap function appear to be reasonable, as the variance are extremely close. The logistical regression model is stable and model assumptions in glm() are reasonable.
data("Boston")
mu_hat<-mean(Boston$medv)
mu_hat
## [1] 22.53281
n<-nrow(Boston)
stddev<-sd(Boston$medv)
mu_hat.se<-stddev/sqrt(n)
mu_hat.se
## [1] 0.4088611
Interpret this result.
Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
Provides the level of certainty or expected variance from the sample mean to be $409 between samples
mu_hat.boot<-function(data,index){
mean(data[index])}
set.seed(1)
mu_hat.bootmean<-boot(Boston$medv,statistic=mu_hat.boot,R=1000)
mu_hat.bootmean
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = mu_hat.boot, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
How does this compare to your answer from (b)?
Using the boostrap to estimate the standard error of ˆμ gives a value of $411, very close to the value at the theoretical formula ($409). This minimal variation confirms the reliability of the theoretical standard error for this dataset.
mu_hat.bootse<-sd(mu_hat.bootmean$t)
boot_ci<-c(mu_hat-2*mu_hat.bootse,mu_hat+2*mu_hat.bootse)
boot_ci
## [1] 21.71148 23.35413
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
Compare it to the results obtained using t.test(Boston$medv).
Hint: You can approximate a 95 % confidence interval using the formula [ˆμ − 2SE(ˆμ), ˆμ + 2SE(ˆμ)].
The t.test() returned a 95% confidence interval of (21.73,23.34) for the population mean of medv. This is a statistically similar to the bootstrap confidence interval of (21.71,23.25), thus suggeting the normality assumption of the t-test is reasonable and both methods provide consistent estimates.
mu_hat.med<-median(Boston$medv)
mu_hat.med
## [1] 21.2
mu_hat.bootmed<-function(data,index){
median(data[index])}
set.seed(1)
mu_hat.medboot<-boot(Boston$medv,statistic=mu_hat.bootmed,R=1000)
mu_hat.medboot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = mu_hat.bootmed, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
Comment on your findings.
The sample median of medv is calculated as $21,200 with the boostrap standard error of ~$378 between samples. The small bootstrap bias (~$23) validating the sample median to be close to the center of the bootstrap distribution. Overall, the bootstrap appears to provide a reliable quantifying variability.
mu_hat0.1<-quantile(Boston$medv,0.1)
mu_hat0.1
## 10%
## 12.75
mu_hat0.1.fnboot<-function(data,index){
quantile(data[index],0.1)}
set.seed(1)
mu_hat0.1.boot<-boot(Boston$medv,statistic=mu_hat0.1.fnboot,R=1000)
mu_hat0.1.boot
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = mu_hat0.1.fnboot, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
Comment on your findings.
The 10th percentile of medv is estimated to be 12,750 with a bootstrap estimated standard error of $477. This provides a moderate level of uncertainty in the estimate. However, the small bias ($34) does sugges the sample estimate is a good representation of the population distribution.