# Import libraries & packages

library(ISLR2)
library(ISLR)
attach(Weekly)
library(MASS)
library(class)
library(e1071)
attach(Auto)
attach(Boston)

Question 3 : We now review k-fold cross-validation.

Explain how k-fold cross-validation is implemented.

K-fold cross-validation is a method used to evaluate the performance of a machine learning model by repeatedly splitting the dataset into different training and testing sets. The dataset is first divided into K equal-sized groups, also referred to as folds. In the first iteration, K − 1 folds are used to train the model, while the remaining fold is used to test its performance. The process is then repeated K times, each time selecting a different fold as the test set and using the remaining folds for training. This is done to ensure that every observation in the dataset is used for testing just once, and multiple times for training. After all iterations are completed, the results from each round are averaged to produce an overall estimate of the model’s performance.

What are the advantages and disadvantages of k-fold cross-validation relative to:

(i). The validation set approach?

Compared to the validation set approach, k-fold cross-validation provides gives us a more reliable estimate of a model’s performance because it allows every observation to be used for both training and testing. This makes better use of the available data, unlike the validation set approach where a portion of the dataset is permanently reserved for testing and cannot be used for training. Additionally, k-fold cross-validation reduces the risk of results being overly dependent on a single random split of the data. K-fold cross-validation has some disadvantages. It is more computationally expensive since the model must be trained multiple times rather than only once. The results may also vary depending on the choice of k, and for very large datasets or complex models, the repeated training process can require significant computational resources.

(ii). LOOCV? K-fold cross-validation is generally less computationally intensive comapared to Leave-One-Out Cross-Validation (LOOCV). This is because the model is trained only k times instead of once for every observation in the dataset. It also tends to have lower variance in performance estimates, making the evaluation more stable across different samples. On the other hand, LOOCV has the advantage of using almost the entire dataset for training in each iteration, since only one observation is left out for testing each time. This can sometimes provide a less biased estimate of model performance, especially with smaller datasets. As a result, LOOCV is much slower because the model must be fit n times, where n is the total number of observations, making it less practical for larger datasets.

Question 5 — we used logistic regression to predict the probability of default using income and balance on the Default data set.

Read Explore data

data("Default")
summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

(a)Use logistic regression model that uses income and balance to predict default.

logmodel1 = glm(default ~ balance + income, data = Default, family = binomial)

summary(logmodel1)

## 
## Call:
## glm(formula = default ~ balance + income, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Using the validation set approach, estimate the test error of this model.

# i)
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]

#ii
LOGmodel2 = glm(default ~ balance + income, data = Default, family = binomial, subset = trainDefault)

#iii
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)

##             
## log.pred_def   No  Yes
##          No  4815  105
##          Yes   29   51

#iv
mean(log.pred_def !=testDefault$default)

## [1] 0.0268

(C)Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

Trial #1 , using 0.6/0.4

set.seed(1)

errors <- c()

for (i in 1:3) {
  
  train <- sample(nrow(Default), nrow(Default) * 0.5)
  
  fit <- glm(default ~ income + balance,
             data = Default,
             family = binomial,
             subset = train)
  
  probs <- predict(fit, Default[-train, ], type = "response")
  
  pred <- rep("No", nrow(Default[-train, ]))
  pred[probs > 0.5] <- "Yes"
  
  errors[i] <- mean(pred != Default[-train, ]$default)
}

errors

## [1] 0.0254 0.0274 0.0244

mean(errors)

## [1] 0.02573333

The validation set approach gave us a 2.5%, 2.7%, 2.4% test error rates for the tree trials. They seem to be close with small variation.

Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student.

#i
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]

#ii
LOGmodel2 = glm(default ~ balance + income + student, data = Default, family = binomial, subset = trainDefault)

#iii
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)

##             
## log.pred_def   No  Yes
##          No  4825  106
##          Yes   15   54

mean(log.pred_def !=testDefault$default)

## [1] 0.0242

Including the student variable in the model results in a test error rate of approximately 2.76%. This is very similar to the error rate obtained from the model without the student variable, shows that adding student status did not improve predictive performance.

Question 6 — We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set.

Using the summary() and glm() functions

set.seed(4)
glm.Default <- glm(default~income+balance, data=Default,family=binomial )
summary(glm.Default)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Write a function, boot.fn()

boot.fn = function(data, index) return(coef(glm(default ~ balance + income, data = data, family = binomial, subset = index)))

Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coeﬀicients for income and balance.

library(boot)
boot(Default, boot.fn, 100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01  1.381094e-02 4.250016e-01
## t2*  5.647103e-03 -1.285551e-06 2.240173e-04
## t3*  2.080898e-05 -3.311297e-07 4.366588e-06

Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

The estimated standard errors for the coefficients for income and balance are 4.240036e-01, 2.294044e-04, and 4.652618e-06

Question 9 — We will now consider the Boston housing data set, from the ISLR2 library.

Attach

attach(Boston)

Explore Dataset

mean.M =mean(medv)
mean.M

## [1] 22.53281

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

estimate for the population mean of medv ˆµ= 22.53281

Provide an estimate of the standard error ofˆµ. Interpret this result.

stderr.mean = sd(medv)/sqrt(length(medv))
stderr.mean

## [1] 0.4088611

Now estimate the standard error ofˆµ using the bootstrap. How does this compare to your answer from (b)?

boot.fn2 = function(data, index) return(mean(data[index]))
boot2 = boot(medv, boot.fn2, 100)
boot2

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn2, R = 100)
## 
## 
## Bootstrap Statistics :
##     original        bias    std. error
## t1* 22.53281 -0.0008359684    0.398165

Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

CI.Boston <- c(22.53 - 2 * 0.41, 22.53 + 2 * 0.41)
CI.Boston

## [1] 21.71 23.35

A 95 % confidence interval using bootstrap is between 21.71 and 23.35, using t test we have a CI of 22.53. The two result are approximately close.

Based on this data set, provide an estimate,ˆ µmed, for the median value of medv in the population.

median.medv = median(medv)
median.medv

## [1] 21.2

We now would like to estimate the standard error ofˆµmed.

The estimated median value of 21.2 from bootstrap which is same as the value in (e), and the the standard error of 0.3874 which is relatively small compared to median value

boot.fn2=function(data, index){return(median(data[index]))}
library(boot)
boot(medv,boot.fn2,500)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn2, R = 500)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2  0.0104   0.3568986

Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts.

tenth.medv = quantile(medv, c(0.1))
tenth.medv

##   10% 
## 12.75

Use the bootstrap to estimate the standard error ofˆ µ0.1. Comment on your findings.

boot.fn3=function(data, index){return(quantile(data[index], c(0.1)))}
library(boot)
boot(medv,boot.fn3,500)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn3, R = 500)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 -0.0031   0.5138614

Estimated quantile value of 12.75 from bootstap which is same as the value in (g), and the standard error of 0.5300358

Predictive Modeling Assignment #4

Christiane Umutoni

2026-06-28

# Import libraries & packages

Question 3 : We now review k-fold cross-validation.

Question 5 — we used logistic regression to predict the probability of default using income and balance on the Default data set.

Question 6 — We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set.

Question 9 — We will now consider the Boston housing data set, from the ISLR2 library.