# Load dependencies
pacman::p_load(ISLR, ISLR2, dplyr, car, caret, boot)

Question 3

Part a)

Explain how k-fold cross-validation is implemented.


K-fold cross-validation is a technique used to asses performance of a machine learning model. It does this by dividing the dataset into k subsets/folds. The model is trained and evaluated k number of times, each time using a different fold as the test set while the remaining k-1 folds serve as the training set.

Part b)

What are the advantages and disadvantages of k-fold cross-validation relative to:

i. The validation set approach?

ii. LOOCV?


  1. The advantages of k-fold cross-validation relative to the validation set approach is it uses the entire dataset for training and testing over multiple iterations. This leads to a more accurate result. The disadvantages are it’s computationally more expensive to run and more complex to implement.

  2. The advantages of k-fold cross-validation compared to the Leave-One-Out Cross-Validation (LOOCV) is it’s more efficient as LOOCV requires n model training iterations while k-fold requires less. K-fold also provides lower variance in the model. The disadvantages compared to LOOCV is k-fold is slightly more biased because of the lower variance.

Question 5

Part a)

Fit a logistic regression model that uses income and balance to predict default.

# Read in data from ISLR package
df = Default
head(df)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559
# Logistic Regression Model
logit_model = glm(default ~ income + balance, data= df, family= 'binomial')

summary(logit_model)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
# Check for multicollinearity
vif(logit_model)
##   income  balance 
## 1.045605 1.045605

Part b)

Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

i. Split the sample set into a training set and a validation set.

ii. Fit a multiple logistic regression model using only the training observations.

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

iv. Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

set.seed(42)

# Split data into train and test
train_index = createDataPartition(df$default, p= 0.8, list= FALSE)

train = df[train_index,]
test = df[-train_index,]
# Train Logistic Regression Model
logit_model = glm(default ~ income + balance, data= train, family= 'binomial') 
# Predictions and Classifications
pred_prob = predict(logit_model, test, type= 'response')
pred_labels = ifelse(pred_prob > 0.5, 'Yes', 'No')
pred_labels = as.factor(pred_labels)
# Compute test error (misclassification rate)
mean(pred_labels != test$default)
## [1] 0.02451226

Part c)

Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

set.seed(42)

# Store test errors
test_errors = numeric(3)

for (i in 1:3) {
  # Split data into train and test
  train_index = createDataPartition(df$default, p = 0.8, list = FALSE)
  train = df[train_index, ]
  test  = df[-train_index, ]
  
  # Train logistic regression model
  logit_model = glm(default ~ income + balance, data = train, family = "binomial")
  
  # Make predictions
  pred_prob = predict(logit_model, test, type = "response")
  pred_labels = ifelse(pred_prob > 0.5, "Yes", "No")
  pred_labels = as.factor(pred_labels)
  
  # Compute test error
  test_errors[i] = mean(pred_labels != test$default)
}

# Display test errors
test_errors
## [1] 0.02451226 0.02651326 0.02301151

Part d)

Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.


Including the dummy variable did result in a slightly lower misclassification rate going from 0.0245 to 0.024. Seeing a decrease of 0.0005.

set.seed(42)

# Create dummy variable for Student
df$student = ifelse(df$student == 'Yes', 1, 0)

# Split data into train and test
train_index = createDataPartition(df$default, p = 0.8, list = FALSE)
train = df[train_index, ]
test  = df[-train_index, ]

# Train logistic regression model
logit_model = glm(default ~ ., data = train, family = "binomial")

# Make predictions
pred_prob = predict(logit_model, test, type = "response")
pred_labels = ifelse(pred_prob > 0.5, "Yes", "No")
pred_labels = as.factor(pred_labels)

# Compute test error
mean(pred_labels != test$default)
## [1] 0.02401201

Question 6

Part a)

Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.


The estimated standard error for the coefficient income is 4.985E-1 and the standard error for the coefficient balance is 2.274E-4.

# Logistic Regression Model
logit_model = glm(default ~ income + balance, data= df, family= 'binomial')

summary(logit_model)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Part b)

Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

set.seed(42)

# Create a bootstrap function to compute a logistic regression model
boot.glm = function(data, index) {
  logit_model = glm(default ~ income + balance, data= data[index, ], family= 'binomial')
  return(coef(logit_model)[c('income', 'balance')])
}

Part c)

Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

# Perform bootstrapping
boot_results = boot(data=df, statistic= boot.glm, R= 1000)
# Compute the standard errors from bootstrap
apply(boot_results$t, 2, sd)
## [1] 5.073444e-06 2.299133e-04

Part d)

Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.


There is a strong similarity in the estimated standard errors obtained for income and balance with the glm() at 4.985E-06 and 2.274E-04 and bootstrap at 4.884E-06 and 2.269E-04 respectively. This suggests that both methods provide consistent estimates of uncertainty and that the results are stable.

Question 9

Part a)

Based on this data set, provide an estimate for the population mean of medv, call this estimate µˆ.


The estimate for the population mean is 22.53 or $22,530.

# Read in the data from ISLR2 package
boston = Boston
head(boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3  4.98 24.0
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8  9.14 21.6
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8  4.03 34.7
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7  2.94 33.4
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7  5.33 36.2
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7  5.21 28.7
# Estimate population mean
mean(boston$medv)
## [1] 22.53281

Part b)

Provide an estimate of the standard error of µˆ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.


The standard error of µˆ is 0.409 (or $409). This means the estimated mean home value is expected to vary by approximately $409 across different samples of the same size (506). This small value indicates that the sample mean is a precise and reliable estimate of the population mean.

sd(boston$medv)/sqrt(nrow(boston))
## [1] 0.4088611

Part c)

Now estimate the standard error of µˆ using the bootstrap. How does this compare to your answer from (b)?


The standard error decreased by about 0.0079395. This suggests the bootstrap samples are producing more consistent or stable estimates to the original sample.

set.seed(42)

# Create a bootstrap function to compute the mean
boot.mean = function(data, indices) {
  return(mean(data[indices]))
}

# Perform bootstrapping
boot_results = boot(boston$medv, boot.mean, R = 1000)

# Print the standard error
sd(boot_results$t)
## [1] 0.4009216

Part d)

Based on your bootstrap estimate from (c), provide a 95 % confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confidence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].


The results produces nearly identical ranges. The 95% confidence interval from the one-sample t-test produced (21.72953, 23.33608). The interval from the bootstrap estimate produced (21.73096, 23.33465). The bootstrap interval is slightly narrower at 1.60369 compared to 1.60655, a difference of 0.00286. The t-test interval has a lower starting point but a higher endpoint than the bootstrap interval.

# Find 95% CI of population mean using t-test
t.test(boston$medv)
## 
##  One Sample t-test
## 
## data:  boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281
# Store population mean and sd
pop_mean = mean(boston$medv)  
boot_sd = sd(boot_results$t)

# Lower bound confidence interval
pop_mean - 2 * boot_sd
## [1] 21.73096
# Upper bound confidence interval
pop_mean + 2 * boot_sd
## [1] 23.33465

Part e)

Based on this data set, provide an estimate, µˆ_med, for the median value of medv in the population.


The estimate for the population median is 21.2 or $21,200.

median(boston$medv)
## [1] 21.2

Part f)

We now would like to estimate the standard error of µˆ_med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.


The bootstrap standard error for the median is 0.366 (or $366). This suggests that the estimated median home value is expected to vary by approximately $366 across different samples of the same size (506). This small value indicates that the sample median is a precise and reliable estimate of the population median.

set.seed(42)

# Create a bootstrap function to compute the median
boot.median = function(data, indices) {
  return(median(data[indices]))
}

# Perform bootstrapping
boot_results = boot(boston$medv, boot.median, R = 1000)

# Print the standard error
sd(boot_results$t)
## [1] 0.3661785

Part g)

Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity µˆ_0.1. (You can use the quantile() function.)

quantile(boston$medv, probs= 0.1)
##   10% 
## 12.75

Part h)

Use the bootstrap to estimate the standard error of µˆ_0.1. Comment on your findings.


The bootstrap standard error for the 10th percentile is 0.495 (or $495). This suggests that the estimated 10th percentile value of $12,750 is expected to vary by approximately $495 across different samples of the same size (506). While this variation is not particularly large, it indicates some level of uncertainty in the estimate.

set.seed(42)

# Create a bootstrap function to compute the 10th percentile
boot.10percentile = function(data, indices) {
  return(quantile(data[indices], probs = 0.1))
}

# Perform bootstrapping
boot_results = boot(boston$medv, boot.10percentile, R = 1000)

# Print the standard error
sd(boot_results$t)
## [1] 0.4948966