Assignment4

Chapter 5 Questions

a. K-fold cross-validation is implemented by dividing a full dataset into k equal parts, or “folds”. Then, train the model k times. Each time, use k-1 folds to train the model and the remaining one fold to test it. Rotate through the folds so each part gets used once for testing. After running all k trials, average the results to see how well the model generally performs. This gives an idea of how the model would do in real-life situations, not just on the data it was trained on.

b. (i) The validation set approach is simpler because it splits into one part for training the model and one for testing it. It can be unreliable because the results depend heavily on how the data was split. K-fold cross-validation gives more accurate results because it tests the model multiple times on different data splits, but it does take longer to run.

(ii) LOOCV is an extreme version of k-fold cross-validation where k equals the number of data points. The model is trained once for every data point, using all the other points for training and just one for testing. This gives very accurate estimates but takes a lot of time, especially with big datasets. K-fold cross-validation still gives reliable results but is much faster because it doesn’t require as many training rounds as LOOCV.

library(ISLR2)
library(GGally)
library(MASS)
library(class)
library(e1071)
data(Default)

log_model <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(log_model)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

set.seed(16)
train_index <- sample(1:nrow(Default), nrow(Default) / 2)
train_data <- Default[train_index, ]
validation_data <- Default[-train_index, ]

logt_model <- glm(default ~ income + balance, data = train_data, family = binomial)

probabilities <- predict(logt_model, newdata = validation_data, type = "response")
predicted_default <- ifelse(probabilities > 0.5, "Yes", "No")

actual_default <- validation_data$default
error_rate <- mean(predicted_default != actual_default)

print(paste("Validation Set Error Rate:", round(error_rate, 4)))

## [1] "Validation Set Error Rate: 0.0262"

c. In each case, the model made a wrong prediction around 2.6% of the time. That’s a very low error rate, which shows that the model is doing a great job at predicting whether someone will default based on their income and balance. The model is accurate and dependable.

compute_validation_error <- function(seed_value) {
  set.seed(seed_value)

  train_index1 <- sample(1:nrow(Default), nrow(Default) / 2)
  train_data1 <- Default[train_index1, ]
  validation_data1 <- Default[-train_index1, ]

  model <- glm(default ~ income + balance, data = train_data1, family = binomial)
  
  probs <- predict(model, newdata = validation_data1, type = "response")
  preds <- ifelse(probs > 0.5, "Yes", "No")

  actual <- validation_data1$default
  error_rate1 <- mean(preds != actual)
  return(error_rate1)
}

error1 <- compute_validation_error(4)
error2 <- compute_validation_error(6)
error3 <- compute_validation_error(8)

cat("Validation Error Rates:\n")

## Validation Error Rates:

cat("Split 1 (Seed 1):", round(error1, 4), "\n")

## Split 1 (Seed 1): 0.0256

cat("Split 2 (Seed 2):", round(error2, 4), "\n")

## Split 2 (Seed 2): 0.0266

cat("Split 3 (Seed 3):", round(error3, 4), "\n")

## Split 3 (Seed 3): 0.0272

d. Including a dummy variable does not lead to a reduction in the error rate.

set.seed(4)

train_index2 <- sample(1:nrow(Default), nrow(Default) / 2)
train_data2 <- Default[train_index2, ]
validation_data2 <- Default[-train_index2, ]

model_with_student <- glm(default ~ income + balance + student, 
                          data = train_data2, 
                          family = binomial)

probs1 <- predict(model_with_student, newdata = validation_data2, type = "response")
preds1 <- ifelse(probs1 > 0.5, "Yes", "No")

actual2 <- validation_data2$default
error_rate2 <- mean(preds1 != actual2)

cat("Validation Set Error Rate (with student variable):", round(error_rate2, 4), "\n")

## Validation Set Error Rate (with student variable): 0.0262

logit_model <- glm(default ~ income + balance, data = Default, family = binomial)
summary(logit_model)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

get_coefficients <- function(data, index) {
  subset_data <- data[index, ]
  model <- glm(default ~ income + balance, data = subset_data, family = binomial)
  return(coef(model)[c("income", "balance")])
}

set.seed(1)
n_boot <- 1000
n <- nrow(Default)

boot_results <- matrix(NA, nrow = n_boot, ncol = 2)

for (i in 1:n_boot) {
  sample_index <- sample(1:n, n, replace = TRUE)
  boot_results[i, ] <- get_coefficients(Default, sample_index)
}

se_income <- sd(boot_results[, 1])
se_balance <- sd(boot_results[, 2])

cat("Estimated SE for income:", round(se_income, 6), "\n")

## Estimated SE for income: 5e-06

cat("Estimated SE for balance:", round(se_balance, 6), "\n")

## Estimated SE for balance: 0.000232

d. For income, the standard error is very small: 0.000005, which means the model is very consistent. For balance, the standard error is a little bigger: 0.000232, but still small. Balance also has a consistent effect on default, but with slightly more variation than income. Overall, small values suggest that the model is reliable and stable in estimating how income and balance relate to default.

data(Boston)

mean(Boston$medv)

## [1] 22.53281

b. The standard error of the mean shows how much the average home value might change looking at a different random sample of neighborhoods in Boston. The standard error is 0.4089 so the averages would usually be within about $410 of the average from the current sample.

b <- nrow(Boston)
sd_balance <- sd(Boston$medv)
se_mean_balance <- sd_balance / sqrt(b)

cat("Standard Error of the Mean (medv):", round(se_mean_balance, 4), "\n")

## Standard Error of the Mean (medv): 0.4089

c. The two numbers are very close. I can trust the estimate of how reliable the average home value is. The small difference between them is expected because the bootstrap is based on simulation, while the formula assumes a perfect mathematical model.

set.seed(1)
n_boot <- 1000
boot_means <- numeric(n_boot)

for (i in 1:n_boot) {
  sample_index1 <- sample(1:nrow(Boston), nrow(Boston), replace = TRUE)
  boot_sample <- Boston[sample_index1, ]
  boot_means[i] <- mean(boot_sample$medv)
}
boot_se <- sd(boot_means)

cat("Bootstrap Estimate of Standard Error of the Mean (medv):", round(boot_se, 4), "\n")

## Bootstrap Estimate of Standard Error of the Mean (medv): 0.3975

ci_lower <- quantile(boot_means, 0.025)
ci_upper <- quantile(boot_means, 0.975)

cat("Bootstrap 95% Confidence Interval for mean of medv: (",
    round(ci_lower, 4), ",", round(ci_upper, 4), ")\n")

## Bootstrap 95% Confidence Interval for mean of medv: ( 21.7721 , 23.3036 )

mean_medv <- mean(Boston$medv)
sd_medv <- sd(Boston$medv)
b <- nrow(Boston)
se_medv <- sd_medv / sqrt(b)

ci_lower_se <- mean_medv - 2 * se_medv
ci_upper_se <- mean_medv + 2 * se_medv

cat("±2 SE Confidence Interval for mean of medv: (",
    round(ci_lower_se, 4), ",", round(ci_upper_se, 4), ")\n")

## ±2 SE Confidence Interval for mean of medv: ( 21.7151 , 23.3505 )

median(Boston$medv)

## [1] 21.2

f. The number 0.3693 means that the median home value could change by about $369 from one sample to another. It is a pretty small amount, so it shows that the estimate of the typical home value is fairly stable and reliable.

set.seed(1)
n_boot <- 1000
boot_medians <- numeric(n_boot)

for (i in 1:n_boot) {
  sample_index1 <- sample(1:nrow(Boston), nrow(Boston), replace = TRUE)
  boot_sample <- Boston[sample_index1, ]
  boot_medians[i] <- median(boot_sample$medv)
}
se_median <- sd(boot_medians)

cat("Bootstrap Estimate of Standard Error of the Median (medv):", round(se_median, 4), "\n")

## Bootstrap Estimate of Standard Error of the Median (medv): 0.3693

tenth_percentile <- quantile(Boston$medv, 0.10)

cat("Estimated 10th Percentile of medv:", tenth_percentile, "\n")

## Estimated 10th Percentile of medv: 12.75

h. Since the standard error is less than $500, that means the estimate of the 10th percentile is pretty stable and can be trusted to be close to the real value for the entire population of Boston neighborhoods.

set.seed(1)
n_boot <- 1000
boot_percentiles <- numeric(n_boot)

for (i in 1:n_boot) {
  sample_index1 <- sample(1:nrow(Boston), nrow(Boston), replace = TRUE)
  boot_sample <- Boston[sample_index1, ]
  boot_percentiles[i] <- quantile(boot_sample$medv, 0.10)
}
se_percentile <- sd(boot_percentiles)

cat("Bootstrap Estimate of Standard Error of the 10th Percentile (medv):", round(se_percentile, 4), "\n")

## Bootstrap Estimate of Standard Error of the 10th Percentile (medv): 0.4794

Assignment4

Mandi Stanley

2025-07-01

Chapter 5 Questions