Assignment 4

Author

Jonathan McCanlas

Problem 3

Part A

In k-fold cross-validation, the dataset is randomly split into k roughly equal parts (called “folds”). The model is trained on k−1 folds and tested on the remaining fold. This process is repeated k times, with each fold used exactly once as the test set. Finally, the average of the k test errors is computed to estimate the model’s test performance.

Part B

Section 1

The validation set approach

Advantages: Uses the full dataset more efficiently (every observation gets to be in both training and test sets).

Gives a more reliable and stable estimate of test error compared to just one split.

Disadvantages: More computationally intensive (k models trained instead of 1).

Section 2

Leave-One-Out Cross-Validation (LOOCV)

Advantages: Faster than LOOCV (which fits n models where n is the number of observations).

Less variance than LOOCV because the test sets are larger (LOOCV has only 1 test point per fold).

Disadvantages: Slightly more bias than LOOCV, since training on n−1 points gives a nearly unbiased estimate.

Problem 5

Part A

library(ISLR)
set.seed(1)
log_model<-glm(default ~ income+balance, data=Default, family = binomial)
summary(log_model)

Call:
glm(formula = default ~ income + balance, family = binomial, 
    data = Default)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1579.0  on 9997  degrees of freedom
AIC: 1585

Number of Fisher Scoring iterations: 8

Part B

set.seed(1)

# Step (i): Split data into training and validation sets
train_index <- sample(1:nrow(Default), nrow(Default) / 2)
train_data <- Default[train_index, ]
validation_data <- Default[-train_index, ]

# Step (ii): Fit logistic regression on training data
log_model <- glm(default ~ income + balance, data = train_data, family = binomial)

# Step (iii): Predict default on validation set
pred_probs <- predict(log_model, newdata = validation_data, type = "response")
pred_class <- ifelse(pred_probs > 0.5, "Yes", "No")

# Step (iv): Compute validation set error
actual_class <- validation_data$default
mean(pred_class != actual_class)
[1] 0.0254

Part C

# Convert 'default' to numeric (optional for clarity)
Default$default_num <- ifelse(Default$default == "Yes", 1, 0)

# Function to run logistic regression and compute test error
run_validation <- function(seed) {
  set.seed(seed)
  train_index <- sample(1:nrow(Default), nrow(Default) / 2)
  
  train_data <- Default[train_index, ]
  test_data <- Default[-train_index, ]
  
  # Fit logistic regression on training set
  model <- glm(default ~ income + balance, data = train_data, family = "binomial")
  
  # Predict on validation set
  probs <- predict(model, newdata = test_data, type = "response")
  preds <- ifelse(probs > 0.5, "Yes", "No")
  
  # Actual values
  actual <- test_data$default
  
  # Misclassification rate
  mean(preds != actual)
}

# Run three times with different seeds
error1 <- run_validation(1)
error2 <- run_validation(2)
error3 <- run_validation(3)

# Print results
cat("Test Errors:\n")
Test Errors:
cat("Run 1:", error1, "\n")
Run 1: 0.0254 
cat("Run 2:", error2, "\n")
Run 2: 0.0238 
cat("Run 3:", error3, "\n")
Run 3: 0.0264 
cat("Average Error:", mean(c(error1, error2, error3)), "\n")
Average Error: 0.0252 

Multiple tests

I ran the validation test three times, each time using a different random split of the data. The test error rates were:

Run 1: 2.54%

Run 2: 2.38%

Run 3: 2.64%

The average error was about 2.52%.

All three runs gave similar results, which tells me the model is stable. It works well no matter how the data is split, and it makes pretty accurate predictions overall.

Part D

set.seed(1)

# Create training and validation sets
train_index <- sample(1:nrow(Default), nrow(Default) / 2)
train_data <- Default[train_index, ]
valid_data <- Default[-train_index, ]

# Fit logistic regression model with student dummy variable
glm_student <- glm(default ~ income + balance + student, data = train_data, family = "binomial")

# Predict probabilities on validation set
pred_probs <- predict(glm_student, newdata = valid_data, type = "response")

# Classify using 0.5 threshold
pred_class <- ifelse(pred_probs > 0.5, "Yes", "No")

# Calculate test error
actual_class <- valid_data$default
test_error <- mean(pred_class != actual_class)
test_error
[1] 0.026

Adding a variable

Adding the student variable to the model does not significantly improve accuracy. The test error went from about 2.5% to 2.6%, which is a very small change. This means that being a student doesn’t help much when trying to predict default — at least not when you’re already using balance and income.

Problem 6

Part A

# Set seed for reproducibility
set.seed(1)

# Load the ISLR library (if you haven't already)
library(ISLR)

# Fit logistic regression model
logit_model <- glm(default ~ income + balance, data = Default, family = binomial)

# View the model summary to get standard errors
summary(logit_model)

Call:
glm(formula = default ~ income + balance, family = binomial, 
    data = Default)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1579.0  on 9997  degrees of freedom
AIC: 1585

Number of Fisher Scoring iterations: 8

Part B

library(boot)

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance, data = data, family = binomial, subset = index)
  return(coef(fit)[c("income", "balance")])
}

Part C

# Perform bootstrap with 1000 replications
set.seed(1)  # for reproducibility
boot_results <- boot(data = Default, statistic = boot.fn, R = 1000)

# View results
boot_results

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Default, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
        original       bias     std. error
t1* 2.080898e-05 1.680317e-07 4.866284e-06
t2* 5.647103e-03 1.855765e-05 2.298949e-04

The standard errors from the bootstrap and the ones from the glm() model are really close. For both income and balance, the differences are tiny. This tells me that both methods agree, and the model is stable — which is a good sign. Bootstrapping didn’t reveal anything surprising, just confirmed what glm() already showed.

Problem 9

Part A

library(ISLR2)

Attaching package: 'ISLR2'
The following object is masked _by_ '.GlobalEnv':

    Default
The following objects are masked from 'package:ISLR':

    Auto, Credit
data("Boston")

mu_hat <- mean(Boston$medv)
mu_hat
[1] 22.53281

Part B

# Load data
library(ISLR2)
data("Boston")

# Calculate standard deviation and number of observations
s <- sd(Boston$medv)
n <- length(Boston$medv)

# Standard Error
se_mu_hat <- s / sqrt(n)
se_mu_hat
[1] 0.4088611

Standard error came out to about 0.4089, so that tells me the average home value (medv) I calculated from the sample is pretty close to the actual population average. Since the error is small, my estimate should be fairly solid.

Part C

# Define bootstrap function
boot.fn <- function(data, index) {
  mean(data[index])  # compute mean of medv for bootstrap sample
}

# Apply bootstrap
set.seed(1)  # for reproducibility
boot_result <- boot(Boston$medv, boot.fn, R = 1000)

# View result
boot_result

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston$medv, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
    original      bias    std. error
t1* 22.53281 0.007650791   0.4106622

I used the bootstrap method to estimate the standard error of the sample mean of medv, and the result was approximately 0.4107. This is very close to the value I got earlier using the formula (about 0.4089), which shows that both methods agree. The small bias also suggests the bootstrap is giving a stable estimate.

Part D

t.test(Boston$medv)

    One Sample t-test

data:  Boston$medv
t = 55.111, df = 505, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 21.72953 23.33608
sample estimates:
mean of x 
 22.53281 

I applied the t.test() function to the medv variable in the Boston dataset to compute a 95% confidence interval for the population mean of median home values. The resulting interval was approximately [21.73, 23.34], with a sample mean of 22.53.

I then compared this to the confidence interval derived from my bootstrap procedure, which yielded a nearly identical estimate. This close agreement between the two methods suggests that both the parametric (t-test) and non-parametric (bootstrap) approaches are providing a stable and reliable estimate for the population mean. Overall, the results reinforce the conclusion that the average median home value in this dataset is around 22.5.

Part E

data("Boston")

median(Boston$medv)
[1] 21.2

Part F

# Bootstrap function to compute the median
boot.fn <- function(data, index) {
  return(median(data[index]))
}

# Run bootstrap with 1000 resamples
set.seed(1)
boot.out <- boot(data = Boston$medv, statistic = boot.fn, R = 1000)

# Print results
boot.out

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston$medv, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
    original  bias    std. error
t1*     21.2 0.02295   0.3778075

Using the bootstrap method, I estimated the median home value (medv) to be $21,200, and the standard error to be approximately $0.378. This tells me that if I repeatedly sampled from the population and calculated the median each time, those medians would vary by about $378 on average. The small standard error suggests that the median estimate is pretty stable, even though there’s no exact formula to calculate the SE of a median.

Part G

quantile(Boston$medv, 0.1)
  10% 
12.75 

This means that 10% of the Boston census tracts have median home values below $12,750.

Part H

boot.fn <- function(data, index) {
  return(quantile(data[index], 0.1))
}
set.seed(1)
boot_result <- boot(data = Boston$medv, statistic = boot.fn, R = 1000)
boot_result

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Boston$medv, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
    original  bias    std. error
t1*    12.75  0.0339   0.4767526

We estimated that the 10th percentile of home values (medv) is about $12.75k, meaning 10% of the Boston census tracts have median home values below this amount.

Using the bootstrap, we also found the standard error to be 0.477, which tells you how much this percentile might vary if you sampled different subsets from the population.

The small bias (0.034) indicates that the bootstrap distribution is centered close to the actual sample estimate, which adds confidence to your result.