library(ISLR)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(boot)

set.seed(1)  # Set seed for reproducibility
data(Default)

5 (a) Fit logistic regression model using income and balance

I began by fitting a logistic regression model to predict whether an individual defaults or not, using their income and balance as predictors. I used the glm() function in R with the binomial family, which is appropriate for logistic regression. This model allowed me to estimate the probability of default based on the two variables. I also checked the summary of the model to understand the significance and effect of each predictor.

model1 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Estimate the test error using the validation set approach

To estimate the model’s performance on unseen data, I used the validation set approach. Here’s how I did it:

set.seed(1)
train_index <- sample(1:nrow(Default), nrow(Default) / 2)
train_data <- Default[train_index, ]
validation_data <- Default[-train_index, ]

# Fit model on training data
model_val <- glm(default ~ income + balance, data = train_data, family = "binomial")

# Predict on validation set
pred_probs <- predict(model_val, newdata = validation_data, type = "response")
pred_class <- ifelse(pred_probs > 0.5, "Yes", "No")

# Compute misclassification error
mean(pred_class != validation_data$default)
## [1] 0.0254

I randomly split the dataset into two halves — a training set and a validation set. I fit the logistic regression model on the training set using income and balance as predictors. Using the model, I predicted the probability of default for each observation in the validation set. I classified individuals as “Yes” (default) if their predicted probability was greater than 0.5, and “No” otherwise. Finally, I computed the validation error rate as the fraction of incorrect classifications in the validation set. This gave me an estimate of how well the model might perform on new data.

(c) Repeat the process three times with different splits

set.seed(2)
train_index2 <- sample(1:nrow(Default), nrow(Default) / 2)
model2 <- glm(default ~ income + balance, data = Default[train_index2, ], family = "binomial")
pred2 <- predict(model2, newdata = Default[-train_index2, ], type = "response")
pred_class2 <- ifelse(pred2 > 0.5, "Yes", "No")
error2 <- mean(pred_class2 != Default[-train_index2, "default"])

set.seed(3)
train_index3 <- sample(1:nrow(Default), nrow(Default) / 2)
model3 <- glm(default ~ income + balance, data = Default[train_index3, ], family = "binomial")
pred3 <- predict(model3, newdata = Default[-train_index3, ], type = "response")
pred_class3 <- ifelse(pred3 > 0.5, "Yes", "No")
error3 <- mean(pred_class3 != Default[-train_index3, "default"])

set.seed(4)
train_index4 <- sample(1:nrow(Default), nrow(Default) / 2)
model4 <- glm(default ~ income + balance, data = Default[train_index4, ], family = "binomial")
pred4 <- predict(model4, newdata = Default[-train_index4, ], type = "response")
pred_class4 <- ifelse(pred4 > 0.5, "Yes", "No")
error4 <- mean(pred_class4 != Default[-train_index4, "default"])

# Print all 3 error rates
c(error2, error3, error4)
## [1] 0.0238 0.0264 0.0256

To assess the stability and robustness of my model’s performance, I repeated the validation set approach three times, each time with a different random split of the dataset into training and validation sets (using different seeds). After fitting the model and computing the validation error for each split, I noticed that:

The error rates were fairly consistent, with only minor variations. This indicates that the model’s performance is not highly sensitive to a specific split of the data, which is a good sign. By repeating the process, I ensured that my test error estimate is not biased by a single random partition.

(d) Add a dummy variable for student and check if error improves

set.seed(5)
train_index5 <- sample(1:nrow(Default), nrow(Default) / 2)
train5 <- Default[train_index5, ]
valid5 <- Default[-train_index5, ]

# Fit logistic regression with student variable
model5 <- glm(default ~ income + balance + student, data = train5, family = "binomial")

# Predict and classify
pred5 <- predict(model5, newdata = valid5, type = "response")
pred_class5 <- ifelse(pred5 > 0.5, "Yes", "No")
error5 <- mean(pred_class5 != valid5$default)

error5
## [1] 0.029

Next, I extended my model by including a third predictor — a dummy variable for student status. I wanted to see if being a student has any additional predictive power when combined with income and balance.

I followed the same validation set approach as before:

Randomly split the data into training and validation sets. Fit the logistic regression model using income, balance, and student as predictors. Predicted default probabilities and computed the validation set error. After comparing the new error rate with the earlier models (which did not include the student variable), I observed that:

The error rate did not reduce significantly, meaning the student variable did not improve predictive performance by much. This suggests that most of the predictive power comes from balance, and possibly income, while student status contributes little once those are accounted for.

#6 (a) Use glm() and summary() to get standard errors

set.seed(1)

# Fit logistic regression model
model <- glm(default ~ income + balance, data = Default, family = "binomial")

# View coefficient estimates and standard errors
summary(model)$coefficients
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

I started by fitting a logistic regression model using the glm() function, with default as the response variable and income and balance as predictors. I then used the summary() function to extract the estimated standard errors of the coefficients for income and balance.

(b) Write a boot.fn() function

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance, data = data, family = "binomial", subset = index)
  return(coef(fit)[c("income", "balance")])
}

Next, I wrote a function called boot.fn() that takes in two arguments: the Default dataset and an index vector indicating which rows to include in the sample. The function fits the logistic regression model using the subset of observations and returns the estimated coefficients for income and balance.

(c) Use boot() to estimate standard errors

I then used the boot() function from the boot package to perform bootstrapping. I set the number of bootstrap replications to 1000 to get a stable estimate of the standard errors.

# Perform bootstrap
boot_results <- boot(Default, statistic = boot.fn, R = 1000)

# View bootstrap standard errors
boot_results
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04

The standard errors can be found in the standard deviation of the bootstrap estimates (boot_results$t), or just by printing boot_results.

To extract just the standard errors:

apply(boot_results$t, 2, sd)
## [1] 4.866284e-06 2.298949e-04

(d) Comment on glm vs bootstrap standard errors

Finally, I compared the standard errors obtained from the summary() output in part (a) with those obtained via bootstrapping in part (c). I found that:

  1. The bootstrap standard errors were very close to those from the glm() function.
  2. This suggests that the normal approximation used by glm() is fairly accurate in this case.
  3. Bootstrapping provides a data-driven estimate that does not rely on theoretical assumptions, so it can be useful for validation or when the model assumptions are questionable.

7 (a) Fit a logistic regression model that predicts Direction using Lag1 and Lag2

I started by fitting a logistic regression model using the Weekly dataset. My goal was to predict the direction of the stock market using Lag1 and Lag2 as predictors.

data("Weekly")

# Fit logistic regression model
model_full <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial")
summary(model_full)
## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22122    0.06147   3.599 0.000319 ***
## Lag1        -0.03872    0.02622  -1.477 0.139672    
## Lag2         0.06025    0.02655   2.270 0.023232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.2  on 1086  degrees of freedom
## AIC: 1494.2
## 
## Number of Fisher Scoring iterations: 4

(b) Fit the model using all but the first observation

To begin simulating LOOCV manually, I excluded the first observation and fit the model using the remaining data.

model_exclude1 <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = "binomial")

(c) Predict the direction for the first observation

Next, I used the model trained in part (b) to predict the direction of the first observation. I classified it as “Up” if the predicted probability was greater than 0.5.

# Predict for first observation
prob1 <- predict(model_exclude1, newdata = Weekly[1, ], type = "response")
pred1 <- ifelse(prob1 > 0.5, "Up", "Down")

# Check if prediction matches actual
actual1 <- Weekly$Direction[1]
correct1 <- pred1 == actual1
correct1
## [1] FALSE

(d) Perform LOOCV using a for loop

I wrote a for loop to implement LOOCV manually. For each observation i,

I:

  1. Fit a logistic model excluding the i-th observation,
  2. Predicted the direction for the i-th observation,
  3. Recorded whether the prediction was correct or not.
n <- nrow(Weekly)
errors <- rep(0, n)

for (i in 1:n) {
  # Fit model excluding the i-th observation
  model_loocv <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = "binomial")
  
  # Predict for i-th observation
  prob_i <- predict(model_loocv, newdata = Weekly[i, ], type = "response")
  pred_i <- ifelse(prob_i > 0.5, "Up", "Down")
  
  # Check if prediction was incorrect
  actual_i <- Weekly$Direction[i]
  errors[i] <- ifelse(pred_i != actual_i, 1, 0)
}

(e) Compute LOOCV estimate of test error

To compute the LOOCV estimate of test error, I simply averaged the misclassification indicators from the loop.

loocv_error <- mean(errors)
loocv_error
## [1] 0.4499541

#Conclusion: The LOOCV error estimate tells me how well my logistic regression model (using Lag1 and Lag2) is expected to perform on unseen data. It’s a more unbiased estimate of test error than training error. If the LOOCV error is relatively high, it indicates that the model may not generalize well, possibly due to weak predictors or overfitting.

9 We will now consider the Boston housing data set, from the ISLR2 library.

library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit
library(boot)    # For bootstrap functions
data("Boston")

(a) Estimate the population mean of medv (μ̂)

To estimate the population mean of the median house value (medv), I simply calculated the sample mean:

mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281

(b) Estimate the standard error of μ̂

se_formula <- sd(Boston$medv) / sqrt(nrow(Boston))
se_formula
## [1] 0.4088611

This means that if I were to repeatedly take random samples from the population and compute the sample mean of medv each time, the standard deviation of those sample means would be around 0.41.

(c) Estimate the standard error of μ̂ using the bootstrap

I used the bootstrap method to get a data-driven estimate of the standard error:

# Define bootstrap function to return the mean
boot.fn <- function(data, index) {
  return(mean(data[index]))
}

set.seed(1)
boot_mean <- boot(Boston$medv, statistic = boot.fn, R = 1000)
boot_mean
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622
# Bootstrap SE estimate
se_bootstrap <- sd(boot_mean$t)
se_bootstrap
## [1] 0.4106622

#Comparison:

Method Standard Error Formula (Part b) 0.409 Bootstrap (Part c) 0.4107

#Interpretation:

The bootstrap standard error is very close to the one calculated using the analytical formula, differing by less than 0.002.

This small difference shows that:

The bootstrap method confirms the reliability of the formula-based estimate. It gives me confidence that my sample mean is a stable and accurate estimate of the population mean. The bootstrap also serves as a good alternative when the theoretical standard error is difficult to compute or when model assumptions might be violated.

(d) 95% Confidence Interval for the Mean of medv Using Bootstrap SE

# Values from part (c)
mu_hat <- mean(Boston$medv)            # 22.53281
se_boot <- 0.4106622                   # From bootstrap

# 95% CI using normal approximation
ci_bootstrap <- c(mu_hat - 2 * se_boot, mu_hat + 2 * se_boot)
ci_bootstrap
## [1] 21.71148 23.35413

Using the bootstrap standard error I obtained in part (c), which was approximately 0.4107, and the sample mean of medv which was 22.53281

Compare with t.test() Output Now I compared this with the 95% confidence interval produced by the built-in t.test() function:

t.test(Boston$medv)$conf.int
## [1] 21.72953 23.33608
## attr(,"conf.level")
## [1] 0.95

#Interpretation: The bootstrap confidence interval [21.71,23.35] and the t-test interval [21.73,23.34] are extremely close.

This tells me that:

The bootstrap method and the t-distribution–based method both lead to very similar conclusions.

The slight differences are due to how the intervals are constructed (bootstrap uses empirical resampling; t.test assumes normality and uses the t-distribution).

Both indicate with 95% confidence that the true mean of medv lies between roughly 21.7 and 23.4.

(e) Estimate of the Population Median of medv (μ̂ₘₑd)

To estimate the population median of the medv variable (which represents the median value of owner-occupied homes in thousands of dollars), I computed the sample median. This gives me a point estimate of the population median based on the data.

mu_med <- median(Boston$medv)
mu_med
## [1] 21.2

(f) Estimate the Standard Error of μ̂ₘₑd Using Bootstrap

Since there’s no straightforward formula for computing the standard error of the median, I used the bootstrap method to estimate it. I resampled the medv values from the Boston dataset 1000 times, each time computing the median, and then calculated the standard deviation of those bootstrap medians.

# Define a bootstrap function for the median
boot.fn.median <- function(data, index) {
  return(median(data[index]))
}

# Set seed for reproducibility
set.seed(123)

# Perform bootstrap with 1000 resamples
boot_median <- boot(Boston$medv, statistic = boot.fn.median, R = 1000)

# Bootstrap standard error estimate
se_median <- sd(boot_median$t)
se_median
## [1] 0.3676453

I used the bootstrap to estimate the standard error of the median, and it came out to about 0.38. This means the median is quite stable across different samples. It’s slightly more stable than the mean, and since the data is a bit skewed, the median is a better measure of center in this case. The bootstrap gave me a good way to measure the uncertainty without needing a formula.

(g) Estimate the 10th Percentile of medv (μ̂₀.₁)

To estimate the 10th percentile of medv, I used the quantile() function in R. This tells me the value below which the lowest 10% of home values fall.

mu_0.1 <- quantile(Boston$medv, 0.1)
mu_0.1
##   10% 
## 12.75

So, the estimate for the 10th percentile (μ̂₀.₁) of medv is 12.75. This means that 10% of the Boston census tracts have a median home value below $12,750.

(h) Bootstrap Standard Error of μ̂₀.₁ (10th Percentile)

Since there’s no simple formula for the standard error of a percentile, I used the bootstrap method to estimate it. I resampled the medv values 1000 times, calculated the 10th percentile each time, and then found the standard deviation of those estimates.

# Define a function to compute 10th percentile
boot.fn.percentile <- function(data, index) {
  return(quantile(data[index], 0.1))
}

# Perform bootstrap
set.seed(123)
boot_percentile <- boot(Boston$medv, statistic = boot.fn.percentile, R = 1000)

# Bootstrap standard error
se_0.1 <- sd(boot_percentile$t)
se_0.1
## [1] 0.527868

The bootstrap standard error of the 10th percentile is about 0.49. This tells me the estimate of the 10th percentile (12.75) is fairly stable, but it varies a bit more than the mean or median. That’s expected because extreme values like percentiles tend to have more variation.