Lab 6

library(ISLR2)

Question 5:

a. Logistic Regression Model:

logistic_model <- glm(default ~ income + balance,
                      data = Default,
                      family = binomial)

summary(logistic_model)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Above is the logistic regression model built using income and balance to predict default as response variable. Both income and balance are positively correlated with default with higher significance as evidenced by very small p-value. However, the impact of these predictors is not too significant as evidenced by their respective small estimate values.

b. Test Error:

i. Split the Data:

set.seed(1)

train_indices <- sample(1:nrow(Default), nrow(Default) / 2)
train_set <- Default[train_indices, ]
validation_set <- Default[-train_indices, ]

The above code split the data into half training set and half validation set.

ii. Fit the Model:

model <- glm(default ~ income + balance,
                      data = train_set,
                      family = binomial)

summary(model)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train_set)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.194e+01  6.178e-01 -19.333  < 2e-16 ***
## income       3.262e-05  7.024e-06   4.644 3.41e-06 ***
## balance      5.689e-03  3.158e-04  18.014  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.8  on 4999  degrees of freedom
## Residual deviance:  803.3  on 4997  degrees of freedom
## AIC: 809.3
## 
## Number of Fisher Scoring iterations: 8

The training set is still producing a similar logistic model as before using the whole data set.

iii. Posterior Probability:

posterior_probs <- predict(model, newdata = validation_set, type = "response")

# Classifying based on a threshold of 0.5:
predicted_default <- ifelse(posterior_probs > 0.5, "Yes", "No")

Here, posterior probability is yielding predicted default on the basis of criteria where any probabilities higher than 0.5 is “Yes” default and else is “No” default.

iv. Validation Set Error:

actual_default <- validation_set$default
validation_error <- mean(predicted_default != actual_default)
print(validation_error)

## [1] 0.0254

This shows that model is incorrectly classifying 2.54% of the observations in the data, while maintaining 97.46% accuracy of model.

c. Repeating #b:

set.seed(123) 
n_splits <- 3
error_rates <- numeric(n_splits)

for(i in 1:n_splits) {
  train_indices <- sample(1:nrow(Default), nrow(Default) / 2)
  train_set <- Default[train_indices, ]
  validation_set <- Default[-train_indices, ]
  
  
  model <- glm(default ~ balance + income, data = train_set, family = binomial)
  
  
  posterior_probs <- predict(model, newdata = validation_set, type = "response")
  predicted_default <- ifelse(posterior_probs > 0.5, "Yes", "No")
  
  
  error_rates[i] <- mean(predicted_default != validation_set$default)
  
  
  cat("Split", i, "error rate:", error_rates[i], "\n")
}

## Split 1 error rate: 0.0276 
## Split 2 error rate: 0.0246 
## Split 3 error rate: 0.0262

cat("Average error rate:", mean(error_rates), "\n")

## Average error rate: 0.02613333

Overall, it looks like the average of model accuracy has not changed much even after randomly selecting training data and validation set. The average of 3 random training set selection is yielding 2.61% of incorrect classification of observation, maintaining 97.39% of model’s accuracy.

d. Including Dummy Variable:

set.seed(456)
train_indices <- sample(1:nrow(Default), nrow(Default) / 2)
train_set <- Default[train_indices, ]
validation_set <- Default[-train_indices, ]


# Fitting the logistic regression model using income, balance, and student:
model_student <- glm(default ~ income + balance + student, 
                     data = train_set, family = binomial)



posterior_probs <- predict(model_student, newdata = validation_set, type = "response")
predicted_default <- ifelse(posterior_probs > 0.5, "Yes", "No")


# Computing the validation set error:
validation_error <- mean(predicted_default != validation_set$default)
print(validation_error)

## [1] 0.0296

From the above output, we can see that after including a dummy variable for student, we get 2.96% of incorrect classification of observations. This shows a slight decrease in the model’s accuracy (97.04%). Hence, there is no evidence of reduction in test error by introducing dummy variable for student in the model.

Question 6:

a. Using summary() and glm():

set.seed(1)

# Fitting logistic regression model:
glm_fit <- glm(default ~ income + balance, data = Default, family = binomial)

summary(glm_fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

b. Creating a bootstrap function:

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance, data = data[index, ], family = binomial)
  return(coef(fit)[2:3])  
}

c. Using boot():

library(boot)
set.seed(1)

# Performing bootstrap with 1000 resamples
boot_results <- boot(data = Default, statistic = boot.fn, R = 1000)

boot_results

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04

d. Compare the results:

boot_results$t0  # original estimates

##       income      balance 
## 2.080898e-05 5.647103e-03

apply(boot_results$t, 2, sd)  # bootstrap standard errors

## [1] 4.866284e-06 2.298949e-04

Interpretation:

The standard errors provided by the glm() function closely match the bootstrap estimates of the standard errors for the income and balance coefficients. It is a strong proof that the asymptotic (formula-based) standard errors from glm() are correct in this instance which is provided by this agreement. A nonparametric technique called bootstrapping verifies the accuracy of these estimates without imposing rigid distributional presumptions.

Question 7:

a. Logistic Regression Model:

library(ISLR)

## 
## Attaching package: 'ISLR'

## The following objects are masked from 'package:ISLR2':
## 
##     Auto, Credit

data("Weekly")
head(Weekly)

##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
## 6 1990  1.178  0.712  3.514 -2.576 -0.270 0.1544440 -1.372      Down

glm_full <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(glm_full)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22122    0.06147   3.599 0.000319 ***
## Lag1        -0.03872    0.02622  -1.477 0.139672    
## Lag2         0.06025    0.02655   2.270 0.023232 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1488.2  on 1086  degrees of freedom
## AIC: 1494.2
## 
## Number of Fisher Scoring iterations: 4

b. Using all but first observation:

glm_minus1 <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = binomial)
summary(glm_minus1)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly[-1, 
##     ])
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.22324    0.06150   3.630 0.000283 ***
## Lag1        -0.03843    0.02622  -1.466 0.142683    
## Lag2         0.06085    0.02656   2.291 0.021971 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1494.6  on 1087  degrees of freedom
## Residual deviance: 1486.5  on 1085  degrees of freedom
## AIC: 1492.5
## 
## Number of Fisher Scoring iterations: 4

c. Predict first observation:

# Computing posterior probability for first observation
prob1 <- predict(glm_minus1, newdata = Weekly[1, ], type = "response")

# Classifying as "Up" if prob > 0.5
pred1 <- ifelse(prob1 > 0.5, "Up", "Down")

# Checking if prediction matches actual
actual1 <- Weekly$Direction[1]
correct1 <- pred1 == actual1
print(correct1)

## [1] FALSE

d. Using for loop:

n <- nrow(Weekly)
errors <- rep(0, n)

for (i in 1:n) {
  
  # (i) Fitting model without ith observation
  glm_loo <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
  
  # (ii) and (iii) Predicting for ith observation
  prob_i <- predict(glm_loo, newdata = Weekly[i, ], type = "response")
  pred_i <- ifelse(prob_i > 0.5, "Up", "Down")
  
  # (iv) Comparing prediction with actual
  actual_i <- Weekly$Direction[i]
  errors[i] <- ifelse(pred_i != actual_i, 1, 0)
}

e. Comments on Results:

loocv_error <- mean(errors)
cat("LOOCV estimate of test error:", loocv_error, "\n")

## LOOCV estimate of test error: 0.4499541

Interpretation:

An error rate close to 0.45 means the model performs only slightly better than random guessing, which would yield about 50% error on a binary classification problem like this.

The possible reason for that could be the relationship between Lag1, Lag2, and Direction which may not be strong. As we can evidence this by observing the higher p-value of 0.142683 and 0.021971 respectively.

Question 9:

a. Mean of medv:

library(ISLR2)
data("Boston")

# Estimating the population mean of medv:
mu_hat <- mean(Boston$medv)
mu_hat

## [1] 22.53281

So on average, the median value of owner-occupied homes is approximately $22,533.

b. Standard Error of Mean:

n <- nrow(Boston)
se_mu <- sd(Boston$medv) / sqrt(n)
se_mu

## [1] 0.4088611

The mean home value of $22,532.81 would typically vary by about $409 (in $1000s) from sample to sample due to random sampling variation.

c. Using Bootstrap:

library(boot)

boot.fn <- function(data, index) {
  return(mean(data[index]))
}

# Performing bootstrap with 1000 resamples:
set.seed(123)
boot_mean <- boot(Boston$medv, boot.fn, R = 1000)
boot_mean

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 -0.01607372   0.4045557

The standard error from bootstrap vs the one from #b are quite similar as reported as 0.4088611 vs 0.4045557 respectively.

d. Confidence Interval:

From Bootstrap:

# Approximating 95% CI:
boot_ci <- c(mean(Boston$medv) - 2 * sd(boot_mean$t),
             mean(Boston$medv) + 2 * sd(boot_mean$t))
boot_ci

## [1] 21.72369 23.34192

From t.test():

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

From the above bootstrap and t test, we can observe both of them yielded very similar confidence interval.

e. Median of medv:

mu_med <- median(Boston$medv)
mu_med

## [1] 21.2

This is the point estimate for the median value of homes (in $1000s) in the Boston dataset. It tells us that half the homes are worth less than $21,200, and half are worth more.

f. Standard Error of Median:

# Bootstrap function for median:
boot.median.fn <- function(data, index) {
  return(median(data[index]))
}

set.seed(123)
boot_median <- boot(Boston$medv, boot.median.fn, R = 1000)
boot_median

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.median.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0203   0.3676453

The standard error obtained for Median of medv is 0.3676453 which is sort of low value. This suggest that the median value from the sample is stable and likely to perform well across different sample - not too sensitive to sample variation.

g. Using quantile():

mu_0.1 <- quantile(Boston$medv, 0.1)
mu_0.1

##   10% 
## 12.75

From above, we obtained that 10% of Boston census tracts have a median house value below $12,750.

h. Using Bootstrap:

boot.p10.fn <- function(data, index) {
  return(quantile(data[index], 0.1))
}

set.seed(123)
boot_p10 <- boot(Boston$medv, boot.p10.fn, R = 1000)
boot_p10

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.p10.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  -0.012    0.527868

From above we understand how much the 10th percentile would vary from sample to sample. So if we collected another sample of the same size, the 10th percentile might fluctuate by about $528 (in $1000s) due to random sampling variation.

Lab 6

2025-03-21

Question 5:

a. Logistic Regression Model:

b. Test Error:

i. Split the Data:

ii. Fit the Model:

iii. Posterior Probability:

iv. Validation Set Error:

c. Repeating #b:

d. Including Dummy Variable:

Question 6:

a. Using summary() and glm():

b. Creating a bootstrap function:

c. Using boot():

d. Compare the results:

Interpretation:

Question 7:

a. Logistic Regression Model:

b. Using all but first observation:

c. Predict first observation:

d. Using for loop:

e. Comments on Results:

Interpretation:

Question 9:

a. Mean of medv:

b. Standard Error of Mean:

c. Using Bootstrap:

d. Confidence Interval:

From Bootstrap:

From t.test():

e. Median of medv:

f. Standard Error of Median:

g. Using quantile():

h. Using Bootstrap: