library(ISLR2)
logistic_model <- glm(default ~ income + balance,
data = Default,
family = binomial)
summary(logistic_model)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Above is the logistic regression model built using income and balance to predict default as response variable. Both income and balance are positively correlated with default with higher significance as evidenced by very small p-value. However, the impact of these predictors is not too significant as evidenced by their respective small estimate values.
set.seed(1)
train_indices <- sample(1:nrow(Default), nrow(Default) / 2)
train_set <- Default[train_indices, ]
validation_set <- Default[-train_indices, ]
The above code split the data into half training set and half validation set.
model <- glm(default ~ income + balance,
data = train_set,
family = binomial)
summary(model)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = train_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
The training set is still producing a similar logistic model as before using the whole data set.
posterior_probs <- predict(model, newdata = validation_set, type = "response")
# Classifying based on a threshold of 0.5:
predicted_default <- ifelse(posterior_probs > 0.5, "Yes", "No")
Here, posterior probability is yielding predicted default on the basis of criteria where any probabilities higher than 0.5 is “Yes” default and else is “No” default.
actual_default <- validation_set$default
validation_error <- mean(predicted_default != actual_default)
print(validation_error)
## [1] 0.0254
This shows that model is incorrectly classifying 2.54% of the observations in the data, while maintaining 97.46% accuracy of model.
set.seed(123)
n_splits <- 3
error_rates <- numeric(n_splits)
for(i in 1:n_splits) {
train_indices <- sample(1:nrow(Default), nrow(Default) / 2)
train_set <- Default[train_indices, ]
validation_set <- Default[-train_indices, ]
model <- glm(default ~ balance + income, data = train_set, family = binomial)
posterior_probs <- predict(model, newdata = validation_set, type = "response")
predicted_default <- ifelse(posterior_probs > 0.5, "Yes", "No")
error_rates[i] <- mean(predicted_default != validation_set$default)
cat("Split", i, "error rate:", error_rates[i], "\n")
}
## Split 1 error rate: 0.0276
## Split 2 error rate: 0.0246
## Split 3 error rate: 0.0262
cat("Average error rate:", mean(error_rates), "\n")
## Average error rate: 0.02613333
Overall, it looks like the average of model accuracy has not changed much even after randomly selecting training data and validation set. The average of 3 random training set selection is yielding 2.61% of incorrect classification of observation, maintaining 97.39% of model’s accuracy.
set.seed(456)
train_indices <- sample(1:nrow(Default), nrow(Default) / 2)
train_set <- Default[train_indices, ]
validation_set <- Default[-train_indices, ]
# Fitting the logistic regression model using income, balance, and student:
model_student <- glm(default ~ income + balance + student,
data = train_set, family = binomial)
posterior_probs <- predict(model_student, newdata = validation_set, type = "response")
predicted_default <- ifelse(posterior_probs > 0.5, "Yes", "No")
# Computing the validation set error:
validation_error <- mean(predicted_default != validation_set$default)
print(validation_error)
## [1] 0.0296
From the above output, we can see that after including a dummy variable for student, we get 2.96% of incorrect classification of observations. This shows a slight decrease in the model’s accuracy (97.04%). Hence, there is no evidence of reduction in test error by introducing dummy variable for student in the model.
set.seed(1)
# Fitting logistic regression model:
glm_fit <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm_fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
boot.fn <- function(data, index) {
fit <- glm(default ~ income + balance, data = data[index, ], family = binomial)
return(coef(fit)[2:3])
}
library(boot)
set.seed(1)
# Performing bootstrap with 1000 resamples
boot_results <- boot(data = Default, statistic = boot.fn, R = 1000)
boot_results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 1.680317e-07 4.866284e-06
## t2* 5.647103e-03 1.855765e-05 2.298949e-04
boot_results$t0 # original estimates
## income balance
## 2.080898e-05 5.647103e-03
apply(boot_results$t, 2, sd) # bootstrap standard errors
## [1] 4.866284e-06 2.298949e-04
The standard errors provided by the glm() function closely match the
bootstrap estimates of the standard errors for the income and balance
coefficients. It is a strong proof that the asymptotic (formula-based)
standard errors from glm() are correct in this instance which is
provided by this agreement. A nonparametric technique called
bootstrapping verifies the accuracy of these estimates without imposing
rigid distributional presumptions.
library(ISLR)
##
## Attaching package: 'ISLR'
## The following objects are masked from 'package:ISLR2':
##
## Auto, Credit
data("Weekly")
head(Weekly)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
## 6 1990 1.178 0.712 3.514 -2.576 -0.270 0.1544440 -1.372 Down
glm_full <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = binomial)
summary(glm_full)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22122 0.06147 3.599 0.000319 ***
## Lag1 -0.03872 0.02622 -1.477 0.139672
## Lag2 0.06025 0.02655 2.270 0.023232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1488.2 on 1086 degrees of freedom
## AIC: 1494.2
##
## Number of Fisher Scoring iterations: 4
glm_minus1 <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = binomial)
summary(glm_minus1)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = binomial, data = Weekly[-1,
## ])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22324 0.06150 3.630 0.000283 ***
## Lag1 -0.03843 0.02622 -1.466 0.142683
## Lag2 0.06085 0.02656 2.291 0.021971 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1494.6 on 1087 degrees of freedom
## Residual deviance: 1486.5 on 1085 degrees of freedom
## AIC: 1492.5
##
## Number of Fisher Scoring iterations: 4
# Computing posterior probability for first observation
prob1 <- predict(glm_minus1, newdata = Weekly[1, ], type = "response")
# Classifying as "Up" if prob > 0.5
pred1 <- ifelse(prob1 > 0.5, "Up", "Down")
# Checking if prediction matches actual
actual1 <- Weekly$Direction[1]
correct1 <- pred1 == actual1
print(correct1)
## [1] FALSE
n <- nrow(Weekly)
errors <- rep(0, n)
for (i in 1:n) {
# (i) Fitting model without ith observation
glm_loo <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = binomial)
# (ii) and (iii) Predicting for ith observation
prob_i <- predict(glm_loo, newdata = Weekly[i, ], type = "response")
pred_i <- ifelse(prob_i > 0.5, "Up", "Down")
# (iv) Comparing prediction with actual
actual_i <- Weekly$Direction[i]
errors[i] <- ifelse(pred_i != actual_i, 1, 0)
}
loocv_error <- mean(errors)
cat("LOOCV estimate of test error:", loocv_error, "\n")
## LOOCV estimate of test error: 0.4499541
An error rate close to 0.45 means the model performs only slightly better than random guessing, which would yield about 50% error on a binary classification problem like this.
The possible reason for that could be the relationship between Lag1, Lag2, and Direction which may not be strong. As we can evidence this by observing the higher p-value of 0.142683 and 0.021971 respectively.
library(ISLR2)
data("Boston")
# Estimating the population mean of medv:
mu_hat <- mean(Boston$medv)
mu_hat
## [1] 22.53281
So on average, the median value of owner-occupied homes is approximately $22,533.
n <- nrow(Boston)
se_mu <- sd(Boston$medv) / sqrt(n)
se_mu
## [1] 0.4088611
The mean home value of $22,532.81 would typically vary by about $409 (in $1000s) from sample to sample due to random sampling variation.
library(boot)
boot.fn <- function(data, index) {
return(mean(data[index]))
}
# Performing bootstrap with 1000 resamples:
set.seed(123)
boot_mean <- boot(Boston$medv, boot.fn, R = 1000)
boot_mean
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.01607372 0.4045557
The standard error from bootstrap vs the one from #b are quite similar as reported as 0.4088611 vs 0.4045557 respectively.
# Approximating 95% CI:
boot_ci <- c(mean(Boston$medv) - 2 * sd(boot_mean$t),
mean(Boston$medv) + 2 * sd(boot_mean$t))
boot_ci
## [1] 21.72369 23.34192
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
From the above bootstrap and t test, we can observe both of them yielded very similar confidence interval.
mu_med <- median(Boston$medv)
mu_med
## [1] 21.2
This is the point estimate for the median value of homes (in $1000s) in the Boston dataset. It tells us that half the homes are worth less than $21,200, and half are worth more.
# Bootstrap function for median:
boot.median.fn <- function(data, index) {
return(median(data[index]))
}
set.seed(123)
boot_median <- boot(Boston$medv, boot.median.fn, R = 1000)
boot_median
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.median.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0203 0.3676453
The standard error obtained for Median of medv is 0.3676453 which is sort of low value. This suggest that the median value from the sample is stable and likely to perform well across different sample - not too sensitive to sample variation.
mu_0.1 <- quantile(Boston$medv, 0.1)
mu_0.1
## 10%
## 12.75
From above, we obtained that 10% of Boston census tracts have a median house value below $12,750.
boot.p10.fn <- function(data, index) {
return(quantile(data[index], 0.1))
}
set.seed(123)
boot_p10 <- boot(Boston$medv, boot.p10.fn, R = 1000)
boot_p10
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.p10.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 -0.012 0.527868
From above we understand how much the 10th percentile would vary from sample to sample. So if we collected another sample of the same size, the 10th percentile might fluctuate by about $528 (in $1000s) due to random sampling variation.