Chapter 5: (5), (6), (7), and (9)
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(boot)
## Warning: package 'boot' was built under R version 4.3.3
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
library(tidyr)
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
glimpse(Default)
## Rows: 10,000
## Columns: 4
## $ default <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ student <fct> No, Yes, No, No, No, Yes, No, Yes, No, No, Yes, Yes, No, No, N…
## $ balance <dbl> 729.5265, 817.1804, 1073.5492, 529.2506, 785.6559, 919.5885, 8…
## $ income <dbl> 44361.625, 12106.135, 31767.139, 35704.494, 38463.496, 7491.55…
log_mo <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(log_mo)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
set.seed(234, sample.kind = "Rounding")
## Warning in set.seed(234, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
nrow(train) / nrow(Default)
## [1] 0.7001
Using income and balance variables
log_mo_2 <- glm(default ~ income + balance, data = train, family = "binomial")
summary(log_mo_2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.191e+01 5.405e-01 -22.041 < 2e-16 ***
## income 2.042e-05 6.081e-06 3.358 0.000785 ***
## balance 5.876e-03 2.824e-04 20.806 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1065.7 on 6998 degrees of freedom
## AIC: 1071.7
##
## Number of Fisher Scoring iterations: 8
test_pred <- factor(ifelse(predict(log_mo_2, newdata = test, type = "response") > 0.5, "Yes", "No"))
test_pred[1:10]
## 2 5 7 13 14 17 22 24 25 29
## No No No No No No No No No No
## Levels: No Yes
round(mean(test$default != test_pred), 5)
## [1] 0.02934
The process involves fitting logistic regression models on a training dataset and then performing three additional train/test splits iteratively within a loop. For each iteration, the model is trained on the newly created training set and evaluated on the corresponding test set. The error rates from testing on each of these test sets are recorded in a vector named log_def_errors. The results from this procedure are subsequently presented.
log_mo_errors <- c()
log_mo_errors[1] <- mean(test$default != test_pred)
set.seed(42, sample.kind = "Rounding")
## Warning in set.seed(42, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
for (i in 2:4) {
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
log_mo <- glm(default ~ income + balance, data = train, family = "binomial")
test_pred <- factor(ifelse(predict(log_mo, newdata = test, type = "response") > 0.5, "Yes", "No"))
log_mo_errors[i] <- mean(test$default != test_pred)
}
round(log_mo_errors, 5)
## [1] 0.02934 0.02534 0.02601 0.02534
There is some variation, as we would expect with the validation set approach and cross-validation. The average test error is 0.02651.
set.seed(55, sample.kind = "Rounding")
## Warning in set.seed(55, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
log_def <- glm(default ~ ., data = train, family = "binomial")
summary(log_def)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.052e+01 5.718e-01 -18.397 <2e-16 ***
## studentYes -5.478e-01 2.748e-01 -1.993 0.0462 *
## balance 5.511e-03 2.643e-04 20.851 <2e-16 ***
## income 3.570e-06 9.599e-06 0.372 0.7100
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1144.9 on 6997 degrees of freedom
## AIC: 1152.9
##
## Number of Fisher Scoring iterations: 8
test_pred <- factor(ifelse(predict(log_def, newdata = test, type = "response") > 0.5, "Yes", "No"))
The studentYes variable is significant, which has led the income variable to lose significance (correlated features). However, we are primarily concerned with the model’s out-of-sample performance. The test error is shown below.
round(mean(test$default != test_pred), 5)
## [1] 0.02434
log_def <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(log_def)$coefficients[, 2]
## (Intercept) income balance
## 4.347564e-01 4.985167e-06 2.273731e-04
I write the boot.fn() method and test it with the default data. Its default action is to estimate the coefficients for all of the supplied data if the index parameter is omitted.
boot.fn <- function(data, index = 1:nrow(data)) {
coef(glm(default ~ income + balance, data = data, subset = index, family = "binomial"))[-1]
}
boot.fn(Default)
## income balance
## 2.080898e-05 5.647103e-03
set.seed(121, sample.kind = "Rounding")
## Warning in set.seed(121, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Default, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 1.141114e-07 4.774037e-06
## t2* 5.647103e-03 1.616992e-05 2.205527e-04
as.data.frame(boot_results$t) %>%
rename(income = V1, balance = V2) %>%
gather(key = "variable", value = "estimate") %>%
ggplot(aes(x = estimate, fill = factor(variable))) +
geom_histogram(bins = 20) +
facet_wrap(~ variable, scales = "free_x") +
labs(title = "1,000 Bootstrap Parameter Estimates - 'balance' & 'income'",
subtitle = paste0("SE(balance) = ", formatC(sd(boot_results$t[ ,2]), format = "e", digits = 6),
", SE(income) = ", formatC(sd(boot_results$t[ ,1]), format = "e", digits = 6)),
x = "Parameter Estimate",
y = "Count") +
theme(legend.position = "none")
sapply(data.frame(income = boot_results$t[ ,1], balance = boot_results$t[ ,2]), sd)
## income balance
## 4.774037e-06 2.205527e-04
summary(log_def)$coefficients[2:3, 2]
## income balance
## 4.985167e-06 2.273731e-04
glimpse(Weekly)
## Rows: 1,089
## Columns: 9
## $ Year <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, …
## $ Lag1 <dbl> 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0…
## $ Lag2 <dbl> 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0…
## $ Lag3 <dbl> -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, 1.178, -…
## $ Lag4 <dbl> -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514, 0.712, …
## $ Lag5 <dbl> -3.484, -0.229, -3.936, 1.572, 0.816, -0.270, -2.576, 3.514,…
## $ Volume <dbl> 0.1549760, 0.1485740, 0.1598375, 0.1616300, 0.1537280, 0.154…
## $ Today <dbl> -0.270, -2.576, 3.514, 0.712, 1.178, -1.372, 0.807, 0.041, 1…
## $ Direction <fct> Down, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, Up…
log_dir <- glm(Direction ~ Lag1 + Lag2, data = Weekly, family = "binomial")
summary(log_dir)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22122 0.06147 3.599 0.000319 ***
## Lag1 -0.03872 0.02622 -1.477 0.139672
## Lag2 0.06025 0.02655 2.270 0.023232 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1496.2 on 1088 degrees of freedom
## Residual deviance: 1488.2 on 1086 degrees of freedom
## AIC: 1494.2
##
## Number of Fisher Scoring iterations: 4
log_dir_2 <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-1, ], family = "binomial")
summary(log_dir_2)
##
## Call:
## glm(formula = Direction ~ Lag1 + Lag2, family = "binomial", data = Weekly[-1,
## ])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22324 0.06150 3.630 0.000283 ***
## Lag1 -0.03843 0.02622 -1.466 0.142683
## Lag2 0.06085 0.02656 2.291 0.021971 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1494.6 on 1087 degrees of freedom
## Residual deviance: 1486.5 on 1085 degrees of freedom
## AIC: 1492.5
##
## Number of Fisher Scoring iterations: 4
P(Direction = Up| Lag1, Lag2) > 0.5. Was
this observation correctly classified?ifelse(predict(log_dir_2, newdata = Weekly[1, ], type = "response") > 0.5, "Up", "Down")
## 1
## "Up"
The actual Direction:
as.character(Weekly[1, "Direction"])
## [1] "Down"
Therefore it was classified wrong.
error <- c()
for (i in 1:nrow(Weekly)) {
log_dir <- glm(Direction ~ Lag1 + Lag2, data = Weekly[-i, ], family = "binomial") # i.
prediction <- ifelse(predict(log_dir, newdata = Weekly[i, ], type = "response") > 0.5, "Up", "Down") # ii. & iii.
error[i] <- as.numeric(prediction != Weekly[i, "Direction"]) # iv.
}
error[1:10]
## [1] 1 1 0 1 0 1 0 0 0 1
mean(error)
## [1] 0.4499541
prop.table(table(Weekly$Direction))
##
## Down Up
## 0.4444444 0.5555556
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
mean(Boston$medv)
## [1] 22.53281
sd(Boston$medv) / sqrt(length(Boston$medv))
## [1] 0.4088611
boot.fn <- function(vector, index) {
mean(vector[index])
}
set.seed(66, sample.kind = "Rounding")
## Warning in set.seed(66, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.0116587 0.4081538
Since it’s not easy to directly extract the bootstrap standard error from the bootstrap output, we can calculate it ourselves by taking the standard deviation of the bootstrap estimates.
From here, it is simple to calculate the bootstrap 95% confidence interval using the hint:
boot_results_SE <- sd(boot_results$t)
round(c(mean(Boston$medv) - 2*boot_results_SE, mean(Boston$medv) + 2*boot_results_SE), 4)
## [1] 21.7165 23.3491
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
median(Boston$medv)
## [1] 21.2
boot.fn <- function(vector, index) {
median(vector[index])
}
set.seed(77, sample.kind = "Rounding")
## Warning in set.seed(77, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.0094 0.3700318
As with the mean, the standard error is quite small relative to the estimate.
quantile(Boston$medv, 0.1)
## 10%
## 12.75
boot.fn <- function(vector, index) {
quantile(vector[index], 0.1)
}
set.seed(77, sample.kind = "Rounding")
## Warning in set.seed(77, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.02085 0.488873
The standard error is slightly larger relative to \(\hat{μ}\) 0.1, but it is still small.