ANSWER 3A: This approach involves randomly k-fold CV dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds. The mean squared error, MSE1, is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error,MSE1,MSE2, . . . ,MSEk.
ANSWER 3B:
Advantages/Disadvantages of k-fold relative to the validation set approach:
Performing 10-fold CV requires fitting the learning procedure only ten times, which may be much more feasible.
The validation set approach can lead to overestimates of the test error rate, since in this approach the training set used to fit the statistical learning method contains only half the observations of the entire data set. Using this logic, it is not hard to see that LOOCV will give approximately unbiased estimates of the test error, since each training set contains n − 1 observations, which is almost as many as the number of observations in the full data set.
Performing k-fold CV for, say, k = 5 or k = 10 will lead to an intermediate level of bias, since each training set contains (k − 1)n/k observations—fewer than in the LOOCV approach, but substantially more than in the validation set approach. Therefore, from the perspective of bias reduction, it is clear that LOOCV is to be preferred to k-fold CV.
Advantages/Disadvantages of k-fold relative to LOOCV:
The most obvious advantage is computational. LOOCV requires fitting the statistical learning method n times. This has the potential to be computationally expensive (except for linear models fit by least squares, in which case formula (5.2) can be used).
Some statistical learning methods have computationally intensive fitting procedures, and so performing LOOCV may pose computational
problems, especially if n is extremely large.
k-fold CV gives more accurate estimates of the test error rate than does LOOCV
LOOCV has higher variance than does k-fold CV with k < n.
ANSWER 5A:
require(ISLR)
## Loading required package: ISLR
log_res <- glm(formula = default ~ income + balance, data = Default, family ="binomial")
summary(log_res)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
ANSWER 5B-i:
I have split 50/50 for train/test as shown:
require(ISLR)
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(123, sample.kind = "Rounding")
index <- createDataPartition(y = Default$default, p = 0.5, list = F)
train <- Default[index, ]
test <- Default[-index, ]
nrow(train) / nrow(Default)
## [1] 0.5001
ANSWER 5B-ii:
require(ISLR)
require(caret)
log_res2 <- glm(default ~ income + balance, data = train, family = "binomial")
summary(log_res2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1929 -0.1502 -0.0606 -0.0230 3.2824
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.121e+01 5.923e-01 -18.927 < 2e-16 ***
## income 1.923e-05 6.989e-06 2.752 0.00593 **
## balance 5.507e-03 3.132e-04 17.582 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1463.76 on 5000 degrees of freedom
## Residual deviance: 814.48 on 4998 degrees of freedom
## AIC: 820.48
##
## Number of Fisher Scoring iterations: 8
ANSWER 5B-iii:
The following shows the first 15 predictions stored in test_preds:
require(ISLR)
require(caret)
test_preds <- factor(ifelse(predict(log_res2, newdata = test, type = "response") > 0.5, "Yes", "No"))
test_preds[1:15]
## 2 3 4 7 12 13 14 15 17 18 19 20 21 25 30
## No No No No No No No No No No No No No No No
## Levels: No Yes
ANSWER 5B-iv:
The validation error = 0.02501 as shown below:
require(ISLR)
require(caret)
round(mean(test$default != test_preds), 5)
## [1] 0.02501
ANSWER 5C:
As shown below, I used a loop to perform the additional train/test splits, fit the logistic regression model on train and store the test error in vactor log_df_err: The average value for test error is 0.02611
require(ISLR)
require(caret)
log_df_err <- c()
log_df_err[1] <- mean(test$default != test_preds)
set.seed(42, sample.kind = "Rounding")
for (i in 2:4) {
index <- createDataPartition(y = Default$default, p = 0.5, list = F)
train <- Default[index, ]
test <- Default[-index, ]
log_res <- glm(default ~ income + balance, data = train, family = "binomial")
test_preds <- factor(ifelse(predict(log_res, newdata = test, type = "response") > 0.5, "Yes", "No"))
log_df_err[i] <- mean(test$default != test_preds)
}
round(log_df_err, 5)
## [1] 0.02501 0.02581 0.02741 0.02621
mean(round(log_df_err, 5)) #average value for test error
## [1] 0.02611
ANSWER 5D:
As shown below we can see that the studentYes variable is not significant which has an impact on the income variable (since the income variable is not significant anymore). The test error is 0.02681
require(ISLR)
require(caret)
set.seed(43, sample.kind = "Rounding")
index <- createDataPartition(y = Default$default, p = 0.5, list = F)
train <- Default[index, ]
test <- Default[-index, ]
log_res5d <- glm(default ~ ., data = train, family = "binomial")
summary(log_res5d)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2165 -0.1350 -0.0509 -0.0183 3.2414
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.166e+01 7.470e-01 -15.603 <2e-16 ***
## studentYes -1.845e-01 3.459e-01 -0.533 0.594
## balance 5.858e-03 3.371e-04 17.379 <2e-16 ***
## income 1.329e-05 1.207e-05 1.100 0.271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1463.76 on 5000 degrees of freedom
## Residual deviance: 762.82 on 4997 degrees of freedom
## AIC: 770.82
##
## Number of Fisher Scoring iterations: 8
test_preds <- factor(ifelse(predict(log_res5d, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_preds), 5)
## [1] 0.02681
ANSWER 6A:
The estimated standard errors for the coefficients associated with income(4.985167e-06) and balance(2.273731e-04) are shown:
require(ISLR)
require(caret)
log_res6 <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(log_res6)$coefficients[, 2]
## (Intercept) income balance
## 4.347564e-01 4.985167e-06 2.273731e-04
ANSWER 6B:
The coefficient estimates for income(2.080898e-05) and balance(5.647103e-03) in the multiple logistic regression model are shown:
require(ISLR)
require(caret)
boot.fn <- function(data, index = 1:nrow(data)) {coef(glm(default ~ income + balance, data = data, subset = index, family = "binomial"))[-1]}
boot.fn(Default)
## income balance
## 2.080898e-05 5.647103e-03
ANSWER 6C:
The SE(income) and SE(balance) coefficients are shown below:
require(tidyr)
## Loading required package: tidyr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#require(ISLR)
#require(caret)
require(boot)
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
require(ggplot2)
set.seed(101, sample.kind = "Rounding")
boot_res <- boot(data = Default, statistic = boot.fn, R = 1000)
boot_res
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.080898e-05 9.797648e-08 4.710812e-06
## t2* 5.647103e-03 1.345215e-05 2.293757e-04
as.data.frame(boot_res$t) %>% rename(income = V1, balance = V2) %>% gather(key = "variable", value = "estimate") %>%
ggplot(aes(x = estimate, fill = factor(variable))) +
geom_histogram(bins = 20, color="blue",fill= "orange") +
facet_wrap(~ variable, scales = "free_x") +
labs(title = "1000 Bootstrap Parameter Estimates - 'balance' & 'income'",
subtitle = paste0("SE(income) = ", formatC(sd(boot_res$t[ ,1]), format = "e", digits = 6),
" ,SE(balance) = ", formatC(sd(boot_res$t[ ,2]), format = "e", digits = 6)),
x = "Parameter Estimate",
y = "Count") +
theme(legend.position = "none")
ANSWER 6D:
The boot() function determines the bootstrap estimate for standard errors and we can use the standard error equation on pg.197
There is not a big difference on results for standard error estimates for income and balance when using the boot() or glm() methods as shown. Based on these results we can assume that the logistic standard errors are satisfied.
require(tidyr)
require(dplyr)
#require(ISLR)
#require(caret)
require(boot)
require(ggplot2)
#boot():
sapply(data.frame(income = boot_res$t[ ,1], balance = boot_res$t[ ,2]), sd)
## income balance
## 4.710812e-06 2.293757e-04
#glm():
summary(log_res6)$coefficients[2:3, 2]
## income balance
## 4.985167e-06 2.273731e-04
ANSWER 9A:
The populating mean of medv is estimated by taking the sample mean ˆμ(22.53281) as shown:
require(MASS)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
mean(Boston$medv)
## [1] 22.53281
ANSWER 9B:
SE(ˆμ) = 0.4088611, This indicates that there is about 41% variation for the standard error estimate across multiple samples of a population.
require(MASS)
sd(Boston$medv) / sqrt(length(Boston$medv))
## [1] 0.4088611
ANSWER 9C:
The estimate of the standard error of ˆμ using the bootstrap is shown below(0.4081538). When comparing to 9b SE results, we find that both standard errors are very similar, about 0.41
require(MASS)
require(boot)
boot.fn <- function(vector, index) {mean(vector[index])}
set.seed(66, sample.kind = "Rounding")
boot_res <- boot(data = Boston$medv, statistic = boot.fn, R = 1000)
boot_res
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.0116587 0.4081538
ANSWER 9D:
The confidence interval estimates have the same results for bootstrap and t.test as shown below: 95 percent confidence interval: 21.72953 23.33608
require(MASS)
require(boot)
boot_resSE <- sd(boot_res$t)
round(c(mean(Boston$medv) - 2*boot_resSE, mean(Boston$medv) + 2*boot_resSE), 4)
## [1] 21.7165 23.3491
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
ANSWER 9E:
The population median of medv estimate is shown:
require(MASS)
median(Boston$medv)
## [1] 21.2
ANSWER 9F:
The standard error estimate of the median using bootstrap(0.3700318) is shown below:
require(MASS)
require(boot)
boot.fn <- function(vector, index) {median(vector[index])}
set.seed(77, sample.kind = "Rounding")
(boot_res <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.0094 0.3700318
ANSWER 9G:
The estimate for the tenth percentile of medv in Boston suburbs is (12.75) shown:
require(MASS)
require(ISLR)
quantile(Boston$medv, 0.1)
## 10%
## 12.75
ANSWER 9H:
The standard error estimate using bootstrap is a bit larger(0.488873) but not by much as shown:
require(MASS)
require(ISLR)
require(boot)
boot.fn <- function(vector, index) {quantile(vector[index], 0.1)}
set.seed(77, sample.kind = "Rounding")
boot_res <- boot(data = Boston$medv, statistic = boot.fn, R = 1000)
boot_res
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.02085 0.488873