Data Mining HW #4

chapter 5: 3, 5, 6, 9

3. a) K-Fold cross validation involves separating the data into k “folds” or subsets. Then, one of the folds is used to validate the model and the rest are used to train. This process is repeated until each fold has been used as a validation set, and the errors of each iteration are averaged together.

b) The advantages and disadvantages of k-fold validation relative to:

(i) The Validation Set approach: This method is not as resource intensive as LOOCV or k-fold validation approaches, but it also leaves room for lots of variation in error rates when evaluating validation sets. That is, the success of this approach is dependent on which data points are included in each partition of the overall dataset.

(ii) The LOOCV approach: k-fold validation is less time and resource intensive than the LOOCV, but it is still costly. LOOCV does not allow for any randomness in the validation set since it is based on a single test case, so k-fold validation helps alleviate some of the issues that accompany that. LOOCV helps to mitigate some of the fluctuations in the error rates of general validation set approaches, however.

5. a)

library(ISLR)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.5     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data("Default")

a<- glm(default~income + balance, family = "binomial", data = Default)

b)

library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(500)
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
b<- glm(default~ income + balance, family = "binomial", data = train)
test_pred <- factor(ifelse(predict(b, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_pred), 5)
## [1] 0.02601

c) The error rates are roughly the same even when using different data partitions.

set.seed(500)
index <- createDataPartition(y = Default$default, p = 0.5, list = F)
train <- Default[index, ]
test <- Default[-index, ]
b<- glm(default~ income + balance, family = "binomial", data = train)
test_pred <- factor(ifelse(predict(b, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_pred), 5)
## [1] 0.02621
set.seed(500)
index <- createDataPartition(y = Default$default, p = 0.6, list = F)
train <- Default[index, ]
test <- Default[-index, ]
b<- glm(default~ income + balance, family = "binomial", data = train)
test_pred <- factor(ifelse(predict(b, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_pred), 5)
## [1] 0.02526
set.seed(500)
index <- createDataPartition(y = Default$default, p = 0.8, list = F)
train <- Default[index, ]
test <- Default[-index, ]
b<- glm(default~ income + balance, family = "binomial", data = train)
test_pred <- factor(ifelse(predict(b, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_pred), 5)
## [1] 0.02451

d) Including the dummy variable student does not impact the model error rates by much (~0.01).

set.seed(500)
index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]
d<- glm(default~ income + balance + student, family = "binomial", data = train)
test_pred <- factor(ifelse(predict(b, newdata = test, type = "response") > 0.5, "Yes", "No"))
round(mean(test$default != test_pred), 5)
## [1] 0.02634

6. a)

a<- glm(default ~ income + balance, family = "binomial", data = Default)
summary(a)$coefficients
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

b)

boot.fn <- function(data, index = 1:nrow(data)) {
  coef(glm(default~ income + balance, data = data, subset = index, family = "binomial"))
}

boot.fn(Default)
##   (Intercept)        income       balance 
## -1.154047e+01  2.080898e-05  5.647103e-03

c)

library(boot)
## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma
set.seed(555, sample.kind = "Rounding")
## Warning in set.seed(555, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(output<- boot(data = Default, statistic = boot.fn, R = 500))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 500)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -7.641167e-02 3.979406e-01
## t2*  2.080898e-05  3.933086e-07 4.690581e-06
## t3*  5.647103e-03  3.755044e-05 2.167909e-04

d) The errors are slightly higher with the bootstrapped version of the data but overall are similar.

9. a)

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston)
mean(Boston$medv)
## [1] 22.53281

b)

sd(Boston$medv)/sqrt(length(Boston$medv))
## [1] 0.4088611

c)

boot.fn<- function(vector, index) {
  mean(vector[index])
}

set.seed(555, sample.kind = "Rounding")
## Warning in set.seed(555, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
output<- boot(Boston$medv, statistic = boot.fn, R = 500)

The std error from the bootstrap was 0.401, and it was 0.408 in part c. They are not the exact same, but they are within 0.001 of each other.

d)

sd_boot<- sd(output$t)
round(c(mean(Boston$medv) - 2*sd_boot, mean(Boston$medv) + 2*sd_boot), 4)
## [1] 21.7297 23.3359

e)

median(Boston$medv)
## [1] 21.2

f)

boot.fn<- function(vector, index) {
  median(vector[index])
}

set.seed(555, sample.kind = "Rounding")
## Warning in set.seed(555, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(output<- boot(Boston$medv, statistic = boot.fn, R = 500))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 500)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0239    0.372816

g)

quantile(Boston$medv, 0.1)
##   10% 
## 12.75

h)

boot.fn<- function(vector, index) {
  quantile(vector[index], 0.1)
}

set.seed(555, sample.kind = "Rounding")
## Warning in set.seed(555, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
(output<- boot(Boston$medv, statistic = boot.fn, R = 500))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 500)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 -0.0432   0.5219373

The standard error is slightly higher, and the bias is slightly lower than the median-based calculation.