Assignment 4 - ISLR Chapter 5

Problem 3 - Part A

The implementation of k-fold cross-validation is achieved by randomly dividing a set of observations into k groups (folds) of approximately equal size. the first k-fold is utilized as a validation set and the method is fit on the remaining k-1 folds. The mean squared error is computed on the observations in the held-out fold. The procedure is repeated k times and in each instance a different group of observations is utilized as a validation set.

Problem 3 - Part B

An advantage that k-fold cross-validation has in comparison to the validation set method is that the validation set method is much more likely to overestimate the test error rate. This occurs because a smaller amount of data is being utilized for training and validation in relation to the k-fold method which repeats the process k times and uses more observations. The disadvantage is computational cost if a large data set is being used to fit models k times.

An advantage that k-fold cross-validation has in comparison to the leave-one-out cross-validation (LOOCV) method is computational expense. Since LOOCV models are trained and test n times based on how many individual observations are in the data, this can be a consumptive process for large data sets. LOOCV has the advantage of always yielding the same results since there is no randomness in the splits, but it also has the disadvantage of having higher variance than the k-fold method since the trained models are highly correlated with each other.

Problem 5 - Part A

default <- Default
attach(default)

## The following object is masked _by_ .GlobalEnv:
## 
##     default

In this chunk, I am creating a logistic regression model that uses income and balance to predict default.

set.seed(1)
defaultGLM <- glm(default~income+balance, data=default, family=binomial)

Problem 5 - Part B

set.seed(1)
split <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain <- default[split,]
defaultTest <- default[-split,]
defaultGLM.2 <- glm(default~income+balance, data=defaultTrain, family=binomial)
defaultProb <- predict(defaultGLM.2, defaultTest, type="response")
defaultPred <- ifelse(defaultProb > 0.5, "Yes", "No")
table(defaultPred, defaultTest$default)

##            
## defaultPred   No  Yes
##         No  4824  108
##         Yes   19   49

2.54% of the observations in the validation set are misclassified.

mean(defaultTest$default != defaultPred)

## [1] 0.0254

Problem 5 - Part C

Repeating the process three times using different splits of the observations does not exhibit a significant alteration of the test error rate. In each instance, the test error rate is between 2.4% and 2.9% (average is approximately 2.6%).

set.seed(7)
split.2 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.2 <- default[split.2,]
defaultTest.2 <- default[-split.2,]
defaultGLM.3 <- glm(default~income+balance, data=defaultTrain.2, family=binomial)
defaultProb.2 <- predict(defaultGLM.3, defaultTest.2, type="response")
defaultPred.2 <- ifelse(defaultProb.2 > 0.5, "Yes", "No")
table(defaultPred.2, defaultTest.2$default)

##              
## defaultPred.2   No  Yes
##           No  4823   99
##           Yes   22   56

mean(defaultTest.2$default != defaultPred.2)

## [1] 0.0242

set.seed(10)
split.3 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.3 <- default[split.3,]
defaultTest.3 <- default[-split.3,]
defaultGLM.4 <- glm(default~income+balance, data=defaultTrain.3, family=binomial)
defaultProb.3 <- predict(defaultGLM.4, defaultTest.3, type="response")
defaultPred.3 <- ifelse(defaultProb.3 > 0.5, "Yes", "No")
table(defaultPred.3, defaultTest.3$default)

##              
## defaultPred.3   No  Yes
##           No  4815  125
##           Yes   19   41

mean(defaultTest.3$default != defaultPred.3)

## [1] 0.0288

set.seed(4)
split.4 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.4 <- default[split.4,]
defaultTest.4 <- default[-split.4,]
defaultGLM.5 <- glm(default~income+balance, data=defaultTrain.4, family=binomial)
defaultProb.4 <- predict(defaultGLM.5, defaultTest.4, type="response")
defaultPred.4 <- ifelse(defaultProb.4 > 0.5, "Yes", "No")
table(defaultPred.4, defaultTest.4$default)

##              
## defaultPred.4   No  Yes
##           No  4815  107
##           Yes   21   57

mean(defaultTest.4$default != defaultPred.4)

## [1] 0.0256

Problem 5 - Part D

set.seed(1)
split.5 <- sample(nrow(default), nrow(default) * 0.5)
defaultTrain.5 <- default[split.5,]
defaultTest.5 <- default[-split.5,]
defaultGLM.6 <- glm(default~income+balance+student, data=defaultTrain.5, family=binomial)
defaultProb.5 <- predict(defaultGLM.6, defaultTest.5, type="response")
defaultPred.5 <- ifelse(defaultProb.5 > 0.5, "Yes", "No")
table(defaultPred.5, defaultTest.5$default)

##              
## defaultPred.5   No  Yes
##           No  4825  112
##           Yes   18   45

The inclusion of the student variable does not seem to have a noticeable effect on the test error rate. The test error rate is 2.6% which is very similar to the test error rates for the previous models.

mean(defaultTest.5$default != defaultPred.5)

## [1] 0.026

Problem 6 - Part A

In this GLM model, the standard error for the coefficients associated with income and balanced is 4.985e-06 and 2.274e-04, respectively.

defaultGLM.7 <- glm(default~income+balance, data=default, family=binomial)
summary(defaultGLM.7)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Problem 6 - Part B

In this chunk, I am creating the boot function which outputs the coefficient estimates for income and balance in the multiple regression model.

defaultBoot.fn <- function(data, index)
  return(coef(glm(default~income+balance, data=data, family=binomial, subset=index)))

Problem 6 - Part C

In this chunk, I am using the boot function to estimate the standard errors for the logistic regression coefficients for income (4.866284e-06) and balance (2.298949e-04).

set.seed(1)
boot(default, defaultBoot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = default, statistic = defaultBoot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

Problem 6 - Part D

The standard errors for income and balance in the bootstrap function is 4.866284e-06 and 2.298949e-04, respectively. These values are very close to the standard errors obtained for income and balance in the GLM function (4.985e-06 and 2.274e-04).

Problem 9 - Part A

boston <- Boston
attach(boston)

In this chunk, I create the variable medvMu which is the estimate for the population mean of medv in the Boston data set.

medvMu <- mean(medv)
medvMu

## [1] 22.53281

Problem 9 - Part B

In this chunk, I obtain the standard error estimate of medvMu.

medvSd <- sd(medv)/sqrt(length(medv))
medvSd

## [1] 0.4088611

Problem 9 - Part C

bostonBoot.fn <- function(data, index)
  return(mean(data[index]))

The standard error obtained in the bootstrap is 0.4106622. This is only slightly higher than the standard error obtained in part B which is 0.4088611.

set.seed(1)
bostonBoot <- boot(medv, bostonBoot.fn, 1000)
bostonBoot

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = bostonBoot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

Problem 9 - Part D

t.test(medv)

## 
##  One Sample t-test
## 
## data:  medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The confidence interval obtained in the bootstrap below is nearly identical to the confidence interval obtained in the one sample t-test (less than a 0.02 difference for each bound)

medvCI <- c(22.53281 - 2 * 0.4106622, 22.53281 + 2 * 0.4106622)
medvCI

## [1] 21.71149 23.35413

Problem 9 - Part E

In this chunk, I obtain the estimate for the median value of medv.

medvMedian <- median(medv)
medvMedian

## [1] 21.2

Problem 9 - Part F

bostonBoot.fn.2 <- function(data, index)
  return(median(data[index]))

The estimated standard error of the median obtained in the bootstrap is 0.3778075. The median value 21.2 is identical to the median obtain in part E. In relation to the median value, the estimated standard error value is quite small.

set.seed(1)
boot(medv, bostonBoot.fn.2, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = bostonBoot.fn.2, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

Problem 9 - Part G

In this chunk, I am creating the variable which provides the estimate for the tenth percentile value of medv.

medvMu0.1 <- quantile(medv, 0.1)
medvMu0.1

##   10% 
## 12.75

Problem 9 - Part H

bostonBoot.fn.3 <- function(data, index)
  return(quantile(data[index], 0.1))

The estimated standard error of the tenth percentile of medv obtained in the bootstrap is 0.4767526. In relation to the tenth percentile value, the estimated standard error value is quite small.

set.seed(1)
boot(medv, bostonBoot.fn.3, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = bostonBoot.fn.3, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526

Assignment 4 - ISLR Chapter 5

Ryan Farias

6/30/2020