Exercise 3

Review k-fold cross-validation.

(a) Explain how k-fold cross-validation is implemented.

The k-fold cross-validation is implemented by randomly dividing the observations into k - groups of roughly same size. One set is held out for use as a validation set and the remaining sets are used to train(fit) the model. The validation errors are the calculated on the held-out set. This is repeated k times with different group of observation held-out as a validation set and results in k- estimates of the test error. The k-fold cross validation error is calculated as an average of the k - test error estimates.

(b) What are the advantages and disadvantages of k-fold crossvalidation relative to:

i. The validation set approach?
  • We are using all the data in different folds avoiding the decrease of observations that happens in the validation set approach.
  • this approach leads to better estimates of the test error in contrast to the overestimation of test error rates in the validation set approach, where the model is only fit on a partial set of data(often half).
  • also, the variability from k-fold approach is typically much lower than the variability in the test error estimates that results from the validation set approach.
  • computationally expensive relative to validation set approach as model is fitted k-times.
ii. LOOCV?
  • the k-fold cross-validation(where, k < n) approach is not as computationally as LOOCV and can be used to any statistical learning method and extremely large n. 
  • In the k <n, case k-fold cv have higher bias than LOOCV, but lower variance than that of LOOCV. The k=5 and k=10 CVs are observed have the best bias-variance trade-off leading to test error rate estimates that suffer neither from excessively high bias nor from very high variance.


Exercise 5

Estimate the test error of this logistic regression model using the validation set approach. #### (a) Fit a logistic regression model that uses income and balance to predict default.

default = Default
str(default)
## 'data.frame':    10000 obs. of  4 variables:
##  $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
##  $ balance: num  730 817 1074 529 786 ...
##  $ income : num  44362 12106 31767 35704 38463 ...

50 % training - testing set split

set.seed(1)
train=sample(nrow(default),nrow(default)/2)
train_d = default[train,]
test_d = default[-train,]
Fit logistic regression model on training set
glm.fit = glm(default ~ balance + income, data = default, family = binomial, subset = train )

Predict defaul from the validation set

prob_d = predict(glm.fit, newdata = test_d, type= "response")
pred_d = rep("No",nrow(test_d))
pred_d[prob_d>.5]="Yes"
Comupute validation test error
#table(pred_d,test_d$default)
mean(pred_d!=test_d$default)
## [1] 0.0254

(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

80 % training - testing set split

set.seed(1)
train.8=sample(nrow(default),nrow(default)*0.8)
train_d.8 = default[train.8,]
test_d.8 = default[-train.8,]
Fit logistic regression model on training set
glm.fit.8 = glm(default ~ balance + income, data = default, family = binomial, subset = train.8 )

Predict defaul from the validation set

prob_d.8 = predict(glm.fit.8, newdata = test_d.8, type= "response")
pred_d.8 = rep("No",nrow(test_d.8))
pred_d.8[prob_d.8>.5]="Yes"
Comupute validation test error
#table(pred_d,test_d$default)
mean(pred_d.8!=test_d.8$default)
## [1] 0.026

70 % training - testing set split

set.seed(1)
train.7=sample(nrow(default),nrow(default)*0.7)
train_d.7 = default[train.7,]
test_d.7 = default[-train.7,]
Fit logistic regression model on training set
glm.fit.7 = glm(default ~ balance + income, data = default, family = binomial, subset = train.7 )

Predict defaul from the validation set

prob_d.7 = predict(glm.fit.7, newdata = test_d.7, type= "response")
pred_d.7 = rep("No",nrow(test_d.7))
pred_d.7[prob_d.7>.5]="Yes"
Comupute validation test error
#table(pred_d,test_d$default)
mean(pred_d.7!=test_d.7$default)
## [1] 0.02666667

60 % training - testing set split

set.seed(1)
train.6=sample(nrow(default),nrow(default)*0.6)
train_d.6 = default[train.6,]
test_d.6 = default[-train.6,]
Fit logistic regression model on training set
glm.fit.6 = glm(default ~ balance + income, data = default, family = binomial, subset = train.6 )

Predict defaul from the validation set

prob_d.6 = predict(glm.fit.6, newdata = test_d.6, type= "response")
pred_d.6 = rep("No",nrow(test_d.6))
pred_d.6[prob_d.6>.5]="Yes"
Comupute validation test error
#table(pred_d,test_d$default)
mean(pred_d.6!=test_d.6$default)
## [1] 0.025

- Fitting the model changes the validation test error on different test - train split, but the values are virtually the same in this particular dataset.

(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

Fit logistic regression model on training set using all variables
glm.fit.all = glm(default ~ balance + income + student, data = default, family = binomial, subset = train )
summary(glm.fit.all)
## 
## Call:
## glm(formula = default ~ balance + income + student, family = binomial, 
##     data = default, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5823  -0.1419  -0.0554  -0.0210   3.3961  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.134e+01  6.937e-01 -16.346   <2e-16 ***
## balance      5.767e-03  3.213e-04  17.947   <2e-16 ***
## income       1.686e-05  1.122e-05   1.502   0.1331    
## studentYes  -5.992e-01  3.324e-01  -1.803   0.0715 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.77  on 4999  degrees of freedom
## Residual deviance:  800.07  on 4996  degrees of freedom
## AIC: 808.07
## 
## Number of Fisher Scoring iterations: 8

Predict defaul from the validation set

prob_d_all = predict(glm.fit.all, newdata = test_d, type= "response")
pred_d_all = rep("No",nrow(test_d))
pred_d_all[prob_d_all>.5]="Yes"
Comupute validation test error
#table(pred_d,test_d$default)
mean(pred_d_all!=test_d$default)
## [1] 0.026

- Adding the dummy variable “student” didn’t improve the model.


Exercise 6

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(1)
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)$coef
##                  Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income       2.080898e-05 4.985167e-06   4.174178  2.990638e-05
## balance      5.647103e-03 2.273731e-04  24.836280 3.638120e-136

(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn = function(data, index) 
  return (coef(glm(default ~ income + balance, 
    data = data, family = binomial, subset = index)))

(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

set.seed(1)
boot(default ,boot.fn ,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

- The standard errors obtained using the glm() function and using the bootstrap function are very close.


Exercise 9

We will now consider the Boston housing data set, from the MASS library. #### (a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.

(medv_mean = mean(Boston$medv))
## [1] 22.53281

(b) Provide an estimate of the standard error of ˆμ. Interpret this result.

(medv_stderror = sd(Boston$medv)/sqrt(length(Boston$medv)))
## [1] 0.4088611

- The estimated standard error of the mean is 40.89%.

(c) Now estimate the standard error of ˆμ using the bootstrap. How does this compare to your answer from (b)?

set.seed(1)
boot.fn.mean=function (data ,index){ #initialize function to accept a data object and index object
 return (mean(data[index])) #calculate the statistic
}

(boot_mean = boot(Boston$medv, boot.fn.mean, R=1000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.mean, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

- The standard error from the bootstrap method is marginally smaller than the standard error in (b).

(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).

c(boot_mean$t0 - 2*0.4106622, boot_mean$t0 + 2*0.4106622)
## [1] 21.71148 23.35413
t.test(Boston$medv)
## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

- The 95% confidence interval for the mean of medv calculated based on the bootstrap method is slightly wider than the interval from the One Sample t-test. However, the numbers are very close.

(e) Based on this data set, provide an estimate, ˆμmed, for themedian value of medv in the population.

median(Boston$medv)
## [1] 21.2

(f) Estimate the standard error of the median using the bootstrap. Comment on your findings.

set.seed(1)
boot.fn.med=function (data ,index){ 
 return (median(data[index])) 
}

(boot_med = boot(Boston$medv, boot.fn.med, R=1000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.med, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

- The estimated standard error of the median is 37.78%.

(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity ˆμ0.1.

(medv_tenth = quantile(Boston$medv, c(0.1)))
##   10% 
## 12.75

(h) Use the bootstrap to estimate the standard error of ˆμ0.1. Comment on your findings.

set.seed(1)
boot.fn.tenth=function (data ,index){ 
 return (quantile(data[index], c(0.1))) 
}

(boot_tenth = boot(Boston$medv, boot.fn.tenth, R=1000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.tenth, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0339   0.4767526

- The standard error for the 10th percentile of medv in Boston suburbs is 47.68%.