Q3

We will now review k-fold cross-validation

(a)

Q:Explain how k-fold cross-validation is implemented.
A:
The data is separated into k distinct ‘folds’. The model is then trained on k-1 folds and tested on the one remaining fold. This is repeated k times, so that each of the folds acts as the testing data once. The performance on the test data is stored and averaged, creating the cross-validation metric.

(b)

Q:What are the advantages and disadvantages of k-fold cross-validation relative to:
i. The validation set approach?
ii. LOOCV?
A:
i. k-fold CV vs Validation Set
Advantages:

  • The k partitions made in k-fold CV create less variability in the model compared to the validation set approach that is partitioned only once. This single partition can create skewed results and can make processes like model selection and parameter tuning highly dependent on which observations are included in the train and test data sets.

  • All the data is used to both train and test the model performance.

  • Does not over-estimate the test error like the validation set approach. Models improve with increased data so the k-fold CV approach is not missing a large proportion of the data during training.

Disadvantages

  • The validation set approach is easier to understand for a less technical audience.

  • The k-fold CV approach is computationally expensive especially on large data sets with large values of k.

ii. k-fold CV vs LOOCV
Advantages:

  • k-fold is less computationally expensive. When k = 10 and n = 100, The k-fold CV approach will fit 10 models while LOOCV will fit 100 models.

  • Bias-variance tradeoff - K-fodl CV can give a more accurate estimate of the test error rate than LOOCV due to the high variance found in this approach. On average, there are more highly correlated values in LOOCV compared to the k-fold approach.

Disadvantages:

  • k-fold CV contains some randomness in the cross-validaiton metric based on hwo the data was split into k folds (equally or unequally). LOOCV does not have this element of randomness.

Q5

Applied:In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

summary(Default)
##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

(a)

Q: Fit a logistic regression model that uses income and balance to predict default
A:

log.fit = glm(default~income+balance, family="binomial", Default)

summary(log.fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b)

Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

  • Split the sample set into a training set and a validation set.
  • Fit a multiple logistic regression model using only the train- ing observations.
  • Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
  • Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

(i.) train/test split using caret
Q:Split the sample set into a training set and a validation set.
A:I will use an 80/20 split for training and testing data, respectively.

set.seed(123)
ind = createDataPartition(Default$default, p = .8, list = F)

train = Default[ind, ]
test = Default[-ind, ]

(ii.) fitting a logit model
Q:Fit a multiple logistic regression model using only the training observations.
A:

log.fit = glm(default~income+balance, family = "binomial", train)

summary(log.fit)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5331  -0.1411  -0.0561  -0.0203   3.7441  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.186e+01  5.007e-01 -23.691  < 2e-16 ***
## income       2.708e-05  5.670e-06   4.777 1.78e-06 ***
## balance      5.715e-03  2.574e-04  22.201  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2340.6  on 8000  degrees of freedom
## Residual deviance: 1252.2  on 7998  degrees of freedom
## AIC: 1258.2
## 
## Number of Fisher Scoring iterations: 8

(iii.) test predictions
Q:Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
A:
Store the predictions in a variable test_pred, below are the first 10 entries

log.pred = predict(log.fit, test, type = "response")

test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))

test_pred[1:10]
##  6 19 23 24 34 37 38 40 42 43 
## No No No No No No No No No No 
## Levels: No Yes

(iv.) test Error
Q:Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
A:

val.set.err = round(mean(test$default != test_pred),5)

val.set.err
## [1] 0.03002

(c)

Repeating the train/test splits
Q:Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
A:
70/30 Split
Using this split the model had a lower misclassification rate.

set.seed(123)
ind = createDataPartition(Default$default, p = .7, list = F)

train = Default[ind, ]
test = Default[-ind, ]


log.fit1 = glm(default~income+balance, family = "binomial", train)

summary(log.fit1)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4738  -0.1512  -0.0618  -0.0230   3.6867  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.142e+01  5.131e-01 -22.250  < 2e-16 ***
## income       2.383e-05  5.973e-06   3.989 6.64e-05 ***
## balance      5.533e-03  2.681e-04  20.639  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1139.1  on 6998  degrees of freedom
## AIC: 1145.1
## 
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit1, test, type = "response")

test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))

test_pred[1:10]
##  6 12 19 23 24 26 29 33 34 37 
## No No No No No No No No No No 
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)

val.set.err
## [1] 0.02534

50/50 Split
Using this split the model had a slightly higher misclassification rate.

set.seed(123)
ind = createDataPartition(Default$default, p = .5, list = F)

train = Default[ind, ]
test = Default[-ind, ]


log.fit2 = glm(default~income+balance, family = "binomial", train)

summary(log.fit2)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4704  -0.1447  -0.0582  -0.0215   3.7232  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.158e+01  6.209e-01 -18.656  < 2e-16 ***
## income       2.296e-05  7.206e-06   3.187  0.00144 ** 
## balance      5.617e-03  3.200e-04  17.555  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1463.76  on 5000  degrees of freedom
## Residual deviance:  791.05  on 4998  degrees of freedom
## AIC: 797.05
## 
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit2, test, type = "response")

test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))

test_pred[1:10]
##  3  4  6  7  8  9 10 11 12 14 
## No No No No No No No No No No 
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)

val.set.err
## [1] 0.02741

60/40 Split
Using this split the misclassification rate went even higher. The model with the lowest ratio of misclassified observations was the data set with train and test data split 70/30, respectively.

set.seed(123)
ind = createDataPartition(Default$default, p = .6, list = F)

train = Default[ind, ]
test = Default[-ind, ]


log.fit3 = glm(default~income+balance, family = "binomial", train)

summary(log.fit3)
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4311  -0.1538  -0.0624  -0.0233   3.6830  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.124e+01  5.486e-01 -20.495  < 2e-16 ***
## income       1.936e-05  6.450e-06   3.002  0.00268 ** 
## balance      5.509e-03  2.881e-04  19.123  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1753.80  on 6000  degrees of freedom
## Residual deviance:  984.23  on 5998  degrees of freedom
## AIC: 990.23
## 
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit3, test, type = "response")

test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))

test_pred[1:10]
##  3  6  7  8 10 12 14 15 19 22 
## No No No No No No No No No No 
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)

val.set.err
## [1] 0.02576

(d)

Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

set.seed(23)
ind = createDataPartition(Default$default, p = 0.7, list = F)
train = Default[ind, ]
test = Default[-ind, ]

log.fit = glm(default~., train, family="binomial")
summary(log.fit)
## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5510  -0.1316  -0.0484  -0.0166   3.6428  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.090e+01  5.917e-01 -18.418  < 2e-16 ***
## studentYes  -1.080e+00  2.830e-01  -3.815 0.000136 ***
## balance      6.106e-03  2.956e-04  20.656  < 2e-16 ***
## income      -9.675e-06  9.739e-06  -0.993 0.320533    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1048.6  on 6997  degrees of freedom
## AIC: 1056.6
## 
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit, test, type = "response")

test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))

test_pred[1:10]
##  5  6  9 10 12 13 14 17 19 21 
## No No No No No No No No No No 
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)

val.set.err
## [1] 0.03168

The studentYes variable is significant but we are concerned with the test error rate. The test error rate is larger than the smaller model, so including this dummy variable for student does not reduce the test error rate.

Q6

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways:

  1. using the bootstrap
  2. using the standard formula for computing the standard errors in the glm() function.

Do not forget to set a random seed before beginning your analysis.

Q9