We will now review k-fold cross-validation
Q:Explain how k-fold cross-validation is implemented.
A:
The data is separated into k distinct ‘folds’. The model is then trained on k-1 folds and tested on the one remaining fold. This is repeated k times, so that each of the folds acts as the testing data once. The performance on the test data is stored and averaged, creating the cross-validation metric.
Q:What are the advantages and disadvantages of k-fold cross-validation relative to:
i. The validation set approach?
ii. LOOCV?
A:
i. k-fold CV vs Validation Set
Advantages:
The k partitions made in k-fold CV create less variability in the model compared to the validation set approach that is partitioned only once. This single partition can create skewed results and can make processes like model selection and parameter tuning highly dependent on which observations are included in the train and test data sets.
All the data is used to both train and test the model performance.
Does not over-estimate the test error like the validation set approach. Models improve with increased data so the k-fold CV approach is not missing a large proportion of the data during training.
Disadvantages
The validation set approach is easier to understand for a less technical audience.
The k-fold CV approach is computationally expensive especially on large data sets with large values of k.
ii. k-fold CV vs LOOCV
Advantages:
k-fold is less computationally expensive. When k = 10 and n = 100, The k-fold CV approach will fit 10 models while LOOCV will fit 100 models.
Bias-variance tradeoff - K-fodl CV can give a more accurate estimate of the test error rate than LOOCV due to the high variance found in this approach. On average, there are more highly correlated values in LOOCV compared to the k-fold approach.
Disadvantages:
Applied:In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
Q: Fit a logistic regression model that uses income and balance to predict default
A:
log.fit = glm(default~income+balance, family="binomial", Default)
summary(log.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
(i.) train/test split using caret
Q:Split the sample set into a training set and a validation set.
A:I will use an 80/20 split for training and testing data, respectively.
set.seed(123)
ind = createDataPartition(Default$default, p = .8, list = F)
train = Default[ind, ]
test = Default[-ind, ]
(ii.) fitting a logit model
Q:Fit a multiple logistic regression model using only the training observations.
A:
log.fit = glm(default~income+balance, family = "binomial", train)
summary(log.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5331 -0.1411 -0.0561 -0.0203 3.7441
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.186e+01 5.007e-01 -23.691 < 2e-16 ***
## income 2.708e-05 5.670e-06 4.777 1.78e-06 ***
## balance 5.715e-03 2.574e-04 22.201 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2340.6 on 8000 degrees of freedom
## Residual deviance: 1252.2 on 7998 degrees of freedom
## AIC: 1258.2
##
## Number of Fisher Scoring iterations: 8
(iii.) test predictions
Q:Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
A:
Store the predictions in a variable test_pred, below are the first 10 entries
log.pred = predict(log.fit, test, type = "response")
test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))
test_pred[1:10]
## 6 19 23 24 34 37 38 40 42 43
## No No No No No No No No No No
## Levels: No Yes
(iv.) test Error
Q:Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.
A:
val.set.err = round(mean(test$default != test_pred),5)
val.set.err
## [1] 0.03002
Repeating the train/test splits
Q:Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.
A:
70/30 Split
Using this split the model had a lower misclassification rate.
set.seed(123)
ind = createDataPartition(Default$default, p = .7, list = F)
train = Default[ind, ]
test = Default[-ind, ]
log.fit1 = glm(default~income+balance, family = "binomial", train)
summary(log.fit1)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4738 -0.1512 -0.0618 -0.0230 3.6867
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.142e+01 5.131e-01 -22.250 < 2e-16 ***
## income 2.383e-05 5.973e-06 3.989 6.64e-05 ***
## balance 5.533e-03 2.681e-04 20.639 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1139.1 on 6998 degrees of freedom
## AIC: 1145.1
##
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit1, test, type = "response")
test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))
test_pred[1:10]
## 6 12 19 23 24 26 29 33 34 37
## No No No No No No No No No No
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)
val.set.err
## [1] 0.02534
50/50 Split
Using this split the model had a slightly higher misclassification rate.
set.seed(123)
ind = createDataPartition(Default$default, p = .5, list = F)
train = Default[ind, ]
test = Default[-ind, ]
log.fit2 = glm(default~income+balance, family = "binomial", train)
summary(log.fit2)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4704 -0.1447 -0.0582 -0.0215 3.7232
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.158e+01 6.209e-01 -18.656 < 2e-16 ***
## income 2.296e-05 7.206e-06 3.187 0.00144 **
## balance 5.617e-03 3.200e-04 17.555 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1463.76 on 5000 degrees of freedom
## Residual deviance: 791.05 on 4998 degrees of freedom
## AIC: 797.05
##
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit2, test, type = "response")
test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))
test_pred[1:10]
## 3 4 6 7 8 9 10 11 12 14
## No No No No No No No No No No
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)
val.set.err
## [1] 0.02741
60/40 Split
Using this split the misclassification rate went even higher. The model with the lowest ratio of misclassified observations was the data set with train and test data split 70/30, respectively.
set.seed(123)
ind = createDataPartition(Default$default, p = .6, list = F)
train = Default[ind, ]
test = Default[-ind, ]
log.fit3 = glm(default~income+balance, family = "binomial", train)
summary(log.fit3)
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4311 -0.1538 -0.0624 -0.0233 3.6830
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.124e+01 5.486e-01 -20.495 < 2e-16 ***
## income 1.936e-05 6.450e-06 3.002 0.00268 **
## balance 5.509e-03 2.881e-04 19.123 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1753.80 on 6000 degrees of freedom
## Residual deviance: 984.23 on 5998 degrees of freedom
## AIC: 990.23
##
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit3, test, type = "response")
test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))
test_pred[1:10]
## 3 6 7 8 10 12 14 15 19 22
## No No No No No No No No No No
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)
val.set.err
## [1] 0.02576
Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.
set.seed(23)
ind = createDataPartition(Default$default, p = 0.7, list = F)
train = Default[ind, ]
test = Default[-ind, ]
log.fit = glm(default~., train, family="binomial")
summary(log.fit)
##
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5510 -0.1316 -0.0484 -0.0166 3.6428
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.090e+01 5.917e-01 -18.418 < 2e-16 ***
## studentYes -1.080e+00 2.830e-01 -3.815 0.000136 ***
## balance 6.106e-03 2.956e-04 20.656 < 2e-16 ***
## income -9.675e-06 9.739e-06 -0.993 0.320533
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2050.6 on 7000 degrees of freedom
## Residual deviance: 1048.6 on 6997 degrees of freedom
## AIC: 1056.6
##
## Number of Fisher Scoring iterations: 8
log.pred = predict(log.fit, test, type = "response")
test_pred = as.factor(ifelse(log.pred > 0.5, "Yes", "No"))
test_pred[1:10]
## 5 6 9 10 12 13 14 17 19 21
## No No No No No No No No No No
## Levels: No Yes
val.set.err = round(mean(test$default != test_pred),5)
val.set.err
## [1] 0.03168
The studentYes variable is significant but we are concerned with the test error rate. The test error rate is larger than the smaller model, so including this dummy variable for student does not reduce the test error rate.
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways:
Do not forget to set a random seed before beginning your analysis.