3.We now review k-fold cross-validation.
The validation set approach? -the approach is done by separating the training sets into two different sets. This method could cause problems depending on which observations are included in each set because it can end up giving a variable test error rate which is not a good thing. The validation set error rate could create issues by overestimating the test error rate for the whole data set.
LOOCV? -LOOCV is a special type of k-fold cross validation. This type of cross-validation is the most complicated because the model must fit n times. The method also has a higher variance with a lower bias than the k-fold CV.
5.) In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.
library(ISLR)
data(Default)
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
logmodel1 = glm(default ~ balance + income, data = Default, family = binomial)
summary(logmodel1)
##
## Call:
## glm(formula = default ~ balance + income, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]
ii. Fit a multiple logistic regression model using only the training observations.
LOGmodel2 = glm(default ~ balance + income, data = Default, family = binomial, subset = trainDefault)
iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of
default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)
##
## log.pred_def No Yes
## No 4812 109
## Yes 25 54
iv. Compute the validation set error, which is the fraction of
the observations in the validation set that are misclassified.
mean(log.pred_def !=testDefault$default)
## [1] 0.0268
the test error rate comes out to 2.74%
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]
LOGmodel2 = glm(default ~ balance + income, data = Default, family = binomial, subset = trainDefault)
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)
##
## log.pred_def No Yes
## No 4821 124
## Yes 9 46
mean(log.pred_def !=testDefault$default)
## [1] 0.0266
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]
LOGmodel2 = glm(default ~ balance + income, data = Default, family = binomial, subset = trainDefault)
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)
##
## log.pred_def No Yes
## No 4820 102
## Yes 28 50
mean(log.pred_def !=testDefault$default)
## [1] 0.026
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]
LOGmodel2 = glm(default ~ balance + income, data = Default, family = binomial, subset = trainDefault)
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)
##
## log.pred_def No Yes
## No 4822 105
## Yes 15 58
mean(log.pred_def !=testDefault$default)
## [1] 0.024
the above functions are the three different splits to validate info
trainDefault = sample(dim(Default)[1], dim(Default)[1]*0.50)
testDefault = Default[-trainDefault, ]
LOGmodel2 = glm(default ~ balance + income + student, data = Default, family = binomial, subset = trainDefault)
log.prob_def = predict(LOGmodel2, testDefault, type = "response")
log.pred_def = rep("No", dim(Default)[1]*0.50)
log.pred_def[log.prob_def > 0.5] = "Yes"
table(log.pred_def, testDefault$default)
##
## log.pred_def No Yes
## No 4809 122
## Yes 20 49
mean(log.pred_def !=testDefault$default)
## [1] 0.0284
the test error comes out to 2.82% which is not too different from the other splits. the variable added did not cause a big impacct to the error rate
6.We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
library(ISLR)
summary(Default)
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
attach(Default)
set.seed(1)
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = default ~ income + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
the standard error coefficients for incime and balance are 2.274e-04 and 4.985e-06
boot.fn = function(data, index) return(coef(glm(default ~ balance + income, data = data, family = binomial, subset = index)))
library(boot)
boot(Default, boot.fn, 100)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 8.556378e-03 4.122015e-01
## t2* 5.647103e-03 -4.116657e-06 2.226242e-04
## t3* 2.080898e-05 -3.993598e-07 4.186088e-06
the standard errors found come out to be 4.122015e-01,2.226242e-04,4.186088e-06 using the bootstrap function
9.9.We will now consider the Boston housing data set, from the MASS library. (a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.
library(MASS)
data(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
attach(Boston)
mean.medv = mean(medv)
mean.medv
## [1] 22.53281
the output comes out to 22.53281 (b) Provide an estimate of the standard error of ˆμ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.
stderr.mean = sd(medv)/sqrt(length(medv))
stderr.mean
## [1] 0.4088611
boot.fn2 = function(data, index) return(mean(data[index]))
boot2 = boot(medv, boot.fn2, 100)
boot2
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn2, R = 100)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.02923123 0.3997709
there is a standard error of 39.97% which makes it an apporiate comparison based on the formulas used
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
CI.bos = c(22.53 - 2 * 0.4174872, 22.53 + 2 * 0.4174872)
CI.bos
## [1] 21.69503 23.36497
median.medv = median(medv)
median.medv
## [1] 21.2
boot.fn3 = function(data, index) return(median(data[index]))
boot3 = boot(medv, boot.fn3, 1000)
boot3
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn3, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 -0.0146 0.3804952
we find a median of 21.2 with a standard error of 38.05%
tenth.medv = quantile(medv, c(0.1))
tenth.medv
## 10%
## 12.75
boot.fn4 = function(data, index) return(quantile(data[index], c(0.1)))
boot4 = boot(medv, boot.fn4, 1000)
boot4
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.fn4, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.00575 0.5188339
we see that the quantile value same as above of 12.75 but with a diffrent standard error of 51.88%