Review k-fold cross-validation.
The k-fold cross-validation is implemented by randomly dividing the observations into k - groups of roughly same size. One set is held out for use as a validation set and the remaining sets are used to train(fit) the model. The validation errors are the calculated on the held-out set. This is repeated k times with different group of observation held-out as a validation set and results in k- estimates of the test error. The k-fold cross validation error is calculated as an average of the k - test error estimates.
Estimate the test error of this logistic regression model using the validation set approach. #### (a) Fit a logistic regression model that uses income and balance to predict default.
default = Default
str(default)
## 'data.frame': 10000 obs. of 4 variables:
## $ default: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ student: Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 2 1 1 ...
## $ balance: num 730 817 1074 529 786 ...
## $ income : num 44362 12106 31767 35704 38463 ...
set.seed(1)
train=sample(nrow(default),nrow(default)/2)
train_d = default[train,]
test_d = default[-train,]
glm.fit = glm(default ~ balance + income, data = default, family = binomial, subset = train )
prob_d = predict(glm.fit, newdata = test_d, type= "response")
pred_d = rep("No",nrow(test_d))
pred_d[prob_d>.5]="Yes"
#table(pred_d,test_d$default)
mean(pred_d!=test_d$default)
## [1] 0.0254
set.seed(1)
train.8=sample(nrow(default),nrow(default)*0.8)
train_d.8 = default[train.8,]
test_d.8 = default[-train.8,]
glm.fit.8 = glm(default ~ balance + income, data = default, family = binomial, subset = train.8 )
prob_d.8 = predict(glm.fit.8, newdata = test_d.8, type= "response")
pred_d.8 = rep("No",nrow(test_d.8))
pred_d.8[prob_d.8>.5]="Yes"
#table(pred_d,test_d$default)
mean(pred_d.8!=test_d.8$default)
## [1] 0.026
set.seed(1)
train.7=sample(nrow(default),nrow(default)*0.7)
train_d.7 = default[train.7,]
test_d.7 = default[-train.7,]
glm.fit.7 = glm(default ~ balance + income, data = default, family = binomial, subset = train.7 )
prob_d.7 = predict(glm.fit.7, newdata = test_d.7, type= "response")
pred_d.7 = rep("No",nrow(test_d.7))
pred_d.7[prob_d.7>.5]="Yes"
#table(pred_d,test_d$default)
mean(pred_d.7!=test_d.7$default)
## [1] 0.02666667
set.seed(1)
train.6=sample(nrow(default),nrow(default)*0.6)
train_d.6 = default[train.6,]
test_d.6 = default[-train.6,]
glm.fit.6 = glm(default ~ balance + income, data = default, family = binomial, subset = train.6 )
prob_d.6 = predict(glm.fit.6, newdata = test_d.6, type= "response")
pred_d.6 = rep("No",nrow(test_d.6))
pred_d.6[prob_d.6>.5]="Yes"
#table(pred_d,test_d$default)
mean(pred_d.6!=test_d.6$default)
## [1] 0.025
- Fitting the model changes the validation test error on different test - train split, but the values are virtually the same in this particular dataset.
glm.fit.all = glm(default ~ balance + income + student, data = default, family = binomial, subset = train )
summary(glm.fit.all)
##
## Call:
## glm(formula = default ~ balance + income + student, family = binomial,
## data = default, subset = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5823 -0.1419 -0.0554 -0.0210 3.3961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.134e+01 6.937e-01 -16.346 <2e-16 ***
## balance 5.767e-03 3.213e-04 17.947 <2e-16 ***
## income 1.686e-05 1.122e-05 1.502 0.1331
## studentYes -5.992e-01 3.324e-01 -1.803 0.0715 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.77 on 4999 degrees of freedom
## Residual deviance: 800.07 on 4996 degrees of freedom
## AIC: 808.07
##
## Number of Fisher Scoring iterations: 8
prob_d_all = predict(glm.fit.all, newdata = test_d, type= "response")
pred_d_all = rep("No",nrow(test_d))
pred_d_all[prob_d_all>.5]="Yes"
#table(pred_d,test_d$default)
mean(pred_d_all!=test_d$default)
## [1] 0.026
- Adding the dummy variable “student” didn’t improve the model.
We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis.
set.seed(1)
glm.fit = glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154047e+01 4.347564e-01 -26.544680 2.958355e-155
## income 2.080898e-05 4.985167e-06 4.174178 2.990638e-05
## balance 5.647103e-03 2.273731e-04 24.836280 3.638120e-136
boot.fn = function(data, index)
return (coef(glm(default ~ income + balance,
data = data, family = binomial, subset = index)))
set.seed(1)
boot(default ,boot.fn ,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
- The standard errors obtained using the glm() function and using the bootstrap function are very close.
We will now consider the Boston housing data set, from the MASS library. #### (a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆμ.
(medv_mean = mean(Boston$medv))
## [1] 22.53281
(medv_stderror = sd(Boston$medv)/sqrt(length(Boston$medv)))
## [1] 0.4088611
- The estimated standard error of the mean is 40.89%.
set.seed(1)
boot.fn.mean=function (data ,index){ #initialize function to accept a data object and index object
return (mean(data[index])) #calculate the statistic
}
(boot_mean = boot(Boston$medv, boot.fn.mean, R=1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.mean, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
- The standard error from the bootstrap method is marginally smaller than the standard error in (b).
c(boot_mean$t0 - 2*0.4106622, boot_mean$t0 + 2*0.4106622)
## [1] 21.71148 23.35413
t.test(Boston$medv)
##
## One Sample t-test
##
## data: Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
- The 95% confidence interval for the mean of medv calculated based on the bootstrap method is slightly wider than the interval from the One Sample t-test. However, the numbers are very close.
median(Boston$medv)
## [1] 21.2
set.seed(1)
boot.fn.med=function (data ,index){
return (median(data[index]))
}
(boot_med = boot(Boston$medv, boot.fn.med, R=1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.med, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
- The estimated standard error of the median is 37.78%.
(medv_tenth = quantile(Boston$medv, c(0.1)))
## 10%
## 12.75
set.seed(1)
boot.fn.tenth=function (data ,index){
return (quantile(data[index], c(0.1)))
}
(boot_tenth = boot(Boston$medv, boot.fn.tenth, R=1000))
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Boston$medv, statistic = boot.fn.tenth, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0339 0.4767526
- The standard error for the 10th percentile of medv in Boston suburbs is 47.68%.