default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis. Fit a logistic regression model that uses income and balance to predict default.#Exercise 5-a
attach(Default) #Attach data to use variable names
set.seed(1)
log_mod1 = glm(default ~ income + balance, data = Default, family = "binomial") #Fit logistic regression model
summary(log_mod1)##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
# Exercise 5-b Split Sample (50-50)
train_1 = sample(dim(Default)[1], 0.50*dim(Default)[1])
test_1 = Default[-train_1, ]# Exercise 5-b Fit Multiple Regression
log_mod2 = glm(default ~ income + balance, data = Default, family = "binomial", subset = train_1)
summary(log_mod2)##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default, subset = train_1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5830 -0.1428 -0.0573 -0.0213 3.3395
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.194e+01 6.178e-01 -19.333 < 2e-16 ***
## income 3.262e-05 7.024e-06 4.644 3.41e-06 ***
## balance 5.689e-03 3.158e-04 18.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1523.8 on 4999 degrees of freedom
## Residual deviance: 803.3 on 4997 degrees of freedom
## AIC: 809.3
##
## Number of Fisher Scoring iterations: 8
default category if the posterior probability is greater than 0.5.# Exercise 5-b Posterior Probability
probs_1 = predict(log_mod2, newdata = test_1, type = "response")
pred.log_mod2 = rep("No", length(probs_1))
pred.log_mod2[probs_1 > 0.5] = "Yes"result_1 = mean(pred.log_mod2 != Default[-train_1, ]$default)
percent = "%"
text1 = "Split @ 50%:"
paste(text1, round(result_1*100, digits = 2) , percent)## [1] "Split @ 50%: 2.54 %"
# Exercise 5-c Three Times
train_2 = sample(dim(Default)[1], 0.75*dim(Default)[1])
test_2 = Default[-train_2, ]
log_mod3 = glm(default ~ income + balance, data = Default, family = "binomial", subset = train_2)
probs_2 = predict(log_mod3, newdata = test_2, type = "response")
pred.log_mod3 = rep("No", length(probs_2))
pred.log_mod3[probs_2 > 0.5] = "Yes"
result_2 = mean(pred.log_mod3 != Default[-train_2, ]$default)
train_3 = sample(dim(Default)[1], 0.25*dim(Default)[1])
test_3 = Default[-train_3, ]
log_mod4 = glm(default ~ income + balance, data = Default, family = "binomial", subset = train_3)
probs_3 = predict(log_mod4, newdata = test_3, type = "response")
pred.log_mod4 = rep("No", length(probs_3))
pred.log_mod4[probs_3 > 0.5] = "Yes"
result_3 = mean(pred.log_mod4 != Default[-train_3, ]$default)
train_4 = sample(dim(Default)[1], 0.65*dim(Default)[1])
test_4 = Default[-train_4, ]
log_mod5 = glm(default ~ income + balance, data = Default, family = "binomial", subset = train_4)
probs_4 = predict(log_mod5, newdata = test_4, type = "response")
pred.log_mod5 = rep("No", length(probs_4))
pred.log_mod5[probs_4 > 0.5] = "Yes"
result_4 = mean(pred.log_mod5 != Default[-train_4, ]$default)## [1] "Split 1 @ 75%: 2.56 %"
## [1] "Split 2 @ 25%: 2.53 %"
## [1] "Split 3 @ 65%: 2.66 %"
default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.# Exercise 5-d
train_5 = sample(dim(Default)[1], 0.50*dim(Default)[1])
test_5 = Default[-train_5, ]
log_mod6 = glm(default ~ income + balance + student, data = Default, family = "binomial", subset = train_5)
probs_5 = predict(log_mod6, newdata = test_5, type = "response")
pred.log_mod6 = rep("No", length(probs_5))
pred.log_mod6[probs_5 > 0.5] = "Yes"
result_5 = mean(pred.log_mod6 != Default[-train_5, ]$default)
text_compare = "Original Model @ 50% Split with Income + Balance: "
text5 = "Split @ 50% with Income + Balance + Student:"## [1] "Original Model @ 50% Split with Income + Balance: 2.54 %"
## [1] "Split @ 50% with Income + Balance + Student: 2.42 %"
## [1] "Including the dummy variable resulted in a decreased test error rate."
default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the glm() function. Do not forget to set a random seed before beginning your analysis. (a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.#Exercise 6-a
lr6 = glm(default ~ income + balance, data = Default, family = "binomial")
set.seed(1)
summary(lr6)##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
balance 2.274e-04income: 4.985e-066boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiplelogistic regression model.# Exercise 6-b
set.seed(1)
boot.fn = function(data, index) return(coef(glm(default ~ income + balance, data = data, family = "binomial", subset = index)))boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2* 2.080898e-05 1.680317e-07 4.866284e-06
## t3* 5.647103e-03 1.855765e-05 2.298949e-04
Boston housing data set, from the MASS library. (a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate ˆµ.## [1] 22.53281
## [1] 0.4088611
# Exercise 9-c
set.seed(1)
boot.mu.fn = function(data, index) {
mu = mean(data[index])
return (mu)
}
bstrap = boot(medv, boot.mu.fn, 1000)
bstrap##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.mu.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 0.007650791 0.4106622
##
## One Sample t-test
##
## data: medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 21.72953 23.33608
## sample estimates:
## mean of x
## 22.53281
## [1] 21.71148 23.35413
## [1] 21.2
# Exercise 9-f
set.seed(1)
boot.med.fn = function(data, index) return(median(data[index]))
bstrap.med = boot(medv, boot.med.fn, 1000)
bstrap.med##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.med.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.02295 0.3778075
## 10%
## 12.75
# Exercise 9-h
boot.h.fn = function(data, index) return(quantile(data[index], c(0.1)))
boot(medv, boot.h.fn, 1000)##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = boot.h.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.01455 0.4823468