CH.5

We now review k-fold cross-validation.

Explain how k-fold cross-validation is implemented

“It involves randomly dividing the available set of observations into two parts, a training set and a validation set or hold-out set. The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The resulting validation set error rate—typically assessed using MSE in the case of a quantitative response—provides an estimate of the test error rate.

What are the advantages and disadvantages of k-fold cross-validation relative to:

The validation set approach?

Easy to use and implement

1.validation estimate of the test error rate can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set. 2. In the validation approach, only a subset of the observations—those that are included in the training set rather than in the validationset—are used to ft the model. Since statistical methods tend to perform worse when trained on fewer observations, this suggests that thevalidation set error rate may tend to overestimate the test error rate for the model ft on the entire data set.

LOOCV?

“First, it has far less bias. In LOOCV, we repeatedly ft the statistical learning method using training sets that contain n − 1 observations, almost as many as are in the entire data set. This is in contrast to the validation set approach, in which the training set is typically around half the size of the original data set. Consequently, the LOOCV approach tends not to overestimate the test error rate as much as the validation set approach does. Second, in contrast to the validation approach which will yield diferent results when applied repeatedly due to randomness in the training/validation set splits, performing LOOCV multiple times will always yield the same results: there is no randomness in the training/validation set splits”

“LOOCV has the potential to be expensive to implement, since the model has to be ft n times. This can be very time consuming if n is large, and if each individual model is slow to fit. With least squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit!”

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

Fit a logistic regression model that uses income and balance topredict default.

require(ISLR)

## Loading required package: ISLR

## Warning: package 'ISLR' was built under R version 4.3.2

getwd()

## [1] "C:/Masters/Predictive Modeling/hw3"

setwd("C:/Users/monee/OneDrive/Pictures/Documents")
auto_data<-read.csv("Default.csv",header=TRUE)

glm.fit <- glm(default ~ income + balance, data=Default, family = binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Using the validation set approach, estimate the test error of thismodel. In order to do this, you must perform the following steps:

Split the sample set into a training set and a validation set.

set.seed(33)
train_id <- sample(10000, 5000)

train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default

Fit a multiple logistic regression model using only the training observations.

glm.fit.train <- glm(default ~ income + balance, data=train, family = binomial)
summary(glm.fit.train)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.156e+01  6.180e-01 -18.702  < 2e-16 ***
## income       2.319e-05  6.988e-06   3.319 0.000903 ***
## balance      5.660e-03  3.213e-04  17.617  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1483.83  on 4999  degrees of freedom
## Residual deviance:  819.08  on 4997  degrees of freedom
## AIC: 825.08
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

glm.fit.prob <- predict(glm.fit.train, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)

##             test.Default
## glm.fit.pred   No  Yes
##          No  4814  105
##          Yes   23   58

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassifed.

mean(glm.fit.pred != test.Default)

## [1] 0.0256

Repeat the process in (b) three times, using three diferent splits of the observations into a training set and a validation set. Comment on the results obtained.

set.seed(3)
train_id <- sample(10000, 5000)

train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.1 <- glm(default ~ income + balance, data=train, family = binomial)
glm.fit.prob <- predict(glm.fit.1, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)

##             test.Default
## glm.fit.pred   No  Yes
##          No  4822  109
##          Yes   23   46

mean(glm.fit.pred != test.Default)

## [1] 0.0264

Sample 2:

set.seed(2)
train_id <- sample(10000, 5000)

train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.2 <- glm(default ~ income + balance, data=train, family = binomial)
glm.fit.prob <- predict(glm.fit.2, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)

##             test.Default
## glm.fit.pred   No  Yes
##          No  4819  101
##          Yes   18   62

mean(glm.fit.pred != test.Default)

## [1] 0.0238

Sample 3

set.seed(1)
train_id <- sample(10000, 5000)

train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default
glm.fit.3 <- glm(default ~ income + balance, data=train, family = binomial)
glm.fit.prob <- predict(glm.fit.3, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)

##             test.Default
## glm.fit.pred   No  Yes
##          No  4824  108
##          Yes   19   49

mean(glm.fit.pred != test.Default)

## [1] 0.0254

Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

set.seed(6)

train_id <- sample(10000, 5000)

train <- Default[train_id, ]
test <- Default[-train_id, ]
test.Default <- test$default

glm.fit.4 <- glm(default ~ income + balance + student, data = train, family = binomial)
glm.fit.prob <- predict(glm.fit.4, test, type = "response")
glm.fit.pred <- ifelse(glm.fit.prob>0.5, "Yes", "No")
table(glm.fit.pred, test.Default)

##             test.Default
## glm.fit.pred   No  Yes
##          No  4825  109
##          Yes   21   45

mean(glm.fit.pred != test.Default)

## [1] 0.026

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefcients in two diferent ways: (1) using the bootstrap, and (2) using the standard formula for computing the standard errors in the sm.GLM() function. Do not forget to set a random seed before beginning your analysis.

Using the summarize() and sm.GLM() functions, determine the estimated standard errors for the coefcients associated with income and balance in a multiple logistic regression model that uses both predictors.

glm.fit.5 <- glm(default ~ income + balance, data = Default, family = binomial)
summary(glm.fit.5)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Write a function, boot_fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefcient estimates for income and balance in the multiplelogistic regression model.

boot.fn <- function(data, index) {
  return(coef(glm(default ~ income + balance, data = data, family = binomial, 
           subset = index)))
}

Following the bootstrap example in the lab, use your boot_fn() function to estimate the standard errors of the logistic regression coefcients for income and balance.

library(boot)
set.seed(1)
boot(Default, boot.fn, R=1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

Comment on the estimated standard errors obtained using the sm.GLM() function and using the bootstrap. Standard errors obtained by the glm() function and by using bootstrap function are very similar.

We will now consider the Boston housing data set, from the ISLP library.

Based on this data set, provide an estimate for the population mean of medv. Call this estimate µˆ

require(MASS)

## Loading required package: MASS

## Warning: package 'MASS' was built under R version 4.3.1

getwd()

## [1] "C:/Masters/Predictive Modeling/hw3"

setwd("C:/Users/monee/OneDrive/Pictures/Documents")
Boston<-read.csv("Boston.csv",header=TRUE)

u.hat <- mean(Boston$medv)
u.hat

## [1] 22.53281

Provide an estimate of the standard error of µˆ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

sd(Boston$medv)/sqrt(nrow(Boston))

## [1] 0.4088611

Now estimate the standard error of µˆ using the bootstrap. How does this compare to your answer from (b)?

boot.fn=function(data,index) {
  return(mean(data[index]))
}
set.seed(5)
boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.01637984   0.4096144

Based on your bootstrap estimate from (c), provide a 95 % confdence interval for the mean of medv. Compare it to the results obtained by using Boston[‘medv’].std() and the two standard error rule (3.9).

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Hint: You can approximate a 95 % confdence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].

u.hat - 2*0.412335 ; u.hat + 2*0.412335

## [1] 21.70814

## [1] 23.35748

Based on this data set, provide an estimate, µˆmed, for the median value of medv in the population.

med.hat <- median(Boston$medv)
med.hat

## [1] 21.2

We now would like to estimate the standard error of µˆmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your fndings.

boot.fn.med=function(data,index) {
  return(median(data[index]))
}
set.seed(5)
boot(Boston$medv, boot.fn.med, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.med, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0026   0.3847418

Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity µˆ0.1.(You can use the np.percentile() function.) np.percentile()

u.0.1 <- quantile(Boston$medv, 0.1)
u.0.1

##   10% 
## 12.75

Use the bootstrap to estimate the standard error of µˆ0.1. Comment on your fndings.

boot.fn.q=function(data,index) {
  u.0.1 <- quantile(data[index], c(0.1))
  return(u.0.1)
}

set.seed(5)
boot(Boston$medv, boot.fn.q, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn.q, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.01505   0.4911489

at 10% we get 49.11% standard error

CH.5

Moneeb Abu-Esba

2024-03-08