Assignment 4

Question 3

Explain how k-fold cross-validation is implemented.

The set of observations is divided into k groups, or folds, of roughly equal size, at random in k-fold cross validation. The statistical learning approach is applied/fitted on the remaining k − 1 folds, with the initial fold serving as a validation set. Next, the observations in the held-out fold are used to determine the mean squared error (MSE). A different set of observations is used as a validation set each time this process is repeated k times. K estimates of the test errors are generated as a result of this procedure. These MSEs are averaged to provide the k-fold cross validation estimate.

(b) What are the advantages and disadvantages of k-fold cross validation relative to:
i. The validation set approach?
Advantages of k-fold method compared to validation set method

The test error rates (MSE) of the k-fold approach are less variable than those of the validation set approach.less biased than the validation set method. Since all of the data is used for both training and testing, k-fold cross validation does not overestimate the test error. On the other hand, the validation set strategy splits the data, which may produce a more or less favorable outcome by chance.

Disadvantages of the k-fold method compared to the validation set method

One computational benefit of the validation set approach is that a model is only trained and evaluated once. k models will be trained in k-fold CV, and these training datasets will typically be larger than those used in the validation technique for conventional values of k. All of this indicates that for huge data and large values of k, k-fold CV may require significantly more time.

ii. LOOCV?

Advantages of k-fold method when compared to LOOCV

While LOOCV fits the model n times, k-fold CV with k < n has an advantage in computing power since k-fold fits the model K times. Aside from computational difficulties, k-fold CV has a less evident but perhaps more significant benefit in that it frequently provides more accurate test error rate estimates than LOOCV. A trade-off between bias and variance is involved in this. The highly flexible LOOCV has more variance than the k-fold CV. Though LOOCV is fitted on almost same data, it produces a higher variation in test error than k-fold CV because k-fold CV is trained on a less identical collection of observations.

Disadvantages of k-fold method when compared to LOOCV

Similar to the validation set approach, k-fold CV has some unpredictability in the process of selecting k-folds, and k-fold has high bias.

Question 5

In Chapter 4, we used logistic regression to predict the probability of default using income and balance on the Default data set. We will now estimate the test error of this logistic regression model using the validation set approach. Do not forget to set a random seed before beginning your analysis.

library(ISLR)
library(MASS)
attach(Default)
library(caTools)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(1)

a) Fit a logistic regression model that uses income and balance to predict default.

lrm<-glm(default~income+balance,family = binomial,data=Default)
summary(lrm)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

pd<-predict(lrm,Default$default,type="response")
pd.class<-ifelse(pd>0.5,"Yes","No")
round(mean(Default$default!=pd.class),4)

## [1] 0.0263

Using the complete data set, the logistic regression model is constructed, and the missclassification rate is determined to be 0.0263.

(b) Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:
i. Split the sample set into a training set and a validation set.

set.seed(123, sample.kind = "Rounding")

## Warning in set.seed(123, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used

index <- createDataPartition(y = Default$default, p = 0.7, list = F)

train <- Default[index, ]
test <- Default[-index, ]

nrow(train) / nrow(Default)

## [1] 0.7001

ii. Fit a multiple logistic regression model using only the training observations.

lrm2 <- glm(default ~ income + balance, data = train, family = "binomial")

summary(lrm2)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.141e+01  5.099e-01 -22.379  < 2e-16 ***
## income       2.256e-05  5.898e-06   3.825 0.000131 ***
## balance      5.531e-03  2.651e-04  20.866  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1119.2  on 6998  degrees of freedom
## AIC: 1125.2
## 
## Number of Fisher Scoring iterations: 8

iii. Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

predict<-predict(lrm2,test,type="response")
class<-ifelse(predict>0.5,"Yes","No")
table(test$default,class,dnn=c("Actual","Predicted"))

##       Predicted
## Actual   No  Yes
##    No  2890   10
##    Yes   68   31

iv) Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

round(mean(test$default != class), 5)

## [1] 0.02601

According to 0.02601, the logistic regression model’s misclassification rate on the Default dataset is roughly 2.601%.

(c) Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

lrmerror <- c()
lrmerror[1] <- mean(test$default != class)

set.seed(42)

for (i in 2:4) {
  index <- createDataPartition(y = Default$default, p = 0.7, list = F)
  train <- Default[index, ]
  test <- Default[-index, ]
  
  lrm <- glm(default ~ income + balance, data = train, family = "binomial")
  predtest <- factor(ifelse(predict(lrm, newdata = test, type = "response") > 0.5, "Yes", "No"))
  
  lrmerror[i] <- mean(test$default != predtest)
}


round(lrmerror, 5)

## [1] 0.02601 0.02534 0.02601 0.02534

As we expect from the validation set technique and cross-validation, there is some variation. 0.02568 is the average test error.

(d) Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate.

set.seed(43)

index <- createDataPartition(y = Default$default, p = 0.7, list = F)
train <- Default[index, ]
test <- Default[-index, ]

lrm <- glm(default ~ ., data = train, family = "binomial")
summary(lrm)

## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.100e+01  5.957e-01 -18.473   <2e-16 ***
## studentYes  -6.353e-01  2.825e-01  -2.249   0.0245 *  
## balance      5.780e-03  2.789e-04  20.721   <2e-16 ***
## income       4.235e-06  9.788e-06   0.433   0.6653    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2050.6  on 7000  degrees of freedom
## Residual deviance: 1093.3  on 6997  degrees of freedom
## AIC: 1101.3
## 
## Number of Fisher Scoring iterations: 8

predtest <- factor(ifelse(predict(lrm, newdata = test, type = "response") > 0.5, "Yes", "No"))

round(mean(test$default != predtest), 5)

## [1] 0.02634

Adding a dummy variable for the student does not appear to lower the test error rate because the test error for this model seems to be higher than the test error for the smaller model—larger than all four of them, in fact.

Question 6

We continue to consider the use of a logistic regression model to predict the probability of default using income and balance on the Default data set. In particular, we will now compute estimates for the standard errors of the income and balance logistic regression coefficients in two different ways:

Using the bootstrap
Using the standard formula for computing the standard errors in the glm() function

Do not forget to set a random seed before beginning your analysis.

(a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

set.seed(1)
lrm<-glm(default~income+balance,family = binomial,data=Default)
summary(lrm)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b) Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn<-function(data,index){
  fit<-glm(default~income+balance,data=data,family="binomial",subset=index)
  return(coef(fit))
}

(c) Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefficients for income and balance.

library(boot)

## 
## Attaching package: 'boot'

## The following object is masked from 'package:lattice':
## 
##     melanoma

boot(Default,boot.fn,1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -8.008379e-03 4.239273e-01
## t2*  2.080898e-05  5.870933e-08 4.582525e-06
## t3*  5.647103e-03  2.299970e-06 2.267955e-04

(d) Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function. The glm() function and the projected standard errors obtained from the logistic regression summary coefficients (beta0, beta1, and beta2) nearly match, indicating the accuracy of the bootstrap approach in capturing coefficient variation. This consistency, which is achieved by employing a non-parametric resampling strategy, emphasizes the robustness of the bootstrap methodology, especially in cases where traditional methods may not be as successful with small or complex datasets (like Default). Consequently, there is an increase in confidence in coefficient estimations.

Question 9

(a) We will now consider the Boston housing data set, from the ISLR2 library.

library(ISLR)
attach(Boston)

mean(Boston$medv)

## [1] 22.53281

Sample mean : 22.5328063

(b) Provide an estimate of the standard error of . Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations.

sd(Boston$medv) / sqrt(length(Boston$medv))

## [1] 0.4088611

0.4088611 is the estimated standard error.

The sample that represents the Boston population makes up the data set. The standard error indicates the estimate’s accuracy, or how much the mean would change if a different sample was used.

(c) Now estimate the standard error of μ̂ using the bootstrap. How does this compare to your answer from (b)?

boot.fn <- function(vector, index) {
  mean(vector[index])
}

set.seed(23)

(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 22.53281 -0.007606917   0.4278912

Both the bootstrap calculated standard error of (mu) of 0.4278912 and the estimate of 0.4088611 found in (b) are incredibly close. proving that the bootstrapped standard error estimation approach and the direct calculation method used in part (b) are consistent.

(d) Based on your bootstrap estimate from (c), provide a 95% confidence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv).

Hint: You can approximate a 95% confidence interval using the formula [μ̂−2SE(μ̂),μ̂+2SE(μ̂)]

boot_results2 <- sd(boot_results$t)
round(c(mean(Boston$medv) - 2*boot_results2, mean(Boston$medv) + 2*boot_results2), 4)

## [1] 21.6770 23.3886

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The confidence interval derived from the t test is very identical to the one that was obtained.

(e) Based on this data set, provide an estimate, μ̂med, for the median value of medv in the population.

median(Boston$medv)

## [1] 21.2

median is 21.2

(f) We now would like to estimate the standard error of μ̂med. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

library(boot)
set.seed(1)
boot.fn<-function(data,index){
  mu.median<-median(data[index])
  return(mu.median)
}
set.seed(1)
boot(medv,boot.fn,100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2  -0.026   0.3683488

From the above median value is 21.2 which is equal to the estimated value obtained in question (e), with a standard error of 0.3683488 where it is very small when relatively compared to median value of 21.2

(g) Based on this data set, provide an estimate for the tenth percentile of medv in Boston suburbs. Call this quantity μ̂0.1. (You can use the quantile() function.)

quantile(Boston$medv, 0.1)

##   10% 
## 12.75

The estimate of 12.75 is obtained for the tenth percentile of medical variables in the Boston suburbs.

(h) Use the bootstrap to estimate the standard error of μ̂0.1. Comment on your findings.

boot.fn <- function(vector, index) {
  quantile(vector[index], 0.1)
}

set.seed(22)

(boot_results <- boot(data = Boston$medv, statistic = boot.fn, R = 1000))

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.02365   0.5000282

Although the standard error has increased significantly from 10th quantile, it is still very small.

Assignment 4

Karthik Munnam

2024-03-08

Question 3

Question 5

Question 6

Question 9