Chapter 5 HW

#QUESTION 3 PART A
#The K-fold Cross-validfation is a technique where the dataset is split into k groups or folds that are approximately equal in size. Each fold is used as a test set and the remaining folds are used for training.

#QUESTION 3 PART B
#The advantages of k-fold corss validation relative to the validation set approach include: less variability in MSE and bias compared to the vlidation set approach, and not overestimating test errors because thte dataset is split when it is used for training and testing
#The disadvantages of k-fold cross validation relative to the validation set approach include: less computational advantages when the folds are higher. 

#The advantages of k-fold cross validation relative to the LOOCV include: computational advantage as it fits the model k times instead of n times. K-fold cross validation also has less variance compared to the very flexible LOOCV.
#THe disadvantages of k-fold cross validation relative to the LOOCV include: having a very high bias and and randomness in selecting k-folds.

#QUESTION 5 PART A)
library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.3

attach(Default)
set.seed(1)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial")
summary(fit.glm)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

#QUESTION 5 PART B)
#i)
train = sample(dim(Default)[1], dim(Default)[1] / 2)
#ii)
fit.glm = glm(default ~ income + balance, data = Default[train,], family = "binomial")
fit.glm = glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
summary(fit.glm)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default, subset = train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.194e+01  6.178e-01 -19.333  < 2e-16 ***
## income       3.262e-05  7.024e-06   4.644 3.41e-06 ***
## balance      5.689e-03  3.158e-04  18.014  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1523.8  on 4999  degrees of freedom
## Residual deviance:  803.3  on 4997  degrees of freedom
## AIC: 809.3
## 
## Number of Fisher Scoring iterations: 8

#iii)
glm.probs = predict(fit.glm, newdata = Default[-train, ], type="response")
glm.pred=rep("No",5000)
glm.pred[glm.probs>0.5] = "Yes"
#iv)
mean(glm.pred != Default[-train, ]$default)

## [1] 0.0254

#QUESTION 5 PART C)
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
probs <- predict(fit.glm, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)

## [1] 0.0274

#QUESTION 5 PART C)
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
probs <- predict(fit.glm, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)

## [1] 0.0244

#QUESTION 5 PART C)
train <- sample(dim(Default)[1], dim(Default)[1] / 2)
fit.glm <- glm(default ~ income + balance, data = Default, family = "binomial", subset = train)
probs <- predict(fit.glm, newdata = Default[-train, ], type = "response")
pred.glm <- rep("No", length(probs))
pred.glm[probs > 0.5] <- "Yes"
mean(pred.glm != Default[-train, ]$default)

## [1] 0.0244

#FINDINGS FROM QUESTION 5 PART C)
#The validation estimate of the test error seems to vary slightly from the range 0.0244 - 0.0274. The variability occurs depending on the observations incorporated in training and validation sets.

#QUESTION 5 PART D)
set.seed(1)
train = sample(dim(Default)[1], dim(Default)[1] / 2)

fit.glm = glm(default ~ income + balance + student, data = Default, family = binomial, subset = train)

glm.pred = rep("No", dim(Default)[1] / 2)
glm.probs = predict(fit.glm, Default[-train, ], type = "response")
glm.pred[glm.probs > 0.5] = "Yes"

mean(glm.pred != Default[-train, ]$default)

## [1] 0.026

#FINDINGS FROM QUESTION 5 PART D)
#Adding a dummy variable for student does not seem to lead to a reduction in the test error rate. In fact, it seems to increase it by a slight amount.

#QUESTION 6 PART A)
set.seed(1)
glm.fit=glm(default~income+balance, data=Default, family=binomial)
summary(glm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

#QUESTION 6 PART B)
boot.fn=function(data, index){fit=glm(default~income+balance, data=data, family=binomial, subset=index)
  return(coef(fit))}

#QUESTION 6 PART C)
library(boot)
boot(Default, boot.fn, 100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01  8.556378e-03 4.122015e-01
## t2*  2.080898e-05 -3.993598e-07 4.186088e-06
## t3*  5.647103e-03 -4.116657e-06 2.226242e-04

#QUESTION 6 PART D)
#The estimated standard errors obtained using the glm() function and using bootstrap function are very similar. The estimated standard erros obtained using the bootstrap function are: The bootstrap estimates of the standard errors for the coefficients β0, β1 and β2 are respectively 0.4122 x 10^(-1), 4.187 x 10^(-6) and 2.226 x 10^(-4).

#QUESTION 9 PART A)
library(MASS)

## Warning: package 'MASS' was built under R version 4.4.2

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

mu.hat = mean(Boston$medv)
mu.hat

## [1] 22.53281

#QUESTION 9 PART B)
se.hat = sd(Boston$medv) /sqrt(dim(Boston)[1])
se.hat

## [1] 0.4088611

#QUESTION 9 PART C)
set.seed(1)
boot.fn <- function(data, index) {
    mu <- mean(data[index])
    return (mu)
}
boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.007650791   0.4106622

#FINDINGS FROM QUESTION 9 PART C)
#The standard error from part b is similar to the standard error from the bootstrap. The standard error using the bootstrap is 0.4106622.

#QUESTION 9 PART D)
t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

#FINDINGS FROM QUESTION 9 PART D)
#The bootstrap confidence interval is very close to the one provided by the t.test(). The t-test provides a 95 percent confidence interval of 21.72953 23.33608.

#QUESTION 9 PART E)
med.hat <- median(Boston$medv)
med.hat

## [1] 21.2

#QUESTION 9 PART F)
boot.fn <- function(data, index) {
    mu <- median(data[index])
    return (mu)
}
boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0386   0.3770241

#FINDINGS FROM QUERSTION 9 PART F)
#The estimated median value from part f is 21.2. This is exactly the same result as part e. The standard error is 0.3770241 which is small compared to the median value.

#QUESTION 9 PART G)
percent.hat <- quantile(Boston$medv, c(0.1))
percent.hat

##   10% 
## 12.75

#QUESTION 9 PART H)
boot.fn <- function(data, index) {
    mu <- quantile(data[index], c(0.1))
    return (mu)
}
boot(Boston$medv, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0186   0.4925766

#FINDINGS FROM QUESTION 9 PART H)
#The estimated median value from part f is 12.75. This is exactly the same result as part e. The standard error is 0.4925766 which is small compared to the percentile value.

Chapter 5 HW

SunBeom Lee

2025-03-25