Pred Modeling Assignment 4

#3. a) Explain how k-fold cross-validation is implemented

K-fold CV is implemented by taking a data set and randomly partitioning it into ‘k’ number of equally sized folds. These folds allows us to create train/test splits on data sets that do not really offer that much data (<1000, or more if we are willing to use the computing power). The training data is the ‘k-1’ folds and the first fold is the test set in that specific cross-section. We then iterate over the same data set and “cross-validate” our results by taking the average accuracy or test MSE of the folds to get our results for overall test, rather than just taking the highest fold.

Advantages & disadvantages of k-fold CV relative to: Validation set approach: Advantages for k-fold CV would be that it uses all of the observations available in the training and testing set while the validation set approach would only have the model be trained on data seen in the training set. K-fold CV would likely produce better results if the data set is small and there would likely be less variability in the data because we could potentially split our data randomly for the training set to become unbalanced. Validation set approach is also likely to overestimate the data when there are not many observations. K-fold CV begins to have disadvantages compared to validation set approach when we have MANY observations to the point where the training time takes longer and it becomes intense computationally for it to run.

Leave one out CV (LOOCV): Advantages of k-fold CV compared to LOOCV is that it has much lower computational costs since we are not doing k=n, we are usually doing k=5 or k=10. When using an optimal k, k-fold CV will come rather close to LOOCV at a much lower cost because we are not treating each observation as its own fold. A disadvantage of k-fold CV compared to LOOCV would be that it can be slightly more biased if we use a sub-optimal k for the specific data set like k=2.

#5. a) Fit a logistic regression model that uses income and balance to predict default.

Default = ISLR2::Default
summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

set.seed(1)
default_glm1 = glm(default ~ income + balance, data = Default, family = 'binomial')
summary(default_glm1)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Both income and balance are significant at an alpha level of 0.05 on predicting if someone is likely to default.

Using the validation set approach, estimate the test error of this model. In order to do this, you must perform the following steps:

Split the sample set into a training set and a validation set.

index = sample(1:nrow(Default), 0.7*nrow(Default))
# 70/30 Train/Test Split
def_train = Default[index,]
def_test = Default[-index,]

Fit a multiple logistic regression model using only the training observations.

default_glm2 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm2)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = def_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.167e+01  5.214e-01 -22.379  < 2e-16 ***
## income       2.560e-05  6.012e-06   4.258 2.06e-05 ***
## balance      5.574e-03  2.678e-04  20.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2030.3  on 6999  degrees of freedom
## Residual deviance: 1079.6  on 6997  degrees of freedom
## AIC: 1085.6
## 
## Number of Fisher Scoring iterations: 8

Obtain a prediction of default status for each individual in the validation set by computing the posterior probability of default for that individual, and classifying the individual to the default category if the posterior probability is greater than 0.5.

probabilities = predict(default_glm2,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')

Compute the validation set error, which is the fraction of the observations in the validation set that are misclassified.

mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)

## [1] 0.02666667

print(acc)

## [1] 0.9733333

We get an accuracy of 97.33% or an error rate or 2.67%. This is with a 70/30 split.

Repeat the process in (b) three times, using three different splits of the observations into a training set and a validation set. Comment on the results obtained.

80/20 split

index = sample(1:nrow(Default), 0.8*nrow(Default))
def_train = Default[index,]
def_test = Default[-index,]

default_glm3 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm3)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = def_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.164e+01  4.908e-01 -23.725  < 2e-16 ***
## income       2.006e-05  5.546e-06   3.616 0.000299 ***
## balance      5.731e-03  2.592e-04  22.111  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2340.6  on 7999  degrees of freedom
## Residual deviance: 1252.2  on 7997  degrees of freedom
## AIC: 1258.2
## 
## Number of Fisher Scoring iterations: 8

probabilities = predict(default_glm3,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')

mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)

## [1] 0.0265

print(acc)

## [1] 0.9735

The 80/20 split is 0.0002 better than the 70/30 split with an accuracy of 97.35%.

90/10

index = sample(1:nrow(Default), 0.9*nrow(Default))
def_train = Default[index,]
def_test = Default[-index,]

default_glm4 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm4)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = def_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.168e+01  4.642e-01 -25.164  < 2e-16 ***
## income       1.960e-05  5.265e-06   3.723 0.000197 ***
## balance      5.764e-03  2.437e-04  23.656  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2650.8  on 8999  degrees of freedom
## Residual deviance: 1410.3  on 8997  degrees of freedom
## AIC: 1416.3
## 
## Number of Fisher Scoring iterations: 8

probabilities = predict(default_glm4,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')

mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)

## [1] 0.03

print(acc)

## [1] 0.97

The 90/10 split is worse than the previous two at an accuracy of 97%

50/50

index = sample(1:nrow(Default), 0.5*nrow(Default))
def_train = Default[index,]
def_test = Default[-index,]

default_glm5 = glm(default ~ income + balance, data = def_train, family = 'binomial')
summary(default_glm5)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = def_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.165e+01  6.311e-01 -18.460  < 2e-16 ***
## income       2.561e-05  7.211e-06   3.551 0.000384 ***
## balance      5.568e-03  3.212e-04  17.335  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1429.88  on 4999  degrees of freedom
## Residual deviance:  781.26  on 4997  degrees of freedom
## AIC: 787.26
## 
## Number of Fisher Scoring iterations: 8

probabilities = predict(default_glm5,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')

mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)

## [1] 0.0262

print(acc)

## [1] 0.9738

The 50/50 split is the best model at 97.38% accuracy which is 0.0003 better than the 80/20 split.

Now consider a logistic regression model that predicts the probability of default using income, balance, and a dummy variable for student. Estimate the test error for this model using the validation set approach. Comment on whether or not including a dummy variable for student leads to a reduction in the test error rate

default_glm6 = glm(default~., data=def_train, family = 'binomial')
summary(default_glm6)

## 
## Call:
## glm(formula = default ~ ., family = "binomial", data = def_train)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.070e+01  6.975e-01 -15.338  < 2e-16 ***
## studentYes  -9.689e-01  3.327e-01  -2.912  0.00359 ** 
## balance      5.714e-03  3.304e-04  17.295  < 2e-16 ***
## income      -3.199e-07  1.146e-05  -0.028  0.97774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1429.88  on 4999  degrees of freedom
## Residual deviance:  772.86  on 4996  degrees of freedom
## AIC: 780.86
## 
## Number of Fisher Scoring iterations: 8

probabilities = predict(default_glm6,newdata = def_test, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 'Yes', 'No')
mse = mean(def_test$default != predicted.classes)
acc = 1 - mse
print(mse)

## [1] 0.0266

print(acc)

## [1] 0.9734

Using the 50/50 split, adding student caused income to no longer be significant but studentYes was significant at an alpha level of 0.05. Accuracy went down by 0.0004 when adding student but this could potentially be the case because of the split and the difference is rather small. Student did not reduce error rate.

#6. a) Using the summary() and glm() functions, determine the estimated standard errors for the coefficients associated with income and balance in a multiple logistic regression model that uses both predictors.

summary(default_glm1)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Standard error of coefficients income: 4.985e-06 balance: 2.274e-04

b)Write a function, boot.fn(), that takes as input the Default data set as well as an index of the observations, and that outputs the coefficient estimates for income and balance in the multiple logistic regression model.

boot.fn = function(data,index){
  coef(glm(default~income+balance,data=data,subset=index,family = 'binomial'))
}
boot.fn(Default, 1:nrow(Default))

##   (Intercept)        income       balance 
## -1.154047e+01  2.080898e-05  5.647103e-03

c)Use the boot() function together with your boot.fn() function to estimate the standard errors of the logistic regression coefcients for income and balance.

library(boot)

## Warning: package 'boot' was built under R version 4.3.3

## 
## Attaching package: 'boot'

## The following object is masked from 'package:car':
## 
##     logit

set.seed(1)
boot(Default, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

Intercept 4.344722e-01 income 4.866284e-06 balance 2.298949e-04

d)Comment on the estimated standard errors obtained using the glm() function and using your bootstrap function.

From using boot() and boot.fn we got income:4.866284e-06, balance:2.298949e-04 for our estimated standard errors. When using the glm() function we got income: 4.985e-06, balance: 2.274e-04. These estimated standard errors are extremely close to each other so we can state that this model provides a good fit to the data and our assumptions are satisfied.

#9. a) Based on this data set, provide an estimate for the population mean of medv. Call this estimate µˆ.

boston = ISLR2::Boston
mu_hat = mean(boston$medv)
mu_hat

## [1] 22.53281

We get a sample mean of 22.53 for medv (mu_hat / µˆ).

Provide an estimate of the standard error of µˆ. Interpret this result. Hint: We can compute the standard error of the sample mean by dividing the sample standard deviation by the square root of the number of observations

(sd(boston$medv)/sqrt(length(boston$medv)))

## [1] 0.4088611

We get 0.4089 as our estimated standard error of mu_hat / sample mean. This means that, on average, the sample mean of medv will vary about 0.4089 from the true population mean.

Now estimate the standard error of µˆ using the bootstrap. How does this compare to your answer from (b)?

set.seed(1)
boot.fn2 = function(data, index) {
  mean(data[index])
}
boot(boston$medv, boot.fn2, 10000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston$medv, statistic = boot.fn2, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original       bias    std. error
## t1* 22.53281 -0.002350771   0.4069546

We estimated the standard error of the sample mean to be about 0.4089 in part b) and when bootstrapping got a standard error of 0.4070.The answers are within 0.0019 of each other so they similar in estimating the error of the sample mean in relation to the true mean.

d)Based on your bootstrap estimate from (c), provide a 95 % confdence interval for the mean of medv. Compare it to the results obtained using t.test(Boston$medv). Hint: You can approximate a 95 % confdence interval using the formula [ˆµ − 2SE(ˆµ), µˆ + 2SE(ˆµ)].

lower_bound = mu_hat - 2*(0.4069546)
upper_bound = mu_hat + 2*(0.4069546)
print(c(lower_bound, upper_bound))

## [1] 21.71890 23.34672

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

Confidence Interval: 21.71890 23.34672 t-test: 21.72953 23.33608 The confidence interval and t-test are almost exactly the same, they only differ by about 0.01.

e)Based on this data set, provide an estimate, µˆmed, for the median value of medv in the population.

median(boston$medv)

## [1] 21.2

The sample median of medv is 21.2.

f)We now would like to estimate the standard error of µˆmed. Unfortunately, there is no simple formula for computing the standard error of the median. Instead, estimate the standard error of the median using the bootstrap. Comment on your findings.

set.seed(1)
boot.fn2 = function(data, index) {
  median(data[index])
}
boot(boston$medv, boot.fn2, 10000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston$medv, statistic = boot.fn2, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*     21.2 -0.014705   0.3780355

When bootstrapping the median, we get a standard error of 0.3780 to estimate the distance of the sample median from the true median. We can create a confidence interval to see the estimated range of the true median:

lower_bound = 21.2 - 2*(0.3780355)
upper_bound = 21.2 + 2*(0.3780355)
print(c(lower_bound, upper_bound))

## [1] 20.44393 21.95607

Which would be that the true median estimate should fall between 20.44393, 21.95607 at 95% confidence.

Based on this data set, provide an estimate for the tenth percentile of medv in Boston census tracts. Call this quantity µˆ0.1. (You can use the quantile() function.)

quantile(boston$medv, 0.1)

##   10% 
## 12.75

The 10th percentile occurs at 12.75 medv.

h)Use the bootstrap to estimate the standard error of µˆ0.1. Comment on your findings

set.seed(1)
boot.fn3 = function(data, index) {
  quantile(data[index], 0.1)
}
boot(boston$medv, boot.fn3, 10000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston$medv, statistic = boot.fn3, R = 10000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*    12.75 0.007625   0.5014507

Using the bootstrapping method on the 10th percentile, we get a standard error of 0.5015. This gave us the largest standard error value compared to our sample mean and sample medians, but the estimate is still rather small. Getting similar error spreads through different tests shows the power of bootstrapping has on estimating variability.

Pred Modeling Assignment 4

Jordan Dever

2024-03-08