Assignment 4 - Predictive Modeling

Question 3

K-fold cross validation is conducted by splitting a data set into “k” different segments of equal size. Then, you take out one set and fit a model to the remaining data and see how it performs using the omitted segment as the test data. You then calculate the mean squared error. Then you repeat this for each segment, basically conducting the same analysis “k” times but changing the test data each time.
Relative to the validation set approach, K-fold cross validation is able to smooth out variance in the data better, and it also not as senstive to how the data is divided. Still, it is far more computationally taxing, and it can be hard to implement on large datasets. Relative to LOOCV, K-fold is actually less computationally taxing because you divided up the data into sets instead of running the model “n” amount of times and it also leads to lower variance. However, LOOCV has lower bias in its estimates.

Question 5

See logistic regression model below.

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = defaultdf)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Validation set approach below split the data 70/30.

## [1] 0.02633333

The validation set approaches below us, in order, 75/25, 80/20, and 85/15. The results show that the model is stable but not identical. Also, it gets slightly better with more data, as visible when comparing the 70/30 split with the 85/15 split.

## [1] 0.0264

## [1] 0.026

## [1] 0.02466667

I conducted the model below using a 70/30 split. By adding a dummy variable for student status, the mean error rate leads is slightly worse off.

## [1] 0.027

Question 6

Standard errors for both coefficients shown in the table below.

## 
## Call:
## glm(formula = default ~ income + balance, family = binomial, 
##     data = defaultdf)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

See function below.

boot.fn <- function(data, index) {
  fit <- glm(default ~ income + balance,
             data = data,
             family = binomial,
             subset = index)
  return(coef(fit)[c("income", "balance")])
}

See bootstrap model below.

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original       bias     std. error
## t1* 2.080898e-05 1.582518e-07 4.729534e-06
## t2* 5.647103e-03 1.296980e-05 2.217214e-04

Both standard errors are extremely small in the bootstrap model and the non-bootstrap model.

Question 9

See below.

## [1] 22.53281

Given the standard error below, we know that the sample mean for medv typically varies by about $408. I would say that this is a pretty small standard error.

## [1] 0.4088611

The standard error with the bootstrap decreased slightly.

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston$medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 -0.01607372   0.4045557

The confidence intervals for the bootstrap model and the non-bootstrap model are almost identical.

## 
##  One Sample t-test
## 
## data:  boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

## [1] 21.73988 23.32574

See below.

mu_med_hat <- median(boston$medv)
mu_med_hat

## [1] 21.2

The standard error of the median shows that medv’s median varies by about $381 on average.

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston$medv, statistic = function(data, index) {
##     median(data[index])
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 -0.0264   0.3783996

## [1] 0.3783996

See below.

##   10% 
## 12.75

The standard error shows us that the tenth percentile’s medv values vary, on average, by about $492.

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Boston$medv, statistic = function(data, index) {
##     quantile(data[index], 0.1)
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.02485   0.5082798

## [1] 0.5082798

Assignment 4 - Predictive Modeling

Matthew Reyes

2026-06-28

Question 3

Question 5

Question 6

Question 9