Algo 2 HW 4

Question 3

A

K-fold cross validation is implemented by splitting the data set into k segments of equal length. K models are developed leaving out each one of the folds. Then, the test error is calculated for each model, and averaged to create a test error for the data set.

B

The advantages of k-fold cross validation compared to the validation approach are that it is less susceptible to randomness caused by the seed and is less susceptible to overestimating the test error. The disadvantage is the extra computation power required to run k number of models.

The advantages of k-fold cross validation compared to LOOCV is the reduced computational power required (k models instead of n models) and lower variance. The disadvantage is the increased bias.

Question 5

A

Default = Default
Default$default = ifelse(Default$default == 'Yes', 1, 0)
logit = glm(default ~ income + balance, data = Default, family = 'binomial')

B

set.seed(1)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)

##    cv.preds
##        0    1
##   0 2897    1
##   1   89   13

val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)

## [1] 0.03

C

Repeating the process for seed 2-4 results in test errors ranging from .0227 to .03. This is a ~30% variance which does not give me confidence in what the actual test error is.

set.seed(2)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)

##    cv.preds
##        0    1
##   0 2900    5
##   1   67   28

val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)

## [1] 0.024

set.seed(3)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)

##    cv.preds
##        0    1
##   0 2910    7
##   1   68   15

val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)

## [1] 0.025

set.seed(4)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)

##    cv.preds
##        0    1
##   0 2906    8
##   1   60   26

val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)

## [1] 0.02266667

D

Adding the student predictor decreased model performance if you compare to identical seeds! I recommend leaving it out of the model.

set.seed(1)
train = sample(nrow(Default), .7 * nrow(Default))
cv.logit = glm(default ~ income + balance + student, data = Default, subset = train, family = 'binomial')
cv.probs = predict(cv.logit, Default[-train, ])
cv.preds = ifelse(cv.probs > .5, 1, 0)
table(Default[-train,]$default, cv.preds)

##    cv.preds
##        0    1
##   0 2897    1
##   1   91   11

val.err = 1 - mean(Default[-train,]$default == cv.preds)
print(val.err)

## [1] 0.03066667

Question 6

A

The standard errors for income and balance are 4.985 x 10^-6 and 2.274 x 10^-4, respectively.

logit = glm(default ~ income + balance, data = Default, family = 'binomial')
summary(logit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

income.std.err = summary(logit)$coefficients[2,2]
balance.std.err = summary(logit)$coefficients[3,2]

B

set.seed(1)
boot.fn = function(data, index) + return(coef(glm(default ~ income + balance, data = data, subset = index, family = 'binomial')))

C

boot(Default, boot.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.945460e-02 4.344722e-01
## t2*  2.080898e-05  1.680317e-07 4.866284e-06
## t3*  5.647103e-03  1.855765e-05 2.298949e-04

D

The standard errors calculated by the glm and the bootstrap function are very similar. With a standard error greater than the coefficient estimate, we can assume that Income does not effect whether or not an individual will Default. But, Balance’s standard error is an order of magnitude less than the coefficent, so we should keep that predictor as for whether or not an individual might Default.

Question 9

A

The mean Median Value of owner-occupied homes is $225,330.

boston = Boston
mu = mean(Boston$medv)
mu

## [1] 22.53281

B

The standard error is .409. This means I am 95% confident the true population mean is between $217,314 and $233,346.

se.mu = sd(Boston[['medv']]) / (nrow(Boston)^.5)
se.mu

## [1] 0.4088611

C

The standard error calculated using bootstrap is the same as the one calculated using the entire sample set.

set.seed(1)
se.fn = function(data,index){
   X = data$medv[index]
   return(sd(X) / (length(X) ^ .5))
}
boot(boston, se.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston, statistic = se.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.4088611 -0.001898635  0.01735132

D

Using the bootstrap standard error of .409 we calculate the same confidence interval as in Part B. I am 95% confident the true population mean is between $217,314 and $233,346. This is nearly identical to the confidence interval calculated using a t-test on the full sample.

t.test(boston$medv)

## 
##  One Sample t-test
## 
## data:  boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

E

The median Median Home Value for Owner-Occupied homes is $212,000.

mu.med = median(boston$medv)
mu.med

## [1] 21.2

F

The bootstrap estimate for standard error is .377. This means we are 95% confident the population median is between $204,611 and $219,389.

set.seed(1)
med.fn = function(data,index){
   X = data$medv[index]
   return(median(X))
}
boot(boston, med.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston, statistic = med.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2 0.02295   0.3778075

G

The top ten percentile has a median home value of $348,000.

mu.10 = quantile(boston$medv, probs = .9)
mu.10

##  90% 
## 34.8

H

The bootstrap estimate for the top 10 percentile is 1.148. This means we are 95% confident the population top 10 percentile is between $325,499 and $370,501.

set.seed(1)
se.10.fn = function(data,index){
   X = data$medv[index]
   return(quantile(X, probs = .9))
}
boot(boston, se.10.fn, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = boston, statistic = se.10.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     34.8 -0.40475     1.14822

Algo 2 HW 4

Joe Schulze

3/25/2021

Question 3

A

B

Question 5

A

B

C

D

Question 6

A

B

C

D

Question 9

A

B

C

D

E

F

G

H