Question 3a

With k-fold cross-validation the data is seperated into k folds and a model is trained on k-1 of those folds. It is then tested on the fold that was not used in the training. This occurs k number of times and then the results are averaged.

Question 3b

(i)

K-fold has lower variabilty than validation set approach, especially for smaller datasets. Also, with the validation set approach the test error can be over estimated.

(ii)

LOOCV is less bias than k-fold but LOOCV has higher variance. Also k-fold is much better with larger datasets because it is much less demanding computationally.

Question 5a

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

attach(Default)
glm.fit=glm(default~income+balance, family="binomial", data=Default)
summary(glm.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Question 5b

(i)

set.seed(3)
train = sample(1:1000,500)
length(train)

## [1] 500

(ii)

glm.fit = glm(default ~ income+balance, family = "binomial", subset = train)

(iii)

glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
summary(glm.pred)

##  No Yes 
## 496   4

(iv)

summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

mean(Default[-train,'default'] != glm.pred)

## [1] 0.04084211

The validation set error is 0.0516

Question 5c

set.seed(3)
train = sample(1:2000,1000)
glm.fit = glm(default ~ income+balance, family = "binomial", subset = train)
glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
summary(glm.pred)

##  No Yes 
## 985  15

mean(Default[-train,'default'] != glm.pred)

## [1] 0.04522222

set.seed(3)
train = sample(1:4000,2000)
glm.fit = glm(default ~ income+balance, family = "binomial", subset = train)
glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
summary(glm.pred)

##   No  Yes 
## 1962   38

mean(Default[-train,'default'] != glm.pred)

## [1] 0.05125

For all three of the splits the seemed to stay relatively the same set error with only the second new split being the lowest.

Question 5d

ValidationLoop = function(totalSize, subtotal, seed) {
  set.seed(seed)
  train = sample(1:totalSize, subtotal)
  glm.fit = glm(default ~ income+balance+student, family = "binomial", subset = train)
  glm.pred = factor(ifelse(predict(glm.fit, data = Default[-train,], type = "response") > 0.5, "Yes", "No"))
  summary(glm.pred)
  mean(Default[-train,'default'] != glm.pred)
}

for(i in list(1,2,5)) print(ValidationLoop(i*2000, i*1000, i))

## [1] 0.05333333
## [1] 0.05
## [1] 0.0486

It does not look like adding the dummy variable student caused much reduction in error.

Question 6a

library(boot)
summary(glm(default~income+balance,data=Default,family="binomial"))

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

Question 6b

boot.fn = function(dataSet, obvIndex){
  coef(glm(default~income+balance, data=dataSet, subset=obvIndex, family = "binomial"))
}

set.seed(1)
boot.fn(Default,sample(2000,1000,replace=T))

##   (Intercept)        income       balance 
## -1.370125e+01  3.073263e-05  6.996657e-03

Question 6c

boot(Default, boot.fn, R=1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -3.920364e-02 0.4344275948
## t2*  2.080898e-05  1.646916e-07 0.0000048674
## t3*  5.647103e-03  1.845423e-05 0.0002298672

Question 6d

The estimated standard error using glm() function is almost exactly the same as when using my bootstrap function

Question 9a

library(MASS)
attach(Boston)

mu=mean(medv)
print(mu)

## [1] 22.53281

Question 9b

sd(Boston$medv)/sqrt(length(Boston$medv))

## [1] 0.4088611

Question 9c

set.seed(42)
boot.fn = function(x,index){
  mean(x[index])
}
boot(medv,statistic = boot.fn, R = 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 22.53281 0.02671186   0.4009216

Although they are not exactly the same they are extremely close being 0.4088 and 0.4009

Question 9d

print(mu-2*0.4088611)

## [1] 21.71508

print(mu-2*0.4009216)

## [1] 21.73096

t.test(Boston$medv)

## 
##  One Sample t-test
## 
## data:  Boston$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

The estimated confidence intervals are quite close.

Question 9e

median(Boston$medv)

## [1] 21.2

Question 9f

set.seed(42)
boot2.fn = function(x,index){
  median(x[index])
}
boot(medv,statistic = boot2.fn, R = 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot2.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2  0.0106   0.3661785

Just like part c with mean, the std error of median is small compared to the estimate.

Question 9g

quantile(Boston$medv, 0.1)

##   10% 
## 12.75

Question 9h

set.seed(42)
boot3.fn = function(x,index){
  quantile(x[index], 0.1)
}
boot(medv,statistic = boot3.fn, R = 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = boot3.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0215   0.4948966

The standard error is bigger but still not large.

Assignment4

Michael Stephens

7/8/2021

Question 3a

Question 3b

(i)

(ii)

Question 5a

Question 5b

(i)

(ii)

(iii)

(iv)

Question 5c

Question 5d

Question 6a

Question 6b

Question 6c

Question 6d

Question 9a

Question 9b

Question 9c

Question 9d

Question 9e

Question 9f

Question 9g

Question 9h