Assignment 4

3.

(a)

K-fold cross-validation works by splitting a sample of \(n\) observations into \(k\) groups of non-overlapping groups of size \(n/k\). These groups are the validation/test sets, with the training set of size \(n-n/k\) being the remainder of each group. Finally, the MSEtest error is calculated by averaging the \(k\) \(MSE\) estimates.

(b)

i.

The validation set approach faces issues of much higher variability when compared to the k-fold method as the model is only being fit on a subset of the data, leading to a higher test error. However, the validation set approach is that it not computationally expensive as k-fold, so with a large enough data set the problems of the subset not being representative diminish.

ii.

Leave one out cross validation suffers from two significant short comings when compare to k-fold cross validation. First, this method is more computationally expensive as it fits the model once per observation, or \(n\) times, rather than \(k\) times. The other issue is that since the models are fit on such a large proportion of the data, variance can be rather high, unlike k-fold that is fit ok \(k\) samples of size \(n-n/k\). A benefit, inversely, is that the model has very little bias in the estimates, whereas k-fold will have more fluctuations in parameter estimates. A good balance between variance and bias for k-fold is \(5 \le k \le 10\).

5.

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.5.3

defaultDf<-Default

(a)

logit.fit <- glm(default ~ income + balance,family = 'binomial',data=defaultDf)
summary(logit.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = defaultDf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

(b)

i.

set.seed(1)
train=sample(nrow(defaultDf),nrow(defaultDf)/2)

ii.

logit.fitTrain<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train)
summary(logit.fitTrain)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = defaultDf, subset = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3583  -0.1268  -0.0475  -0.0165   3.8116  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.208e+01  6.658e-01 -18.148   <2e-16 ***
## income       1.858e-05  7.573e-06   2.454   0.0141 *  
## balance      6.053e-03  3.467e-04  17.457   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1457.0  on 4999  degrees of freedom
## Residual deviance:  734.4  on 4997  degrees of freedom
## AIC: 740.4
## 
## Number of Fisher Scoring iterations: 8

iii.

predLogit <- predict(logit.fitTrain,defaultDf, type = "response",subset = !train)
preds.logit <- rep('No', length(predLogit))
preds.logit[predLogit > 0.5] <- 'Yes'

iv.

mean(preds.logit != defaultDf[-train, ]$default)

## [1] 0.0484

The model fit on the train sample has a test error rate of \(4.84\%\). ### (c)

set.seed(2)
train2=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain2<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train2)
summary(logit.fitTrain2)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = defaultDf, subset = train2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.1385  -0.0552  -0.0203   3.7058  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.184e+01  6.403e-01 -18.492  < 2e-16 ***
## income       2.717e-05  7.183e-06   3.783 0.000155 ***
## balance      5.703e-03  3.266e-04  17.460  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1443.44  on 4999  degrees of freedom
## Residual deviance:  776.64  on 4997  degrees of freedom
## AIC: 782.64
## 
## Number of Fisher Scoring iterations: 8

predLogit2 <- predict(logit.fitTrain2,defaultDf, type = "response",subset = !train2)
preds.logit2 <- rep('No', length(predLogit2))
preds.logit2[predLogit2 > 0.5] <- 'Yes'
mean(preds.logit2 != defaultDf[-train2, ]$default)

## [1] 0.0475

set.seed(3)
train3=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain3<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train3)
summary(logit.fitTrain3)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = defaultDf, subset = train3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1014  -0.1433  -0.0569  -0.0206   3.7241  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.160e+01  6.055e-01 -19.162  < 2e-16 ***
## income       2.254e-05  6.972e-06   3.233  0.00123 ** 
## balance      5.660e-03  3.131e-04  18.079  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1530.39  on 4999  degrees of freedom
## Residual deviance:  812.77  on 4997  degrees of freedom
## AIC: 818.77
## 
## Number of Fisher Scoring iterations: 8

predLogit3 <- predict(logit.fitTrain3,defaultDf, type = "response",subset = !train3)
preds.logit3 <- rep('No', length(predLogit3))
preds.logit3[predLogit3 > 0.5] <- 'Yes'
mean(preds.logit3 != defaultDf[-train3, ]$default)

## [1] 0.0453

set.seed(4)
train4=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain4<-glm(default ~ income + balance, family = 'binomial', data = defaultDf, subset = train4)
summary(logit.fitTrain4)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = defaultDf, subset = train4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4799  -0.1411  -0.0559  -0.0208   3.7223  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.156e+01  6.146e-01 -18.803  < 2e-16 ***
## income       2.004e-05  6.997e-06   2.864  0.00418 ** 
## balance      5.678e-03  3.244e-04  17.502  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1450.21  on 4999  degrees of freedom
## Residual deviance:  780.49  on 4997  degrees of freedom
## AIC: 786.49
## 
## Number of Fisher Scoring iterations: 8

predLogit4 <- predict(logit.fitTrain4,defaultDf, type = "response",subset = !train4)
preds.logit4 <- rep('No', length(predLogit4))
preds.logit4[predLogit4 > 0.5] <- 'Yes'
mean(preds.logit4 != defaultDf[-train4, ]$default)

## [1] 0.0472

Splitting the data into training and validation sets three more times yields similar results as the first split, with the new splits having a test error rate of \(4.75\%\), \(4.53\%\), and \(4.72\%\) respectively.

(d)

set.seed(5)
train5=sample(nrow(defaultDf),nrow(defaultDf)/2)
logit.fitTrain5<-glm(default ~ income + balance + student, family = 'binomial', data = defaultDf, subset = train5)
summary(logit.fitTrain5)

## 
## Call:
## glm(formula = default ~ income + balance + student, family = "binomial", 
##     data = defaultDf, subset = train5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4304  -0.1472  -0.0598  -0.0220   3.7046  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.033e+01  6.789e-01 -15.221  < 2e-16 ***
## income      -4.264e-06  1.176e-05  -0.363  0.71684    
## balance      5.624e-03  3.186e-04  17.652  < 2e-16 ***
## studentYes  -9.811e-01  3.356e-01  -2.923  0.00346 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1470.4  on 4999  degrees of freedom
## Residual deviance:  805.1  on 4996  degrees of freedom
## AIC: 813.1
## 
## Number of Fisher Scoring iterations: 8

predLogit5 <- predict(logit.fitTrain5,defaultDf, type = "response",subset = !train5)
preds.logit5 <- rep('No', length(predLogit5))
preds.logit5[predLogit5 > 0.5] <- 'Yes'
mean(preds.logit5 != defaultDf[-train5, ]$default)

## [1] 0.0454

Adding the \(student\) variable does not seem to have a significant effect on the test error rate as it is \(4.54\%\), which is similar to the other models.

6.

(a)

logitFull.fit <- glm(default ~ income + balance,family = 'binomial',data=defaultDf)
summary(logitFull.fit)

## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = defaultDf)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

The standard error for \(income\) and \(balance\) is \(4.985 \times 10^{-6}\) and \(2.274 \times 10^{-4}\) respectively.

(b)

library(boot)

boot.fn=function(data,index)
 return(coef(glm(default ~ income + balance,data=data,subset=index,family = 'binomial')))

(c)

set.seed(1)
boot(defaultDf,boot.fn,100)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = defaultDf, statistic = boot.fn, R = 100)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01  9.699111e-02 4.101121e-01
## t2*  2.080898e-05  6.715005e-08 4.127740e-06
## t3*  5.647103e-03 -5.733883e-05 2.105660e-04

(d)

The bootstrap method is an accurate estimate of the model coefficient standard errors with the bootstrapped SE for \(income\) being \(4.127 \times 10^{-6}\) vs. the actual model’s SE of \(4.985 \times 10^{-6}\) and the bootstrapped SE for \(balance\) being \(2.106 \times 10^{-4}\) being close to the actual \(2.274 \times 10^{-4}\).

9.

library(MASS)

## Warning: package 'MASS' was built under R version 3.5.3

bostonDf <- Boston

(a)

mu_hat<-mean(bostonDf$medv)
mu_hat

## [1] 22.53281

\(\hat{\mu}=22.53281\)

(b)

se_hat<-sd(bostonDf$medv)/sqrt(nrow(bostonDf))
se_hat

## [1] 0.4088611

The SE of \(\hat{\mu}\) is \(0.4088611\).

(c)

set.seed(1)
boot.mu <- function(data, index) {
    mu <- mean(data[index])
    return (mu)
}
boot(bostonDf$medv, boot.mu, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = bostonDf$medv, statistic = boot.mu, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 0.008517589   0.4119374

Using the bootstrapping method, the SE for \(\hat{\mu}\) is \(0.4119374\) which is quite close to the actual \(0.4088611\).

(d)

t.test(bostonDf$medv)

## 
##  One Sample t-test
## 
## data:  bostonDf$medv
## t = 55.111, df = 505, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  21.72953 23.33608
## sample estimates:
## mean of x 
##  22.53281

mu_hat.CI <- c(22.53281 - 2 * 0.4119374, 22.53281 + 2 * 0.4119374)
mu_hat.CI

## [1] 21.70894 23.35668

The CI created via bootstrapping is very close to the one from the t-test, the bootstrap method giving bounds of \((21.70894, 23.35668)\) vs the t-test \((21.72953,23.33608)\).

(e)

med_hat <- median(bostonDf$medv)
med_hat

## [1] 21.2

\(\hat{\mu}_{med} = 21.2\)

(f)

set.seed(1)
boot.mu_med <- function(data, index) {
    mu_med <- median(data[index])
    return (mu_med)
}
boot(bostonDf$medv, boot.mu_med, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = bostonDf$medv, statistic = boot.mu_med, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original   bias    std. error
## t1*     21.2 -0.01615   0.3801002

The median derived from the bootsrapping method yields the same median of \(21.2\) with a standard error of \(0.3801002\).

(g)

med_hat_10p <- quantile(bostonDf$medv, c(0.1))
med_hat_10p

##   10% 
## 12.75

The 10th percentil of the \(medv\) variable is \(12.75\).

(h)

set.seed(1)
boot.mu_10p <- function(data, index) {
    mu_10p <- quantile(data[index], c(0.1))
    return (mu_10p)
}
boot(bostonDf$medv, boot.mu_10p, 1000)

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = bostonDf$medv, statistic = boot.mu_10p, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75 0.01005    0.505056

Similar to the median, the 10th percentile calculated with bootstrapping is the same as the sample data with a value of \(12.75\), and the standard error is \(0.505056\).

Assignment 4

Ian Holmes

June 29, 2020

3.

(a)

(b)

i.

ii.

5.

(a)

(b)

i.

ii.

iii.

iv.

(d)

6.

(a)

(b)

(c)

(d)

9.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)