Notebook prepared by Everton Lima

Introduction to Statistical Learning Solutions (ISLR)

Ch 5 Exercises

Table of Contents

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1

\(Var(\alpha X + (1-\alpha)Y) = E[\alpha X + (1-\alpha)Y^2]-(E[\alpha X + (1-\alpha)Y]^2)\)

Where,

\(E[(\alpha X + (1-\alpha)Y)^2] = E[\alpha^2 X + 2\alpha(1-\alpha)XY+(1-\alpha)^2 Y^2]\) \(= \alpha^2 E[X^2] + 2\alpha(1-\alpha)E[XY]+(1-\alpha)^2E[Y^2]\)

and,

\(E[\alpha X + (1-\alpha)Y]^2 = (\alpha E[X] + (1-\alpha)E[X])^2\) \(=\alpha^2 E[X]^2 + 2\alpha(1-\alpha)E[X]E[Y] + (1-\alpha)^2 E[X]^2\)

thus,

\(\alpha^2 E[X^2] + 2\alpha(1-\alpha)E[XY]+(1-\alpha)^2E[Y^2] - (\alpha^2 E[X]^2 + 2\alpha(1-\alpha)E[X]E[Y] + (1-\alpha)^2 E[X]^2)\)

\(\alpha^2(E[X^2] - E[X]^2) + 2\alpha(1-\alpha)(E[XY] - E[X]E[Y]) + (1-\alpha)^2(E[Y^2]-E[Y]^2)\) \(= \alpha^2\sigma_X^2 + 2(\alpha-\alpha^2)\sigma_{XY} + (1-\alpha)^2\sigma_Y\).

Taking the derivative and setting to zero yields, \(2\alpha\sigma_X + 2(1-2\alpha)\sigma_{XY} + 2(1-\alpha)\sigma_Y = 0\)

\(2\alpha (\sigma_X +\sigma_Y+2\sigma_{XY}) = 2\sigma_Y - 2\sigma_{XY}\)

Therefore, \(\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2-2\sigma_{XY}}\)

2

2a

The probability that the first bootstrap sample is not the jth observation from the original is \(\frac{n-1}/{n}\). This is because all samples have the same probability of being picked. Namely, \(\frac{1}{n}\). Thus, it follows from the standard rules of probability that not choosing this observation has the probability of \(1-\frac{1}{n} = \frac{n-1}/{n}\).

2b

Same as in the previous question. The selection of an additional sample does not depend on the already selected samples since boostrap performs repeated sampling with replacement.

2c

The probability of not choosing a sample is \(1-\frac{1}{n}\) as stated before. Since bootstrap samples with replacement additional samples do not depend of the already chosen values. Therefore, not choosing a sample n times has the probability \((1-\frac{1}{n})_1 \cdot (1-\frac{1}{n})_2 \cdot (1-\frac{1}{n})_3 ... \cdot (1-\frac{1}{n})_n=(1-\frac{1}{n})^n\) since choosing any sample is independent of another.

2d

\(\frac{1}{n}^5\)

2e

\(\frac{1}{n}^100\)

2f

\(\frac{1}{n}^10,000\)

2g

prob_jth = function(n){ (1/n)^n }
x = 1:10000
y = sapply(x,prob_jth)
plot(x,y,type='l')

The function rapidly approaches 0. We can observe that the probability of any observation being chosen in all of the samples decrease with the amount of samples.

2h

# This code in is page 198 of ISLR
store=rep(NA,10000)
for(i in 1:10000){
  store[i]=sum(sample(1:100,rep=TRUE)==4)>0
}

mean(store)
## [1] 0.6316

In the snipped above we are creating a 10,000 samples of integers from 1 to 100 (inclusive) with replacement, and checking the amount of times the number 4 (our jth observation) occurs in each sample. If it occurs at least once, it is recorded in the store list. The final statement shows that about 60% of the samples contain the number 4. This result will vary due to the randomness involved in sampling, however the value will be closed to 60%

The value has a 1 out of a 100 chance of occurring in any sample, or 0.01. Thus, probability of it occurring in a sample of size 100 is about 64%. Since the repeated experiments are independent, and the mean is an unbiased estimate, doing more experiments will result in an approximation of the value 64%. Which confirms the previously derived formulas.

3

3a

K-fold validation is the approach taken to validate a model by separating the available data into k sets, where one of the folds is selected as the test set and the remaining are used for training. In doing so, we obtain the error of using this particular fold for testing. The process is then repeated once for each fold, and all the errors obtained are averaged.

3b

i

The main disadvantage of using an approach such as k-fold is the computational cost involved. This may provide too time consuming or costly for some models. The validation set approach has a clear benefit in this case, since the model only needs be learnt from the data once, and tested once. However, a validation set may not be available (due to only a small number of observations being available) or may be too costly to obtain in practice. In these cases, the k-fold cross validation is a clear winner, for it allows us to fine tune our model. Furthermore, a validation set may tend to over estimate the error since less data is used for training.

i

The Leave-One-Out Cross Validation (LOOCV) approach has a worse (or the same, in the case \(k=n\)) computational cost to K-Fold Cross Validation since the model needs to be trained and tested n times, instead of k. Furthermore, LOOCV suffers from a higher variance in result, since they are typically highly correlated (most models trained are very similar). There are no significant advance to using LOOCV since 5-fold or 10-fold will significantly reduce the computational cost and will neither suffer from high bias or variance.

4

In order to estimate the standard deviation of the prediction one can make use of bootstrap. This works by repeatedly sampling from our data the tulle of values X, and Y. For each such sample we can obtain a prediction value using the learning method. Finally, we then obtain a histogram for the prediction value which we can use to compute confidence intervals. It is important to keep in mind however, that bootstrap does so with replacement and this significantly affects the prediction value (as opposed to cross validation, that keeps each fold separate).

This approach will work if we assume independence among the observations. However, some adjustments are possible even in the case of dependency, as in for time data using approaches such as the block bootstrap.

Applied Exercises

5

5a

library(ISLR)
glm.fit=glm(default~income+balance,family='binomial',data=Default)

5b

set.seed(3)
subset=sample(1:1000,500)                                  # i. done by the subset parameter.
glm.fit=glm(default~income+balance,family='binomial',      # ii. fitting the model.
            data=Default,subset=subset) 

glm.resp=predict(glm.fit,Default[-subset,],type='response') # iii. computing the posterior.
glm.pred=ifelse(glm.resp>0.5,'Yes','No')

mean(glm.pred!=Default[-subset,'default'])                 # iv. validation set error
## [1] 0.02705263

5c

  DefaultValidationFn = function(formula=default~income+balance,n=1000,s=500,seed){
  set.seed(seed)
  subset=sample(n,s)                                
  glm.fit=glm(formula,family='binomial',     
            data=Default,subset=subset) 

  glm.resp=predict(glm.fit,Default[-subset,],type='response') 

  mean(glm.pred!=Default[-subset,'default'])                 
  }

for(i in 1:3) print(DefaultValidationFn(seed=i))
## [1] 0.02768421
## [1] 0.02768421
## [1] 0.02705263

5d

for(i in 1:3) print(DefaultValidationFn(formula=default~income+balance+student,seed=i))
## [1] 0.02768421
## [1] 0.02768421
## [1] 0.02705263

A reduction is not observed when including the variable student.

6

6a

summary(glm(default~income+balance,data=Default,family='binomial'))
## 
## Call:
## glm(formula = default ~ income + balance, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8

6b

library(boot)

boot.fn = function(data,index){
  coef(glm(default~income+balance,data=data,subset=index,family='binomial'))
}

set.seed(1)
boot.fn(Default,sample(1000,500,replace = T))
##   (Intercept)        income       balance 
## -1.213765e+01  3.243044e-05  5.627413e-03

6c

boot(Default,boot.fn,R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1* -1.154047e+01 -8.113059e-03 4.241721e-01
## t2*  2.080898e-05  6.002512e-08 4.584290e-06
## t3*  5.647103e-03  2.342386e-06 2.268706e-04

The standard error estimated by the glm function are a good approximation.

7

7a

library(ISLR)
glm.fit=glm(Direction~Lag1+Lag2,family='binomial',data=Weekly)

7b

glm.fit=glm(Direction~Lag1+Lag2,family='binomial',data=Weekly[-1,])

7c

ifelse(predict(glm.fit,Weekly[1,],type='response')>0.5,'Up','Down') 
##    1 
## "Up"
Weekly[1,]
##   Year  Lag1  Lag2   Lag3   Lag4   Lag5   Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.154976 -0.27      Down

It is miss classified.

7d

cv=function(i){
  glm.fit=glm(Direction~Lag1+Lag2,family='binomial',data=Weekly[-i,])
  pred=ifelse(predict(glm.fit,Weekly[i,],type='response')>0.5,'Up','Down') 
  sum(pred == Weekly[i,'Direction'])
}

res=unlist(lapply(1:nrow(Weekly),cv))

7e

mean(res)
## [1] 0.5500459

The result is slightly better than guessing.

8

8a

set.seed(1)
y=rnorm(100)  # n 
x=rnorm(100)  # p 
y=x-2*x^2+rnorm(100)

\(y = x-2*x^2+\epsilon\)

8b

plot(x,y)

It is clear that there is a non-linear relationship from the parabola shape of the curve.

8c

library(boot)
data=data.frame(y,x)

set.seed(34)

for(i in 1:4) 
  print(cv.glm(data = data,glm(y~poly(x,i),data = data))$delta[1])
## [1] 5.890979
## [1] 1.086596
## [1] 1.102585
## [1] 1.114772

8d

set.seed(43)

for(i in 1:4) 
  print(cv.glm(data = data,glm(y~poly(x,i),data = data))$delta[1])
## [1] 5.890979
## [1] 1.086596
## [1] 1.102585
## [1] 1.114772

The result is the same. This is because there is no sampling involved in LOOCV; the model is trained with the same observations for each cross validation test.

8e

The model with the smallest LOOCV error is the quadratic model. This is to be expected from the equation that defines Y as a second degree polynomial dependent on X.

8f

for(i in 1:4) {
  print(summary(glm(y~poly(x,i),data = data)))
}
## 
## Call:
## glm(formula = y ~ poly(x, i), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -7.3469  -0.9275   0.8028   1.5608   4.3974  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.8277     0.2362  -7.737 9.18e-12 ***
## poly(x, i)    2.3164     2.3622   0.981    0.329    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5.580018)
## 
##     Null deviance: 552.21  on 99  degrees of freedom
## Residual deviance: 546.84  on 98  degrees of freedom
## AIC: 459.69
## 
## Number of Fisher Scoring iterations: 2
## 
## 
## Call:
## glm(formula = y ~ poly(x, i), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.89884  -0.53765   0.04135   0.61490   2.73607  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.8277     0.1032 -17.704   <2e-16 ***
## poly(x, i)1   2.3164     1.0324   2.244   0.0271 *  
## poly(x, i)2 -21.0586     1.0324 -20.399   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.06575)
## 
##     Null deviance: 552.21  on 99  degrees of freedom
## Residual deviance: 103.38  on 97  degrees of freedom
## AIC: 295.11
## 
## Number of Fisher Scoring iterations: 2
## 
## 
## Call:
## glm(formula = y ~ poly(x, i), data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.87250  -0.53881   0.02862   0.59383   2.74350  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.8277     0.1037 -17.621   <2e-16 ***
## poly(x, i)1   2.3164     1.0372   2.233   0.0279 *  
## poly(x, i)2 -21.0586     1.0372 -20.302   <2e-16 ***
## poly(x, i)3  -0.3048     1.0372  -0.294   0.7695    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.075883)
## 
##     Null deviance: 552.21  on 99  degrees of freedom
## Residual deviance: 103.28  on 96  degrees of freedom
## AIC: 297.02
## 
## Number of Fisher Scoring iterations: 2
## 
## 
## Call:
## glm(formula = y ~ poly(x, i), data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8914  -0.5244   0.0749   0.5932   2.7796  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.8277     0.1041 -17.549   <2e-16 ***
## poly(x, i)1   2.3164     1.0415   2.224   0.0285 *  
## poly(x, i)2 -21.0586     1.0415 -20.220   <2e-16 ***
## poly(x, i)3  -0.3048     1.0415  -0.293   0.7704    
## poly(x, i)4  -0.4926     1.0415  -0.473   0.6373    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.084654)
## 
##     Null deviance: 552.21  on 99  degrees of freedom
## Residual deviance: 103.04  on 95  degrees of freedom
## AIC: 298.78
## 
## Number of Fisher Scoring iterations: 2

All the models agree, on the 5% confidence level; The squared term is seen as significant in all the equation it is present, and the reaming terms are not seem as significant at this confidence level in any model.

9

9a

library(MASS)
attach(Boston)

mu=mean(medv) # estimate of population mean
print(mu)
## [1] 22.53281

9b

sterr = function(x,index){ sd(x[index])/sqrt(length(x[index])) }  # estimate of standard error.
print(sterr(medv)) 
## [1] 0.4088611

9c

library(boot)

set.seed(456)
boot(medv,function(x,index){mean(x[index])},R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = function(x, index) {
##     mean(x[index])
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original      bias    std. error
## t1* 22.53281 -0.04298696    0.404878

The results are different but approximate; 0.404878 for the boot estimate and 0.4088611 for the sample estimate.

9d

mu-2*0.404878  
## [1] 21.72305
mu+2*0.404878  
## [1] 23.34256

9e

print(median(medv))
## [1] 21.2

9f

boot(medv,function(data,index){median(data[index])},1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = function(data, index) {
##     median(data[index])
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*     21.2  0.0061   0.3758178

There is a standard error or 0.3756641 for the estimating the median value.

9g

quantile(medv,.1)
##   10% 
## 12.75

9h

boot(medv,function(data,index){quantile(medv[index],.1)},R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = medv, statistic = function(data, index) {
##     quantile(medv[index], 0.1)
## }, R = 1000)
## 
## 
## Bootstrap Statistics :
##     original  bias    std. error
## t1*    12.75  0.0144   0.5025537

There is a standard error of 0.5166154. One can interpret this value to mean the amount that the quarterly deviates from the estimate in each sample.