Notebook prepared by Everton Lima
\(Var(\alpha X + (1-\alpha)Y) = E[\alpha X + (1-\alpha)Y^2]-(E[\alpha X + (1-\alpha)Y]^2)\)
Where,
\(E[(\alpha X + (1-\alpha)Y)^2] = E[\alpha^2 X + 2\alpha(1-\alpha)XY+(1-\alpha)^2 Y^2]\) \(= \alpha^2 E[X^2] + 2\alpha(1-\alpha)E[XY]+(1-\alpha)^2E[Y^2]\)
and,
\(E[\alpha X + (1-\alpha)Y]^2 = (\alpha E[X] + (1-\alpha)E[X])^2\) \(=\alpha^2 E[X]^2 + 2\alpha(1-\alpha)E[X]E[Y] + (1-\alpha)^2 E[X]^2\)
thus,
\(\alpha^2 E[X^2] + 2\alpha(1-\alpha)E[XY]+(1-\alpha)^2E[Y^2] - (\alpha^2 E[X]^2 + 2\alpha(1-\alpha)E[X]E[Y] + (1-\alpha)^2 E[X]^2)\)
\(\alpha^2(E[X^2] - E[X]^2) + 2\alpha(1-\alpha)(E[XY] - E[X]E[Y]) + (1-\alpha)^2(E[Y^2]-E[Y]^2)\) \(= \alpha^2\sigma_X^2 + 2(\alpha-\alpha^2)\sigma_{XY} + (1-\alpha)^2\sigma_Y\).
Taking the derivative and setting to zero yields, \(2\alpha\sigma_X + 2(1-2\alpha)\sigma_{XY} + 2(1-\alpha)\sigma_Y = 0\)
\(2\alpha (\sigma_X +\sigma_Y+2\sigma_{XY}) = 2\sigma_Y - 2\sigma_{XY}\)
Therefore, \(\alpha = \frac{\sigma_Y^2 - \sigma_{XY}}{\sigma_X^2 + \sigma_Y^2-2\sigma_{XY}}\)
The probability that the first bootstrap sample is not the jth observation from the original is \(\frac{n-1}/{n}\). This is because all samples have the same probability of being picked. Namely, \(\frac{1}{n}\). Thus, it follows from the standard rules of probability that not choosing this observation has the probability of \(1-\frac{1}{n} = \frac{n-1}/{n}\).
Same as in the previous question. The selection of an additional sample does not depend on the already selected samples since boostrap performs repeated sampling with replacement.
The probability of not choosing a sample is \(1-\frac{1}{n}\) as stated before. Since bootstrap samples with replacement additional samples do not depend of the already chosen values. Therefore, not choosing a sample n times has the probability \((1-\frac{1}{n})_1 \cdot (1-\frac{1}{n})_2 \cdot (1-\frac{1}{n})_3 ... \cdot (1-\frac{1}{n})_n=(1-\frac{1}{n})^n\) since choosing any sample is independent of another.
prob_jth = function(n){ (1/n)^n }
x = 1:10000
y = sapply(x,prob_jth)
plot(x,y,type='l')
The function rapidly approaches 0. We can observe that the probability of any observation being chosen in all of the samples decrease with the amount of samples.
# This code in is page 198 of ISLR
store=rep(NA,10000)
for(i in 1:10000){
store[i]=sum(sample(1:100,rep=TRUE)==4)>0
}
mean(store)
## [1] 0.6316
In the snipped above we are creating a 10,000 samples of integers from 1 to 100 (inclusive) with replacement, and checking the amount of times the number 4 (our jth observation) occurs in each sample. If it occurs at least once, it is recorded in the store list. The final statement shows that about 60% of the samples contain the number 4. This result will vary due to the randomness involved in sampling, however the value will be closed to 60%
The value has a 1 out of a 100 chance of occurring in any sample, or 0.01. Thus, probability of it occurring in a sample of size 100 is about 64%. Since the repeated experiments are independent, and the mean is an unbiased estimate, doing more experiments will result in an approximation of the value 64%. Which confirms the previously derived formulas.
K-fold validation is the approach taken to validate a model by separating the available data into k sets, where one of the folds is selected as the test set and the remaining are used for training. In doing so, we obtain the error of using this particular fold for testing. The process is then repeated once for each fold, and all the errors obtained are averaged.
The main disadvantage of using an approach such as k-fold is the computational cost involved. This may provide too time consuming or costly for some models. The validation set approach has a clear benefit in this case, since the model only needs be learnt from the data once, and tested once. However, a validation set may not be available (due to only a small number of observations being available) or may be too costly to obtain in practice. In these cases, the k-fold cross validation is a clear winner, for it allows us to fine tune our model. Furthermore, a validation set may tend to over estimate the error since less data is used for training.
The Leave-One-Out Cross Validation (LOOCV) approach has a worse (or the same, in the case \(k=n\)) computational cost to K-Fold Cross Validation since the model needs to be trained and tested n times, instead of k. Furthermore, LOOCV suffers from a higher variance in result, since they are typically highly correlated (most models trained are very similar). There are no significant advance to using LOOCV since 5-fold or 10-fold will significantly reduce the computational cost and will neither suffer from high bias or variance.
In order to estimate the standard deviation of the prediction one can make use of bootstrap. This works by repeatedly sampling from our data the tulle of values X, and Y. For each such sample we can obtain a prediction value using the learning method. Finally, we then obtain a histogram for the prediction value which we can use to compute confidence intervals. It is important to keep in mind however, that bootstrap does so with replacement and this significantly affects the prediction value (as opposed to cross validation, that keeps each fold separate).
This approach will work if we assume independence among the observations. However, some adjustments are possible even in the case of dependency, as in for time data using approaches such as the block bootstrap.
set.seed(3)
subset=sample(1:1000,500) # i. done by the subset parameter.
glm.fit=glm(default~income+balance,family='binomial', # ii. fitting the model.
data=Default,subset=subset)
glm.resp=predict(glm.fit,Default[-subset,],type='response') # iii. computing the posterior.
glm.pred=ifelse(glm.resp>0.5,'Yes','No')
mean(glm.pred!=Default[-subset,'default']) # iv. validation set error
## [1] 0.02705263
DefaultValidationFn = function(formula=default~income+balance,n=1000,s=500,seed){
set.seed(seed)
subset=sample(n,s)
glm.fit=glm(formula,family='binomial',
data=Default,subset=subset)
glm.resp=predict(glm.fit,Default[-subset,],type='response')
mean(glm.pred!=Default[-subset,'default'])
}
for(i in 1:3) print(DefaultValidationFn(seed=i))
## [1] 0.02768421
## [1] 0.02768421
## [1] 0.02705263
summary(glm(default~income+balance,data=Default,family='binomial'))
##
## Call:
## glm(formula = default ~ income + balance, family = "binomial",
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4725 -0.1444 -0.0574 -0.0211 3.7245
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.154e+01 4.348e-01 -26.545 < 2e-16 ***
## income 2.081e-05 4.985e-06 4.174 2.99e-05 ***
## balance 5.647e-03 2.274e-04 24.836 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1579.0 on 9997 degrees of freedom
## AIC: 1585
##
## Number of Fisher Scoring iterations: 8
library(boot)
boot.fn = function(data,index){
coef(glm(default~income+balance,data=data,subset=index,family='binomial'))
}
set.seed(1)
boot.fn(Default,sample(1000,500,replace = T))
## (Intercept) income balance
## -1.213765e+01 3.243044e-05 5.627413e-03
boot(Default,boot.fn,R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Default, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -1.154047e+01 -8.113059e-03 4.241721e-01
## t2* 2.080898e-05 6.002512e-08 4.584290e-06
## t3* 5.647103e-03 2.342386e-06 2.268706e-04
The standard error estimated by the glm function are a good approximation.
ifelse(predict(glm.fit,Weekly[1,],type='response')>0.5,'Up','Down')
## 1
## "Up"
Weekly[1,]
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.154976 -0.27 Down
It is miss classified.
plot(x,y)
It is clear that there is a non-linear relationship from the parabola shape of the curve.
library(boot)
data=data.frame(y,x)
set.seed(34)
for(i in 1:4)
print(cv.glm(data = data,glm(y~poly(x,i),data = data))$delta[1])
## [1] 5.890979
## [1] 1.086596
## [1] 1.102585
## [1] 1.114772
set.seed(43)
for(i in 1:4)
print(cv.glm(data = data,glm(y~poly(x,i),data = data))$delta[1])
## [1] 5.890979
## [1] 1.086596
## [1] 1.102585
## [1] 1.114772
The result is the same. This is because there is no sampling involved in LOOCV; the model is trained with the same observations for each cross validation test.
The model with the smallest LOOCV error is the quadratic model. This is to be expected from the equation that defines Y as a second degree polynomial dependent on X.
for(i in 1:4) {
print(summary(glm(y~poly(x,i),data = data)))
}
##
## Call:
## glm(formula = y ~ poly(x, i), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7.3469 -0.9275 0.8028 1.5608 4.3974
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8277 0.2362 -7.737 9.18e-12 ***
## poly(x, i) 2.3164 2.3622 0.981 0.329
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 5.580018)
##
## Null deviance: 552.21 on 99 degrees of freedom
## Residual deviance: 546.84 on 98 degrees of freedom
## AIC: 459.69
##
## Number of Fisher Scoring iterations: 2
##
##
## Call:
## glm(formula = y ~ poly(x, i), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.89884 -0.53765 0.04135 0.61490 2.73607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8277 0.1032 -17.704 <2e-16 ***
## poly(x, i)1 2.3164 1.0324 2.244 0.0271 *
## poly(x, i)2 -21.0586 1.0324 -20.399 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.06575)
##
## Null deviance: 552.21 on 99 degrees of freedom
## Residual deviance: 103.38 on 97 degrees of freedom
## AIC: 295.11
##
## Number of Fisher Scoring iterations: 2
##
##
## Call:
## glm(formula = y ~ poly(x, i), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.87250 -0.53881 0.02862 0.59383 2.74350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8277 0.1037 -17.621 <2e-16 ***
## poly(x, i)1 2.3164 1.0372 2.233 0.0279 *
## poly(x, i)2 -21.0586 1.0372 -20.302 <2e-16 ***
## poly(x, i)3 -0.3048 1.0372 -0.294 0.7695
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.075883)
##
## Null deviance: 552.21 on 99 degrees of freedom
## Residual deviance: 103.28 on 96 degrees of freedom
## AIC: 297.02
##
## Number of Fisher Scoring iterations: 2
##
##
## Call:
## glm(formula = y ~ poly(x, i), data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8914 -0.5244 0.0749 0.5932 2.7796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.8277 0.1041 -17.549 <2e-16 ***
## poly(x, i)1 2.3164 1.0415 2.224 0.0285 *
## poly(x, i)2 -21.0586 1.0415 -20.220 <2e-16 ***
## poly(x, i)3 -0.3048 1.0415 -0.293 0.7704
## poly(x, i)4 -0.4926 1.0415 -0.473 0.6373
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.084654)
##
## Null deviance: 552.21 on 99 degrees of freedom
## Residual deviance: 103.04 on 95 degrees of freedom
## AIC: 298.78
##
## Number of Fisher Scoring iterations: 2
All the models agree, on the 5% confidence level; The squared term is seen as significant in all the equation it is present, and the reaming terms are not seem as significant at this confidence level in any model.
library(MASS)
attach(Boston)
mu=mean(medv) # estimate of population mean
print(mu)
## [1] 22.53281
sterr = function(x,index){ sd(x[index])/sqrt(length(x[index])) } # estimate of standard error.
print(sterr(medv))
## [1] 0.4088611
library(boot)
set.seed(456)
boot(medv,function(x,index){mean(x[index])},R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = function(x, index) {
## mean(x[index])
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 22.53281 -0.04298696 0.404878
The results are different but approximate; 0.404878 for the boot estimate and 0.4088611 for the sample estimate.
boot(medv,function(data,index){median(data[index])},1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = function(data, index) {
## median(data[index])
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 21.2 0.0061 0.3758178
There is a standard error or 0.3756641 for the estimating the median value.
boot(medv,function(data,index){quantile(medv[index],.1)},R=1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = medv, statistic = function(data, index) {
## quantile(medv[index], 0.1)
## }, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 12.75 0.0144 0.5025537
There is a standard error of 0.5166154. One can interpret this value to mean the amount that the quarterly deviates from the estimate in each sample.