Given: class = 2, \(p=1\), \(\pi_1 = \pi_2 = 0.5\) Show that the decision boundary = \(\frac{\mu_1 + \mu_2}{2}\)
The decision boundary between class \(k\) and class \(l\) is \({x : \delta_k (x) = \delta_l(x)}\) .
The classification of a value \(x\) is ambiguous, and there are two classes with an equal probability of being chosen.
Each class \(k\) is Gaussian with a common variance. The centers are different for each class \(k\). If each class were considered independent of the other (class k is split in two), the decision boundary would be \(\mu\).
The decision boundary of two equally probable classses is the average or the mean. Thus, \(\frac{\mu_1+\mu_2}{2}\)
In the sample there are \(n\) observations and we are sampling using the bootstrap method meaning we draw items with replacement. The pool is composed of are \(n-1\) items that are not \(j\) and one item that is \(j\). So, there is a \(\frac{j}{n}\) chance of that the first item is \(j\) and a \(\frac{n-1}{n}\) chance that the first item is not \(j\).
As explained above, bootstrapping samples with replacement. Therefore, the probability that the second item is not \(j\) is also \(\frac{n-1}{n}\).
From above, we know that the probability that an observation is not \(j\)th observation from the original sample is \(\frac{n-1}{n}\), which after some algebra, \(=1-\frac{1}{n}\). With bootstrap sampling, we have \(n\) draw. The probability that the \(j\)th observation is not in the bootstrap sample implies that each time we draw from \(n\), we have to successfully not draw \(j\). So, this would simply be the product of \(n\) of these probabilities, which is \((1-\frac{1}{n})^n\).
If n=5, the probability that j is in the bootstrap would be \(1-(1-\frac{1}{5})^5\), which is 0.67232.
1-(1-(1/5))^5
## [1] 0.67232
If n=100, the probability that the jth observation is in the bootstrap sample would be 0.6339677.
1-((1-(1/100))^100)
## [1] 0.6339677
If n=10,000, the probability that the jth observation is in the bootstrap sample would be 0.632139
1-((1-(1/10000))^10000)
## [1] 0.632139
Based on the plot below, we see that the prbability The probability starts pretty high and quickly drops and converges around 0.63 quickly. This happens around n=100, and this was proven above with the calculation. This is confusing to me because it doesn’t make sense how a dataset as big as 10,000 can have the same probability of containing something as a dataset of 100.
x=seq(1,100000)
y=sapply(x,function(n){1-((1-(1/n))^n)})
plot(x,y,xlab="n",ylab="Probability", main="Probability jth observation is in the bootstrap sample", log="x", col="pink")
In the console, I ran what “store” was to see what the code was doing. It produced a list of length 10,000, and each time sampled 0-100 with replacement and checked to see if 4 is in the list. Just as above, the mean explains that the probability that 4 was in the list 0.6261, or approximately 0.63 times.
store=rep(NA, 10000)
for(i in 1:10000){
store[i]=sum(sample(1:100, rep=TRUE)==4)>0
}
mean(store)
## [1] 0.6312
k-fold cross-validation is implemented by dividing the observations from the dataset into k groups that are roughly the same size. You then train on \(\frac{k-1}{k}\) and of the datasset and test on \(\frac{1}{k}\) of the dataset. This procedure is implemented procedure \(k\) times to understand the variability in the model and to best estimate the test error.
set.seed(1)
y=rnorm(100)
x=rnorm(100)
y=x-2*x^2+rnorm (100)
\[ Y = X - 2X^2 + \epsilon_i \]
plot(x, y, xlab="X Values", ylab="Y values", main="X vs.Y", pch=16, col="purple")
require(boot)
## Loading required package: boot
set.seed(1234)
frame <- data.frame(x,y)
model1_glm <- glm(y~x, data=frame)
model1_cv<-cv.glm(frame,model1_glm, K=nrow(frame))
str(model1_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model1_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 5.89 5.89
## $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
#= delta raw CV error, raw CV error for not using LOOCV but we are using LOOCV so its the same
set.seed(1234)
model2_glm<-glm(y~x+I(x^2), data=frame)
model2_cv <- cv.glm(frame, model2_glm, K=nrow(frame))
str(model2_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model2_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 1.09 1.09
## $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
set.seed(1234)
model3_glm<-glm(y~x+I(x^2)+I(x^3), data=frame)
model3_cv<-cv.glm(frame, model3_glm, K=nrow(frame))
str(model3_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model3_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 1.1 1.1
## $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
set.seed(1234)
model4_glm<-glm(y~x+I(x^2)+I(x^3)+I(x^4), data=frame)
model4_cv<-cv.glm(frame, model4_glm, K=nrow(frame))
str(model4_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model4_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 1.11 1.11
## $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
require(boot)
set.seed(2500)
frame <- data.frame(x,y)
model1_glm <- glm(y~x, data=frame)
model1_cv<-cv.glm(frame,model1_glm, K=nrow(frame))
str(model1_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model1_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 5.89 5.89
## $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...
#= delta raw CV error, raw CV error for not using LOOCV but we are using LOOCV so its the same
set.seed(2500)
model2_glm<-glm(y~x+I(x^2), data=frame)
model2_cv <- cv.glm(frame, model2_glm, K=nrow(frame))
str(model2_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model2_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 1.09 1.09
## $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...
set.seed(2500)
model3_glm<-glm(y~x+I(x^2)+I(x^3), data=frame)
model3_cv<-cv.glm(frame, model3_glm, K=nrow(frame))
str(model3_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model3_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 1.1 1.1
## $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...
set.seed(2500)
model4_glm<-glm(y~x+I(x^2)+I(x^3)+I(x^4), data=frame)
model4_cv<-cv.glm(frame, model4_glm, K=nrow(frame))
str(model4_cv)
## List of 4
## $ call : language cv.glm(data = frame, glmfit = model4_glm, K = nrow(frame))
## $ K : num 100
## $ delta: num [1:2] 1.11 1.11
## $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...
\[ Y = \beta_0 + X\beta_1 + X^2\beta_2 + \epsilon_i\]
This makes a lot of sense because the model that was used to generate the data was similar to this.
summary(model4_glm)
##
## Call:
## glm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4), data = frame)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8914 -0.5244 0.0749 0.5932 2.7796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.13897 0.15973 -0.870 0.386455
## x 0.90980 0.24249 3.752 0.000302 ***
## I(x^2) -1.72802 0.28379 -6.089 2.4e-08 ***
## I(x^3) 0.00715 0.10832 0.066 0.947510
## I(x^4) -0.03807 0.08049 -0.473 0.637291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.084654)
##
## Null deviance: 552.21 on 99 degrees of freedom
## Residual deviance: 103.04 on 95 degrees of freedom
## AIC: 298.78
##
## Number of Fisher Scoring iterations: 2
set.seed(1)
x <- rnorm(100)
epsilon <- rnorm(100)
beta <- sample(1:100, 4, replace=TRUE)
y<- beta[1]+beta[2]*x+beta[3]*x^2+beta[4]*x^3+epsilon
require(leaps)
## Loading required package: leaps
all <- data.frame(y = y, x = x)
all_regfit <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = all, nvmax = 10)
regression_summary <- summary(all_regfit)
par(mfrow = c(2, 2))
plot(regression_summary$cp, xlab = "Number of Variables", ylab = "C_p", type = "l")
points(which.min(regression_summary$cp), regression_summary$cp[which.min(regression_summary$cp)], col = "red", cex = 2, pch = 20)
plot(regression_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(regression_summary$bic), regression_summary$bic[which.min(regression_summary$bic)], col = "red", cex = 2, pch = 20)
plot(regression_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(regression_summary$adjr2), regression_summary$adjr2[which.max(regression_summary$adjr2)], col = "red", cex = 2, pch = 20)
coef(all_regfit, 4)
## (Intercept) x I(x^2) I(x^3) I(x^5)
## 66.07200775 19.38745596 95.84575641 89.55797426 0.08072292
coef(all_regfit, 3)
## (Intercept) x I(x^2) I(x^3)
## 66.06151 18.97528 95.87621 90.01764
require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
xmatrix <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = all)[, -1]
lasso_crossvalidation <- cv.glmnet(xmatrix, y, alpha = 1)
plot(lasso_crossvalidation)
bestlambda <- lasso_crossvalidation$lambda.min
lasso_refit <- glmnet(xmatrix, y, alpha = 1)
predict(lasso_refit, s = bestlambda, type = "coefficients")[1:11, ]
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5)
## 71.33947 13.17485 90.31330 89.36556 0.00000 0.00000
## I(x^6) I(x^7) I(x^8) I(x^9) I(x^10)
## 0.00000 0.00000 0.00000 0.00000 0.00000
lasso_crossvalidation$lambda.min
## [1] 7.31916
lasso_crossvalidation$lambda.1se
## [1] 8.032769
The red line is the c.v. curve. Two ?? values of the minimum cv error and the lambda within one standard devation of the minimum cv error are given by the vertical dashed lines, whose values are 7.31916 and 8.032 respectively.
With the best value of ?? giving the minimum cv error, the Lasso shrinks most of the predictors to zero, and only leaves X, X2 and X3 nozero.
beta <- sample(1:100, 7, replace=TRUE)
y<- beta[1]+beta[7]*x^7+epsilon
data.full <- data.frame(y = y, x = x)
regfit.full <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10)
reg.summary <- summary(regfit.full)
par(mfrow = c(2, 2))
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "C_p", type = "l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch = 20)
coef(regfit.full, 1)
## (Intercept) I(x^7)
## 47.95894 88.00077
coef(regfit.full, 2)
## (Intercept) I(x^2) I(x^7)
## 48.0704904 -0.1417084 88.0015552
coef(regfit.full, 4)
## (Intercept) x I(x^2) I(x^3) I(x^7)
## 48.0762524 0.2914016 -0.1617671 -0.2526527 88.0091338
xmat <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[, -1]
cv.lasso <- cv.glmnet(xmat, y, alpha = 1)
bestlam <- cv.lasso$lambda.min
bestlam
## [1] 170.6371
fit.lasso <- glmnet(xmat, y, alpha = 1)
predict(fit.lasso, s = bestlam, type = "coefficients")[1:11, ]
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5)
## 59.84085 0.00000 0.00000 0.00000 0.00000 0.00000
## I(x^6) I(x^7) I(x^8) I(x^9) I(x^10)
## 0.00000 85.18539 0.00000 0.00000 0.00000
set.seed(1)
require(MASS)
## Loading required package: MASS
library(leaps)
library(glmnet)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
data(Boston)
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
lasso_cross_validation = cv.glmnet(x, y, type.measure = "mse")
plot(lasso_cross_validation)
coef(lasso_cross_validation)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.0894283
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2643196
## tax .
## ptratio .
## black .
## lstat .
## medv .
sqrt(lasso_cross_validation$cvm[lasso_cross_validation$lambda == lasso_cross_validation$lambda.1se])
## [1] 7.418552
x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
ridge_cross_validation= cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(ridge_cross_validation)
coef(ridge_cross_validation)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.887549636
## zn -0.002684451
## indus 0.035521503
## chas -0.242070224
## nox 2.338520336
## rm -0.166158479
## age 0.007601568
## dis -0.120012848
## rad 0.063692316
## tax 0.002813534
## ptratio 0.090098298
## black -0.003542339
## lstat 0.046779459
## medv -0.030585796
sqrt(ridge_cross_validation$cvm[ridge_cross_validation$lambda == ridge_cross_validation$lambda.1se])
## [1] 7.405204
library(pls)
pcr_boston = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr_boston)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.221 7.219 6.775 6.763 6.768 6.802
## adjCV 8.61 7.217 7.215 6.769 6.756 6.764 6.795
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.800 6.680 6.696 6.688 6.702 6.681 6.607
## adjCV 6.792 6.671 6.686 6.678 6.691 6.667 6.594
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 93.45 95.40 97.04 98.46 99.52 100.0
## crim 42.47 42.55 42.78 43.04 44.13 45.4
As computed above the model, the Ridge Regression model has the smallest cross validation error and it performed rather well on this dataset.
No, the model chosen by Ridge Regression only has 13 features.