Problem 1

Given: class = 2, \(p=1\), \(\pi_1 = \pi_2 = 0.5\) Show that the decision boundary = \(\frac{\mu_1 + \mu_2}{2}\)

The decision boundary between class \(k\) and class \(l\) is \({x : \delta_k (x) = \delta_l(x)}\) .

The classification of a value \(x\) is ambiguous, and there are two classes with an equal probability of being chosen.

Each class \(k\) is Gaussian with a common variance. The centers are different for each class \(k\). If each class were considered independent of the other (class k is split in two), the decision boundary would be \(\mu\).

The decision boundary of two equally probable classses is the average or the mean. Thus, \(\frac{\mu_1+\mu_2}{2}\)

Problem 3 (Chapter 5, Question 2)

Part A.

In the sample there are \(n\) observations and we are sampling using the bootstrap method meaning we draw items with replacement. The pool is composed of are \(n-1\) items that are not \(j\) and one item that is \(j\). So, there is a \(\frac{j}{n}\) chance of that the first item is \(j\) and a \(\frac{n-1}{n}\) chance that the first item is not \(j\).

Part B.

As explained above, bootstrapping samples with replacement. Therefore, the probability that the second item is not \(j\) is also \(\frac{n-1}{n}\).

Part C.

From above, we know that the probability that an observation is not \(j\)th observation from the original sample is \(\frac{n-1}{n}\), which after some algebra, \(=1-\frac{1}{n}\). With bootstrap sampling, we have \(n\) draw. The probability that the \(j\)th observation is not in the bootstrap sample implies that each time we draw from \(n\), we have to successfully not draw \(j\). So, this would simply be the product of \(n\) of these probabilities, which is \((1-\frac{1}{n})^n\).

Part D.

If n=5, the probability that j is in the bootstrap would be \(1-(1-\frac{1}{5})^5\), which is 0.67232.

1-(1-(1/5))^5
## [1] 0.67232

Part E.

If n=100, the probability that the jth observation is in the bootstrap sample would be 0.6339677.

1-((1-(1/100))^100)
## [1] 0.6339677

Part F.

If n=10,000, the probability that the jth observation is in the bootstrap sample would be 0.632139

1-((1-(1/10000))^10000)
## [1] 0.632139

Part G.

Based on the plot below, we see that the prbability The probability starts pretty high and quickly drops and converges around 0.63 quickly. This happens around n=100, and this was proven above with the calculation. This is confusing to me because it doesn’t make sense how a dataset as big as 10,000 can have the same probability of containing something as a dataset of 100.

x=seq(1,100000)
y=sapply(x,function(n){1-((1-(1/n))^n)})
plot(x,y,xlab="n",ylab="Probability", main="Probability jth observation is in the bootstrap sample", log="x", col="pink")

Part H.

In the console, I ran what “store” was to see what the code was doing. It produced a list of length 10,000, and each time sampled 0-100 with replacement and checked to see if 4 is in the list. Just as above, the mean explains that the probability that 4 was in the list 0.6261, or approximately 0.63 times.

store=rep(NA, 10000)
for(i in 1:10000){
  store[i]=sum(sample(1:100, rep=TRUE)==4)>0 
}
mean(store)
## [1] 0.6312

Problem 4 (Chapter 5, Question 3)

Part A.

k-fold cross-validation is implemented by dividing the observations from the dataset into k groups that are roughly the same size. You then train on \(\frac{k-1}{k}\) and of the datasset and test on \(\frac{1}{k}\) of the dataset. This procedure is implemented procedure \(k\) times to understand the variability in the model and to best estimate the test error.

Part. B.

    1. Compared to the validation set approach (VSA), an advantage of k-fold cross validation would be that it includes a lot more of the data set. In the VSA, the estimated test error is really dependent on which observations are included in the training and validation sets. Thus, in VSA, the validation test error tends to over estimate the error for the model fit. From k-fold c.v. we can see how the model performs with the various k groups. A disadvantage of k-fold c.v. compared to VSA is that it is a little more complicated and difficult to implement.
  1. LOOCV is one specific case of k-fold c.v. where k= n. The estimates from each fold in LOOCV are highly correlated and hence their average can have high variance, and this is an advantage k-fold c.v. has over LOOCV. A disadvantage for k-fold c.v. is that since each training set is only \(\frac{K-1}{K}\) as big as the original training set, the estimates of prediction error will typically be biased upward. From class, we know that a happy medium where K= 5 would be the optimal procedure.

Problem 5 (Chapter 5, Question 8)

Part A.

  • In this data set, n= 100 and p=1.
set.seed(1)
y=rnorm(100)
x=rnorm(100)
y=x-2*x^2+rnorm (100) 
  • The model is as described below based on the code above:

\[ Y = X - 2X^2 + \epsilon_i \]

Part B.

  • The scatterplot is below. It is not random and in fact, there is a quadratic relationship within it. This makes sense as the model is a quadratic model in the form y= B0 + B1x + B2x^2.
plot(x, y, xlab="X Values", ylab="Y values", main="X vs.Y", pch=16, col="purple") 

Part C.

  • The LOOCV is computed for each model. It is as follows for each model:
  • i.) 5.89
  • ii.) 1.09
  • iii.) 1.1
  • iv.) 1.11
require(boot)
## Loading required package: boot
set.seed(1234)
frame <- data.frame(x,y)
model1_glm <- glm(y~x, data=frame)
model1_cv<-cv.glm(frame,model1_glm, K=nrow(frame))
str(model1_cv) 
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model1_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 5.89 5.89
##  $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
#= delta raw CV error, raw CV error for not using LOOCV but we are using LOOCV so its the same
set.seed(1234)
model2_glm<-glm(y~x+I(x^2), data=frame)
model2_cv <- cv.glm(frame, model2_glm, K=nrow(frame))
str(model2_cv)
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model2_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 1.09 1.09
##  $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
set.seed(1234)
model3_glm<-glm(y~x+I(x^2)+I(x^3), data=frame)
model3_cv<-cv.glm(frame, model3_glm, K=nrow(frame))
str(model3_cv)
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model3_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 1.1 1.1
##  $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
set.seed(1234)
model4_glm<-glm(y~x+I(x^2)+I(x^3)+I(x^4), data=frame)
model4_cv<-cv.glm(frame, model4_glm, K=nrow(frame))
str(model4_cv)
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model4_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 1.11 1.11
##  $ seed : int [1:626] 403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...

Part D.

  • Here we set the seed at 2500 but run the same code. The LOOCV values stayed the exact same regardless of the seed. This is because LOOCV does not have a randomness factor involved in the training or validation sets.
require(boot)
set.seed(2500)
frame <- data.frame(x,y)
model1_glm <- glm(y~x, data=frame)
model1_cv<-cv.glm(frame,model1_glm, K=nrow(frame))
str(model1_cv) 
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model1_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 5.89 5.89
##  $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...
#= delta raw CV error, raw CV error for not using LOOCV but we are using LOOCV so its the same
set.seed(2500)
model2_glm<-glm(y~x+I(x^2), data=frame)
model2_cv <- cv.glm(frame, model2_glm, K=nrow(frame))
str(model2_cv)
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model2_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 1.09 1.09
##  $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...
set.seed(2500)
model3_glm<-glm(y~x+I(x^2)+I(x^3), data=frame)
model3_cv<-cv.glm(frame, model3_glm, K=nrow(frame))
str(model3_cv)
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model3_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 1.1 1.1
##  $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...
set.seed(2500)
model4_glm<-glm(y~x+I(x^2)+I(x^3)+I(x^4), data=frame)
model4_cv<-cv.glm(frame, model4_glm, K=nrow(frame))
str(model4_cv)
## List of 4
##  $ call : language cv.glm(data = frame, glmfit = model4_glm, K = nrow(frame))
##  $ K    : num 100
##  $ delta: num [1:2] 1.11 1.11
##  $ seed : int [1:626] 403 624 1067955936 880202081 -584542290 -1058845609 1295748268 1803575741 -58598054 -1449798893 ...

Part E.

  • The best model with the lowest error in part c was model2, represented as

\[ Y = \beta_0 + X\beta_1 + X^2\beta_2 + \epsilon_i\]

This makes a lot of sense because the model that was used to generate the data was similar to this.

Part F.

  • By looking at the summary from model4, which has all the X values, we see the p-values associated with each X. From this, we see that X and X^2 have low p-values below 0.05, suggesting they are statistically significant. X^3 and X^4 have p-values greater than 0.05 so we cannot make conclusions about those predictors.These results do agree with the cross-validation results.
summary(model4_glm)
## 
## Call:
## glm(formula = y ~ x + I(x^2) + I(x^3) + I(x^4), data = frame)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8914  -0.5244   0.0749   0.5932   2.7796  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.13897    0.15973  -0.870 0.386455    
## x            0.90980    0.24249   3.752 0.000302 ***
## I(x^2)      -1.72802    0.28379  -6.089  2.4e-08 ***
## I(x^3)       0.00715    0.10832   0.066 0.947510    
## I(x^4)      -0.03807    0.08049  -0.473 0.637291    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.084654)
## 
##     Null deviance: 552.21  on 99  degrees of freedom
## Residual deviance: 103.04  on 95  degrees of freedom
## AIC: 298.78
## 
## Number of Fisher Scoring iterations: 2

Problem 6 (Chapter 6, Question 8, Skip Part D.)

Part A.

set.seed(1)
x <- rnorm(100)
epsilon <- rnorm(100)

Part B.

beta <- sample(1:100, 4, replace=TRUE)
y<- beta[1]+beta[2]*x+beta[3]*x^2+beta[4]*x^3+epsilon

Part C.

  • Based on all three models below, the best \(C_p\) has 4 predictors, the \(BIC\) has 3 predictors, and the adjusted \(R^2\) has 4 predictors. We see this in the plots below where the red dot is.
require(leaps)
## Loading required package: leaps
all <- data.frame(y = y, x = x)
all_regfit <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = all, nvmax = 10)
regression_summary <- summary(all_regfit)
par(mfrow = c(2, 2))
plot(regression_summary$cp, xlab = "Number of Variables", ylab = "C_p", type = "l")
points(which.min(regression_summary$cp), regression_summary$cp[which.min(regression_summary$cp)], col = "red", cex = 2, pch = 20)
plot(regression_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(regression_summary$bic), regression_summary$bic[which.min(regression_summary$bic)], col = "red", cex = 2, pch = 20)
plot(regression_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(regression_summary$adjr2), regression_summary$adjr2[which.max(regression_summary$adjr2)], col = "red", cex = 2, pch = 20)

  • The coefficients are found below.
coef(all_regfit, 4)
## (Intercept)           x      I(x^2)      I(x^3)      I(x^5) 
## 66.07200775 19.38745596 95.84575641 89.55797426  0.08072292
coef(all_regfit, 3)
## (Intercept)           x      I(x^2)      I(x^3) 
##    66.06151    18.97528    95.87621    90.01764

Part E.

require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
xmatrix <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = all)[, -1]
lasso_crossvalidation <- cv.glmnet(xmatrix, y, alpha = 1)
plot(lasso_crossvalidation)

bestlambda <- lasso_crossvalidation$lambda.min
  • Now, re-fit lasso model based on the best lambda value of 7.31916
lasso_refit <- glmnet(xmatrix, y, alpha = 1)
predict(lasso_refit, s = bestlambda, type = "coefficients")[1:11, ]
## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5) 
##    71.33947    13.17485    90.31330    89.36556     0.00000     0.00000 
##      I(x^6)      I(x^7)      I(x^8)      I(x^9)     I(x^10) 
##     0.00000     0.00000     0.00000     0.00000     0.00000
lasso_crossvalidation$lambda.min
## [1] 7.31916
lasso_crossvalidation$lambda.1se
## [1] 8.032769
  • Based on this, lasso method picks \(X\), \(X^2\), and \(X^3\) and as predictors for the model.

The red line is the c.v. curve. Two ?? values of the minimum cv error and the lambda within one standard devation of the minimum cv error are given by the vertical dashed lines, whose values are 7.31916 and 8.032 respectively.

With the best value of ?? giving the minimum cv error, the Lasso shrinks most of the predictors to zero, and only leaves X, X2 and X3 nozero.

Part F.

beta <- sample(1:100, 7, replace=TRUE)
y<- beta[1]+beta[7]*x^7+epsilon
data.full <- data.frame(y = y, x = x)
regfit.full <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10)
reg.summary <- summary(regfit.full)
par(mfrow = c(2, 2))
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "C_p", type = "l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch = 20)

  • Based on all three models below, the best \(C_p\) has 2 predictors, the \(BIC\) has 1 predictor, and the adjusted \(R^2\) has 4 predictors. We see this in the plots below where the red dot is.
coef(regfit.full, 1)
## (Intercept)      I(x^7) 
##    47.95894    88.00077
coef(regfit.full, 2)
## (Intercept)      I(x^2)      I(x^7) 
##  48.0704904  -0.1417084  88.0015552
coef(regfit.full, 4)
## (Intercept)           x      I(x^2)      I(x^3)      I(x^7) 
##  48.0762524   0.2914016  -0.1617671  -0.2526527  88.0091338
xmat <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[, -1]
cv.lasso <- cv.glmnet(xmat, y, alpha = 1)
bestlam <- cv.lasso$lambda.min
bestlam
## [1] 170.6371
fit.lasso <- glmnet(xmat, y, alpha = 1)
predict(fit.lasso, s = bestlam, type = "coefficients")[1:11, ]
## (Intercept)           x      I(x^2)      I(x^3)      I(x^4)      I(x^5) 
##    59.84085     0.00000     0.00000     0.00000     0.00000     0.00000 
##      I(x^6)      I(x^7)      I(x^8)      I(x^9)     I(x^10) 
##     0.00000    85.18539     0.00000     0.00000     0.00000

Problem 7 (Chapter 6, Question 11)

Part A.

set.seed(1)
require(MASS)
## Loading required package: MASS
library(leaps)
library(glmnet)
library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
data(Boston)

Lasso

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
lasso_cross_validation = cv.glmnet(x, y, type.measure = "mse")
plot(lasso_cross_validation)

coef(lasso_cross_validation)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                     1
## (Intercept) 1.0894283
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2643196
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .
sqrt(lasso_cross_validation$cvm[lasso_cross_validation$lambda == lasso_cross_validation$lambda.1se])
## [1] 7.418552

Ridge Regression

x = model.matrix(crim ~ . - 1, data = Boston)
y = Boston$crim
ridge_cross_validation= cv.glmnet(x, y, type.measure = "mse", alpha = 0)
plot(ridge_cross_validation)

coef(ridge_cross_validation)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  0.887549636
## zn          -0.002684451
## indus        0.035521503
## chas        -0.242070224
## nox          2.338520336
## rm          -0.166158479
## age          0.007601568
## dis         -0.120012848
## rad          0.063692316
## tax          0.002813534
## ptratio      0.090098298
## black       -0.003542339
## lstat        0.046779459
## medv        -0.030585796
sqrt(ridge_cross_validation$cvm[ridge_cross_validation$lambda == ridge_cross_validation$lambda.1se])
## [1] 7.405204

PCR

library(pls)
pcr_boston = pcr(crim ~ ., data = Boston, scale = TRUE, validation = "CV")
summary(pcr_boston)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.221    7.219    6.775    6.763    6.768    6.802
## adjCV         8.61    7.217    7.215    6.769    6.756    6.764    6.795
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.800    6.680    6.696     6.688     6.702     6.681     6.607
## adjCV    6.792    6.671    6.686     6.678     6.691     6.667     6.594
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       93.45    95.40     97.04     98.46     99.52     100.0
## crim    42.47    42.55     42.78     43.04     44.13      45.4

Part B.

As computed above the model, the Ridge Regression model has the smallest cross validation error and it performed rather well on this dataset.

Part C.

No, the model chosen by Ridge Regression only has 13 features.