For parts (a) through(c), indicate which of i. through iv. is correct & justify your answer.
The lasso, relative to least squares, is:
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
When the least squares estimate has a very high variance, the lasso solution can reduce the variance. However, it causes an increase in bias. This would, however, cause an increase in prediction accuracy.
The ridge regression, relative to least squares, is:
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
As λ increases, ridge regression fit flexibility decreases - which causes a decrease in variance but increase in bias.
The nonlinear methods, relative to least squares, is:
ii. More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.
A nonlinear method is more flexible because it makes less assumptions about the functional form of f. This means a decrease in bias that is greater than the increase in variance - resulting in higher prediction accuracy.
Predict the number of applications received using the other variables in the College dataset.
Split data into training and test sets.
attach(College)
set.seed(1)
index <- sample(1:nrow(College), 0.8 * nrow(College))
train <- College[index,]
test <- College[-index,]
Fit a linear model using least squares on the training set & report the test error.
lm.fit <- lm(Apps ~ ., data = train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5555.2 -404.6 19.9 310.3 7577.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -630.58238 435.56266 -1.448 0.148209
## PrivateYes -388.97393 148.87623 -2.613 0.009206 **
## Accept 1.69123 0.04433 38.153 < 2e-16 ***
## Enroll -1.21543 0.20873 -5.823 9.41e-09 ***
## Top10perc 50.45622 5.88174 8.578 < 2e-16 ***
## Top25perc -13.62655 4.67321 -2.916 0.003679 **
## F.Undergrad 0.08271 0.03632 2.277 0.023111 *
## P.Undergrad 0.06555 0.03367 1.947 0.052008 .
## Outstate -0.07562 0.01987 -3.805 0.000156 ***
## Room.Board 0.14161 0.05130 2.760 0.005947 **
## Books 0.21161 0.25184 0.840 0.401102
## Personal 0.01873 0.06604 0.284 0.776803
## PhD -9.72551 4.91228 -1.980 0.048176 *
## Terminal -0.48690 5.43302 -0.090 0.928620
## S.F.Ratio 18.26146 13.83984 1.319 0.187508
## perc.alumni 1.39008 4.39572 0.316 0.751934
## Expend 0.05764 0.01254 4.595 5.26e-06 ***
## Grad.Rate 5.89480 3.11185 1.894 0.058662 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 993.8 on 603 degrees of freedom
## Multiple R-squared: 0.9347, Adjusted R-squared: 0.9328
## F-statistic: 507.5 on 17 and 603 DF, p-value: < 2.2e-16
lm.pred <- predict(lm.fit, newdata = test, type = "response")
lm.mse <- mean((lm.pred - test$Apps)^2)
lm.mse
## [1] 1567324
Interpretation
The test error is 1567324.
Fit a ridge regression model on the training dataset with λ chosen by cross-validation. Report the test error.
x <- model.matrix(Apps~., train)
y <- model.matrix(Apps~., test)
grid <- 10^seq(10, -2, length = 100)
set.seed(1)
ridge.mod <- glmnet(x, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
cv.out <- cv.glmnet(x, train$Apps, alpha = 0, lambda = grid, thresh = 1e-12)
plot(cv.out)
bestlam <- cv.out$lambda.min
bestlam
## [1] 0.01
ridge.pred <- predict(ridge.mod, s = bestlam, newx = y)
ridge.mse <- mean((ridge.pred-test$Apps)^2)
ridge.mse
## [1] 1567278
Interpretation
The new test error is 1567278. This is less than the previous test error from the linear regression model.
Fit a lasso model on the training dataset with λ chosen by cross-validation. Report the test error & number of non-zero coefficients.
x <- model.matrix(Apps~.,train)
y <- model.matrix(Apps~.,test)
grid <- 10^seq(10, -2, length = 100)
set.seed(1)
lasso.mod <- glmnet(x, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
cv.out <- cv.glmnet(x, train$Apps, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam <- cv.out$lambda.min
bestlam
## [1] 0.01
lasso.coef <- predict(lasso.mod, s = bestlam, newx = y)
lasso.mse <- mean((lasso.coef - test$Apps)^2)
lasso.mse
## [1] 1567261
predict(lasso.mod, s = bestlam, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -630.61389530
## (Intercept) .
## PrivateYes -388.94711918
## Accept 1.69115587
## Enroll -1.21488255
## Top10perc 50.44933775
## Top25perc -13.62126672
## F.Undergrad 0.08264172
## P.Undergrad 0.06554432
## Outstate -0.07560901
## Room.Board 0.14159590
## Books 0.21154724
## Personal 0.01871734
## PhD -9.72479696
## Terminal -0.48590436
## S.F.Ratio 18.25562704
## perc.alumni 1.38634947
## Expend 0.05763637
## Grad.Rate 5.89356487
Interpretation
Test Error = 1567261 This is less than both the linear model and the ridge regression. All 17 coefficients from the lasso model are non-zero.
Fit a PCR model on the training dataset, with M chosen by cross-validation. Report the test error and value of M.
set.seed(1)
pcr.fit <- pcr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.fit$ncomp
## [1] 17
pcr.pred <- predict(pcr.fit, test, ncomp = 17)
pcr.mse <- mean((pcr.pred-test$Apps)^2)
pcr.mse
## [1] 1567324
Interpretation
The M value chosen by cross-validation is 17. When that is applied to the prediction, we get a test error of 1567324.
Fit a PLS model on the training dataset with M chosen by cross-validation. Report the test error and the value of M.
set.seed(1)
pls.fit <- plsr(Apps~., data = train, scale = TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")
pls.fit$ncomp
## [1] 17
pls.pred <- predict(pls.fit, test, ncomp = 17)
pls.mse <- mean((pls.pred-test$Apps)^2)
pls.mse
## [1] 1567324
Interpretation
M Value = 17 & Test Error = 1567324
Comment on results. How accurately can the number of college applications received be predicted? Is there much difference between the five approaches?
model.names <- c("LM", "RIDGE", "LASSO", "PCR", "PLS")
all.errors <- c(lm.mse, ridge.mse, lasso.mse, pcr.mse, pls.mse)
data.frame(model.names, all.errors)
## model.names all.errors
## 1 LM 1567324
## 2 RIDGE 1567278
## 3 LASSO 1567261
## 4 PCR 1567324
## 5 PLS 1567324
Interpretation
The above comparison easily represents the different test errors. The Lasso method produces the lowest test errors. However, there is not much difference in the five errors. Three are the same and all are within about 70 points.
Predict per capita crime rate in Boston dataset.
detach(College)
attach(Boston)
Try some regression methods from the chapter. Present & discuss results.
1. Best Subset Selection
predict.resubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars = names(coefi)
mat[,xvars]%*%coefi
}
regfit.full <- regsubsets(crim~., Boston, nvmax = 13)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., Boston, nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
k = 10
set.seed(1)
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for(j in 1:k) {
best.fit <- regsubsets(crim~., data = Boston[folds != j,], nvmax = 13)
for(i in 1:13) {
pred <- predict.resubsets(best.fit, Boston[folds == j,], id = i)
cv.errors[j,i] = mean((Boston$crim[folds==j]-pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 45.44573 43.87260 43.94979 44.02424 43.96415 43.96199 42.96268 42.66948
## 9 10 11 12 13
## 42.53822 42.73416 42.52367 42.46014 42.50125
which.min(mean.cv.errors)
## 12
## 12
par(mfrow = c(1,1))
plot(mean.cv.errors, type = "b")
bss.error <- min(mean.cv.errors)
bss.error
## [1] 42.46014
Interpretation
The least error from cross-validation in the Best Subset Selection is 42.46.
2. Lasso
x <- model.matrix(crim~., Boston)
y <- Boston$crim
grid <- 10^seq(10,-2,length = 100)
set.seed(1)
lasso.mod <- glmnet(x,y, alpha = 1, lambda = grid, thresh = 1e-12)
cv.out <- cv.glmnet(x,y, alpha = 1, lambda = grid, thresh = 1e-12)
bestlam <- cv.out$lambda.min
bestlam
## [1] 0.05336699
plot(cv.out)
lasso.coef <- predict(lasso.mod, s = bestlam, newx = x)
lasso.mse <- mean((lasso.coef-y)^2)
lasso.mse
## [1] 40.47009
Interpretation
The mean squared error in the lasso model is 40.47.
3. Ridge Regression
x <- model.matrix(crim~., Boston)
y <- Boston$crim
grid <- 10^seq(10,-2, length = 100)
set.seed(1)
ridge.mod <- glmnet(x,y, alpha = 0, lambda = grid, thresh =1e-12)
cv.out <- cv.glmnet(x,y, alpha = 0, lambda = grid, thresh = 1e-12)
bestlam <- cv.out$lambda.min
bestlam
## [1] 0.1232847
plot(cv.out)
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x)
ridge.mse <- mean((ridge.pred-y)^2)
ridge.mse
## [1] 40.36313
Interpretation
The test error from the ridge regression is 40.36.
4. Principal Components Regression (PCR)
set.seed(1)
pcr.fit <- pcr(crim~., data = Boston, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.fit$ncomp
## [1] 13
pcr.pred <- predict(pcr.fit, Boston, ncomp = 13)
pcr.mse <- mean((pcr.pred-y)^2)
pcr.mse
## [1] 40.31607
Interpretation
The test error for PCR is 40.32.
Model Comparison
model.names <- c("BSS", "LASSO", "RIDGE","PCR")
errors <- c(bss.error, lasso.mse, ridge.mse, pcr.mse)
data.frame(model.names, errors)
## model.names errors
## 1 BSS 42.46014
## 2 LASSO 40.47009
## 3 RIDGE 40.36313
## 4 PCR 40.31607
Interpretation
The PCR provided the lowest test error meaning it is the most accurate model.
Propose a model that seems to perform well on this dataset.
The PCR model produced the most accurate model with a test error of 40.32. However, we would need to split the data into training and testing subsets to better understand how the model performs. The whole dataset produced an M value of 13 using cross-validation.
Does the chosen model involve all the features in the dataset?
The PCR model does include all the features in the dataset. More models can be calculated using interaction effects or polynomial terms to help in analyzing whether they help or hinder the model.