suppressMessages(library(ISLR))
suppressMessages(library(MASS))
suppressMessages(library(glmnet))
## Warning: package 'glmnet' was built under R version 4.0.2
suppressMessages(library(Amelia))
suppressMessages(library(pls))
## Warning: package 'pls' was built under R version 4.0.2
suppressMessages(library(leaps))
## Warning: package 'leaps' was built under R version 4.0.2
When it comes to regression, the standard linear model is utilized to explain the relationship between a response Y and a set of variables, X1, X2,…Xp. Now, we know that in the real world most relationships are not exactly linear, but the linear model still has certain advantages when it comes to inference and with real world problems, and can at times compete in relation to non-linear methods. Chapter 6 of ISLR discusses ways in which the simple linear model can be improved by replacing least squares fitting other fitting procedures. The following are a few exercises from this chapter.
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
(a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
RESPONSE: iii. would be correct. The lasso is a less flexible model, as it puts a budget constraint on least squares. Essentially, when we perform the lasso, we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budget, s ,for how large the coefficient estimates can be. Also while λ(tuning parameter) increases the variance decreases and the bias increases, resulting in an improved prediction accuracy when this trade off occurs(increase in bias is less than its decrease in variance).
(b) Repeat (a) for ridge regression relative to least squares.
RESPONSE: iii. would be correct here as well. This is because ridge regression’s advantage over least squares comes from the bias-variance trade off. As λ increases(shrinkage penalty), ridge regressions become less flexible, resulting in a significant decrease in variance, with not much increase in bias. We know when the number of variables p is almost as large as the number of observations n, then the least squares estimates can contain alot of variance, and if p>n, then the OLS will not even have a unique solution, whereas ride regressions can still perform well by trading off a small increase in bias for a large decrease in variance. Ridge regression will be more biased as it shrinks predictors that dont have a strong relationship with the response variable while trading off for less variance, the prediction accuracy is improved when its increase in bias is less then the decrease in variance.
(c) Repeat (a) for non-linear methods relative to least squares.
RESPONSE: ii. would be correct, because non-linear methods follow observations more tightly than just least squares allowing for more flexibility, hence giving better preidciton accuracy when the variance increases more than the bias decreases.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
data(College)
missmap(College) # missmap() is useful for ensuring there are no missing values in the data...from library(Amelia)
set.seed(12)
trainID <- sample(nrow(College), nrow(College)*.5)#Random 50/50 split for test and train
train <- College[trainID,]
test <- College[-trainID,]
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
set.seed(12)
fit.lm <- lm(Apps ~ .,data = train)
summary(fit.lm)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2796.0 -388.6 -58.6 298.8 5496.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -340.71990 522.82263 -0.652 0.51500
## PrivateYes -531.76257 181.15740 -2.935 0.00354 **
## Accept 1.40371 0.07011 20.022 < 2e-16 ***
## Enroll -0.52560 0.24701 -2.128 0.03401 *
## Top10perc 44.64763 7.14555 6.248 1.14e-09 ***
## Top25perc -11.97377 5.76813 -2.076 0.03860 *
## F.Undergrad 0.09258 0.04332 2.137 0.03327 *
## P.Undergrad 0.01486 0.05191 0.286 0.77483
## Outstate -0.05728 0.02418 -2.369 0.01836 *
## Room.Board 0.14359 0.05892 2.437 0.01528 *
## Books 0.03323 0.30707 0.108 0.91388
## Personal -0.01028 0.08033 -0.128 0.89828
## PhD -7.32657 6.00842 -1.219 0.22348
## Terminal -5.21514 6.56187 -0.795 0.42726
## S.F.Ratio 6.69220 17.07931 0.392 0.69541
## perc.alumni -5.85709 5.32647 -1.100 0.27221
## Expend 0.08821 0.01929 4.572 6.59e-06 ***
## Grad.Rate 8.09763 3.48744 2.322 0.02078 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 904.1 on 370 degrees of freedom
## Multiple R-squared: 0.9352, Adjusted R-squared: 0.9322
## F-statistic: 313.9 on 17 and 370 DF, p-value: < 2.2e-16
lm.pred <- predict(fit.lm,test)
lm.error <- mean((test$Apps-lm.pred)^2)
lm.error
## [1] 1416217
(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
# to perform a ridge regression we must first create a matrix x and vector y for the training and testing data
set.seed(12)
train.x <- model.matrix(Apps ~ ., data = train)
test.x <- model.matrix(Apps ~., data = test)
train.y <- train$Apps
test.y <- test$Apps
grid <- 10 ^ seq(4,-2, length = 100) # the glmnet() function will be implemented over a grid defined by these ranges for λ
fit.ridge <- glmnet(train.x, train.y, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(train.x, train.y, alpha = 0, lambda = grid, thresh = 1e-12)
plot(cv.ridge)
best.lambda1 <- cv.ridge$lambda.min # here I am getting the λ chosen by cross-validation
best.lambda1
## [1] 18.73817
ridge.pred <- predict(fit.ridge, s = best.lambda1, newx = test.x)
ridge.error <- mean((ridge.pred - test.y)^2)
print(ridge.error) #test error obtained
## [1] 1470483
(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(12)
fit.lasso <- glmnet(train.x, train.y, alpha =1, lambda= grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.x, train.y, alpha =1, lambda = grid, thresh = 1e-12)
bestlambda2 <- cv.lasso$lambda.min
print(bestlambda2)
## [1] 16.29751
lasso.pred <- predict(fit.lasso, s = bestlambda2, newx = test.x)
lasso.error <- mean((lasso.pred - test.y)^2)
print(lasso.error)
## [1] 1488722
coeff.lasso <- predict(fit.lasso, type = "coefficients", s = bestlambda2, )[1:18,] # this will produce the coefficient estimates for the Lasso method, you will see 4 coefficients that are shrunk down to 0 namely Enroll, Books, Personal, S.F. Ratio.
print(coeff.lasso)
## (Intercept) (Intercept) PrivateYes Accept Enroll
## -4.849043e+02 0.000000e+00 -5.417953e+02 1.313305e+00 0.000000e+00
## Top10perc Top25perc F.Undergrad P.Undergrad Outstate
## 3.502755e+01 -4.129574e+00 3.150119e-02 4.604656e-03 -3.515171e-02
## Room.Board Books Personal PhD Terminal
## 1.313325e-01 0.000000e+00 0.000000e+00 -5.279551e+00 -5.038469e+00
## S.F.Ratio perc.alumni Expend
## 0.000000e+00 -6.894089e+00 7.823714e-02
coeff.lasso[coeff.lasso!=0] # these are the non-zero coefficients
## (Intercept) PrivateYes Accept Top10perc Top25perc
## -4.849043e+02 -5.417953e+02 1.313305e+00 3.502755e+01 -4.129574e+00
## F.Undergrad P.Undergrad Outstate Room.Board PhD
## 3.150119e-02 4.604656e-03 -3.515171e-02 1.313325e-01 -5.279551e+00
## Terminal perc.alumni Expend
## -5.038469e+00 -6.894089e+00 7.823714e-02
(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(12)
fit.pcr <- pcr(Apps~., data = train, scale = TRUE, validation ="CV")
summary(fit.pcr)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3476 3384 1665 1568 1326 1300 1194
## adjCV 3476 3388 1663 1568 1308 1294 1191
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1166 1154 1120 1111 1115 1112 1116
## adjCV 1163 1152 1116 1108 1113 1109 1113
## 14 comps 15 comps 16 comps 17 comps
## CV 1116 1102 963.2 970.9
## adjCV 1113 1101 959.5 966.0
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.609 59.82 66.17 71.90 77.23 81.95 85.39 88.45
## Apps 7.327 77.33 81.18 86.64 86.72 88.87 89.35 89.65
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.20 93.61 95.63 97.22 98.27 99.09 99.52
## Apps 90.38 90.47 90.49 90.66 90.66 90.67 90.96
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.28 93.52
validationplot(fit.pcr, val.type = "MSEP")
pcr.pred <- predict(fit.pcr,test,ncomp = 16)
pcr.error<- mean((test.y-pcr.pred)^2) # Principal Components Regression test error
pcr.error
## [1] 1483940
RESPONSE: We see from the summary above that the lowest adjusted CV aligns with 16 components, resulting in a test error of 1483940.
(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
fit.pls <- plsr(Apps~., data =train, scale = TRUE, validation = "CV")
validationplot(fit.pls, val.type = "MSEP")
summary(fit.pls)
## Data: X dimension: 388 17
## Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3476 1532 1265 1110 1064 1018 969.8
## adjCV 3476 1528 1265 1107 1061 1015 965.6
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 963.0 955.7 954.9 951.3 951.8 952.0 952.2
## adjCV 958.7 952.0 951.2 947.6 948.0 948.2 948.3
## 14 comps 15 comps 16 comps 17 comps
## CV 951.5 950.4 950.0 949.8
## adjCV 947.7 946.7 946.3 946.2
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 27.77 46.85 65.42 69.25 72.60 75.30 79.24 81.85
## Apps 81.82 87.66 90.76 91.63 92.49 93.25 93.41 93.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 84.51 86.78 90.24 92.49 93.91 95.98 97.11
## Apps 93.47 93.49 93.51 93.51 93.51 93.52 93.52
## 16 comps 17 comps
## X 98.95 100.00
## Apps 93.52 93.52
pls.pred <- predict(fit.pls, test, ncomp = 10)
pls.error <- mean((test.y - pls.pred)^2) # Partial Least Squares test error
pls.error
## [1] 1419159
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
all.error <- c(lm.error, ridge.error, lasso.error, pcr.error, pls.error)
names(all.error) <- c("lm","ridge","lasso","pcr","pls")
testerror.plot <- barplot(all.error, col = 'orange')
RESPONSE:The bar chart above gives a visualization of the MSE from the different models, we see they are all similiar with the lm and pls looking like they have the lowest test error.
#r-squared
test.avg <- mean(test.y)
lm.r2<- 1- mean ((lm.pred - test.y)^2)/mean((test.avg-test.y)^2)
lm.r2
## [1] 0.9207656
ridge.r2 <- 1- mean ((ridge.pred - test.y)^2)/mean((test.avg-test.y)^2)
ridge.r2
## [1] 0.9177295
lasso.r2 <- 1- mean ((lasso.pred - test.y)^2)/mean((test.avg-test.y)^2)
lasso.r2
## [1] 0.9167091
pcr.r2 <- 1- mean ((pcr.pred - test.y)^2)/mean((test.avg-test.y)^2)
pcr.r2
## [1] 0.9169766
pls.r2 <- 1- mean ((pls.pred - test.y)^2)/mean((test.avg-test.y)^2)
pls.r2
## [1] 0.920601
RESPONSE:Above are the R-squared calculations which show that all these models can explain the variance of our dependent variable with great accuracy, all above 90%. its a close call but there is not too much difference between the models with the pls and lm coming in almost identical. #### Question 11
We will now try to predict per capita crime rate in the Boston data set.
(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
#I like to begin exploring the data, also ensuring there is no missing values
attach(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
missmap(Boston)
#best subset
set.seed(12)
data(Boston)
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call[[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred <- predict(best.fit, Boston[folds == j, ], id = i)
cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")
mean.cv.errors[which.min(mean.cv.errors)]
## 12
## 45.33533
RESPONSE:the lowest CV errorcomes at 12 variables and has a error of about 45.33
bestfit.full <- regsubsets(crim ~., data = Boston, nvmax = 13)
coef(bestfit.full, 12)
## (Intercept) zn indus chas nox
## 16.985713928 0.044673247 -0.063848469 -0.744367726 -10.202169211
## rm dis rad tax ptratio
## 0.439588002 -0.993556631 0.587660185 -0.003767546 -0.269948860
## black lstat medv
## -0.007518904 0.128120290 -0.198877768
#Lasso
set.seed(1)
train <- sample(506, 253)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test,]
x_train <- model.matrix(crim~., data = Boston.train)[,-1]
y_train <- Boston.train$crim
x_test <- model.matrix(crim~., data = Boston.test)[,-1]
y_test <- Boston.test$crim
grid <- 10^seq(10, -2, length = 100)
lasso.fit <- cv.glmnet(x_train, y_train, alpha = 1, lambda = grid, thresh = 1e-12)
plot(lasso.fit)
bestlam <- lasso.fit$lambda.min
bestlam
## [1] 0.07054802
lasso.pred <- predict(lasso.fit, newx = x_test, s = bestlam)
mean((y_test - lasso.pred)^2)
## [1] 40.88312
x <- model.matrix(crim ~., data = Boston)[,-1]
y <- Boston$crim
mod.lasso <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coeff <- predict(mod.lasso, s = bestlam, type = "coefficients")[1:14,]
lasso.coeff[lasso.coeff != 0]
## (Intercept) zn indus chas nox rm
## 11.377041716 0.034373111 -0.062605969 -0.555023861 -5.721122105 0.151927338
## dis rad ptratio black lstat medv
## -0.715085799 0.506621375 -0.156825685 -0.007550409 0.122010397 -0.145585637
#ridge regression
set.seed(12)
ridge.fit <- cv.glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridge.fit)
bestlam <- ridge.fit$lambda.min
bestlam
## [1] 0.2848036
ridge.pred <- predict(ridge.fit, newx = x_test, s = bestlam)
mean((y_test - ridge.pred)^2)
## [1] 41.10444
ridgefull.fit <- glmnet(x, y, alpha = 0, lambda = grid)
ridge.coeff <- predict(ridgefull.fit, s = bestlam, type = "coefficients")[1:14,]
ridge.coeff
## (Intercept) zn indus chas nox rm
## 11.805024789 0.036910120 -0.082474528 -0.727909961 -7.185946536 0.380200039
## age dis rad tax ptratio black
## 0.001381276 -0.809468723 0.476201825 0.001352865 -0.185334855 -0.008163851
## lstat medv
## 0.139994168 -0.160079863
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
RESPONSE: using cross validation the lowest MSE turns out to be from the lasso regression, although it is not much different then the ridge regression.
(c) Does your chosen model involve all of the features in the data set? Why or why not?
RESPONSE: no, two variables were left out, because these two variables, given the best lambda, were shrunk down to 0.