More flexible and hence will give improved prediction ac- curacy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accu- racy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accu- racy when its increase in variance is less than its decrease in bias.
False, because Lasso can have coefficients that can be exactly zero which makes it not as flexible.
iii is true for ridge regression as well because is not the most flexible but more flexible than lasso, and the bias increase is smaller than the variance reduction.
i, is true for non-linear methods relative to least squares because we want the variance is less than the relative decrease that it receives in terms of bias.
In this exercise, we will predict the number of applications received using the other variables in the College data set. (a) Split the data set into a training set and a test set.
#checking the structure
str(df)
'data.frame': 777 obs. of 17 variables:
$ Apps : int 1660 2186 1428 417 193 587 353 1899 1038 582 ...
$ Accept : int 1232 1924 1097 349 146 479 340 1720 839 498 ...
$ Enroll : int 721 512 336 137 55 158 103 489 227 172 ...
$ Top10perc : int 23 16 22 60 16 38 17 37 30 21 ...
$ Top25perc : int 52 29 50 89 44 62 45 68 63 44 ...
$ F.Undergrad: int 2885 2683 1036 510 249 678 416 1594 973 799 ...
$ P.Undergrad: int 537 1227 99 63 869 41 230 32 306 78 ...
$ Outstate : int 7440 12280 11250 12960 7560 13500 13290 13868 15595 10468 ...
$ Room.Board : int 3300 6450 3750 5450 4120 3335 5720 4826 4400 3380 ...
$ Books : int 450 750 400 450 800 500 500 450 300 660 ...
$ Personal : int 2200 1500 1165 875 1500 675 1500 850 500 1800 ...
$ PhD : int 70 29 53 92 76 67 90 89 79 40 ...
$ Terminal : int 78 30 66 97 72 73 93 100 84 41 ...
$ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
$ perc.alumni: int 12 16 30 37 2 11 26 37 23 15 ...
$ Expend : int 7041 10527 8735 19016 10922 9727 8861 11487 11644 8991 ...
$ Grad.Rate : int 60 56 54 59 15 55 63 73 80 52 ...
# creating training and testing data
index = sample(1:nrow(df), size = 0.7 *nrow(df))
train = df[index,]
test = df[-index,]
x.train = model.matrix(Apps~.,train)[,-1]
y.train = train$Apps
x.test = model.matrix(Apps~.,train)[,-1]
y.test = train$Apps
lm = lm(Apps ~ .,data = train)
summary(lm)
Call:
lm(formula = Apps ~ ., data = train)
Residuals:
Min 1Q Median 3Q Max
-3470.2 -447.5 -39.1 305.6 6861.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.197e+03 4.851e+02 -2.467 0.013931 *
Accept 1.279e+00 6.654e-02 19.215 < 2e-16 ***
Enroll -2.212e-01 2.694e-01 -0.821 0.411890
Top10perc 4.733e+01 7.090e+00 6.676 6.25e-11 ***
Top25perc -1.383e+01 5.552e+00 -2.491 0.013049 *
F.Undergrad 7.968e-02 4.225e-02 1.886 0.059860 .
P.Undergrad 5.029e-02 3.773e-02 1.333 0.183104
Outstate -9.617e-02 2.282e-02 -4.213 2.96e-05 ***
Room.Board 1.460e-01 6.010e-02 2.429 0.015487 *
Books -5.628e-02 2.715e-01 -0.207 0.835869
Personal -1.769e-02 7.358e-02 -0.240 0.810121
PhD -6.853e+00 5.587e+00 -1.227 0.220525
Terminal 5.239e-02 6.294e+00 0.008 0.993362
S.F.Ratio 2.026e+01 1.651e+01 1.227 0.220190
perc.alumni -9.682e+00 5.166e+00 -1.874 0.061487 .
Expend 9.455e-02 1.621e-02 5.832 9.55e-09 ***
Grad.Rate 1.348e+01 3.900e+00 3.455 0.000595 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1065 on 526 degrees of freedom
Multiple R-squared: 0.9127, Adjusted R-squared: 0.9101
F-statistic: 343.7 on 16 and 526 DF, p-value: < 2.2e-16
set.seed(1)
# finding the test error by predicting on the test data
predictions = predict(lm, newdata = test)
actual = test$Apps
mse = mean((predictions - actual)^2)
cat("MSE:", mse)
MSE: 1328973
x = model.matrix(Apps ~.,df)[,-1]
y = df$Apps
library(glmnet)
grid = 10^seq(10,-2, length =100)
ridge.mod = glmnet(x,y, alpha = 0 , lambda = grid)
# cross validation
cv.ridge = cv.glmnet(x,y, alpha = 0, lambda = grid)
best.lambda = cv.ridge$lambda.min
print(paste("Best lambda:", best.lambda))
[1] "Best lambda: 0.01"
cat("MSE:", mse)
MSE: 1154607
print(paste("Best Lasso Lambda:", round(best.lasso.lam,3)))
[1] "Best Lasso Lambda: 0.376"
lasso.model = glmnet(x.train,y.train,alpha = 1,lambda = best.lasso.lam)
lasso.pred = predict(lasso.model,x.test)
mse = mean((lasso.pred-y.test)^2)
cat("Test MSE for Lasso Regression", mse)
Test MSE for Lasso Regression 1099404
summary(pcr.fit)
Data: X dimension: 777 16
Y dimension: 777 1
Fit method: svdpc
Number of components considered: 16
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps
CV 3873 3667 1991 1977 1905
adjCV 3873 3666 1988 1976 1986
5 comps 6 comps 7 comps 8 comps 9 comps
CV 1642 1601 1602 1560 1515
adjCV 1622 1597 1605 1555 1512
10 comps 11 comps 12 comps 13 comps 14 comps
CV 1514 1514 1518 1518 1450
adjCV 1511 1511 1514 1514 1433
15 comps 16 comps
CV 1169 1143
adjCV 1164 1137
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 32.77 57.08 64.39 70.22 75.96 81.25
Apps 10.83 74.28 74.53 74.62 83.94 84.06
7 comps 8 comps 9 comps 10 comps 11 comps
X 85.03 88.65 91.93 94.44 96.39
Apps 84.13 85.00 85.80 85.93 85.98
12 comps 13 comps 14 comps 15 comps 16 comps
X 97.77 98.67 99.32 99.83 100.0
Apps 85.99 86.02 90.20 92.36 92.8
All the predictors indicate the CV error is the lowest.
pcr.fit = pcr(Apps ~.,data = df, subset=x.train, scale = T,validation = "CV")
pcr.pred = predict(pcr.fit,x.test, ncomp = 16)
mean((pcr.pred -y.test)^2)
[1] 1270180
pcr.fit = pcr(y~x , scale =T, ncomp = 16)
summary(pcr.fit)
Data: X dimension: 777 16
Y dimension: 777 1
Fit method: svdpc
Number of components considered: 16
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 32.77 57.08 64.39 70.22 75.96 81.25
y 10.83 74.28 74.53 74.62 83.94 84.06
7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
X 85.03 88.65 91.93 94.44 96.39 97.77
y 84.13 85.00 85.80 85.93 85.98 85.99
13 comps 14 comps 15 comps 16 comps
X 98.67 99.32 99.83 100.0
y 86.02 90.20 92.36 92.8
set.seed(1)
pls.fit = plsr(Apps ~., data = df, subset = x.train, scale = T, validation = "CV")
summary(pls.fit)
Data: X dimension: 5490 16
Y dimension: 5490 1
Fit method: kernelpls
Number of components considered: 16
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps
CV 3427 1631 1480 1232 1104
adjCV 3427 1631 1479 1232 1103
5 comps 6 comps 7 comps 8 comps 9 comps
CV 1030 980.3 974.8 969.4 965.1
adjCV 1029 979.3 973.6 968.6 964.3
10 comps 11 comps 12 comps 13 comps 14 comps
CV 960.5 958.0 957.7 957.4 957.3
adjCV 959.7 957.2 956.8 956.5 956.4
15 comps 16 comps
CV 957.2 957.2
adjCV 956.4 956.3
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 25.76 54.35 62.55 66.30 71.37 74.48
Apps 77.69 81.72 87.39 90.03 91.33 92.10
7 comps 8 comps 9 comps 10 comps 11 comps
X 77.60 81.30 84.37 86.84 90.59
Apps 92.19 92.27 92.35 92.42 92.44
12 comps 13 comps 14 comps 15 comps 16 comps
X 93.41 94.57 96.80 98.70 100.00
Apps 92.45 92.46 92.46 92.46 92.46
validationplot(pls.fit,val.type = "MSEP")
out = which.min(pls.fit$validation$PRESS)
out
[1] 16
The whole model has the lowest CV error rate.
pls.pred = predict(pls.fit,x.test,ncomp = 16)
mean((pls.pred-y.test)^2)
[1] 1270180
summary(pls.fit)
Data: X dimension: 5490 16
Y dimension: 5490 1
Fit method: kernelpls
Number of components considered: 16
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps
CV 3427 1631 1480 1232 1104
adjCV 3427 1631 1479 1232 1103
5 comps 6 comps 7 comps 8 comps 9 comps
CV 1030 980.3 974.8 969.4 965.1
adjCV 1029 979.3 973.6 968.6 964.3
10 comps 11 comps 12 comps 13 comps 14 comps
CV 960.5 958.0 957.7 957.4 957.3
adjCV 959.7 957.2 956.8 956.5 956.4
15 comps 16 comps
CV 957.2 957.2
adjCV 956.4 956.3
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 25.76 54.35 62.55 66.30 71.37 74.48
Apps 77.69 81.72 87.39 90.03 91.33 92.10
7 comps 8 comps 9 comps 10 comps 11 comps
X 77.60 81.30 84.37 86.84 90.59
Apps 92.19 92.27 92.35 92.42 92.44
12 comps 13 comps 14 comps 15 comps 16 comps
X 93.41 94.57 96.80 98.70 100.00
Apps 92.45 92.46 92.46 92.46 92.46
I do not think that we can predict all that accurately how many applications colleges are going to receive. When looking at all of the model there seems to be a wide range of values when looking at the following MSE values. Generally, we would want the MSE values to be particularly lower, but they are not so that is not a great indicator of predicting the number of applications that colleges are going to receive.
We will now try to predict per capita crime rate in the Boston data set.
library(MASS)
data(Boston)
boston.index = sample(1:nrow(Boston), size = 0.7 * nrow(Boston))
boston.train = Boston[boston.index,]
boston.test = Boston[-boston.index,]
bx.train = model.matrix(crim ~.,boston.train)
by.train = boston.train$crim
bx.test =model.matrix(crim ~.,boston.test)
by.test = boston.test$crim
library(leaps)
# Best Subsets Method
regfit.full = regsubsets(crim ~., boston.train,nvmax = 13)
reg.summary= summary(regfit.full)
names(reg.summary)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic"
[7] "outmat" "obj"
val.errors = rep(NA,13)
for (i in 1:13) {
coefi = coef(regfit.full, id = i)
pred = bx.test[,names(coefi)] %*% coefi
val.errors[i] = mean((by.test - pred)^2)
}
val.errors
which.min(val.errors)
coef(regfit.full,1)
The model with crim as the dependent variable and rad as the independent variable yields the lowest test MSE with a value of \(13.3\).
set.seed(1)
cv.out = cv.glmnet(bx.train,by.train,alpha = 1)
bos.lam = cv.out$lambda.min
bos.las.pred = predict(blasso.model, s= best.lasso.lam, newx = bx.test)
mean((bos.las.pred -by.test)^2)
[1] 13.63303
MSE is close to that of the Best Subsets Method # Ridge Method
boston.ridge = glmnet(bx.train,by.train, alpha = 0, lambda = 0)
cv.bos.rid = cv.glmnet(bx.train, by.train,alpha = 0)
plot(cv.bos.rid)
best.bos.ridge.lam = cv.bos.rid$lambda.min
bos.ridge.pred = predict(boston.ridge,s =best.bos.ridge.lam,newx = bx.test)
mean((bos.ridge.pred-by.test)^2)
[1] 18.0694
MSE is higher than the other models.
set.seed(2)
bos.pcr = pcr(crim ~., data = Boston, scale = T, validation = "CV")
summary(bos.pcr)
Data: X dimension: 506 13
Y dimension: 506 1
Fit method: svdpc
Number of components considered: 13
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps
CV 8.61 7.229 7.227 6.814 6.799
adjCV 8.61 7.225 7.222 6.807 6.789
5 comps 6 comps 7 comps 8 comps 9 comps
CV 6.795 6.794 6.787 6.654 6.664
adjCV 6.788 6.787 6.780 6.645 6.656
10 comps 11 comps 12 comps 13 comps
CV 6.673 6.676 6.651 6.573
adjCV 6.664 6.666 6.639 6.562
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 47.70 60.36 69.67 76.45 82.99 88.00
crim 30.69 30.87 39.27 39.61 39.61 39.86
7 comps 8 comps 9 comps 10 comps 11 comps
X 91.14 93.45 95.40 97.04 98.46
crim 40.14 42.47 42.55 42.78 43.04
12 comps 13 comps
X 99.52 100.0
crim 44.13 45.4
validationplot(bos.pcr, val.type = "MSEP")
# Cross Validating to find the best M value
set.seed(1)
bos.pcr2 = pcr(crim ~., data = Boston,subset = bx.train, scale = T, validation = "CV")
summary(bos.pcr2)
Data: X dimension: 3916 13
Y dimension: 3916 1
Fit method: svdpc
Number of components considered: 13
VALIDATION: RMSEP
Cross-validated using 10 random segments.
(Intercept) 1 comps 2 comps 3 comps 4 comps
CV 3.31 2.036 1.818 1.714 1.714
adjCV 3.31 2.036 1.818 1.714 1.714
5 comps 6 comps 7 comps 8 comps 9 comps
CV 1.615 1.589 1.53 1.528 1.515
adjCV 1.615 1.588 1.53 1.528 1.515
10 comps 11 comps 12 comps 13 comps
CV 1.478 1.438 1.396 1.381
adjCV 1.477 1.438 1.395 1.381
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
X 48.11 59.6 68.81 76.66 83.79 89.08
crim 62.29 70.0 73.40 73.43 76.44 77.22
7 comps 8 comps 9 comps 10 comps 11 comps
X 93.07 95.99 97.37 98.40 99.19
crim 78.86 78.95 79.33 80.36 81.48
12 comps 13 comps
X 99.79 100.00
crim 82.55 82.94
validationplot(bos.pcr2, val.type = "MSEP")
which.min(bos.pcr2$validation$PRESS)
[1] 13
mean((bos.pcr.pred-by.test)^2)
Warning: longer object length is not a multiple of shorter object length
[1] 39.8526
The MSE is the highest between all of the other methods.
The best model that was used was the Best Subsets Method which performed well on the test data being \(13.3\).
This model only had one variable that being rad on crim.