library(ISLR)
library(caret)
library(pls)
library(glmnet)
The correct answer would be less flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance.
This is due to the fact that when least squares have an excessively high variance, the lasso solution will produce a reduction in variance leading to more accurate predictions.
The correct answer would be less flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance.
This relationship is due to the fact that when there is a small change in the training data set,the least squares will produce a large change in variance. On the opposing side the ridge regression is able to perform well by trading off small increases in bias for a decrease in the variance.
More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias
library(ISLR)
data(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
set.seed(1)
trainingindex = sample(nrow(College), .75*nrow(College))
train = College[trainingindex,]
test = College[-trainingindex,]
dim(train)
## [1] 582 18
dim(test)
## [1] 195 18
lm.fit = lm(Apps ~., data = train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5773.1 -425.2 4.5 327.9 7496.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.784e+02 4.707e+02 -1.229 0.21962
## PrivateYes -4.673e+02 1.571e+02 -2.975 0.00305 **
## Accept 1.712e+00 4.567e-02 37.497 < 2e-16 ***
## Enroll -1.197e+00 2.151e-01 -5.564 4.08e-08 ***
## Top10perc 5.298e+01 6.158e+00 8.603 < 2e-16 ***
## Top25perc -1.528e+01 4.866e+00 -3.141 0.00177 **
## F.Undergrad 7.085e-02 3.760e-02 1.884 0.06002 .
## P.Undergrad 5.771e-02 3.530e-02 1.635 0.10266
## Outstate -8.143e-02 2.077e-02 -3.920 9.95e-05 ***
## Room.Board 1.609e-01 5.361e-02 3.002 0.00280 **
## Books 2.338e-01 2.634e-01 0.887 0.37521
## Personal 6.611e-03 6.850e-02 0.097 0.92315
## PhD -1.114e+01 5.149e+00 -2.163 0.03093 *
## Terminal 9.186e-01 5.709e+00 0.161 0.87223
## S.F.Ratio 1.689e+01 1.542e+01 1.096 0.27368
## perc.alumni 2.256e+00 4.635e+00 0.487 0.62667
## Expend 5.567e-02 1.300e-02 4.284 2.16e-05 ***
## Grad.Rate 6.427e+00 3.307e+00 1.944 0.05243 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9316
## F-statistic: 466.7 on 17 and 564 DF, p-value: < 2.2e-16
lm.pred = predict(lm.fit, newdata = test)
lm.err = mean((test$Apps-lm.pred)^2)
lm.err
## [1] 1384604
The test error obtained from my model is 1384604. The r-squared for my model is 93.36% which is telling me that only 93.36% of the variance is presented in this model.
xtrain = model.matrix(Apps ~ ., data = train[,-1])
ytrain = train$Apps
xtest = model.matrix(Apps ~., data = test[,-1])
ytest = test$Apps
set.seed(1)
library(glmnet)
ridge.fit = cv.glmnet(xtrain, ytrain, alpha=0)
plot(ridge.fit)
ridge.lamda = ridge.fit$lambda.min
ridge.lamda
## [1] 364.6228
ridge.pred = predict(ridge.fit, s=ridge.lamda, newx = xtest)
ridge.err = mean((ridge.pred - ytest)^2)
ridge.err
## [1] 1260111
As we can observe above the ridge regression error is 1260111, and this error is significantly lower than the linear model. Due to the fact that ridge regressions typically predict better when OLS has higher variance due to a large change in data.
set.seed(1)
lasso.fit = cv.glmnet(xtrain, ytrain, alpha = 1)
plot(lasso.fit)
lasso.lambda = lasso.fit$lambda.min
lasso.lambda
## [1] 1.945882
lasso.pred = predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err = mean((lasso.pred-ytest)^2)
lasso.err
## [1] 1394834
Reported above is the test error obtained from running the lasso model.
lasso.coeff=predict(lasso.fit, type = "coefficients", s = lasso.lambda)[1:18,]
lasso.coeff
## (Intercept) (Intercept) Accept Enroll Top10perc
## -1.030320e+03 0.000000e+00 1.693638e+00 -1.100616e+00 5.104012e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -1.369969e+01 7.851307e-02 6.036558e-02 -9.861002e-02 1.475536e-01
## Books Personal PhD Terminal S.F.Ratio
## 2.185146e-01 0.000000e+00 -8.390364e+00 1.349648e+00 2.457249e+01
## perc.alumni Expend Grad.Rate
## 7.684156e-01 5.858007e-02 5.324356e+00
Reported above are the non-zero coefficients.
library(pls)
pcr.fit = pcr(Apps~., data = train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type = "MSEP")
summary(pcr.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 3773 2100 2106 1812 1678 1674
## adjCV 3862 3772 2097 2104 1800 1666 1668
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1674 1625 1580 1579 1582 1586 1590
## adjCV 1669 1613 1574 1573 1576 1580 1584
## 14 comps 15 comps 16 comps 17 comps
## CV 1585 1480 1189 1123
## adjCV 1580 1463 1179 1115
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.159 57.17 64.41 70.20 75.53 80.48 84.24 87.56
## Apps 5.226 71.83 71.84 80.02 83.01 83.07 83.21 84.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.54 92.81 94.92 96.73 97.81 98.69 99.35
## Apps 85.00 85.22 85.22 85.23 85.36 85.45 89.93
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.84 93.36
From the chart above we can see that the MSE is the lowest while using 10 components in the data set. We can see that the 94.92% of the variance comes from predictors and 85.22% of the variance comes from the response variables.
pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1952693
As seen above the PCR model does not outperform any of the previous models. This could be due to the fact that there were a large amount of components used in the model. With this modeling technique the more principal components, the bias will decrease, and the variance will increase.
pls.fit = plsr(Apps~., data=train, scale=TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")
summary(pls.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 1916 1708 1496 1415 1261 1181
## adjCV 3862 1912 1706 1489 1396 1237 1170
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1166 1152 1146 1144 1144 1140 1138
## adjCV 1157 1144 1137 1135 1135 1131 1129
## 14 comps 15 comps 16 comps 17 comps
## CV 1137 1137 1137 1137
## adjCV 1129 1129 1129 1129
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.09 62.54 65.0 67.54 72.28 76.80 80.63
## Apps 76.80 82.71 87.20 90.8 92.79 93.05 93.14 93.22
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.71 85.53 88.01 90.95 93.07 95.18 96.86
## Apps 93.30 93.32 93.34 93.35 93.36 93.36 93.36
## 16 comps 17 comps
## X 98.00 100.00
## Apps 93.36 93.36
Looking at the summary above we can see that the lowest error occurs when there is 7 components in the model. With a variance of 76.8% based on the predictor variables and a variance of 93.14% from the response variable.
pls.pred = predict(pls.fit, test, ncomp = 7)
pls.err = mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1333314
The PLS function shown here can come close to beating the ridge function but does not perform better than that function.
avg.apps = mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2 =1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
barplot(c(lm.r2, ridge.r2, lasso.r2,pcr.r2,pls.r2), xlab = "Regression Methods", ylab = "Test R-Squared",main = "Test R-Squared for all Methods",names.arg = c("OLS","Ridge","Lasso","PCR","PLS"))
As seen in the chart above we can see that all the methods have high accuracy. The lowest accuracy model is PCR.
set.seed(1)
library(MASS)
data(Boston)
attach(Boston)
library(leaps)
predict.regsubsets = function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
mat[, names(coefi)] %*% coefi
}
k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
for (j in 1:p) {
pred = predict(best.fit, Boston[folds == i, ], id = j)
cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
}
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")
which.min(rmse.cv)
## [1] 9
bsm.err=(rmse.cv[which.min(rmse.cv)])^2
bsm.err
## [1] 42.81453
boston.x=model.matrix(crim~., data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(boston.lasso)
lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
lasso.err
## [1] 62.74783
boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)
ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
ridge.err
## [1] 58.8156
boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.175 7.180 6.724 6.731 6.727 6.727
## adjCV 8.61 7.174 7.179 6.721 6.725 6.724 6.724
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.722 6.614 6.618 6.607 6.598 6.553 6.488
## adjCV 6.718 6.609 6.613 6.602 6.592 6.546 6.481
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
As tested above the model with the lowest error was the subset model method. With an MSE of 42.81453
This model does not include all variables as this method kicks the unsignificant variables. The subset model only contains 9 variables which are zn, indus, nox, dis, rad, ptratio, black, lstat, and medv.