Assignment 5 Chapter 06 (page 259): 2, 9, 11
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The lasso shrinks the coefficient estimates towards zero. The lasso performs variable selection. As a result, models generated from the lasso are generally much easier to interpret than other methods like ridge regression. The least squares estimates have excessively high variance and the lasso solution can yield a reduction in variance at the expense of an increase in bias. This can generate more accurate predictions.
Repeat (a) for ridge regression relative to least squares. For ridge regression (relative to least squares) is less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge regression has the effect of “shrinking” large values of β’s towards zero which should improve the fit, because shrinking the coefficients can significantly reduce their variance. The penalty term makes the ridge regression estimates biased, but can also substantially reduce variance.
Repeat (a) for non-linear methods relative to least squares. For non-linear methods relative to least squares, it is more flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias.
library(ISLR)
data(College)
str(College)
## 'data.frame': 777 obs. of 18 variables:
## $ Private : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Apps : num 1660 2186 1428 417 193 ...
## $ Accept : num 1232 1924 1097 349 146 ...
## $ Enroll : num 721 512 336 137 55 158 103 489 227 172 ...
## $ Top10perc : num 23 16 22 60 16 38 17 37 30 21 ...
## $ Top25perc : num 52 29 50 89 44 62 45 68 63 44 ...
## $ F.Undergrad: num 2885 2683 1036 510 249 ...
## $ P.Undergrad: num 537 1227 99 63 869 ...
## $ Outstate : num 7440 12280 11250 12960 7560 ...
## $ Room.Board : num 3300 6450 3750 5450 4120 ...
## $ Books : num 450 750 400 450 800 500 500 450 300 660 ...
## $ Personal : num 2200 1500 1165 875 1500 ...
## $ PhD : num 70 29 53 92 76 67 90 89 79 40 ...
## $ Terminal : num 78 30 66 97 72 73 93 100 84 41 ...
## $ S.F.Ratio : num 18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
## $ perc.alumni: num 12 16 30 37 2 11 26 37 23 15 ...
## $ Expend : num 7041 10527 8735 19016 10922 ...
## $ Grad.Rate : num 60 56 54 59 15 55 63 73 80 52 ...
set.seed(1)
trainingindex<-sample(nrow(College), 0.75*nrow(College))
train<-College[trainingindex, ]
test<-College[-trainingindex, ]
lm.fit <- lm(Apps~.,data=train)
summary(lm.fit)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5773.1 -425.2 4.5 327.9 7496.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.784e+02 4.707e+02 -1.229 0.21962
## PrivateYes -4.673e+02 1.571e+02 -2.975 0.00305 **
## Accept 1.712e+00 4.567e-02 37.497 < 2e-16 ***
## Enroll -1.197e+00 2.151e-01 -5.564 4.08e-08 ***
## Top10perc 5.298e+01 6.158e+00 8.603 < 2e-16 ***
## Top25perc -1.528e+01 4.866e+00 -3.141 0.00177 **
## F.Undergrad 7.085e-02 3.760e-02 1.884 0.06002 .
## P.Undergrad 5.771e-02 3.530e-02 1.635 0.10266
## Outstate -8.143e-02 2.077e-02 -3.920 9.95e-05 ***
## Room.Board 1.609e-01 5.361e-02 3.002 0.00280 **
## Books 2.338e-01 2.634e-01 0.887 0.37521
## Personal 6.611e-03 6.850e-02 0.097 0.92315
## PhD -1.114e+01 5.149e+00 -2.163 0.03093 *
## Terminal 9.186e-01 5.709e+00 0.161 0.87223
## S.F.Ratio 1.689e+01 1.542e+01 1.096 0.27368
## perc.alumni 2.256e+00 4.635e+00 0.487 0.62667
## Expend 5.567e-02 1.300e-02 4.284 2.16e-05 ***
## Grad.Rate 6.427e+00 3.307e+00 1.944 0.05243 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9316
## F-statistic: 466.7 on 17 and 564 DF, p-value: < 2.2e-16
Fit model to the test set and checking accuracy. Below is Mean Squared Error.
lm.pred <- predict(lm.fit, newdata=test)
lm.error <- mean((test$Apps-lm.pred)^2)
lm.error
## [1] 1384604
The test error obtained from the linear model fit on all variables is 1384604. The r2 for this model is 93.36%. This means that 93.36% of the variance is described by this model.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-2
trainX <- model.matrix(Apps~., data = train[,-1])
trainY <- train$Apps
testX <- model.matrix(Apps~., data=test[,-1])
testY <- test$Apps
set.seed(1)
grid <- 10^seq(10,-2,length=100)
ridge.fit <- cv.glmnet(trainX, trainY, alpha=0)#alpha = 0 for ridge regression
plot(ridge.fit)
By viewing the plot above we see that our mean squared error increases as log λ increases. Lets find the λ that most reduces the error.
best.lambda <- ridge.fit$lambda.min
best.lambda
## [1] 364.6228
After getting the best λ from cross-validation, I will predict with the ridge regression using the best λ value.
ridge.pred <- predict(ridge.fit, s=best.lambda, newx = testX)
ridge.err <- mean((ridge.pred-testY)^2)
ridge.err
## [1] 1260111
The ridge regression for this dataset has a test error of 1260111 which is better than the OLS. Ridge regression includes all variables in the model, but penalizes some variables and shrinks the coefficients towards zero. By doing so do not create a problem for prediction accuracy, but makes it challenging to interpret the relationship between the variables and the response variable.
set.seed(1)
lasso.fit <- cv.glmnet(trainX, trainY, alpha=1)#alpha = 1 for lasso model
plot(lasso.fit)
best.lamb <- lasso.fit$lambda.min
best.lamb
## [1] 1.945882
After getting the best λ from cross-validation, I will predict with the ridge regression using the best λ value.
lasso.pred <- predict(lasso.fit, s=best.lamb, newx = testX)
lasso.err <- mean((lasso.pred-testY)^2)
lasso.err
## [1] 1394834
The Lasso model produces a test error higher than ridge regression which indicates that it has less predicting power than ridge regression.
set.seed(1)
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit <- pcr(Apps~.,data = train,scale=TRUE,validation="CV")
summary(pcr.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 3767 2088 2097 1815 1682 1675
## adjCV 3862 3766 2085 2096 1807 1669 1670
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1671 1620 1574 1568 1575 1579 1580
## adjCV 1666 1610 1568 1562 1570 1574 1574
## 14 comps 15 comps 16 comps 17 comps
## CV 1575 1502 1204 1134
## adjCV 1571 1486 1193 1126
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 32.159 57.17 64.41 70.20 75.53 80.48 84.24 87.56
## Apps 5.226 71.83 71.84 80.02 83.01 83.07 83.21 84.46
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.54 92.81 94.92 96.73 97.81 98.69 99.35
## Apps 85.00 85.22 85.22 85.23 85.36 85.45 89.93
## 16 comps 17 comps
## X 99.82 100.00
## Apps 92.84 93.36
validationplot(pcr.fit, val.type = "MSEP")
Viewing the plot above we can see the MSEP decrease as the number of components increases.
Reviewing the summary of our PCR model we see that the smallest CV score, 1139, is produced with 17 of components.
pcr.pred <- predict(pcr.fit, test, ncomp=17)
pcr.err <- mean((pcr.pred-test$Apps)^2)
pcr.err
## [1] 1384604
Our PCR has a test error of 1384604 which does not out perfom our ridge regression model. The nature of PCR model explains that as more principal components are used in the regression model, the bias decreases, but the variance increases.
pls.fit <- plsr(Apps~., data=train, scale=TRUE, validation="CV")
validationplot(pls.fit, val.type = "MSEP")
summary(pls.fit)
## Data: X dimension: 582 17
## Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3862 1922 1704 1510 1425 1215 1158
## adjCV 3862 1918 1701 1503 1405 1196 1148
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1141 1136 1132 1129 1129 1125 1124
## adjCV 1133 1129 1124 1121 1121 1117 1116
## 14 comps 15 comps 16 comps 17 comps
## CV 1123 1123 1123 1123
## adjCV 1115 1116 1115 1115
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.67 47.09 62.54 65.0 67.54 72.28 76.80 80.63
## Apps 76.80 82.71 87.20 90.8 92.79 93.05 93.14 93.22
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.71 85.53 88.01 90.95 93.07 95.18 96.86
## Apps 93.30 93.32 93.34 93.35 93.36 93.36 93.36
## 16 comps 17 comps
## X 98.00 100.00
## Apps 93.36 93.36
The lowest CV occurs around 14 components.
pls.pred <- predict(pls.fit, test, ncomp = 14)
pls.err <- mean((pls.pred-test$Apps)^2)
pls.err
## [1] 1385374
The PLS model produces a test error of 1385374 a little greater than our PCR model. PLS tries to reduce dimensions, it reduces the bias and therefore increase the variance of the model.
barplot(c(lm.error, ridge.err, lasso.err, pcr.err, pls.err), col="red", xlab="Regression Methods", ylab="Test Error", main="Test Errors for All Methods", names.arg=c("OLS", "Ridge", "Lasso", "PCR", "PLS"))
The best model was the Ridge Regression model. From the barplot above the test errors are drastically different, but we can see that the ridge model produces the smallest error.
library(MASS)
attach(Boston)
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
set.seed(1)
train<-sample(1:nrow(Boston), nrow(Boston)/2)
train.boston<-Boston[train,]
test<-(-train)
test.boston<-Boston[test,]
crim.test<-Boston$crim[test]
x.mat.train<-model.matrix(crim~., data=train.boston)[,-crim]
x.mat.train<-model.matrix(crim~., data=test.boston)[,-crim]
Best Subset Selection
library(leaps)
regfit.best<-regsubsets(crim~.,data=Boston[train,],nvmax=13)
test.mat<-model.matrix(crim~.,data=Boston[test,])
val.errors=rep(NA,13)
for(i in 1:13){
coefi=coef(regfit.best,id=i)
pred=test.mat[,names(coefi)]%*%coefi
val.errors[i]=mean((Boston$crim[test]-pred)^2)
}
val.errors
## [1] 40.14557 41.87706 42.40901 42.45745 43.03836 42.13258 41.25016 41.28957
## [9] 41.49271 41.50577 41.48839 41.46692 41.54639
which.min(val.errors)
## [1] 1
sub.mse<-val.errors[1]
sub.mse
## [1] 40.14557
Best Subset selection selects a 1-variable model based on the Test MSE. At 1-variables, the CV estimate for the test MSE is 40.14557- the lowest MSE reported.
Ridge Regression:
library(glmnet)
set.seed(1)
x<-model.matrix(crim~.,Boston)[,-1]
y<-Boston$crim
train <-sample(nrow(Boston), 0.75*nrow(Boston))
y.test<-y[-train]
grid<-10^seq(10,-2,length=100)
ridge.mod <- glmnet(x,y,lambda=grid, alpha=0)
cv.out<-cv.glmnet(x[train,],y[train],alpha=0)
plot(cv.out)
bestlam<-cv.out$lambda.min
ridge.pred<-predict(ridge.mod,s=bestlam,newx=x[test,])
ridge.mse<-mean((ridge.pred-y.test)^2)
## Warning in ridge.pred - y.test: longer object length is not a multiple of
## shorter object length
ridge.mse
## [1] 120.8606
Ridge Regression MSE 120.8606
Lasso Model:
grid<-10^seq(10,-2,length=100)
lasso.mod<-glmnet(x,y,lambda=grid, alpha=1)
cv.out<-cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)
bestlam<-cv.out$lambda.min
lasso.pred<-predict(lasso.mod,s=bestlam,newx=x[test,])
lasso.mse<-mean((lasso.pred-y.test)^2)
## Warning in lasso.pred - y.test: longer object length is not a multiple of
## shorter object length
lasso.mse
## [1] 122.0856
The MSE for this method is 122.0856.
PCR
set.seed(1)
library(pls)
pcr.fit<-pcr(crim~., data=train.boston, Scale=TRUE, validation='CV')
summary(pcr.fit)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.275 7.305 7.267 7.254 7.234 7.130 7.115
## adjCV 9.275 7.300 7.259 7.246 7.226 7.121 7.107
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.795 6.763 6.743 6.694 6.694 6.696 6.700
## adjCV 6.782 6.753 6.732 6.682 6.682 6.684 6.686
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 81.27 97.18 99.08 99.72 99.89 99.93 99.97 99.99
## crim 39.26 40.61 40.87 41.25 43.24 43.95 48.88 49.34
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 100.00 100.00 100.00 100.00 100.00
## crim 49.94 50.86 50.93 50.98 51.37
validationplot(pcr.fit,val.type="MSEP")
pcr.pred<-predict(pcr.fit,test.boston,ncomp = 11)
pcr.mse<-mean((pcr.pred-y.test)^2)
## Warning in pcr.pred - y.test: longer object length is not a multiple of shorter
## object length
pcr.mse
## [1] 131.4856
Reviewing the summary of our PCR model we see that the smallest CV score, 6.694 is produced with 10 and 11 components.Using 10 components our MSR produced is 131.4856
As computed above, the model that has the lowest cross-validation error is the one chosen by the Ridge Regression method. This model had an MSE of 40.14557 with one variable.
This model does not include all of the features in the dataset.