Homework 5

Results

library(ISLR)
library(caret)
library(pls)
library(glmnet)

Q2

a. The lasso relative to least squares is:

The correct answer would be less flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance.

This is due to the fact that when least squares have an excessively high variance, the lasso solution will produce a reduction in variance leading to more accurate predictions.

b. The ridge regression relative to lease squares is:

The correct answer would be less flexible and hence will give improved prediction accuracy when it increase in bias is less than its decrease in variance.

This relationship is due to the fact that when there is a small change in the training data set,the least squares will produce a large change in variance. On the opposing side the ridge regression is able to perform well by trading off small increases in bias for a decrease in the variance.

c. The non-linear methods, relativeto least squares is:

More flexible and hence will give improved accuracy when its increase in variance is less than its decrease in bias

Q9

a. Split the data into training and testing set

library(ISLR)
data(College)
str(College)

## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

set.seed(1)
trainingindex = sample(nrow(College), .75*nrow(College))
train = College[trainingindex,]
test = College[-trainingindex,]
dim(train)

## [1] 582  18

dim(test)

## [1] 195  18

b. Fit a linear model using least squares on the training set, and report the test error obtained

lm.fit = lm(Apps ~., data = train)
summary(lm.fit)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5773.1  -425.2     4.5   327.9  7496.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.784e+02  4.707e+02  -1.229  0.21962    
## PrivateYes  -4.673e+02  1.571e+02  -2.975  0.00305 ** 
## Accept       1.712e+00  4.567e-02  37.497  < 2e-16 ***
## Enroll      -1.197e+00  2.151e-01  -5.564 4.08e-08 ***
## Top10perc    5.298e+01  6.158e+00   8.603  < 2e-16 ***
## Top25perc   -1.528e+01  4.866e+00  -3.141  0.00177 ** 
## F.Undergrad  7.085e-02  3.760e-02   1.884  0.06002 .  
## P.Undergrad  5.771e-02  3.530e-02   1.635  0.10266    
## Outstate    -8.143e-02  2.077e-02  -3.920 9.95e-05 ***
## Room.Board   1.609e-01  5.361e-02   3.002  0.00280 ** 
## Books        2.338e-01  2.634e-01   0.887  0.37521    
## Personal     6.611e-03  6.850e-02   0.097  0.92315    
## PhD         -1.114e+01  5.149e+00  -2.163  0.03093 *  
## Terminal     9.186e-01  5.709e+00   0.161  0.87223    
## S.F.Ratio    1.689e+01  1.542e+01   1.096  0.27368    
## perc.alumni  2.256e+00  4.635e+00   0.487  0.62667    
## Expend       5.567e-02  1.300e-02   4.284 2.16e-05 ***
## Grad.Rate    6.427e+00  3.307e+00   1.944  0.05243 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1009 on 564 degrees of freedom
## Multiple R-squared:  0.9336, Adjusted R-squared:  0.9316 
## F-statistic: 466.7 on 17 and 564 DF,  p-value: < 2.2e-16

lm.pred = predict(lm.fit, newdata = test)
lm.err = mean((test$Apps-lm.pred)^2)
lm.err

## [1] 1384604

The test error obtained from my model is 1384604. The r-squared for my model is 93.36% which is telling me that only 93.36% of the variance is presented in this model.

c. Fit a ridge regression model on the training set, wih λ chosen by cross-validation. Report the test error obtained.

xtrain = model.matrix(Apps ~ ., data = train[,-1])
ytrain = train$Apps
xtest = model.matrix(Apps ~., data = test[,-1])
ytest = test$Apps
set.seed(1)
library(glmnet)
ridge.fit = cv.glmnet(xtrain, ytrain, alpha=0)
plot(ridge.fit)

ridge.lamda = ridge.fit$lambda.min
ridge.lamda

## [1] 364.6228

ridge.pred = predict(ridge.fit, s=ridge.lamda, newx = xtest)
ridge.err = mean((ridge.pred - ytest)^2)
ridge.err

## [1] 1260111

As we can observe above the ridge regression error is 1260111, and this error is significantly lower than the linear model. Due to the fact that ridge regressions typically predict better when OLS has higher variance due to a large change in data.

d.Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(1)
lasso.fit = cv.glmnet(xtrain, ytrain, alpha = 1)
plot(lasso.fit)

lasso.lambda = lasso.fit$lambda.min
lasso.lambda

## [1] 1.945882

lasso.pred = predict(lasso.fit, s=lasso.lambda, newx = xtest)
lasso.err = mean((lasso.pred-ytest)^2)
lasso.err

## [1] 1394834

Reported above is the test error obtained from running the lasso model.

lasso.coeff=predict(lasso.fit, type = "coefficients", s = lasso.lambda)[1:18,]
lasso.coeff

##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -1.030320e+03  0.000000e+00  1.693638e+00 -1.100616e+00  5.104012e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -1.369969e+01  7.851307e-02  6.036558e-02 -9.861002e-02  1.475536e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  2.185146e-01  0.000000e+00 -8.390364e+00  1.349648e+00  2.457249e+01 
##   perc.alumni        Expend     Grad.Rate 
##  7.684156e-01  5.858007e-02  5.324356e+00

Reported above are the non-zero coefficients.

e.Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

pcr.fit = pcr(Apps~., data = train, scale=TRUE, validation="CV")
validationplot(pcr.fit, val.type = "MSEP")

summary(pcr.fit)

## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     3773     2100     2106     1812     1678     1674
## adjCV         3862     3772     2097     2104     1800     1666     1668
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1674     1625     1580      1579      1582      1586      1590
## adjCV     1669     1613     1574      1573      1576      1580      1584
##        14 comps  15 comps  16 comps  17 comps
## CV         1585      1480      1189      1123
## adjCV      1580      1463      1179      1115
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.159    57.17    64.41    70.20    75.53    80.48    84.24    87.56
## Apps    5.226    71.83    71.84    80.02    83.01    83.07    83.21    84.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.54     92.81     94.92     96.73     97.81     98.69     99.35
## Apps    85.00     85.22     85.22     85.23     85.36     85.45     89.93
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.84     93.36

From the chart above we can see that the MSE is the lowest while using 10 components in the data set. We can see that the 94.92% of the variance comes from predictors and 85.22% of the variance comes from the response variables.

pcr.pred=predict(pcr.fit, test, ncomp=10)
pcr.err=mean((pcr.pred-test$Apps)^2)
pcr.err

## [1] 1952693

As seen above the PCR model does not outperform any of the previous models. This could be due to the fact that there were a large amount of components used in the model. With this modeling technique the more principal components, the bias will decrease, and the variance will increase.

f.Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit = plsr(Apps~., data=train, scale=TRUE, validation = "CV")
validationplot(pls.fit, val.type = "MSEP")

summary(pls.fit)

## Data:    X dimension: 582 17 
##  Y dimension: 582 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3862     1916     1708     1496     1415     1261     1181
## adjCV         3862     1912     1706     1489     1396     1237     1170
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1166     1152     1146      1144      1144      1140      1138
## adjCV     1157     1144     1137      1135      1135      1131      1129
##        14 comps  15 comps  16 comps  17 comps
## CV         1137      1137      1137      1137
## adjCV      1129      1129      1129      1129
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.67    47.09    62.54     65.0    67.54    72.28    76.80    80.63
## Apps    76.80    82.71    87.20     90.8    92.79    93.05    93.14    93.22
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.71     85.53     88.01     90.95     93.07     95.18     96.86
## Apps    93.30     93.32     93.34     93.35     93.36     93.36     93.36
##       16 comps  17 comps
## X        98.00    100.00
## Apps     93.36     93.36

Looking at the summary above we can see that the lowest error occurs when there is 7 components in the model. With a variance of 76.8% based on the predictor variables and a variance of 93.14% from the response variable.

pls.pred = predict(pls.fit, test, ncomp = 7)
pls.err = mean((pls.pred-test$Apps)^2)
pls.err

## [1] 1333314

The PLS function shown here can come close to beating the ridge function but does not perform better than that function.

g.Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

avg.apps = mean(test$Apps)
lm.r2=1-mean((lm.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
ridge.r2 =1-mean((ridge.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
lasso.r2=1-mean((lasso.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pcr.r2=1-mean((pcr.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)
pls.r2=1-mean((pls.pred-test$Apps)^2)/mean((avg.apps-test$Apps)^2)

barplot(c(lm.r2, ridge.r2, lasso.r2,pcr.r2,pls.r2), xlab = "Regression Methods", ylab = "Test R-Squared",main = "Test R-Squared for all Methods",names.arg = c("OLS","Ridge","Lasso","PCR","PLS"))

As seen in the chart above we can see that all the methods have high accuracy. The lowest accuracy model is PCR.

Q11

a.Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

set.seed(1)
library(MASS)
data(Boston)
attach(Boston)

Subset selection

library(leaps)
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

k = 10
p = ncol(Boston) - 1
folds = sample(rep(1:k, length = nrow(Boston)))
cv.errors = matrix(NA, k, p)
for (i in 1:k) {
    best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
    for (j in 1:p) {
        pred = predict(best.fit, Boston[folds == i, ], id = j)
        cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
    }
}
rmse.cv = sqrt(apply(cv.errors, 2, mean))
plot(rmse.cv, pch = 19, type = "b")

which.min(rmse.cv)

## [1] 9

bsm.err=(rmse.cv[which.min(rmse.cv)])^2
bsm.err

## [1] 42.81453

Lasso

boston.x=model.matrix(crim~., data=Boston)[,-1]
boston.y=Boston$crim
boston.lasso=cv.glmnet(boston.x,boston.y,alpha=1, type.measure = "mse")
plot(boston.lasso)

lasso.err=(boston.lasso$cvm[boston.lasso$lambda==boston.lasso$lambda.1se])
lasso.err

## [1] 62.74783

Ridge Regression

boston.ridge=cv.glmnet(boston.x, boston.y, type.measure = "mse", alpha=0)
plot(boston.ridge)

ridge.err=boston.ridge$cvm[boston.ridge$lambda==boston.ridge$lambda.1se]
ridge.err

## [1] 58.8156

PCR

boston.pcr=pcr(crim~., data=Boston, scale=TRUE, validation="CV")
summary(boston.pcr)

## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.175    7.180    6.724    6.731    6.727    6.727
## adjCV         8.61    7.174    7.179    6.721    6.725    6.724    6.724
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.722    6.614    6.618     6.607     6.598     6.553     6.488
## adjCV    6.718    6.609    6.613     6.602     6.592     6.546     6.481
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4

b.Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

As tested above the model with the lowest error was the subset model method. With an MSE of 42.81453

c.Does your chosen model involve all of the features in the data set? Why or why not?

This model does not include all variables as this method kicks the unsignificant variables. The subset model only contains 9 variables which are zn, indus, nox, dis, rad, ptratio, black, lstat, and medv.