Assignment 05

Linear Model Selection and Regularization

When it comes to regression, the standard linear model is utilized to explain the relationship between a response Y and a set of variables, X1, X2,…Xp. Now, we know that in the real world most relationships are not exactly linear, but the linear model still has certain advantages when it comes to inference and with real world problems, and can at times compete in relation to non-linear methods. Chapter 6 of ISLR discusses ways in which the simple linear model can be improved by replacing least squares fitting other fitting procedures. The following are a few exercises from this chapter.

Conceptual Exercise

Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

RESPONSE: iii. would be correct. The lasso is a less flexible model, as it puts a budget constraint on least squares. Essentially, when we perform the lasso, we are trying to find the set of coefficient estimates that lead to the smallest RSS, subject to the constraint that there is a budget, s ,for how large the coefficient estimates can be. Also while λ(tuning parameter) increases the variance decreases and the bias increases, resulting in an improved prediction accuracy when this trade off occurs(increase in bias is less than its decrease in variance).

(b) Repeat (a) for ridge regression relative to least squares.

RESPONSE: iii. would be correct here as well. This is because ridge regression’s advantage over least squares comes from the bias-variance trade off. As λ increases(shrinkage penalty), ridge regressions become less flexible, resulting in a significant decrease in variance, with not much increase in bias. We know when the number of variables p is almost as large as the number of observations n, then the least squares estimates can contain alot of variance, and if p>n, then the OLS will not even have a unique solution, whereas ride regressions can still perform well by trading off a small increase in bias for a large decrease in variance. Ridge regression will be more biased as it shrinks predictors that dont have a strong relationship with the response variable while trading off for less variance, the prediction accuracy is improved when its increase in bias is less then the decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.

RESPONSE: ii. would be correct, because non-linear methods follow observations more tightly than just least squares allowing for more flexibility, hence giving better preidciton accuracy when the variance increases more than the bias decreases.

Applied Exercises

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

data(College)
missmap(College) # missmap() is useful for ensuring there are no missing values in the data...from library(Amelia)

set.seed(12)
trainID <- sample(nrow(College), nrow(College)*.5)#Random 50/50 split for test and train
train <- College[trainID,]
test <- College[-trainID,]

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

set.seed(12)
fit.lm <- lm(Apps ~ .,data = train)
summary(fit.lm)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2796.0  -388.6   -58.6   298.8  5496.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -340.71990  522.82263  -0.652  0.51500    
## PrivateYes  -531.76257  181.15740  -2.935  0.00354 ** 
## Accept         1.40371    0.07011  20.022  < 2e-16 ***
## Enroll        -0.52560    0.24701  -2.128  0.03401 *  
## Top10perc     44.64763    7.14555   6.248 1.14e-09 ***
## Top25perc    -11.97377    5.76813  -2.076  0.03860 *  
## F.Undergrad    0.09258    0.04332   2.137  0.03327 *  
## P.Undergrad    0.01486    0.05191   0.286  0.77483    
## Outstate      -0.05728    0.02418  -2.369  0.01836 *  
## Room.Board     0.14359    0.05892   2.437  0.01528 *  
## Books          0.03323    0.30707   0.108  0.91388    
## Personal      -0.01028    0.08033  -0.128  0.89828    
## PhD           -7.32657    6.00842  -1.219  0.22348    
## Terminal      -5.21514    6.56187  -0.795  0.42726    
## S.F.Ratio      6.69220   17.07931   0.392  0.69541    
## perc.alumni   -5.85709    5.32647  -1.100  0.27221    
## Expend         0.08821    0.01929   4.572 6.59e-06 ***
## Grad.Rate      8.09763    3.48744   2.322  0.02078 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 904.1 on 370 degrees of freedom
## Multiple R-squared:  0.9352, Adjusted R-squared:  0.9322 
## F-statistic: 313.9 on 17 and 370 DF,  p-value: < 2.2e-16

lm.pred <- predict(fit.lm,test)
lm.error <- mean((test$Apps-lm.pred)^2)
lm.error

## [1] 1416217

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

# to perform a ridge regression we must first create a matrix x and vector y for the training and testing data
set.seed(12)
train.x <- model.matrix(Apps ~ ., data = train)
test.x <- model.matrix(Apps ~., data = test)
train.y <- train$Apps
test.y <- test$Apps

grid <- 10 ^ seq(4,-2, length = 100) # the glmnet() function will be implemented over a grid defined by these ranges for λ

fit.ridge <- glmnet(train.x, train.y, alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(train.x, train.y, alpha = 0, lambda = grid, thresh = 1e-12) 
plot(cv.ridge)

best.lambda1 <- cv.ridge$lambda.min # here I am getting the λ chosen by cross-validation
best.lambda1

## [1] 18.73817

ridge.pred <- predict(fit.ridge, s = best.lambda1, newx = test.x)
ridge.error <- mean((ridge.pred - test.y)^2)
print(ridge.error) #test error obtained

## [1] 1470483

(d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(12)
fit.lasso <- glmnet(train.x, train.y, alpha =1, lambda= grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.x, train.y, alpha =1, lambda = grid, thresh = 1e-12)
bestlambda2 <- cv.lasso$lambda.min
print(bestlambda2)

## [1] 16.29751

lasso.pred <- predict(fit.lasso, s = bestlambda2, newx = test.x)
lasso.error <- mean((lasso.pred - test.y)^2)
print(lasso.error)

## [1] 1488722

coeff.lasso <- predict(fit.lasso, type = "coefficients", s = bestlambda2, )[1:18,] # this will produce the coefficient estimates for the Lasso method, you will see 4 coefficients that are shrunk down to 0 namely Enroll, Books, Personal, S.F. Ratio.
print(coeff.lasso)

##   (Intercept)   (Intercept)    PrivateYes        Accept        Enroll 
## -4.849043e+02  0.000000e+00 -5.417953e+02  1.313305e+00  0.000000e+00 
##     Top10perc     Top25perc   F.Undergrad   P.Undergrad      Outstate 
##  3.502755e+01 -4.129574e+00  3.150119e-02  4.604656e-03 -3.515171e-02 
##    Room.Board         Books      Personal           PhD      Terminal 
##  1.313325e-01  0.000000e+00  0.000000e+00 -5.279551e+00 -5.038469e+00 
##     S.F.Ratio   perc.alumni        Expend 
##  0.000000e+00 -6.894089e+00  7.823714e-02

coeff.lasso[coeff.lasso!=0] # these are the non-zero coefficients

##   (Intercept)    PrivateYes        Accept     Top10perc     Top25perc 
## -4.849043e+02 -5.417953e+02  1.313305e+00  3.502755e+01 -4.129574e+00 
##   F.Undergrad   P.Undergrad      Outstate    Room.Board           PhD 
##  3.150119e-02  4.604656e-03 -3.515171e-02  1.313325e-01 -5.279551e+00 
##      Terminal   perc.alumni        Expend 
## -5.038469e+00 -6.894089e+00  7.823714e-02

(e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(12)
fit.pcr <- pcr(Apps~., data = train, scale = TRUE, validation ="CV")
summary(fit.pcr)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3476     3384     1665     1568     1326     1300     1194
## adjCV         3476     3388     1663     1568     1308     1294     1191
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1166     1154     1120      1111      1115      1112      1116
## adjCV     1163     1152     1116      1108      1113      1109      1113
##        14 comps  15 comps  16 comps  17 comps
## CV         1116      1102     963.2     970.9
## adjCV      1113      1101     959.5     966.0
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.609    59.82    66.17    71.90    77.23    81.95    85.39    88.45
## Apps    7.327    77.33    81.18    86.64    86.72    88.87    89.35    89.65
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.20     93.61     95.63     97.22     98.27     99.09     99.52
## Apps    90.38     90.47     90.49     90.66     90.66     90.67     90.96
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.28     93.52

validationplot(fit.pcr, val.type = "MSEP")

pcr.pred <- predict(fit.pcr,test,ncomp = 16)
pcr.error<- mean((test.y-pcr.pred)^2) # Principal Components Regression test error
pcr.error

## [1] 1483940

RESPONSE: We see from the summary above that the lowest adjusted CV aligns with 16 components, resulting in a test error of 1483940.

(f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

fit.pls <- plsr(Apps~., data =train, scale = TRUE, validation = "CV")

validationplot(fit.pls, val.type = "MSEP")

summary(fit.pls)

## Data:    X dimension: 388 17 
##  Y dimension: 388 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3476     1532     1265     1110     1064     1018    969.8
## adjCV         3476     1528     1265     1107     1061     1015    965.6
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       963.0    955.7    954.9     951.3     951.8     952.0     952.2
## adjCV    958.7    952.0    951.2     947.6     948.0     948.2     948.3
##        14 comps  15 comps  16 comps  17 comps
## CV        951.5     950.4     950.0     949.8
## adjCV     947.7     946.7     946.3     946.2
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       27.77    46.85    65.42    69.25    72.60    75.30    79.24    81.85
## Apps    81.82    87.66    90.76    91.63    92.49    93.25    93.41    93.46
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       84.51     86.78     90.24     92.49     93.91     95.98     97.11
## Apps    93.47     93.49     93.51     93.51     93.51     93.52     93.52
##       16 comps  17 comps
## X        98.95    100.00
## Apps     93.52     93.52

pls.pred <- predict(fit.pls, test, ncomp = 10)
pls.error <- mean((test.y - pls.pred)^2) # Partial Least Squares test error
pls.error

## [1] 1419159

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

all.error <- c(lm.error, ridge.error, lasso.error, pcr.error, pls.error)
names(all.error) <- c("lm","ridge","lasso","pcr","pls")
testerror.plot <- barplot(all.error, col = 'orange')

RESPONSE:The bar chart above gives a visualization of the MSE from the different models, we see they are all similiar with the lm and pls looking like they have the lowest test error.

#r-squared
test.avg <- mean(test.y)
lm.r2<- 1- mean ((lm.pred - test.y)^2)/mean((test.avg-test.y)^2)
lm.r2

## [1] 0.9207656

ridge.r2 <- 1- mean ((ridge.pred - test.y)^2)/mean((test.avg-test.y)^2)
ridge.r2

## [1] 0.9177295

lasso.r2 <- 1- mean ((lasso.pred - test.y)^2)/mean((test.avg-test.y)^2)
lasso.r2

## [1] 0.9167091

pcr.r2 <- 1- mean ((pcr.pred - test.y)^2)/mean((test.avg-test.y)^2)
pcr.r2

## [1] 0.9169766

pls.r2 <- 1- mean ((pls.pred - test.y)^2)/mean((test.avg-test.y)^2)
pls.r2

## [1] 0.920601

RESPONSE:Above are the R-squared calculations which show that all these models can explain the variance of our dependent variable with great accuracy, all above 90%. its a close call but there is not too much difference between the models with the pls and lm coming in almost identical. #### Question 11

We will now try to predict per capita crime rate in the Boston data set.

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

#I like to begin exploring the data, also ensuring there is no missing values
attach(Boston)

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

missmap(Boston)

#best subset
set.seed(12)
data(Boston)

predict.regsubsets <- function(object, newdata, id, ...) {
    form <- as.formula(object$call[[2]])
    mat <- model.matrix(form, newdata)
    coefi <- coef(object, id = id)
    xvars <- names(coefi)
    mat[, xvars] %*% coefi
}

k = 10
folds <- sample(1:k, nrow(Boston), replace = TRUE)
cv.errors <- matrix(NA, k, 13, dimnames = list(NULL, paste(1:13)))
for (j in 1:k) {
    best.fit <- regsubsets(crim ~ ., data = Boston[folds != j, ], nvmax = 13)
    for (i in 1:13) {
        pred <- predict(best.fit, Boston[folds == j, ], id = i)
        cv.errors[j, i] <- mean((Boston$crim[folds == j] - pred)^2)
    }
}

mean.cv.errors <- apply(cv.errors, 2, mean)

plot(mean.cv.errors, type = "b", xlab = "Number of variables", ylab = "CV error")

mean.cv.errors[which.min(mean.cv.errors)]

##       12 
## 45.33533

RESPONSE:the lowest CV errorcomes at 12 variables and has a error of about 45.33

bestfit.full <- regsubsets(crim ~., data = Boston, nvmax = 13)
coef(bestfit.full, 12)

##   (Intercept)            zn         indus          chas           nox 
##  16.985713928   0.044673247  -0.063848469  -0.744367726 -10.202169211 
##            rm           dis           rad           tax       ptratio 
##   0.439588002  -0.993556631   0.587660185  -0.003767546  -0.269948860 
##         black         lstat          medv 
##  -0.007518904   0.128120290  -0.198877768

#Lasso
set.seed(1)
train <- sample(506, 253)
test <- -train
Boston.train <- Boston[train, ]
Boston.test <- Boston[test,]
x_train <- model.matrix(crim~., data = Boston.train)[,-1]
y_train <- Boston.train$crim
x_test <- model.matrix(crim~., data = Boston.test)[,-1]
y_test <- Boston.test$crim
grid <- 10^seq(10, -2, length = 100)
lasso.fit <- cv.glmnet(x_train, y_train, alpha = 1, lambda = grid, thresh = 1e-12)
plot(lasso.fit)

bestlam <- lasso.fit$lambda.min
bestlam

## [1] 0.07054802

lasso.pred <- predict(lasso.fit, newx = x_test, s = bestlam)
mean((y_test - lasso.pred)^2)

## [1] 40.88312

x <- model.matrix(crim ~., data = Boston)[,-1]
y <- Boston$crim
mod.lasso <- glmnet(x, y, alpha = 1, lambda = grid)
lasso.coeff <- predict(mod.lasso, s = bestlam, type = "coefficients")[1:14,]
lasso.coeff[lasso.coeff != 0]

##  (Intercept)           zn        indus         chas          nox           rm 
## 11.377041716  0.034373111 -0.062605969 -0.555023861 -5.721122105  0.151927338 
##          dis          rad      ptratio        black        lstat         medv 
## -0.715085799  0.506621375 -0.156825685 -0.007550409  0.122010397 -0.145585637

#ridge regression
set.seed(12)
ridge.fit <- cv.glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
plot(ridge.fit)

bestlam <- ridge.fit$lambda.min
bestlam

## [1] 0.2848036

ridge.pred <- predict(ridge.fit, newx = x_test, s = bestlam)
mean((y_test - ridge.pred)^2)

## [1] 41.10444

ridgefull.fit <- glmnet(x, y, alpha = 0, lambda = grid)
ridge.coeff <- predict(ridgefull.fit, s = bestlam, type = "coefficients")[1:14,]
ridge.coeff

##  (Intercept)           zn        indus         chas          nox           rm 
## 11.805024789  0.036910120 -0.082474528 -0.727909961 -7.185946536  0.380200039 
##          age          dis          rad          tax      ptratio        black 
##  0.001381276 -0.809468723  0.476201825  0.001352865 -0.185334855 -0.008163851 
##        lstat         medv 
##  0.139994168 -0.160079863

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

RESPONSE: using cross validation the lowest MSE turns out to be from the lasso regression, although it is not much different then the ridge regression.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

RESPONSE: no, two variables were left out, because these two variables, given the best lambda, were shrunk down to 0.