#Kevin Kuipers (Completed byself)

#3/19/2019

#1. Question 6.8.4 pg 260 i.Increase initially, and then eventually start decreasing in an inverted U shape. ii. Decrease initially, and then eventually start increasing in a U shape. iii. Steadily increase. iv. Steadily decrease. v. Remain constant.

  1. as we increase \(\lambda\) from 0, the training RSS will: The RSS will (iii.) steadily increase as \(\lambda\) increases from 0. The model is becoming less flexible as \(\lambda\) increases which increases the training RSS. This is because the \(\beta_{j}\) coefficients are being restricted. This causes the training RSS to steadily increase.

  2. as we increase \(\lambda\) from 0, the test RSS will: The test RSS will (ii.) decrease initially, and then eventually start increasing in a U shape. The model is becoming less flexible as \(\lambda\) increases because the \(\beta_{j}\) coefficients are being restricted therefore the coefficients will move away from their least sqaure estimates. The relationship will initially result in a decrease in the test RSS but then as \(\lamda\) continues to increase the test RSS will change directions and increase in U shape.

  3. As we increase \(\lambda\) from 0, the variance will: The variance will (iv.) steadily decrease due to the model decreasing flexibility as the \(\beta_{j}\) coefficients are being restricted.

  4. As we increase \(\lambda\) from 0, the bias will (iii.) steadily increase. Since the model is becoming less flexible as as \(\lambda\) increases, the bias inflicted upon the model will also increase since \(\beta_{j}\) coefficients incurring restriction.

  5. As we increase \(\lamda\) from 0, the irreducible error will (v.) remain constant since it is independent of model parameters.

#2. Question 6.8.9 pg 263

In this exercise, we will predict the number of applications received using the other variables in the College data set.

##Question 9 a) Split the data set into a training set and a test set.

First I will load the College data set from the ISLR library. Then I will spilt the data set into training and testing.

#Loading Data set
library(ISLR)
data(College)

#Setting seed
set.seed(123)

#splitting the data set, 70% training, 30% testing
index <- sample(x=nrow(College), size=.70*nrow(College))
train <- College[index,]
test <-  College[-index,]

Question 9b) ##Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit <- lm(Apps ~ ., data=train)
lm.pred <- predict(lm.fit, test)
lm.mse <- mean((lm.pred - test$Apps)^2)

cat('The linear regression model MSE is :', lm.mse)
## The linear regression model MSE is : 1734841

Question 9 c) ##Fit a ridge regression model on the training set, with ?? chosen by cross-validation. Report the test error obtained.

In order to do this I will have create a training model matrix and testing model matrix. I will also have to use the glmnet library. First I will determine the best lambda.

#install.packages('glmnet')
library(glmnet)
set.seed(123)

#Creating the training and testing matrices for the ridge regression method
train.matrix <- model.matrix(Apps ~ ., data=train)
test.matrix <- model.matrix(Apps ~., data=test)

#implementing a grid of values ranging from lambda=10^10 to lambda=10^-2, covering the full range of scenarios from the null mode containing only the intercept, to the least squares fit. 
grid <- 10^seq(10,-2,length=100)

#fitting ridge ression model. 
ridge.fit <- glmnet(train.matrix, train$Apps, alpha=0, lambda=grid, thresh = 1e-12)

#Cross validating the model
cv.ridge <- cv.glmnet(train.matrix, train$Apps, alpha=0, lambda=grid, thresh = 1e-12)
plot(cv.ridge)

#Selecting the best lamdba from the cross validation approach
best.lambda <- cv.ridge$lambda.min

#out the best lambda from the cross validation method
cat('The best lambda from the cross validation is :', best.lambda)
## The best lambda from the cross validation is : 32.74549

##MSE Now I will determine the MSE of the ridge regression model.

#obtaining the MSE using the best lambda from the cross validation 
ridge.pred <- predict(ridge.fit, s=best.lambda, newx=test.matrix)
ridge.mse <- mean((ridge.pred - test$Apps)^2)
cat('The Ridge Regression model MSE is: ', ridge.mse)
## The Ridge Regression model MSE is:  1936665

The ridge ression model model MSE is slighly less than the standard linear regression model. In fact, the difference between the two MSE’s is only 11.

##Question 9 d) Fit a lasso model on the training set, with ?? chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

#Fitting lasso model 
lasso.fit <- glmnet(train.matrix, train$Apps, alpha=1, lambda=grid, thresh=1e-12)

#Cross validating the model
cv.lasso <- cv.glmnet(train.matrix, train$Apps, alpha=1, lambda=grid, thresh=1e-12)

#Selecting the best lambda from the cross vadlidation
lasso.best.lambda <- cv.lasso$lambda.min
plot(cv.lasso)

cat('The best lambda from the cross validation is :', lasso.best.lambda)
## The best lambda from the cross validation is : 6.135907

##MSE of the lasso fit model

pred.lasso <- predict(lasso.fit, s=lasso.best.lambda, newx = test.matrix)
lasso.mse <- mean((pred.lasso - test$Apps)^2)

cat('The lasso method MSE is :', lasso.mse)
## The lasso method MSE is : 1732241

##Non-zero cofficient estimates

lasso.coef <- predict(lasso.fit, s=lasso.best.lambda, type='coefficients')[1:19,]
lasso.coef[lasso.coef!=0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -434.87100500 -661.77070292    1.21825124    0.08527611   45.95904318 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -13.28682982    0.02277674    0.02970480   -0.04827206    0.17831079 
##         Books      Personal           PhD      Terminal   perc.alumni 
##    0.16958630   -0.01898678   -5.33505536   -4.80609827   -7.59328977 
##        Expend     Grad.Rate 
##    0.07542260    9.70968049

It appears that the lasso model has a slightly lower MSE than the ridge regression. In fact, the difference between the two MSE’s is 17. The number of non-zero coefficient estimates is 17.

##Question 9 e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

In order to fit a PCR model I will have load the pls library. I will plot the number of components and output a summary of the data to find out which M (principle components) produces the lowest cross validation error

#install.packages('pls')
library(pls)

set.seed(123)
pcr.fit <- pcr(Apps ~ ., data=train, scale=TRUE, validation='CV')

validationplot(pcr.fit, val.type='R2')

summary(pcr.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3402     3378     1683     1661     1335     1307     1274
## adjCV         3402     3379     1681     1660     1324     1302     1272
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1238     1221     1177      1181      1182      1185      1187
## adjCV     1230     1218     1175      1178      1180      1182      1184
##        14 comps  15 comps  16 comps  17 comps
## CV         1189      1199      1050      1052
## adjCV      1186      1196      1046      1048
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X      31.797    57.68    64.58     70.2    75.49    80.41    84.00
## Apps    3.037    76.20    77.49     85.1    86.15    86.83    87.92
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       87.43    90.62     93.05     95.12     96.96     98.04     98.89
## Apps    88.01    88.85     88.87     88.94     88.98     89.01     89.01
##       15 comps  16 comps  17 comps
## X        99.42     99.83    100.00
## Apps     89.03     91.68     91.75

It seems that M=17 is the best choice. Now I find the MSE of the PCR model by testing the training againse the testing

pcr.pred <- predict(pcr.fit, test, ncomp=17)
pcr.mse <- mean((test$Apps - pcr.pred)^2)

cat('The MSE for PCR model with 17 components is : ', pcr.mse )
## The MSE for PCR model with 17 components is :  1734841

This MSE is the same as the MSE in the regular linear regression model.

##Question 9 f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(123)

pls.fit <- plsr(Apps ~ ., data=train, scale=TRUE, validation='CV')

validationplot(pls.fit, val.type='MSEP')

summary(pls.fit)
## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3402     1520     1261     1159     1132     1104     1067
## adjCV         3402     1517     1262     1157     1129     1100     1061
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1057     1051     1052      1050      1049      1050      1051
## adjCV     1052     1047     1048      1046      1045      1046      1047
##        14 comps  15 comps  16 comps  17 comps
## CV         1051      1052      1052      1052
## adjCV      1047      1048      1048      1048
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       26.09    41.97    63.14    67.44    71.36    74.05    77.72
## Apps    80.83    86.94    89.29    90.10    90.94    91.65    91.71
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X       80.98    83.77     86.46     89.83     91.07     93.08     95.14
## Apps    91.73    91.73     91.74     91.74     91.74     91.75     91.75
##       15 comps  16 comps  17 comps
## X        97.06     99.09    100.00
## Apps     91.75     91.75     91.75

It appears using the pls method that the number of components in the model reduces in cross validation error sooner abd plateaus off around 10-13. Since 13 is hardly any better than 10, I am going to use M=10 for testing the MSE.

pls.pred <- predict(pls.fit, test, ncomp=10)
pls.mse <- mean((pls.pred - test$Apps)^2)

cat('The MSE using the pls method with the number of components as 10 is : ', pls.mse)
## The MSE using the pls method with the number of components as 10 is :  1758617

The MSE is lower using the pls method than all other methods. The difference between the regular regression model and the pls model MSE is 12084

test.avg <- mean(test$Apps)

divisor <-  mean((test.avg - test$Apps)^2)


#Regular Regression model R2
LM.R2 <- 1 - lm.mse / divisor

#Ridge Regression Model R2
Ridge.R2 <- 1 - ridge.mse / divisor

#lasso Regression Model R2
Lasso.R2 <- 1 - lasso.mse / divisor

#PCR Model R2
PCR.R2 <- 1 - pcr.mse / divisor

#PLS Model R2
PLS.R2 <- 1 - pls.mse / divisor


R2 <- data.frame(
  LM.R2 = LM.R2,
  Ridge.R2 = Ridge.R2,
  Lasso.R2 = Lasso.R2,
  PCR.R2 = PCR.R2,
  PLS.R2 = PLS.R2
)

R2

It appears that the highest R-Sqaure value is from the PLS model. However, they are very close in performance. Since the MSE was the lowest in the PLS model, this makes sense the PLS will also have the highest R-Square performance.

#3. Question 6.8.11 pg 26 We will now try to predict per capita crime rate in the Boston data set.

##Question 11 a)

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

First I will load the data set from the MASS library. Then I will split the data set into training and testing data sets. 70% will be used for training and the other 30% will be used for testing.

library(MASS)
data(Boston)

set.seed(602)
index <- sample(x=nrow(Boston), size=.70*nrow(Boston))
train <- Boston[index,]
test <-  Boston[-index,]

##Full Model Approach

lm.fit <- lm(crim ~ ., data=train)
lm.pred <- predict(lm.fit, test)
lm.mse <- mean((lm.pred - test$crim)^2)

##Best Subset Selection

To use this approach the leaps library is required.

#install.packages('leaps')
library(leaps)

set.seed(602)

predict.regsubsets = function (object, newdata, id, ...) {
  form = as.formula(object$call[[2]])
  mat=model.matrix(form, newdata)
  coefi=coef(object,id=id)
  xvars=names(coefi)
  mat[,xvars]%*%coefi
}



k <- 10
folds = sample(1:k, nrow(Boston), replace=TRUE)
cv.errors = matrix(NA, k, 13, dimnames=list(NULL,paste(1:13)))

for (i in 1:k) {
  best.fit = regsubsets(crim ~., data=Boston[folds != i,], nvmax=13)
  for (j in 1:13) {
    pred = predict(best.fit, Boston[folds==i, ], id=j)
    cv.errors[i, j] = mean((Boston$crim[folds==i] - pred)^2)
  }
}

mean.cv.errors=apply(cv.errors,2,mean)
plot(mean.cv.errors, ylab='CV Error', xlab='No. Variables', main='Best Subset Selection Method')

From the scatter plot above, it appears that the lowest CV error occurs with 12 variables. I will confirm the results with by outputting all mean.cv.errors

mean.cv.errors
##        1        2        3        4        5        6        7        8 
## 47.45595 45.69634 46.52498 46.36232 46.32926 46.02681 45.79949 45.44469 
##        9       10       11       12       13 
## 44.68523 44.70753 44.68550 44.66832 44.74099

Results confirm the scatter plot. 12 is produces the lowers CV error.

##Lasso

set.seed(602)
#install.packages('glmnet')
library('glmnet')
x <- model.matrix(crim ~., data=Boston)[,-1]
y <- Boston$crim


grid <- 10^seq(10,-2,length=100)

train <- sample(1:nrow(x), nrow(x)/2)

test <- (-train)
y.test <- y[test]

cvlasso <- cv.glmnet(x[train,],y[train],alpha=1)

plot(cvlasso)

bestlambdalasso <- cvlasso$lambda.min
lasso <- glmnet(x[train,], y[train], alpha=1, lambda=grid)
predlasso <- predict(lasso, s=bestlambdalasso, newx=x[test,])
lasso.mse <- mean((predlasso - y.test)^2)

##Ridge Regression

cvridge <- cv.glmnet(x[train,],y[train],alpha=0)

plot(cvridge)

bestlambdaridge <- cvridge$lambda.min
ridge <- glmnet(x[train,], y[train], alpha=1, lambda=grid)
predridge <- predict(ridge, s=bestlambdalasso, newx=x[test,])
ridge.mse <- mean((predridge - y.test)^2)

##PCR Model

library('pls')
pcr.mod <- pcr(crim ~., data=Boston, subset=train, scale=TRUE, validation='CV')
validationplot(pcr.mod, val.type='MSEP')

summary(pcr.mod)
## Data:    X dimension: 253 13 
##  Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           9.648    8.275    8.315    7.928    7.917    7.937    8.076
## adjCV        9.648    8.271    8.313    7.918    7.905    7.926    8.057
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       8.096    7.983    8.053     8.063     8.090     8.069     8.017
## adjCV    8.077    7.960    8.025     8.037     8.067     8.037     7.984
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       48.51    60.90    70.72    77.76    83.71    88.72    91.85
## crim    27.29    27.36    34.49    35.39    35.48    35.60    35.83
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       93.95    95.64     97.28     98.56     99.58    100.00
## crim    37.94    38.17     38.23     38.40     40.10     41.07
pcr.pred <- predict(pcr.mod, x[test,], ncomp=5)

based on the plot and summary I am going to with 5 components (M=5)

pcr.pred <- predict(pcr.mod, x[test,], ncomp=5)
pcr.mse <- mean((pcr.pred - y.test))

#Question 11 b)

##MSE Model Comparison

bss.mse <- 41.03457

mse.models <- data.frame(
  bss.mse <- bss.mse,
  lasso.mse <- lasso.mse,
  ridge.mse <- ridge.mse,
  lm.mse <- lm.mse
)

mse.models

based on the MSEs, the model that produces the lowest MSE value is the best subset selection model. This one gives us the lowest cross validation error.

##Question 11 c)

Since the best subset selection model had the lowest error rate, this is the model we are looking at. According to the graph at the beginning of this problem is used only 12 variables. The reason it only used 12 variables is because that model works best with only 12 selected for predicting the crim variable.