#Kevin Kuipers (Completed byself)
#3/19/2019
#1. Question 6.8.4 pg 260 i.Increase initially, and then eventually start decreasing in an inverted U shape. ii. Decrease initially, and then eventually start increasing in a U shape. iii. Steadily increase. iv. Steadily decrease. v. Remain constant.
as we increase \(\lambda\) from 0, the training RSS will: The RSS will (iii.) steadily increase as \(\lambda\) increases from 0. The model is becoming less flexible as \(\lambda\) increases which increases the training RSS. This is because the \(\beta_{j}\) coefficients are being restricted. This causes the training RSS to steadily increase.
as we increase \(\lambda\) from 0, the test RSS will: The test RSS will (ii.) decrease initially, and then eventually start increasing in a U shape. The model is becoming less flexible as \(\lambda\) increases because the \(\beta_{j}\) coefficients are being restricted therefore the coefficients will move away from their least sqaure estimates. The relationship will initially result in a decrease in the test RSS but then as \(\lamda\) continues to increase the test RSS will change directions and increase in U shape.
As we increase \(\lambda\) from 0, the variance will: The variance will (iv.) steadily decrease due to the model decreasing flexibility as the \(\beta_{j}\) coefficients are being restricted.
As we increase \(\lambda\) from 0, the bias will (iii.) steadily increase. Since the model is becoming less flexible as as \(\lambda\) increases, the bias inflicted upon the model will also increase since \(\beta_{j}\) coefficients incurring restriction.
As we increase \(\lamda\)Â from 0, the irreducible error will (v.) remain constant since it is independent of model parameters.
#2. Question 6.8.9 pg 263
In this exercise, we will predict the number of applications received using the other variables in the College data set.
##Question 9 a) Split the data set into a training set and a test set.
First I will load the College data set from the ISLR library. Then I will spilt the data set into training and testing.
#Loading Data set
library(ISLR)
data(College)
#Setting seed
set.seed(123)
#splitting the data set, 70% training, 30% testing
index <- sample(x=nrow(College), size=.70*nrow(College))
train <- College[index,]
test <- College[-index,]
Question 9b) ##Fit a linear model using least squares on the training set, and report the test error obtained.
lm.fit <- lm(Apps ~ ., data=train)
lm.pred <- predict(lm.fit, test)
lm.mse <- mean((lm.pred - test$Apps)^2)
cat('The linear regression model MSE is :', lm.mse)
## The linear regression model MSE is : 1734841
Question 9 c) ##Fit a ridge regression model on the training set, with ?? chosen by cross-validation. Report the test error obtained.
In order to do this I will have create a training model matrix and testing model matrix. I will also have to use the glmnet library. First I will determine the best lambda.
#install.packages('glmnet')
library(glmnet)
set.seed(123)
#Creating the training and testing matrices for the ridge regression method
train.matrix <- model.matrix(Apps ~ ., data=train)
test.matrix <- model.matrix(Apps ~., data=test)
#implementing a grid of values ranging from lambda=10^10 to lambda=10^-2, covering the full range of scenarios from the null mode containing only the intercept, to the least squares fit.
grid <- 10^seq(10,-2,length=100)
#fitting ridge ression model.
ridge.fit <- glmnet(train.matrix, train$Apps, alpha=0, lambda=grid, thresh = 1e-12)
#Cross validating the model
cv.ridge <- cv.glmnet(train.matrix, train$Apps, alpha=0, lambda=grid, thresh = 1e-12)
plot(cv.ridge)
#Selecting the best lamdba from the cross validation approach
best.lambda <- cv.ridge$lambda.min
#out the best lambda from the cross validation method
cat('The best lambda from the cross validation is :', best.lambda)
## The best lambda from the cross validation is : 32.74549
##MSE Now I will determine the MSE of the ridge regression model.
#obtaining the MSE using the best lambda from the cross validation
ridge.pred <- predict(ridge.fit, s=best.lambda, newx=test.matrix)
ridge.mse <- mean((ridge.pred - test$Apps)^2)
cat('The Ridge Regression model MSE is: ', ridge.mse)
## The Ridge Regression model MSE is: 1936665
The ridge ression model model MSE is slighly less than the standard linear regression model. In fact, the difference between the two MSE’s is only 11.
##Question 9 d) Fit a lasso model on the training set, with ?? chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#Fitting lasso model
lasso.fit <- glmnet(train.matrix, train$Apps, alpha=1, lambda=grid, thresh=1e-12)
#Cross validating the model
cv.lasso <- cv.glmnet(train.matrix, train$Apps, alpha=1, lambda=grid, thresh=1e-12)
#Selecting the best lambda from the cross vadlidation
lasso.best.lambda <- cv.lasso$lambda.min
plot(cv.lasso)
cat('The best lambda from the cross validation is :', lasso.best.lambda)
## The best lambda from the cross validation is : 6.135907
##MSE of the lasso fit model
pred.lasso <- predict(lasso.fit, s=lasso.best.lambda, newx = test.matrix)
lasso.mse <- mean((pred.lasso - test$Apps)^2)
cat('The lasso method MSE is :', lasso.mse)
## The lasso method MSE is : 1732241
##Non-zero cofficient estimates
lasso.coef <- predict(lasso.fit, s=lasso.best.lambda, type='coefficients')[1:19,]
lasso.coef[lasso.coef!=0]
## (Intercept) PrivateYes Accept Enroll Top10perc
## -434.87100500 -661.77070292 1.21825124 0.08527611 45.95904318
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -13.28682982 0.02277674 0.02970480 -0.04827206 0.17831079
## Books Personal PhD Terminal perc.alumni
## 0.16958630 -0.01898678 -5.33505536 -4.80609827 -7.59328977
## Expend Grad.Rate
## 0.07542260 9.70968049
It appears that the lasso model has a slightly lower MSE than the ridge regression. In fact, the difference between the two MSE’s is 17. The number of non-zero coefficient estimates is 17.
##Question 9 e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
In order to fit a PCR model I will have load the pls library. I will plot the number of components and output a summary of the data to find out which M (principle components) produces the lowest cross validation error
#install.packages('pls')
library(pls)
set.seed(123)
pcr.fit <- pcr(Apps ~ ., data=train, scale=TRUE, validation='CV')
validationplot(pcr.fit, val.type='R2')
summary(pcr.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3402 3378 1683 1661 1335 1307 1274
## adjCV 3402 3379 1681 1660 1324 1302 1272
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1238 1221 1177 1181 1182 1185 1187
## adjCV 1230 1218 1175 1178 1180 1182 1184
## 14 comps 15 comps 16 comps 17 comps
## CV 1189 1199 1050 1052
## adjCV 1186 1196 1046 1048
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 31.797 57.68 64.58 70.2 75.49 80.41 84.00
## Apps 3.037 76.20 77.49 85.1 86.15 86.83 87.92
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 87.43 90.62 93.05 95.12 96.96 98.04 98.89
## Apps 88.01 88.85 88.87 88.94 88.98 89.01 89.01
## 15 comps 16 comps 17 comps
## X 99.42 99.83 100.00
## Apps 89.03 91.68 91.75
It seems that M=17 is the best choice. Now I find the MSE of the PCR model by testing the training againse the testing
pcr.pred <- predict(pcr.fit, test, ncomp=17)
pcr.mse <- mean((test$Apps - pcr.pred)^2)
cat('The MSE for PCR model with 17 components is : ', pcr.mse )
## The MSE for PCR model with 17 components is : 1734841
This MSE is the same as the MSE in the regular linear regression model.
##Question 9 f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(123)
pls.fit <- plsr(Apps ~ ., data=train, scale=TRUE, validation='CV')
validationplot(pls.fit, val.type='MSEP')
summary(pls.fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3402 1520 1261 1159 1132 1104 1067
## adjCV 3402 1517 1262 1157 1129 1100 1061
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1057 1051 1052 1050 1049 1050 1051
## adjCV 1052 1047 1048 1046 1045 1046 1047
## 14 comps 15 comps 16 comps 17 comps
## CV 1051 1052 1052 1052
## adjCV 1047 1048 1048 1048
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 26.09 41.97 63.14 67.44 71.36 74.05 77.72
## Apps 80.83 86.94 89.29 90.10 90.94 91.65 91.71
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps
## X 80.98 83.77 86.46 89.83 91.07 93.08 95.14
## Apps 91.73 91.73 91.74 91.74 91.74 91.75 91.75
## 15 comps 16 comps 17 comps
## X 97.06 99.09 100.00
## Apps 91.75 91.75 91.75
It appears using the pls method that the number of components in the model reduces in cross validation error sooner abd plateaus off around 10-13. Since 13 is hardly any better than 10, I am going to use M=10 for testing the MSE.
pls.pred <- predict(pls.fit, test, ncomp=10)
pls.mse <- mean((pls.pred - test$Apps)^2)
cat('The MSE using the pls method with the number of components as 10 is : ', pls.mse)
## The MSE using the pls method with the number of components as 10 is : 1758617
The MSE is lower using the pls method than all other methods. The difference between the regular regression model and the pls model MSE is 12084
test.avg <- mean(test$Apps)
divisor <- mean((test.avg - test$Apps)^2)
#Regular Regression model R2
LM.R2 <- 1 - lm.mse / divisor
#Ridge Regression Model R2
Ridge.R2 <- 1 - ridge.mse / divisor
#lasso Regression Model R2
Lasso.R2 <- 1 - lasso.mse / divisor
#PCR Model R2
PCR.R2 <- 1 - pcr.mse / divisor
#PLS Model R2
PLS.R2 <- 1 - pls.mse / divisor
R2 <- data.frame(
LM.R2 = LM.R2,
Ridge.R2 = Ridge.R2,
Lasso.R2 = Lasso.R2,
PCR.R2 = PCR.R2,
PLS.R2 = PLS.R2
)
R2
It appears that the highest R-Sqaure value is from the PLS model. However, they are very close in performance. Since the MSE was the lowest in the PLS model, this makes sense the PLS will also have the highest R-Square performance.
#3. Question 6.8.11 pg 26 We will now try to predict per capita crime rate in the Boston data set.
##Question 11 a)
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
First I will load the data set from the MASS library. Then I will split the data set into training and testing data sets. 70% will be used for training and the other 30% will be used for testing.
library(MASS)
data(Boston)
set.seed(602)
index <- sample(x=nrow(Boston), size=.70*nrow(Boston))
train <- Boston[index,]
test <- Boston[-index,]
##Full Model Approach
lm.fit <- lm(crim ~ ., data=train)
lm.pred <- predict(lm.fit, test)
lm.mse <- mean((lm.pred - test$crim)^2)
##Best Subset Selection
To use this approach the leaps library is required.
#install.packages('leaps')
library(leaps)
set.seed(602)
predict.regsubsets = function (object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat=model.matrix(form, newdata)
coefi=coef(object,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
k <- 10
folds = sample(1:k, nrow(Boston), replace=TRUE)
cv.errors = matrix(NA, k, 13, dimnames=list(NULL,paste(1:13)))
for (i in 1:k) {
best.fit = regsubsets(crim ~., data=Boston[folds != i,], nvmax=13)
for (j in 1:13) {
pred = predict(best.fit, Boston[folds==i, ], id=j)
cv.errors[i, j] = mean((Boston$crim[folds==i] - pred)^2)
}
}
mean.cv.errors=apply(cv.errors,2,mean)
plot(mean.cv.errors, ylab='CV Error', xlab='No. Variables', main='Best Subset Selection Method')
From the scatter plot above, it appears that the lowest CV error occurs with 12 variables. I will confirm the results with by outputting all mean.cv.errors
mean.cv.errors
## 1 2 3 4 5 6 7 8
## 47.45595 45.69634 46.52498 46.36232 46.32926 46.02681 45.79949 45.44469
## 9 10 11 12 13
## 44.68523 44.70753 44.68550 44.66832 44.74099
Results confirm the scatter plot. 12 is produces the lowers CV error.
##Lasso
set.seed(602)
#install.packages('glmnet')
library('glmnet')
x <- model.matrix(crim ~., data=Boston)[,-1]
y <- Boston$crim
grid <- 10^seq(10,-2,length=100)
train <- sample(1:nrow(x), nrow(x)/2)
test <- (-train)
y.test <- y[test]
cvlasso <- cv.glmnet(x[train,],y[train],alpha=1)
plot(cvlasso)
bestlambdalasso <- cvlasso$lambda.min
lasso <- glmnet(x[train,], y[train], alpha=1, lambda=grid)
predlasso <- predict(lasso, s=bestlambdalasso, newx=x[test,])
lasso.mse <- mean((predlasso - y.test)^2)
##Ridge Regression
cvridge <- cv.glmnet(x[train,],y[train],alpha=0)
plot(cvridge)
bestlambdaridge <- cvridge$lambda.min
ridge <- glmnet(x[train,], y[train], alpha=1, lambda=grid)
predridge <- predict(ridge, s=bestlambdalasso, newx=x[test,])
ridge.mse <- mean((predridge - y.test)^2)
##PCR Model
library('pls')
pcr.mod <- pcr(crim ~., data=Boston, subset=train, scale=TRUE, validation='CV')
validationplot(pcr.mod, val.type='MSEP')
summary(pcr.mod)
## Data: X dimension: 253 13
## Y dimension: 253 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.648 8.275 8.315 7.928 7.917 7.937 8.076
## adjCV 9.648 8.271 8.313 7.918 7.905 7.926 8.057
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 8.096 7.983 8.053 8.063 8.090 8.069 8.017
## adjCV 8.077 7.960 8.025 8.037 8.067 8.037 7.984
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 48.51 60.90 70.72 77.76 83.71 88.72 91.85
## crim 27.29 27.36 34.49 35.39 35.48 35.60 35.83
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 93.95 95.64 97.28 98.56 99.58 100.00
## crim 37.94 38.17 38.23 38.40 40.10 41.07
pcr.pred <- predict(pcr.mod, x[test,], ncomp=5)
based on the plot and summary I am going to with 5 components (M=5)
pcr.pred <- predict(pcr.mod, x[test,], ncomp=5)
pcr.mse <- mean((pcr.pred - y.test))
#Question 11 b)
##MSE Model Comparison
bss.mse <- 41.03457
mse.models <- data.frame(
bss.mse <- bss.mse,
lasso.mse <- lasso.mse,
ridge.mse <- ridge.mse,
lm.mse <- lm.mse
)
mse.models
based on the MSEs, the model that produces the lowest MSE value is the best subset selection model. This one gives us the lowest cross validation error.
##Question 11 c)
Since the best subset selection model had the lowest error rate, this is the model we are looking at. According to the graph at the beginning of this problem is used only 12 variables. The reason it only used 12 variables is because that model works best with only 12 selected for predicting the crim variable.