set.seed(2000)
library(ISLR)
library(tidyverse)
library (glmnet)#For lasso and ridge
library(pls) # PCR
For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
Predicting the number of applications received in the college data set
df.r.9 <- College #Dataframe.raw.question 9
partion <- .80
tt <- sample(nrow(df.r.9), nrow(df.r.9)*partion)
df.r.9.train <- df.r.9[tt,]
df.r.9.test <- df.r.9[-tt,]
lm.9.fit <- lm(Apps ~ ., data = df.r.9.train)
mean((df.r.9.test$Apps - predict(lm.9.fit, df.r.9.test))^2)
## [1] 1871831
Creating training and testing datasets
#grid =10^seq(10,-2, length =100)
x.9.train = model.matrix(Apps ~., df.r.9.train)
y.9.train = df.r.9.train$Apps
x.9.test = model.matrix(Apps ~., df.r.9.test)
y.9.test = df.r.9.test$Apps
Fitting the ridge regression with the best lambda value
ridge.9.cv = cv.glmnet(x.9.train,y.9.train,alpha =0)
bestlam.ridge = ridge.9.cv$lambda.min
ridge.9.fit = glmnet(x.9.train,y.9.train, alpha = 0, lambda = bestlam.ridge)
ridge.9.pred = predict(ridge.9.fit, newx = x.9.test)
mean((y.9.test - ridge.9.pred )^2)
## [1] 3706089
lasso.9.cv = cv.glmnet(x.9.train,y.9.train,alpha = 1)
bestlam.lasso = ridge.9.cv$lambda.min
lasso.9.fit = glmnet(x.9.train,y.9.train, alpha = 1, lambda = bestlam.lasso)
lasso.9.pred = predict(lasso.9.fit, newx = x.9.test)
colSums(coef(lasso.9.fit)[,] != 0)
## s0
## 5
mean((y.9.test - lasso.9.pred )^2)
## [1] 2216886
pcr.9.fit = pcr(Apps~., data=df.r.9.train, scale=TRUE, validation ="CV")
summary(pcr.9.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3561 3506 1692 1691 1693 1373 1319
## adjCV 3561 3506 1690 1692 1724 1349 1316
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1285 1254 1217 1214 1218 1219 1215
## adjCV 1283 1250 1215 1212 1215 1217 1212
## 14 comps 15 comps 16 comps 17 comps
## CV 1213 1213 1080 1064
## adjCV 1211 1211 1077 1060
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.336 57.04 64.37 70.13 75.61 80.78 84.58 88.06
## Apps 3.688 77.99 77.99 78.00 86.88 87.00 87.72 88.35
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.01 93.23 95.27 97.03 98.06 98.92 99.45
## Apps 89.03 89.20 89.20 89.24 89.35 89.38 89.39
## 16 comps 17 comps
## X 99.86 100.00
## Apps 91.75 92.24
validationplot(pcr.9.fit) #Let's say M = 16 is the lowest
pcr.9.pred = predict(pcr.9.fit, df.r.9.test, ncomp = 16)
mean((pcr.9.pred - df.r.9.test$Apps)^2)
## [1] 2194053
plsr.9.fit=plsr(Apps~., data=df.r.9.train, scale=TRUE, validation ="CV")
summary(plsr.9.fit)
## Data: X dimension: 621 17
## Y dimension: 621 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3561 1533 1302 1184 1171 1154 1111
## adjCV 3561 1531 1304 1182 1164 1132 1102
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1092 1088 1082 1082 1078 1080 1080
## adjCV 1086 1083 1077 1077 1073 1075 1075
## 14 comps 15 comps 16 comps 17 comps
## CV 1080 1081 1081 1081
## adjCV 1075 1076 1076 1076
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.98 43.13 62.24 65.29 66.95 71.75 76.24 78.90
## Apps 81.88 87.11 89.61 90.59 91.76 91.96 92.05 92.11
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.28 84.93 87.01 90.65 94.26 96.35 97.37
## Apps 92.14 92.18 92.22 92.23 92.23 92.24 92.24
## 16 comps 17 comps
## X 97.99 100.00
## Apps 92.24 92.24
validationplot(plsr.9.fit) #Let's say M = 5 is the lowest
plsr.9.pred = predict(plsr.9.fit, df.r.9.test, ncomp = 5)
mean((plsr.9.pred - df.r.9.test$Apps)^2)
## [1] 2338137
From running the various models we saw that PCR had the smallest MSE. And PLSR had the highest MSE. PCR had 16 components which there are only 17 variables so we see that PCR is the closest to linear regression. Linear regression being the second best model.