Chapter 06 (page 259): 9, 11 ##2.For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
i. More flexable and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexable and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexable and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexable and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
i. More flexable and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexable and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexable and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexable and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
i. More flexable and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
ii. More flexable and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii. Less flexable and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
iv. Less flexable and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
I decided to set the seed because as I ran the probram over and over again i noticed i was getting different numbers then the previous run.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.3
attach(College)
set.seed(11)
faction<-sample(1:dim(College)[1],dim(College)[1]/2)
train<-College[faction, ]
test<-College[-faction, ]
lm.fit<-lm(Apps~.,data = train )
pred<-predict(lm.fit,test)
mean((pred-test$Apps)^2)
## [1] 1538442
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.5.3
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.5.3
## Loaded glmnet 2.0-16
X<-model.matrix(Apps~., data = train )
Y<- model.matrix(Apps~., data = test)
grid = 10^seq(10,-2, length= 100)
ridge.mod = cv.glmnet(X,train$Apps, alpha =0,lambda = grid, thresh = 1e-12)
lamba.best<-ridge.mod$lambda.min
lamba.best
## [1] 18.73817
Reported test error:
lasso.fit<-cv.glmnet(X,train$Apps, alpha=1,lambda = grid, thresh = 1e-12)
lambda.best1<-lasso.fit$lambda.min
lambda.best1
## [1] 18.73817
Now the Test Error
lasso.pred<-predict(lasso.fit, newx = Y, s = lambda.best1)
mean((test$Apps-lasso.pred)^2)
## [1] 1632226
Now for the Coeficients:
lasso.mod<-glmnet(model.matrix(Apps~., data = College),College$Apps, alpha = 1)
predict(lasso.mod,s = lambda.best1, type = "coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -5.813707e+02
## (Intercept) .
## PrivateYes -4.368911e+02
## Accept 1.471991e+00
## Enroll -2.541925e-01
## Top10perc 3.574808e+01
## Top25perc -3.838790e+00
## F.Undergrad .
## P.Undergrad 2.596168e-02
## Outstate -6.179537e-02
## Room.Board 1.284644e-01
## Books .
## Personal 2.520096e-03
## PhD -6.025350e+00
## Terminal -3.256552e+00
## S.F.Ratio 6.026031e+00
## perc.alumni -8.955597e-01
## Expend 7.076914e-02
## Grad.Rate 5.580536e+00
library(pls)
## Warning: package 'pls' was built under R version 3.5.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.fit<-pcr(Apps~.,data = train, scale = TRUE, validation = "CV")
validationplot(pcr.fit, val.type = "MSEP")
pcr.pred <- predict(pcr.fit, test, ncomp = 10)
mean((test[,"Apps"]- pcr.pred)^2)
## [1] 3014496
pls.fit<-plsr(Apps~.,data= train, scale = TRUE, validation= "CV")
validationplot(pls.fit,val.type = "MSEP")
pls.predict<-predict(pls.fit,test, ncomp=10)
mean((pls.predict-test$Apps)^2)
## [1] 1508987
test.avg <- mean(test$Apps)
#install.packages("MASS")
#library(MASS)
#library(leaps)
#library(glmnet)
#predict.regsubsets = function(object, newdata, id, ...) {
# form = as.formula(object$call[[2]])
# mat = model.matrix(form, newdata)
# coefi = coef(object, id = id)
#mat[, names(coefi)] %*% coefi
#}
#k = 10
#p = (ncol(Boston) -1)
#folds = sample(rep(1:k, length = nrow(Boston)))
#cv.errors = matrix(NA, k, p)
#for (i in 1:k) {
# best.fit = regsubsets(crim ~ ., data = Boston[folds != i, ], nvmax = p)
# for (j in 1:p) {
# pred = predict(best.fit, Boston[folds == i, ], id = j)
# cv.errors[i, j] = mean((Boston$crim[folds == i] - pred)^2)
# }
#}
#rmse.cv = sqrt(apply(cv.errors, 2, mean))
#plot(rmse.cv, pch = 19, type = "b")
#Boston
#x<-model.matrix(crim~.-1,data = Boston)
#Y = Boston$crim
#cv.lasso = cv.glmnet(x,Y,type.measure = "mse")
#plot(cv.lasso)
#coef(cv.lasso)
#X<-model.matrix(crim~.-1,data = Boston)
#Y<-Boston$crim
#cv.ridge<-cv.glmnet(X,Y,type.measure = "mse", alpha = 0)
#plot(cv.ridge)
#coef(cv.ridge)
#library(pls)
#pcr.fit<-pcr(crim~.,data = Boston,scale =T, Validation = "CV")
#summary(pcr.fit)
Here we see PCR has the lowest cross validation.
The best subset mode will be the PCR, it has the lowest rate.
No the model of PCR only accounts for 13 not all 14 because running str shows many “NAN” and “NA”.