library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.1.3
## Loading required package: Matrix
## Loaded glmnet 4.1-3
library(pls)
## Warning: package 'pls' was built under R version 4.1.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(MASS)
## Warning: package 'MASS' was built under R version 4.1.2
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
##2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
#Review: (these are my notes to help me walk through this question and develop an answer) Bias/Variance tradeoff
Variance - how much your estimate f would change by if you had different training data
Bias - error that is introduced by modeling a real life problem
More flexible methods - less bias
Less flexible methods - less variance
measuring quality of fit
measure of accuracy is mean squared error (most common in regression) Training vs Test MSE’s
the more flexible a method is the lower it training MSE, it will fit and explain the training data very well. More flexible - generate more shapes for f Less flexible - easier to interpret
Linear model is not flexible, easier to interpret, less variance, more bias (under fitting is a problem) Complex model is flexible, harder to interpret, less bias, more variance (over fitting is a problem)
#(a) The lasso, relative to least squares, is: Lasso - shrinkage class, as lambda goes up, variance goes down, bias goes up (slightly) shrinks the coefficient estimates towards zero, but also forces some of the coefficient estimates to be exactly equal zero when lambda is large. Lasso is like a hybrid of ridge regression (shrinkage) and subset (variable selection), results in easier models to interpret
This one is true: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. lasso is more flexible than least squares, it shrinks the coefs and it reduces some to zero (variable selection), a slight increase in bias (caused by the shrinkage of the coefs) results in a substantial reduction in the variance of the predictions.
#(b) Repeat (a) for ridge regression relative to least squares.
Ridge: Lambda Up, Flex Down, Variance Down, Bias up (slight) Least Squares: if close to linear: low bias, may have high variance (small change in training data cause large change in least squares coefs) When p (predictors) is close to number of observations n, least square extremely variable. If p > n least square does not have a solution, ridge still performs well. Ridge works best in situations where the least squares has high variance (no close to linear) Ridge includes all of the variables, shrinks the coefs, create challenges for interpretation.
This one is true: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge is more flexible than least squares, it shrinks the coefs and but it does not reduce some to zero (like lasso), a slight increase in bias (caused by the shrinkage of the coefs) results in a substantial reduction in the variance of the predictions.
#(c) Repeat (a) for non-linear methods relative to least squares.
Least square is less flexible, than non-linear, but least square is easier to interpret. Least squares has less variance and more bias compared to complex or more flexible models.
For non-linear methods, they are more flexible than least squares, have more variance and less bias For non-linear variance is the issue which can lead to over fitting, so if you have less variance and more bias than that will help for prediction. This one seems to fit the above logic: ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. (To me this was a confusing way to put it) Basically for non-linear models, you want to bring down the variance and bring up the bias, so if you can bring in bias more than decrease variance, you get to a balance point (give up what you have a lot of (variance) for something you don’t have a lot of (bias))
##9. In this exercise, we will predict the number of applications received using the other variables in the College data set. #Least squares, Ridge Regression, lasso, PCR, PLS examples
Look at the data
View(College)
names(College) ## "Apps" target variable
## [1] "Private" "Apps" "Accept" "Enroll" "Top10perc"
## [6] "Top25perc" "F.Undergrad" "P.Undergrad" "Outstate" "Room.Board"
## [11] "Books" "Personal" "PhD" "Terminal" "S.F.Ratio"
## [16] "perc.alumni" "Expend" "Grad.Rate"
dim(College) ##777 rows, 18 variables
## [1] 777 18
Check for missing values
sum(is.na(College)) ##no missing values
## [1] 0
set.seed(22)
train=sample(c(TRUE,FALSE), nrow(College),rep=TRUE)
test=(!train)
college_train = College[train,]
college_test = College[!train,]
college_lm = lm(Apps~.,data=college_train)
summary(college_lm)
##
## Call:
## lm(formula = Apps ~ ., data = college_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3392.7 -447.0 -27.8 283.6 6365.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -170.26321 524.81191 -0.324 0.745799
## PrivateYes -806.04915 183.24187 -4.399 1.43e-05 ***
## Accept 1.32102 0.06458 20.456 < 2e-16 ***
## Enroll 0.01739 0.25158 0.069 0.944941
## Top10perc 30.63347 6.46730 4.737 3.11e-06 ***
## Top25perc -1.30594 5.41889 -0.241 0.809691
## F.Undergrad -0.02477 0.04870 -0.509 0.611272
## P.Undergrad 0.01377 0.04857 0.283 0.777000
## Outstate -0.03796 0.02346 -1.618 0.106586
## Room.Board 0.09850 0.05838 1.687 0.092395 .
## Books -0.02504 0.29345 -0.085 0.932056
## Personal 0.00407 0.08572 0.047 0.962154
## PhD -4.49625 5.67164 -0.793 0.428429
## Terminal -3.70106 6.29933 -0.588 0.557206
## S.F.Ratio -1.78231 15.16046 -0.118 0.906478
## perc.alumni -11.59185 5.71617 -2.028 0.043290 *
## Expend 0.05372 0.01513 3.550 0.000435 ***
## Grad.Rate 10.68390 4.17454 2.559 0.010887 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 899.8 on 368 degrees of freedom
## Multiple R-squared: 0.9352, Adjusted R-squared: 0.9322
## F-statistic: 312.5 on 17 and 368 DF, p-value: < 2.2e-16
report the test error obtained.
pred_college = predict(college_lm, college_test) ##predict applications
actual_preds = data.frame(cbind(actuals=college_test$Apps,predicts=pred_college)) ##actual predictions from test with predicted predictions
head(actual_preds)
## actuals predicts
## Abilene Christian University 1660 1521.14916
## Adrian College 1428 1192.98773
## Agnes Scott College 417 1670.95214
## Alaska Pacific University 193 -144.65177
## Albertson College 587 926.86015
## Albertus Magnus College 353 60.43043
testerror = mean((actual_preds$actuals-actual_preds$predicts)^2)
testerror
## [1] 1611884
set.seed(22)
college_x=model.matrix(Apps~.,College[,-1])
y=College$Apps
y_test=y[test]
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(college_x[train,],y[train],lambda = grid, alpha = 0) ##0 indicates ridge regression 1 indicates lasso
using cross-validation to choose tuning parameter for lambda, default is 10 folds
set.seed(22)
cv.out = cv.glmnet(college_x[train,],y[train],alpha = 0)
plot(cv.out)
Get the best lambda
bestlam=cv.out$lambda.min
bestlam
## [1] 327.9068
Predict for ridge regression model Test error for ridge regression model
pred_ridge=predict(ridge.mod,s=bestlam,newx=college_x[test,])
mean((pred_ridge-y_test)^2)
## [1] 2396957
lasso.mod=glmnet(college_x[train,],y[train],lambda = grid, alpha = 1) ##1 indicates lasso
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
using cross-validation to choose tuning parameter for lambda, default is 10 folds
set.seed(22)
cv.outlasso = cv.glmnet(college_x[train,],y[train],alpha = 1)
plot(cv.outlasso)
Get the best lambda
bestlamlasso=cv.outlasso$lambda.min
bestlamlasso
## [1] 21.57407
Predict for lasso model Test error for lasso model
pred_lasso=predict(lasso.mod,s=bestlamlasso,newx=college_x[test,])
mean((pred_lasso-y_test)^2)
## [1] 1642011
outC=glmnet(college_x,y,alpha=1,lambda=grid)
lasso.coef=predict(outC,type="coefficients",s= bestlamlasso) [1:18,]
lasso.coef
## (Intercept) (Intercept) Accept Enroll Top10perc
## -1.012324e+03 0.000000e+00 1.460466e+00 -1.480273e-01 3.343630e+01
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -2.430620e+00 0.000000e+00 2.870927e-02 -7.888266e-02 1.109541e-01
## Books Personal PhD Terminal S.F.Ratio
## 0.000000e+00 1.913308e-03 -3.889824e+00 -1.556056e+00 1.129338e+01
## perc.alumni Expend Grad.Rate
## -2.175181e+00 7.305924e-02 4.573781e+00
lasso.coef[lasso.coef!=0] ##(14 non-zero coefs)
## (Intercept) Accept Enroll Top10perc Top25perc
## -1.012324e+03 1.460466e+00 -1.480273e-01 3.343630e+01 -2.430620e+00
## P.Undergrad Outstate Room.Board Personal PhD
## 2.870927e-02 -7.888266e-02 1.109541e-01 1.913308e-03 -3.889824e+00
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## -1.556056e+00 1.129338e+01 -2.175181e+00 7.305924e-02 4.573781e+00
set.seed(22)
college_pcr = pcr(Apps ~.,data = College[train,], scale=TRUE, validation="CV")
summary(college_pcr) ##16, 17, and 9 have the lowest CV errors, 16 and 17 is pretty much same as lm full
## Data: X dimension: 386 17
## Y dimension: 386 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3461 3475 1708 1707 1257 1245 1203
## adjCV 3461 3475 1705 1707 1231 1240 1199
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1199 1153 1113 1117 1121 1126 1132
## adjCV 1196 1142 1110 1114 1118 1123 1128
## 14 comps 15 comps 16 comps 17 comps
## CV 1132 1149 955.9 954.3
## adjCV 1128 1145 952.0 950.3
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 33.4178 58.26 66.04 72.12 77.40 82.27 85.76 88.73
## Apps 0.1637 76.63 76.66 87.99 88.01 88.76 88.86 89.82
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.53 93.63 95.57 97.18 98.14 98.94 99.50
## Apps 90.47 90.50 90.51 90.53 90.54 90.61 90.63
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.47 93.52
validationplot(college_pcr,val.type = "MSEP")
Prediction and mean test error
college_pcr_predict = predict(college_pcr, College[test,],ncomp = 9)
mean((college_pcr_predict - College$Apps[test])^2)
## [1] 3173508
set.seed(22)
college_pls = plsr(Apps ~.,data = College[train,], scale=TRUE, validation="CV")
summary(college_pls) ## 12 has the lowest CV 953.7, the "break" is at 8/9 mark
## Data: X dimension: 386 17
## Y dimension: 386 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3461 1499 1121 1106 1068 1022 985.6
## adjCV 3461 1496 1111 1105 1062 1007 977.6
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 969.7 957.4 954.9 954.1 954.2 953.7 954
## adjCV 963.9 953.0 950.7 950.1 950.2 949.8 950
## 14 comps 15 comps 16 comps 17 comps
## CV 954.3 954.2 954.2 954.3
## adjCV 950.3 950.2 950.2 950.3
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 24.69 32.88 63.85 67.24 70.74 75.12 80.07 82.2
## Apps 82.32 90.16 90.73 91.74 92.73 93.36 93.47 93.5
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 84.62 87.54 91.14 92.91 94.53 96.38 97.27
## Apps 93.51 93.52 93.52 93.52 93.52 93.52 93.52
## 16 comps 17 comps
## X 98.67 100.00
## Apps 93.52 93.52
validationplot(college_pls,val.type = "MSEP")
Prediction and mean test error
college_pls_predict = predict(college_pls, College[test,],ncomp = 12)
mean((college_pls_predict - College$Apps[test])^2)
## [1] 1621018
LM, PLS and Lasso appear to be the best approaches based on MSE. More than 90% of the variation in the Apps variable (target) can be explained by the model coefficients.
##11. We will now try to predict per capita crime rate in the Boston data set.
#(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
Look at the data
View(Boston)
names(Boston) ## "crim" target variable
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
?Boston
## starting httpd help server ... done
dim(Boston) ##506 rows, 14 variables
## [1] 506 14
Check for missing values
sum(is.na(Boston)) ##no missing values
## [1] 0
Split the data set into a training set and a test set.
set.seed(11)
trainB=sample(c(TRUE,FALSE), nrow(Boston),rep=TRUE)
testB=(!trainB)
Boston_train = Boston[trainB,] ##250/14
Boston_test = Boston[!trainB,] ##256/14
ridge regression model
set.seed(11)
boston_x=model.matrix(Boston$crim ~.,Boston[,-1])
yB=Boston$crim
y_testB=yB[testB]
grid=10^seq(10,-2,length=100)
ridge.modB=glmnet(boston_x[trainB,],yB[trainB],lambda = grid, alpha = 0) ##0 indicates ridge regression 1 indicates lasso
using cross-validation to choose tuning parameter for lambda, default is 10 folds
set.seed(11)
cv.outB = cv.glmnet(boston_x[trainB,],yB[trainB],alpha = 0)
plot(cv.outB)
Get the best lambda
bestlamRB=cv.outB$lambda.min
bestlamRB
## [1] 0.8850296
Predict for ridge regression model Test error for ridge regression model
pred_ridgeB=predict(ridge.modB,s=bestlamRB,newx=boston_x[testB,])
mean((pred_ridgeB-y_testB)^2)
## [1] 55.34372
lasso model on the training set
lasso.modB=glmnet(boston_x[trainB,],yB[trainB],lambda = grid, alpha = 1) ##1 indicates lasso
plot(lasso.modB)
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values
using cross-validation to choose tuning parameter for lambda, default is 10 folds
set.seed(11)
cv.outlassoB = cv.glmnet(boston_x[trainB,],yB[trainB],alpha = 1)
plot(cv.outlassoB)
Get the best lambda
bestlamlassoB=cv.outlassoB$lambda.min
bestlamlassoB
## [1] 0.1345166
Predict for lasso model Test error for lasso model
pred_lassoB=predict(lasso.modB,s=bestlamlassoB,newx=boston_x[testB,])
mean((pred_lassoB-y_testB)^2)
## [1] 55.75482
PCR model on the training set
set.seed(11)
boston_pcr = pcr(crim ~.,data = Boston[trainB,], scale=TRUE, validation="CV")
summary(boston_pcr) ##10, 13, and 12 have the lowest CV errors, 12 and 13 is pretty much same as lm full
## Data: X dimension: 250 13
## Y dimension: 250 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.319 6.736 6.750 6.070 6.056 6.059 6.183
## adjCV 8.319 6.733 6.748 6.059 6.049 6.051 6.168
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.189 6.089 5.972 5.880 5.894 5.923 5.882
## adjCV 6.173 6.074 5.954 5.855 5.873 5.902 5.860
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.60 60.85 70.30 77.58 83.82 88.45 91.63 94.04
## crim 35.03 35.10 48.85 49.06 49.30 49.74 49.77 51.66
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.83 97.28 98.52 99.4 100.0
## crim 53.77 55.26 55.48 55.5 56.3
validationplot(boston_pcr,val.type = "MSEP")
Prediction and mean test error
boston_pcr_predict = predict(boston_pcr, Boston[testB,],ncomp = 10)
mean((boston_pcr_predict - Boston$crim[testB])^2)
## [1] 58.70196
PLS model on the training set
set.seed(11)
boston_pls = plsr(crim ~.,data = Boston[trainB,], scale=TRUE, validation="CV")
summary(boston_pls) ##10, has the lowest CV errors, after that it is the same CV error
## Data: X dimension: 250 13
## Y dimension: 250 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.319 6.478 6.000 5.976 5.905 5.867 5.867
## adjCV 8.319 6.474 5.987 5.945 5.883 5.849 5.848
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5.889 5.888 5.883 5.882 5.882 5.882 5.882
## adjCV 5.867 5.866 5.861 5.861 5.861 5.860 5.860
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.99 57.80 61.09 69.83 76.99 81.06 85.16 88.80
## crim 40.72 52.18 55.43 55.87 56.07 56.20 56.26 56.29
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 92.98 95.09 97.7 98.92 100.0
## crim 56.30 56.30 56.3 56.30 56.3
validationplot(boston_pls,val.type = "MSEP")
Prediction and mean test error
boston_pls_predict = predict(boston_pls, Boston[testB,],ncomp = 10)
mean((boston_pls_predict - Boston$crim[testB])^2)
## [1] 57.45414
Subset approach
boston_subset_best = regsubsets(crim ~., data = Boston[trainB,],nvmax=13)
test_boston_mat = model.matrix(crim ~.,data = Boston[testB,])
val.errors =rep(NA ,13)
for(i in 1:13){
coefi=coef(boston_subset_best ,id=i)
pred_boston=test_boston_mat[,names(coefi)]%*%coefi
val.errors[i]=mean(( Boston$crim[testB]-pred_boston)^2)
}
val.errors
## [1] 52.41811 56.28508 57.47860 60.33110 56.91372 57.49680 57.65316 57.49860
## [9] 57.60281 57.41812 56.85848 57.47398 57.44183
which.min(val.errors)
## [1] 1
coef(boston_subset_best,1)
## (Intercept) rad
## -2.4729812 0.6643563
predict method for subset
predict.regsubsets=function(object , newdata ,id ,...){
form=as.formula (object$call [[2]])
mat=model.matrix(form ,newdata )
coefi=coef(object ,id=id)
xvars=names(coefi)
mat[,xvars]%*%coefi
}
boston_subset_best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(boston_subset_best ,1)
## (Intercept) rad
## -2.2871594 0.6179109
k=10
set.seed(11)
folds=sample (1:k,nrow(Boston),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL,paste(1:13)))
for(j in 1:k){
boston_subset_best=regsubsets (crim ~.,data=Boston[folds!=j,],nvmax=13)
for(i in 1:13){
predB=predict (boston_subset_best ,Boston [folds ==j,],id=i)
cv.errors[j,i]= mean( ( Boston$crim[ folds==j]-predB)^2)
}
}
mean.cv.errors=apply(cv.errors ,2, mean)
mean.cv.errors ## 2 variable or 12 variable model
## 1 2 3 4 5 6 7 8
## 44.53966 42.93213 44.30954 44.15914 44.15057 44.18860 43.72870 43.45615
## 9 10 11 12 13
## 43.35866 43.45879 43.18816 42.93844 43.03094
par(mfrow=c(1,1))
plot(mean.cv.errors ,type='b')
boston_subset_best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(boston_subset_best ,12) ##12 variable model
## (Intercept) zn indus chas nox
## 16.985713928 0.044673247 -0.063848469 -0.744367726 -10.202169211
## rm dis rad tax ptratio
## 0.439588002 -0.993556631 0.587660185 -0.003767546 -0.269948860
## black lstat medv
## -0.007518904 0.128120290 -0.198877768
boston_subset_best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(boston_subset_best ,2) ##2 variable model
## (Intercept) rad lstat
## -4.3814053 0.5228128 0.2372846
#11 (a) Present and discuss results for the approaches that you consider.
MSE: MSE R2 Ridge: 55.343
Lasso: 55.754 PCR: 58.701 97.28 variance x explained, 55.26 crim (per capita crime rate) explained (10 comps) PLS: 57.454 95.09 variance x explained, 53.30 crim (per capita crime rate) explained (10 comps) Subset: 42.932 2 variable model, rad and lstat
Propose a model (or set of models) that seems to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error. I tried 5 different approachs and the subset approach produced the lowest MSE. Used CV and train/test data sets (validation approach).
Does your chosen model involve all of the features in the data set? Why or why not No it does not involve all of the variables. Subset approach had the lowest MSE of the different model approaches, based on the lab example from class (which I followed) ran through the cross validation approach and decided on a 2 variable model (rad and lstat).
A 10 variable model subset approach produced the loses MSE, however 2 variable model is much easier to work with than a 10 variable model