HW5 - Justin Howard

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.1.2

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.1.3

## Loading required package: Matrix

## Loaded glmnet 4.1-3

library(pls)

## Warning: package 'pls' was built under R version 4.1.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

library(MASS)

## Warning: package 'MASS' was built under R version 4.1.2

library(leaps)

## Warning: package 'leaps' was built under R version 4.1.3

##2. For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

#Review: (these are my notes to help me walk through this question and develop an answer) Bias/Variance tradeoff
Variance - how much your estimate f would change by if you had different training data

Bias - error that is introduced by modeling a real life problem


More flexible methods - less bias
Less flexible methods - less variance

measuring quality of fit
measure of accuracy is mean squared error (most common in regression) Training vs Test MSE’s
the more flexible a method is the lower it training MSE, it will fit and explain the training data very well. More flexible - generate more shapes for f Less flexible - easier to interpret

Linear model is not flexible, easier to interpret, less variance, more bias (under fitting is a problem) Complex model is flexible, harder to interpret, less bias, more variance (over fitting is a problem)

#(a) The lasso, relative to least squares, is: Lasso - shrinkage class, as lambda goes up, variance goes down, bias goes up (slightly) shrinks the coefficient estimates towards zero, but also forces some of the coefficient estimates to be exactly equal zero when lambda is large. Lasso is like a hybrid of ridge regression (shrinkage) and subset (variable selection), results in easier models to interpret

This one is true: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. lasso is more flexible than least squares, it shrinks the coefs and it reduces some to zero (variable selection), a slight increase in bias (caused by the shrinkage of the coefs) results in a substantial reduction in the variance of the predictions.

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

#(b) Repeat (a) for ridge regression relative to least squares.

Ridge: Lambda Up, Flex Down, Variance Down, Bias up (slight) Least Squares: if close to linear: low bias, may have high variance (small change in training data cause large change in least squares coefs) When p (predictors) is close to number of observations n, least square extremely variable. If p > n least square does not have a solution, ridge still performs well. Ridge works best in situations where the least squares has high variance (no close to linear) Ridge includes all of the variables, shrinks the coefs, create challenges for interpretation.

This one is true: iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Ridge is more flexible than least squares, it shrinks the coefs and but it does not reduce some to zero (like lasso), a slight increase in bias (caused by the shrinkage of the coefs) results in a substantial reduction in the variance of the predictions.

#(c) Repeat (a) for non-linear methods relative to least squares.

Least square is less flexible, than non-linear, but least square is easier to interpret. Least squares has less variance and more bias compared to complex or more flexible models.

For non-linear methods, they are more flexible than least squares, have more variance and less bias For non-linear variance is the issue which can lead to over fitting, so if you have less variance and more bias than that will help for prediction. This one seems to fit the above logic: ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. (To me this was a confusing way to put it) Basically for non-linear models, you want to bring down the variance and bring up the bias, so if you can bring in bias more than decrease variance, you get to a balance point (give up what you have a lot of (variance) for something you don’t have a lot of (bias))

##9. In this exercise, we will predict the number of applications received using the other variables in the College data set. #Least squares, Ridge Regression, lasso, PCR, PLS examples

Look at the data

View(College)
names(College) ## "Apps" target variable

##  [1] "Private"     "Apps"        "Accept"      "Enroll"      "Top10perc"  
##  [6] "Top25perc"   "F.Undergrad" "P.Undergrad" "Outstate"    "Room.Board" 
## [11] "Books"       "Personal"    "PhD"         "Terminal"    "S.F.Ratio"  
## [16] "perc.alumni" "Expend"      "Grad.Rate"

dim(College) ##777 rows, 18 variables

## [1] 777  18

Check for missing values

sum(is.na(College)) ##no missing values

## [1] 0

Split the data set into a training set and a test set.

set.seed(22)
train=sample(c(TRUE,FALSE), nrow(College),rep=TRUE)
test=(!train)
college_train = College[train,]
college_test = College[!train,]

Fit a linear model using least squares on the training set, and report the test error obtained.

college_lm = lm(Apps~.,data=college_train)
summary(college_lm)

## 
## Call:
## lm(formula = Apps ~ ., data = college_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3392.7  -447.0   -27.8   283.6  6365.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -170.26321  524.81191  -0.324 0.745799    
## PrivateYes  -806.04915  183.24187  -4.399 1.43e-05 ***
## Accept         1.32102    0.06458  20.456  < 2e-16 ***
## Enroll         0.01739    0.25158   0.069 0.944941    
## Top10perc     30.63347    6.46730   4.737 3.11e-06 ***
## Top25perc     -1.30594    5.41889  -0.241 0.809691    
## F.Undergrad   -0.02477    0.04870  -0.509 0.611272    
## P.Undergrad    0.01377    0.04857   0.283 0.777000    
## Outstate      -0.03796    0.02346  -1.618 0.106586    
## Room.Board     0.09850    0.05838   1.687 0.092395 .  
## Books         -0.02504    0.29345  -0.085 0.932056    
## Personal       0.00407    0.08572   0.047 0.962154    
## PhD           -4.49625    5.67164  -0.793 0.428429    
## Terminal      -3.70106    6.29933  -0.588 0.557206    
## S.F.Ratio     -1.78231   15.16046  -0.118 0.906478    
## perc.alumni  -11.59185    5.71617  -2.028 0.043290 *  
## Expend         0.05372    0.01513   3.550 0.000435 ***
## Grad.Rate     10.68390    4.17454   2.559 0.010887 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 899.8 on 368 degrees of freedom
## Multiple R-squared:  0.9352, Adjusted R-squared:  0.9322 
## F-statistic: 312.5 on 17 and 368 DF,  p-value: < 2.2e-16

report the test error obtained.

pred_college = predict(college_lm, college_test) ##predict applications
actual_preds = data.frame(cbind(actuals=college_test$Apps,predicts=pred_college)) ##actual predictions from test with predicted predictions
head(actual_preds)

##                              actuals   predicts
## Abilene Christian University    1660 1521.14916
## Adrian College                  1428 1192.98773
## Agnes Scott College              417 1670.95214
## Alaska Pacific University        193 -144.65177
## Albertson College                587  926.86015
## Albertus Magnus College          353   60.43043

testerror = mean((actual_preds$actuals-actual_preds$predicts)^2)
testerror

## [1] 1611884

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

set.seed(22)
college_x=model.matrix(Apps~.,College[,-1])
y=College$Apps
y_test=y[test]
grid=10^seq(10,-2,length=100) 
ridge.mod=glmnet(college_x[train,],y[train],lambda = grid, alpha = 0) ##0 indicates ridge regression 1 indicates lasso

using cross-validation to choose tuning parameter for lambda, default is 10 folds

set.seed(22)
cv.out = cv.glmnet(college_x[train,],y[train],alpha = 0)
plot(cv.out)

Get the best lambda

bestlam=cv.out$lambda.min
bestlam

## [1] 327.9068

Predict for ridge regression model Test error for ridge regression model

pred_ridge=predict(ridge.mod,s=bestlam,newx=college_x[test,])
mean((pred_ridge-y_test)^2)

## [1] 2396957

Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

lasso.mod=glmnet(college_x[train,],y[train],lambda = grid, alpha = 1) ##1 indicates lasso
plot(lasso.mod)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

using cross-validation to choose tuning parameter for lambda, default is 10 folds

set.seed(22)
cv.outlasso = cv.glmnet(college_x[train,],y[train],alpha = 1)
plot(cv.outlasso)

Get the best lambda

bestlamlasso=cv.outlasso$lambda.min
bestlamlasso

## [1] 21.57407

Predict for lasso model Test error for lasso model

pred_lasso=predict(lasso.mod,s=bestlamlasso,newx=college_x[test,])
mean((pred_lasso-y_test)^2)

## [1] 1642011

outC=glmnet(college_x,y,alpha=1,lambda=grid)
lasso.coef=predict(outC,type="coefficients",s= bestlamlasso) [1:18,]
lasso.coef

##   (Intercept)   (Intercept)        Accept        Enroll     Top10perc 
## -1.012324e+03  0.000000e+00  1.460466e+00 -1.480273e-01  3.343630e+01 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
## -2.430620e+00  0.000000e+00  2.870927e-02 -7.888266e-02  1.109541e-01 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##  0.000000e+00  1.913308e-03 -3.889824e+00 -1.556056e+00  1.129338e+01 
##   perc.alumni        Expend     Grad.Rate 
## -2.175181e+00  7.305924e-02  4.573781e+00

lasso.coef[lasso.coef!=0]  ##(14 non-zero coefs)

##   (Intercept)        Accept        Enroll     Top10perc     Top25perc 
## -1.012324e+03  1.460466e+00 -1.480273e-01  3.343630e+01 -2.430620e+00 
##   P.Undergrad      Outstate    Room.Board      Personal           PhD 
##  2.870927e-02 -7.888266e-02  1.109541e-01  1.913308e-03 -3.889824e+00 
##      Terminal     S.F.Ratio   perc.alumni        Expend     Grad.Rate 
## -1.556056e+00  1.129338e+01 -2.175181e+00  7.305924e-02  4.573781e+00

Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(22)
college_pcr = pcr(Apps ~.,data = College[train,], scale=TRUE, validation="CV")
summary(college_pcr) ##16, 17, and 9 have the lowest CV errors, 16 and 17 is pretty much same as lm full

## Data:    X dimension: 386 17 
##  Y dimension: 386 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3461     3475     1708     1707     1257     1245     1203
## adjCV         3461     3475     1705     1707     1231     1240     1199
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1199     1153     1113      1117      1121      1126      1132
## adjCV     1196     1142     1110      1114      1118      1123      1128
##        14 comps  15 comps  16 comps  17 comps
## CV         1132      1149     955.9     954.3
## adjCV      1128      1145     952.0     950.3
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X     33.4178    58.26    66.04    72.12    77.40    82.27    85.76    88.73
## Apps   0.1637    76.63    76.66    87.99    88.01    88.76    88.86    89.82
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       91.53     93.63     95.57     97.18     98.14     98.94     99.50
## Apps    90.47     90.50     90.51     90.53     90.54     90.61     90.63
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.47     93.52

validationplot(college_pcr,val.type = "MSEP")

Prediction and mean test error

college_pcr_predict = predict(college_pcr, College[test,],ncomp = 9)
mean((college_pcr_predict - College$Apps[test])^2)

## [1] 3173508

Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(22)
college_pls = plsr(Apps ~.,data = College[train,], scale=TRUE, validation="CV")
summary(college_pls) ## 12 has the lowest CV 953.7, the "break" is at 8/9 mark

## Data:    X dimension: 386 17 
##  Y dimension: 386 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3461     1499     1121     1106     1068     1022    985.6
## adjCV         3461     1496     1111     1105     1062     1007    977.6
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       969.7    957.4    954.9     954.1     954.2     953.7       954
## adjCV    963.9    953.0    950.7     950.1     950.2     949.8       950
##        14 comps  15 comps  16 comps  17 comps
## CV        954.3     954.2     954.2     954.3
## adjCV     950.3     950.2     950.2     950.3
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       24.69    32.88    63.85    67.24    70.74    75.12    80.07     82.2
## Apps    82.32    90.16    90.73    91.74    92.73    93.36    93.47     93.5
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       84.62     87.54     91.14     92.91     94.53     96.38     97.27
## Apps    93.51     93.52     93.52     93.52     93.52     93.52     93.52
##       16 comps  17 comps
## X        98.67    100.00
## Apps     93.52     93.52

validationplot(college_pls,val.type = "MSEP")

Prediction and mean test error

college_pls_predict = predict(college_pls, College[test,],ncomp = 12)
mean((college_pls_predict - College$Apps[test])^2)

## [1] 1621018

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among test errors resulting from these five approaches? MSE: MSE R2 LM: 1611884 0.9322 (full model) Ridge: 2396957
Lasso: 1642011 14 non-zero coefficients PCR: 3173508 95.57 variance x explained, 90.50 Apps explained (9 comps) PLS: 1621018 92.91 variance x explained, 93.52 Apps explained (12 comps)

LM, PLS and Lasso appear to be the best approaches based on MSE. More than 90% of the variation in the Apps variable (target) can be explained by the model coefficients.

##11. We will now try to predict per capita crime rate in the Boston data set.

#(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Look at the data

View(Boston)
names(Boston) ## "crim" target variable

##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

?Boston

## starting httpd help server ... done

dim(Boston) ##506 rows, 14 variables

## [1] 506  14

Check for missing values

sum(is.na(Boston)) ##no missing values

## [1] 0

Split the data set into a training set and a test set.

set.seed(11)
trainB=sample(c(TRUE,FALSE), nrow(Boston),rep=TRUE)
testB=(!trainB)
Boston_train = Boston[trainB,] ##250/14
Boston_test = Boston[!trainB,] ##256/14

ridge regression model

set.seed(11)
boston_x=model.matrix(Boston$crim ~.,Boston[,-1])
yB=Boston$crim
y_testB=yB[testB]
grid=10^seq(10,-2,length=100) 
ridge.modB=glmnet(boston_x[trainB,],yB[trainB],lambda = grid, alpha = 0) ##0 indicates ridge regression 1 indicates lasso

using cross-validation to choose tuning parameter for lambda, default is 10 folds

set.seed(11)
cv.outB = cv.glmnet(boston_x[trainB,],yB[trainB],alpha = 0)
plot(cv.outB)

Get the best lambda

bestlamRB=cv.outB$lambda.min
bestlamRB

## [1] 0.8850296

Predict for ridge regression model Test error for ridge regression model

pred_ridgeB=predict(ridge.modB,s=bestlamRB,newx=boston_x[testB,])
mean((pred_ridgeB-y_testB)^2)

## [1] 55.34372

lasso model on the training set

lasso.modB=glmnet(boston_x[trainB,],yB[trainB],lambda = grid, alpha = 1) ##1 indicates lasso
plot(lasso.modB)

## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
## collapsing to unique 'x' values

using cross-validation to choose tuning parameter for lambda, default is 10 folds

set.seed(11)
cv.outlassoB = cv.glmnet(boston_x[trainB,],yB[trainB],alpha = 1)
plot(cv.outlassoB)

Get the best lambda

bestlamlassoB=cv.outlassoB$lambda.min
bestlamlassoB

## [1] 0.1345166

Predict for lasso model Test error for lasso model

pred_lassoB=predict(lasso.modB,s=bestlamlassoB,newx=boston_x[testB,])
mean((pred_lassoB-y_testB)^2)

## [1] 55.75482

PCR model on the training set

set.seed(11)
boston_pcr = pcr(crim ~.,data = Boston[trainB,], scale=TRUE, validation="CV")
summary(boston_pcr) ##10, 13, and 12 have the lowest CV errors, 12 and 13 is pretty much same as lm full

## Data:    X dimension: 250 13 
##  Y dimension: 250 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.319    6.736    6.750    6.070    6.056    6.059    6.183
## adjCV        8.319    6.733    6.748    6.059    6.049    6.051    6.168
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.189    6.089    5.972     5.880     5.894     5.923     5.882
## adjCV    6.173    6.074    5.954     5.855     5.873     5.902     5.860
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.60    60.85    70.30    77.58    83.82    88.45    91.63    94.04
## crim    35.03    35.10    48.85    49.06    49.30    49.74    49.77    51.66
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.83     97.28     98.52      99.4     100.0
## crim    53.77     55.26     55.48      55.5      56.3

validationplot(boston_pcr,val.type = "MSEP")

Prediction and mean test error

boston_pcr_predict = predict(boston_pcr, Boston[testB,],ncomp = 10)
mean((boston_pcr_predict - Boston$crim[testB])^2)

## [1] 58.70196

PLS model on the training set

set.seed(11)
boston_pls = plsr(crim ~.,data = Boston[trainB,], scale=TRUE, validation="CV")
summary(boston_pls) ##10, has the lowest CV errors, after that it is the same CV error

## Data:    X dimension: 250 13 
##  Y dimension: 250 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.319    6.478    6.000    5.976    5.905    5.867    5.867
## adjCV        8.319    6.474    5.987    5.945    5.883    5.849    5.848
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       5.889    5.888    5.883     5.882     5.882     5.882     5.882
## adjCV    5.867    5.866    5.861     5.861     5.861     5.860     5.860
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.99    57.80    61.09    69.83    76.99    81.06    85.16    88.80
## crim    40.72    52.18    55.43    55.87    56.07    56.20    56.26    56.29
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       92.98     95.09      97.7     98.92     100.0
## crim    56.30     56.30      56.3     56.30      56.3

validationplot(boston_pls,val.type = "MSEP")

Prediction and mean test error

boston_pls_predict = predict(boston_pls, Boston[testB,],ncomp = 10)
mean((boston_pls_predict - Boston$crim[testB])^2)

## [1] 57.45414

Subset approach

boston_subset_best = regsubsets(crim ~., data = Boston[trainB,],nvmax=13)

test_boston_mat = model.matrix(crim ~.,data = Boston[testB,])

val.errors =rep(NA ,13)
for(i in 1:13){
 coefi=coef(boston_subset_best ,id=i)
 pred_boston=test_boston_mat[,names(coefi)]%*%coefi
 val.errors[i]=mean(( Boston$crim[testB]-pred_boston)^2)
}

val.errors

##  [1] 52.41811 56.28508 57.47860 60.33110 56.91372 57.49680 57.65316 57.49860
##  [9] 57.60281 57.41812 56.85848 57.47398 57.44183

which.min(val.errors)

## [1] 1

coef(boston_subset_best,1)

## (Intercept)         rad 
##  -2.4729812   0.6643563

predict method for subset

predict.regsubsets=function(object , newdata ,id ,...){
 form=as.formula (object$call [[2]])
 mat=model.matrix(form ,newdata )
 coefi=coef(object ,id=id)
 xvars=names(coefi)
 mat[,xvars]%*%coefi
}

boston_subset_best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(boston_subset_best ,1)

## (Intercept)         rad 
##  -2.2871594   0.6179109

k=10
set.seed(11)
folds=sample (1:k,nrow(Boston),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL,paste(1:13)))

for(j in 1:k){
 boston_subset_best=regsubsets (crim ~.,data=Boston[folds!=j,],nvmax=13)
 for(i in 1:13){
  predB=predict (boston_subset_best ,Boston [folds ==j,],id=i)
  cv.errors[j,i]= mean( ( Boston$crim[ folds==j]-predB)^2)
 }
}

mean.cv.errors=apply(cv.errors ,2, mean)
mean.cv.errors ## 2 variable or 12 variable model

##        1        2        3        4        5        6        7        8 
## 44.53966 42.93213 44.30954 44.15914 44.15057 44.18860 43.72870 43.45615 
##        9       10       11       12       13 
## 43.35866 43.45879 43.18816 42.93844 43.03094

par(mfrow=c(1,1))
plot(mean.cv.errors ,type='b')

boston_subset_best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(boston_subset_best ,12) ##12 variable model

##   (Intercept)            zn         indus          chas           nox 
##  16.985713928   0.044673247  -0.063848469  -0.744367726 -10.202169211 
##            rm           dis           rad           tax       ptratio 
##   0.439588002  -0.993556631   0.587660185  -0.003767546  -0.269948860 
##         black         lstat          medv 
##  -0.007518904   0.128120290  -0.198877768

boston_subset_best=regsubsets(crim~.,data=Boston,nvmax=13)
coef(boston_subset_best ,2) ##2 variable model

## (Intercept)         rad       lstat 
##  -4.3814053   0.5228128   0.2372846

#11 (a) Present and discuss results for the approaches that you consider.

MSE: MSE R2 Ridge: 55.343
Lasso: 55.754 PCR: 58.701 97.28 variance x explained, 55.26 crim (per capita crime rate) explained (10 comps) PLS: 57.454 95.09 variance x explained, 53.30 crim (per capita crime rate) explained (10 comps) Subset: 42.932 2 variable model, rad and lstat

Propose a model (or set of models) that seems to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error. I tried 5 different approachs and the subset approach produced the lowest MSE. Used CV and train/test data sets (validation approach).
Does your chosen model involve all of the features in the data set? Why or why not No it does not involve all of the variables. Subset approach had the lowest MSE of the different model approaches, based on the lab example from class (which I followed) ran through the cross validation approach and decided on a 2 variable model (rad and lstat).
A 10 variable model subset approach produced the loses MSE, however 2 variable model is much easier to work with than a 10 variable model