Best Subset Selection

If there are ‘p’ predictors, this algorithm searches all the 2^p subsets to select the best set of predictors which reduces the R-Squared (for Regression) and Deviance (for Classification). This method has a drawback. It is computationally Intensive.

We will use the ‘Boston’ dataset to understand this method.

library(ISLR)
library(MASS)
fix(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

We will use ‘regsubsets’ method of ‘leaps’ package to fit a linear regression model to estimate ‘medv’ variable.

library(leaps)

We have 13 predictors and one response. We dont want to use all the 13 predictors for our model. We will use the best subset of 5 predictors.

best_model <- regsubsets(medv ~., data = Boston, nvmax = 5)

The syntax is similar to ‘lm’ except we use another variable ‘nvmax’ which specifies the maximum number of variables to be used for our best subset of predictors. Let us use ‘summary’ function to further understand our model.

summary(best_model)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 5)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
##          crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 ) " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 ) " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 ) " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"

So now we want a subset of 5 best predictors. In the last part of summary look at the 5th row. Variables that has asterisk are the predictors that we are looking for. So in this example, they are ‘nox’, ‘rm’, ‘dis’, ‘ptratio’ and ‘lstat’.

Running ‘names’ on summary, gives us more details about our model.

names(summary(best_model))
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

Let us look at adjusted r squared of our model.

summary(best_model)$adjr2
## [1] 0.5432418 0.6371245 0.6767036 0.6878351 0.7051702

We get 5 different Adjusted R Squared values. The first value is adjuted r squared of model that has 1 best predictor. The second is for model with 2 best predictors and so on.

We will find out the model which has maximum adjusted r squared using ‘which’ function in ‘summary’

which.max(summary(best_model)$adjr2)
## [1] 5

This says that model with 5 predictors has maximum adjusted r squared. We can use ‘coef’ function to get the coefficients of the predictors. We also have to specify the number of predictors also.

coef(best_model, 5)
## (Intercept)         nox          rm         dis     ptratio       lstat 
##  37.4991961 -17.9965715   4.1633074  -1.1846623  -1.0457738  -0.5810836

Forward Stepwise Selection

This model has same goal as the best subset method, but the algorithm and variables selected are different. First a model is created with no predictors. Then a predictor is added to the model which increases its adjusted r square. Next another predictor is added to further increase our adjusted R square. This process is continued till the r squared reaches the maximum value.

For code, the syntax and process is similar to ‘Best Subset’ method, except we add another term ‘method’ to define ‘forward selection method’.

library(MASS)
library(ISLR)
library(leaps)
fix(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
fwd_model <- regsubsets(medv ~., data = Boston, nvmax = 5, method = 'forward')
summary(fwd_model)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 5, method = "forward")
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: forward
##          crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 ) " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 ) " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 ) " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"
names(summary(fwd_model))
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

Now let us look at agjusted r squared of our model.

summary(fwd_model)$adjr2
## [1] 0.5432418 0.6371245 0.6767036 0.6878351 0.7051702

We get 5 different Adjusted R Squared values. The first value is adjuted r squared of model that has 1 best predictor. The second is for model with 2 best predictors and so on.

We will find out the model which has maximum adjusted r squared using ‘which’ function in ‘summary’

which.max(summary(fwd_model)$adjr2)
## [1] 5

This says that model with 5 predictors has maximum adjusted r squared. We can use ‘coef’ function to get the coefficients of the predictors. We also have to specify the number of predictors also.

coef(best_model, 5)
## (Intercept)         nox          rm         dis     ptratio       lstat 
##  37.4991961 -17.9965715   4.1633074  -1.1846623  -1.0457738  -0.5810836

Backward Stepwise Selection

This model has same goal as the forward stepwise selection method. First a model is created with all predictors. Then a predictor is removed from the model which increases its adjusted r square. Next another predictor is removed from the model to further increase our adjusted R square. This process is continued till the r squared reaches the maximum value.

For code, the syntax and process is similar to ‘Forward Stepwise Selection’ method, except we replace ‘forward’ with ‘backward’ in the ‘method’ variable of regsubsets.

library(MASS)
library(ISLR)
library(leaps)
fix(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
bwd_model <- regsubsets(medv ~., data = Boston, nvmax = 5, method = 'backward')
summary(bwd_model)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 5, method = "backward")
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: backward
##          crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 ) " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 ) " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 ) " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"
names(summary(bwd_model))
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

Now let us look at agjusted r squared of our model.

summary(bwd_model)$adjr2
## [1] 0.5432418 0.6371245 0.6767036 0.6878351 0.7051702

We get 5 different Adjusted R Squared values. The first value is adjuted r squared of model that has 1 best predictor. The second is for model with 2 best predictors and so on.

We will find out the model which has maximum adjusted r squared using ‘which’ function in ‘summary’

which.max(summary(bwd_model)$adjr2)
## [1] 5

This says that model with 5 predictors has maximum adjusted r squared. We can use ‘coef’ function to get the coefficients of the predictors. We also have to specify the number of predictors also.

coef(bwd_model, 5)
## (Intercept)         nox          rm         dis     ptratio       lstat 
##  37.4991961 -17.9965715   4.1633074  -1.1846623  -1.0457738  -0.5810836

Principal Component Regression

This is a dimension reduction technique. In this technique,new predictors called Princial Components are formed as a linear combination of original predictors. These new predictors are used in linear regression. These lines are nothing but the new vectors that are formed which explain the variance in the predictors.

This regression is done using ‘pcr()’ function in ‘pls’ library. We will do this regression in Boston Dataset to predict ‘medv’ variable using other predictors.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
library(ISLR)
library(MASS)
fix(Boston)
names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
Boston_x <- head(Boston, 300)

The PCR has similar syntax to that of ‘lm’, but we add another two parameters to it. ‘scale = TRUE’ standardizes the variables and ‘validation = ’CV’‘, uses Cross - Validation to calculate Root Mean Square Error. The model can be further explored by ’summary’ command.

pcr_model <- pcr(medv ~., data = Boston_x, scale = T, validation = "CV")
summary(pcr_model)
## Data:    X dimension: 300 13 
##  Y dimension: 300 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.902    7.837    5.330    4.423    4.303    3.869    3.879
## adjCV        8.902    7.831    5.316    4.366    4.296    3.857    3.874
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       3.774    3.769    3.669     3.634     3.648     3.315     3.295
## adjCV    3.765    3.760    3.660     3.624     3.645     3.304     3.284
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       37.78    51.39    60.81    69.65    76.93    82.53    87.70
## medv    23.76    65.20    76.53    77.22    81.90    81.91    83.18
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       91.26    93.78     96.00     97.73     99.09    100.00
## medv    83.25    84.35     84.68     84.69     87.54     87.76

We can plot Cross Validation errors, using ‘validationplot’ command. We give type = “RMSEP” to plot the CV RMSE.

validationplot(pcr_model, val.type = "RMSEP")

In this figure, we see that CV RMSE is least when all the 13 components are used, which means there is no improvement compared to least squares method.

Predictions

We can make predictions on a test set using the predict function. We will use the last 206 observations of Boston dataset as our test data.

Boston_test <- tail(Boston, 206)

To make our predictions, we use predict function. In ‘predict’ function, we pass the model we are using to make predictions, the dataset on which the predictions are to be made, and the number of components to be used.

pcr_pred <- predict(pcr_model, Boston_test, ncomp = 7)

Now let us calculate the MSE of our test data.

mean((Boston_test$medv - pcr_pred)^2)
## [1] 106.9002

Partial Least Squares

Partial least squares is similar to Principal Component Regression. In PCR, the components explain the variance in Predictors only. In Partial Least Squares, the variance in response variable is also explained. We use the ‘plsr’ function from the ‘pls’ library and Boston data.

library(pls)

plsr_model <- plsr(medv ~., data = Boston_x, scale = T, validation = "CV")
summary(plsr_model)
## Data:    X dimension: 300 13 
##  Y dimension: 300 1
## Fit method: kernelpls
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.902    5.695    3.849    3.589    3.440    3.384    3.373
## adjCV        8.902    5.692    3.835    3.576    3.424    3.369    3.358
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       3.365    3.367    3.366     3.365     3.365     3.365     3.365
## adjCV    3.351    3.353    3.351     3.351     3.351     3.351     3.351
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X       32.55    50.36    58.98    63.63    69.13    73.75    77.64
## medv    61.33    83.27    85.81    87.23    87.65    87.73    87.76
##       8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X       85.28    87.57     90.34     93.48     98.23    100.00
## medv    87.76    87.76     87.76     87.76     87.76     87.76

We can plot Cross Validation errors, using ‘validationplot’ command. We give type = “RMSEP” to plot the CV RMSE.

validationplot(plsr_model, val.type = "RMSEP")

In this figure, we see that CV RMSE is least when all the 13 components are used, which means there is no improvement compared to least squares method.

Predictions

We can make predictions on a test set using the predict function. We will use the last 206 observations of Boston dataset as our test data.

Boston_test <- tail(Boston, 206)

To make our predictions, we use predict function. In ‘predict’ function, we pass the model we are using to make predictions, the dataset on which the predictions are to be made, and the number of components to be used.

plsr_pred <- predict(plsr_model, Boston_test, ncomp = 2)

Now let us calculate the MSE of our test data.

mean((Boston_test$medv - plsr_pred)^2)
## [1] 76.27682