If there are ‘p’ predictors, this algorithm searches all the 2^p subsets to select the best set of predictors which reduces the R-Squared (for Regression) and Deviance (for Classification). This method has a drawback. It is computationally Intensive.
We will use the ‘Boston’ dataset to understand this method.
library(ISLR)
library(MASS)
fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
We will use ‘regsubsets’ method of ‘leaps’ package to fit a linear regression model to estimate ‘medv’ variable.
library(leaps)
We have 13 predictors and one response. We dont want to use all the 13 predictors for our model. We will use the best subset of 5 predictors.
best_model <- regsubsets(medv ~., data = Boston, nvmax = 5)
The syntax is similar to ‘lm’ except we use another variable ‘nvmax’ which specifies the maximum number of variables to be used for our best subset of predictors. Let us use ‘summary’ function to further understand our model.
summary(best_model)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 5)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
So now we want a subset of 5 best predictors. In the last part of summary look at the 5th row. Variables that has asterisk are the predictors that we are looking for. So in this example, they are ‘nox’, ‘rm’, ‘dis’, ‘ptratio’ and ‘lstat’.
Running ‘names’ on summary, gives us more details about our model.
names(summary(best_model))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Let us look at adjusted r squared of our model.
summary(best_model)$adjr2
## [1] 0.5432418 0.6371245 0.6767036 0.6878351 0.7051702
We get 5 different Adjusted R Squared values. The first value is adjuted r squared of model that has 1 best predictor. The second is for model with 2 best predictors and so on.
We will find out the model which has maximum adjusted r squared using ‘which’ function in ‘summary’
which.max(summary(best_model)$adjr2)
## [1] 5
This says that model with 5 predictors has maximum adjusted r squared. We can use ‘coef’ function to get the coefficients of the predictors. We also have to specify the number of predictors also.
coef(best_model, 5)
## (Intercept) nox rm dis ptratio lstat
## 37.4991961 -17.9965715 4.1633074 -1.1846623 -1.0457738 -0.5810836
This model has same goal as the best subset method, but the algorithm and variables selected are different. First a model is created with no predictors. Then a predictor is added to the model which increases its adjusted r square. Next another predictor is added to further increase our adjusted R square. This process is continued till the r squared reaches the maximum value.
For code, the syntax and process is similar to ‘Best Subset’ method, except we add another term ‘method’ to define ‘forward selection method’.
library(MASS)
library(ISLR)
library(leaps)
fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
fwd_model <- regsubsets(medv ~., data = Boston, nvmax = 5, method = 'forward')
summary(fwd_model)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 5, method = "forward")
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: forward
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
names(summary(fwd_model))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Now let us look at agjusted r squared of our model.
summary(fwd_model)$adjr2
## [1] 0.5432418 0.6371245 0.6767036 0.6878351 0.7051702
We get 5 different Adjusted R Squared values. The first value is adjuted r squared of model that has 1 best predictor. The second is for model with 2 best predictors and so on.
We will find out the model which has maximum adjusted r squared using ‘which’ function in ‘summary’
which.max(summary(fwd_model)$adjr2)
## [1] 5
This says that model with 5 predictors has maximum adjusted r squared. We can use ‘coef’ function to get the coefficients of the predictors. We also have to specify the number of predictors also.
coef(best_model, 5)
## (Intercept) nox rm dis ptratio lstat
## 37.4991961 -17.9965715 4.1633074 -1.1846623 -1.0457738 -0.5810836
This model has same goal as the forward stepwise selection method. First a model is created with all predictors. Then a predictor is removed from the model which increases its adjusted r square. Next another predictor is removed from the model to further increase our adjusted R square. This process is continued till the r squared reaches the maximum value.
For code, the syntax and process is similar to ‘Forward Stepwise Selection’ method, except we replace ‘forward’ with ‘backward’ in the ‘method’ variable of regsubsets.
library(MASS)
library(ISLR)
library(leaps)
fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
bwd_model <- regsubsets(medv ~., data = Boston, nvmax = 5, method = 'backward')
summary(bwd_model)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = Boston, nvmax = 5, method = "backward")
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: backward
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
names(summary(bwd_model))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
Now let us look at agjusted r squared of our model.
summary(bwd_model)$adjr2
## [1] 0.5432418 0.6371245 0.6767036 0.6878351 0.7051702
We get 5 different Adjusted R Squared values. The first value is adjuted r squared of model that has 1 best predictor. The second is for model with 2 best predictors and so on.
We will find out the model which has maximum adjusted r squared using ‘which’ function in ‘summary’
which.max(summary(bwd_model)$adjr2)
## [1] 5
This says that model with 5 predictors has maximum adjusted r squared. We can use ‘coef’ function to get the coefficients of the predictors. We also have to specify the number of predictors also.
coef(bwd_model, 5)
## (Intercept) nox rm dis ptratio lstat
## 37.4991961 -17.9965715 4.1633074 -1.1846623 -1.0457738 -0.5810836
This is a dimension reduction technique. In this technique,new predictors called Princial Components are formed as a linear combination of original predictors. These new predictors are used in linear regression. These lines are nothing but the new vectors that are formed which explain the variance in the predictors.
This regression is done using ‘pcr()’ function in ‘pls’ library. We will do this regression in Boston Dataset to predict ‘medv’ variable using other predictors.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
library(ISLR)
library(MASS)
fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
Boston_x <- head(Boston, 300)
The PCR has similar syntax to that of ‘lm’, but we add another two parameters to it. ‘scale = TRUE’ standardizes the variables and ‘validation = ’CV’‘, uses Cross - Validation to calculate Root Mean Square Error. The model can be further explored by ’summary’ command.
pcr_model <- pcr(medv ~., data = Boston_x, scale = T, validation = "CV")
summary(pcr_model)
## Data: X dimension: 300 13
## Y dimension: 300 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.902 7.837 5.330 4.423 4.303 3.869 3.879
## adjCV 8.902 7.831 5.316 4.366 4.296 3.857 3.874
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 3.774 3.769 3.669 3.634 3.648 3.315 3.295
## adjCV 3.765 3.760 3.660 3.624 3.645 3.304 3.284
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 37.78 51.39 60.81 69.65 76.93 82.53 87.70
## medv 23.76 65.20 76.53 77.22 81.90 81.91 83.18
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 91.26 93.78 96.00 97.73 99.09 100.00
## medv 83.25 84.35 84.68 84.69 87.54 87.76
We can plot Cross Validation errors, using ‘validationplot’ command. We give type = “RMSEP” to plot the CV RMSE.
validationplot(pcr_model, val.type = "RMSEP")
In this figure, we see that CV RMSE is least when all the 13 components are used, which means there is no improvement compared to least squares method.
We can make predictions on a test set using the predict function. We will use the last 206 observations of Boston dataset as our test data.
Boston_test <- tail(Boston, 206)
To make our predictions, we use predict function. In ‘predict’ function, we pass the model we are using to make predictions, the dataset on which the predictions are to be made, and the number of components to be used.
pcr_pred <- predict(pcr_model, Boston_test, ncomp = 7)
Now let us calculate the MSE of our test data.
mean((Boston_test$medv - pcr_pred)^2)
## [1] 106.9002
Partial least squares is similar to Principal Component Regression. In PCR, the components explain the variance in Predictors only. In Partial Least Squares, the variance in response variable is also explained. We use the ‘plsr’ function from the ‘pls’ library and Boston data.
library(pls)
plsr_model <- plsr(medv ~., data = Boston_x, scale = T, validation = "CV")
summary(plsr_model)
## Data: X dimension: 300 13
## Y dimension: 300 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.902 5.695 3.849 3.589 3.440 3.384 3.373
## adjCV 8.902 5.692 3.835 3.576 3.424 3.369 3.358
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 3.365 3.367 3.366 3.365 3.365 3.365 3.365
## adjCV 3.351 3.353 3.351 3.351 3.351 3.351 3.351
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 32.55 50.36 58.98 63.63 69.13 73.75 77.64
## medv 61.33 83.27 85.81 87.23 87.65 87.73 87.76
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 85.28 87.57 90.34 93.48 98.23 100.00
## medv 87.76 87.76 87.76 87.76 87.76 87.76
We can plot Cross Validation errors, using ‘validationplot’ command. We give type = “RMSEP” to plot the CV RMSE.
validationplot(plsr_model, val.type = "RMSEP")
In this figure, we see that CV RMSE is least when all the 13 components are used, which means there is no improvement compared to least squares method.
We can make predictions on a test set using the predict function. We will use the last 206 observations of Boston dataset as our test data.
Boston_test <- tail(Boston, 206)
To make our predictions, we use predict function. In ‘predict’ function, we pass the model we are using to make predictions, the dataset on which the predictions are to be made, and the number of components to be used.
plsr_pred <- predict(plsr_model, Boston_test, ncomp = 2)
Now let us calculate the MSE of our test data.
mean((Boston_test$medv - plsr_pred)^2)
## [1] 76.27682