Partial Least Squares

In this blog, I am going to discuss what Partial Least Square is and how it is useful in some as a better regression techinque. PLS regression is a regression technique which reduces the predictors to the smalelr set of uncorrelated items. This technique is very beneficial when there is high multicollinearity among the predictors. It also removed the overfitting when there are more predictors than observations. It is a well used techinque throughout the industry.

In this example, I am going to take a sample data from AppliedPredictiveModelling function in R. In the following example of PLS, I took the data from AppliedPredictiveModeling library and prepared the data using nearZeroVar which removed 719 columns. After that I splitted the data into training and test datasets to apply PLS test on training data and then on test data. 11.23 is a RMSE value after optimization with ncomp value of 2 which gave RMSE value of 12.67 on test dataset. X239,X245, X253,X240, X157, X246 and X244 are the highly important predictors which I found using caretโ€™s varImp function.
X254

library(AppliedPredictiveModeling)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(permeability)

# Preparation
low_frequency <- nearZeroVar(fingerprints) # low frequencies using nearZeroVar function
X <- fingerprints[,-low_frequency] # Removing the low frequencies
print(paste0(dim(X)[2], " columns are left after removing 719 columns using nearZeroVar function"))
## [1] "388 columns are left after removing 719 columns using nearZeroVar function"
# Splitting the data into training and test
splitt <- createDataPartition(permeability, p=0.8, list=FALSE)

# Training
X_train <- X[splitt, ]
y_train <- permeability[splitt, ]

# Test
X_test <- X[-splitt, ]
y_test <- permeability[-splitt, ]

# PLS Method
model_pls <- train(X_train, y_train, method='pls', metric='RSquared',
                   tuneLength=20, trControl = trainControl(method='cv'),
                   preProcess= c('center','scale'))
## Warning in train.default(X_train, y_train, method = "pls", metric =
## "RSquared", : The metric "RSquared" was not in the result set. RMSE will be used
## instead.
model_pls
## Partial Least Squares 
## 
## 133 samples
## 388 predictors
## 
## Pre-processing: centered (388), scaled (388) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 119, 119, 121, 120, 121, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     13.10975  0.3665169  10.155169
##    2     11.99313  0.4847670   8.542758
##    3     11.68967  0.4903324   8.767831
##    4     11.71347  0.4763350   8.663431
##    5     11.43567  0.5093898   8.338799
##    6     11.42011  0.5222302   8.461767
##    7     11.54132  0.5094581   8.581878
##    8     11.67392  0.5098131   8.845636
##    9     11.91191  0.5031517   9.059841
##   10     12.04174  0.5098257   8.856724
##   11     12.32517  0.4898875   9.101895
##   12     12.52219  0.4732587   9.435517
##   13     12.77392  0.4557891   9.613748
##   14     12.82544  0.4546280   9.617036
##   15     13.15447  0.4370187   9.937822
##   16     13.32949  0.4238000   9.991967
##   17     13.42058  0.4179085  10.142765
##   18     13.48801  0.4074714  10.138794
##   19     13.27041  0.4222705   9.743078
##   20     13.30562  0.4280119   9.827343
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
# Plot
plot(model_pls)

# important variables
varImp(model_pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
## pls variable importance
## 
##   only 20 most important variables shown (out of 388)
## 
##      Overall
## X6    100.00
## X372   71.66
## X373   71.66
## X15    58.87
## X244   55.81
## X253   55.81
## X157   55.81
## X245   55.81
## X240   55.81
## X246   55.81
## X239   55.81
## X254   55.81
## X127   52.77
## X126   52.77
## X129   48.59
## X235   46.55
## X266   43.86
## X262   43.86
## X130   43.44
## X125   43.44
# Checking the accuracy on test dataset
postResample(predict(model_pls, X_test), obs=y_test)
##      RMSE  Rsquared       MAE 
## 12.798513  0.261938  9.225175