In this blog, I am going to discuss what Partial Least Square is and how it is useful in some as a better regression techinque. PLS regression is a regression technique which reduces the predictors to the smalelr set of uncorrelated items. This technique is very beneficial when there is high multicollinearity among the predictors. It also removed the overfitting when there are more predictors than observations. It is a well used techinque throughout the industry.
In this example, I am going to take a sample data from AppliedPredictiveModelling function in R. In the following example of PLS, I took the data from AppliedPredictiveModeling library and prepared the data using nearZeroVar which removed 719 columns. After that I splitted the data into training and test datasets to apply PLS test on training data and then on test data. 11.23 is a RMSE value after optimization with ncomp value of 2 which gave RMSE value of 12.67 on test dataset. X239,X245, X253,X240, X157, X246 and X244 are the highly important predictors which I found using caretโs varImp function.
X254
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
data(permeability)
# Preparation
low_frequency <- nearZeroVar(fingerprints) # low frequencies using nearZeroVar function
X <- fingerprints[,-low_frequency] # Removing the low frequencies
print(paste0(dim(X)[2], " columns are left after removing 719 columns using nearZeroVar function"))
## [1] "388 columns are left after removing 719 columns using nearZeroVar function"
# Splitting the data into training and test
splitt <- createDataPartition(permeability, p=0.8, list=FALSE)
# Training
X_train <- X[splitt, ]
y_train <- permeability[splitt, ]
# Test
X_test <- X[-splitt, ]
y_test <- permeability[-splitt, ]
# PLS Method
model_pls <- train(X_train, y_train, method='pls', metric='RSquared',
tuneLength=20, trControl = trainControl(method='cv'),
preProcess= c('center','scale'))
## Warning in train.default(X_train, y_train, method = "pls", metric =
## "RSquared", : The metric "RSquared" was not in the result set. RMSE will be used
## instead.
model_pls
## Partial Least Squares
##
## 133 samples
## 388 predictors
##
## Pre-processing: centered (388), scaled (388)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 120, 119, 119, 121, 120, 121, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 13.10975 0.3665169 10.155169
## 2 11.99313 0.4847670 8.542758
## 3 11.68967 0.4903324 8.767831
## 4 11.71347 0.4763350 8.663431
## 5 11.43567 0.5093898 8.338799
## 6 11.42011 0.5222302 8.461767
## 7 11.54132 0.5094581 8.581878
## 8 11.67392 0.5098131 8.845636
## 9 11.91191 0.5031517 9.059841
## 10 12.04174 0.5098257 8.856724
## 11 12.32517 0.4898875 9.101895
## 12 12.52219 0.4732587 9.435517
## 13 12.77392 0.4557891 9.613748
## 14 12.82544 0.4546280 9.617036
## 15 13.15447 0.4370187 9.937822
## 16 13.32949 0.4238000 9.991967
## 17 13.42058 0.4179085 10.142765
## 18 13.48801 0.4074714 10.138794
## 19 13.27041 0.4222705 9.743078
## 20 13.30562 0.4280119 9.827343
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 6.
# Plot
plot(model_pls)
# important variables
varImp(model_pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
## pls variable importance
##
## only 20 most important variables shown (out of 388)
##
## Overall
## X6 100.00
## X372 71.66
## X373 71.66
## X15 58.87
## X244 55.81
## X253 55.81
## X157 55.81
## X245 55.81
## X240 55.81
## X246 55.81
## X239 55.81
## X254 55.81
## X127 52.77
## X126 52.77
## X129 48.59
## X235 46.55
## X266 43.86
## X262 43.86
## X130 43.44
## X125 43.44
# Checking the accuracy on test dataset
postResample(predict(model_pls, X_test), obs=y_test)
## RMSE Rsquared MAE
## 12.798513 0.261938 9.225175