What is Partial Least Squares (PLS) Regression?

Partial Least Squares (PLS) regression is a technique that reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components, instead of just on the original data. PLS regression is a solution for:

  • The problem of Multicollinearity. That is, when the predictors are highly collinear.

  • When there more predictors than observations which encourages overfitting due to lack of degree of freedom.

  • When ordinary least-squares regression either produces coefficients with high standard errors or fails completely.

PLS does not assume that the predictors are fixed, unlike multiple regression. This means that the predictors can be measured with error, making PLS more robust to measure uncertainty.

PLS regression is primarily used in the chemical, drug, food, and plastic industries. In this case, we will use an example of such background.

An Example using PLS

This example comes from the Applied Predictive Modeling text by Kuhn & Johnson pg.139.

library(AppliedPredictiveModeling)
library(tidyverse)
library(MASS)
library(caret)
library(pls)
library(Amelia)

A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:

  1. Start R and use these commands to load the data:
 data("ChemicalManufacturingProcess")

The matrix processPredictors contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield contains the percent yield for each run.

  1. A small percentage of cells in the predictor set contain missing values. Use an imputation function to fill in these missing values (e.g., see Sect.3.8).
missmap(ChemicalManufacturingProcess, col = c("yellow", "navy"))

We can see some predictors do have missing values.

Below I will preprocess the data. This includes:

  1. centering and scaling the data
  2. Using the knn imputation method to replace missing values
  3. Using corr to filter out highly correlated predictors
  4. nzv to filter near zero variance predictors that could cause trouble.
#preprocess data excluding the yeild column
preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv")) 

Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])
missmap(Xpreprocess, col = c("yellow", "navy"))

As seen in this second plot, the missing values were replaced and the data is now complete.

  1. Split the data into a training and a test set, pre-process the data, and tune a model of your choice from this chapter. What is the optimal value of the performance metric?
yield <- as.matrix(ChemicalManufacturingProcess$Yield)

set.seed(789)
split <- yield %>%
  createDataPartition(p = 0.8, list = FALSE, times = 1)

Xtrain  <- Xpreprocess[split, ] #chem train
xtest <- Xpreprocess[-split, ] #chem test
Ytrain  <- yield[split, ] #yield train
ytest <- yield[-split, ] #yield test

Build Model

ctrl <- trainControl(method = "cv", number = 10)
plsmod <- train(x = Xtrain, y = Ytrain, method = "pls", tuneLength = 20, trControl = ctrl)

View Model

plsmod
## Partial Least Squares 
## 
## 144 samples
##  56 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 130, 128, 130, 129, 132, 129, ... 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE      Rsquared   MAE      
##    1     1.378920  0.4782463  1.1198697
##    2     1.209939  0.5699965  0.9880707
##    3     1.178335  0.6137609  0.9689162
##    4     1.174866  0.6287347  0.9724853
##    5     1.158603  0.6416229  0.9634906
##    6     1.168447  0.6285696  0.9637299
##    7     1.189409  0.6197823  0.9822850
##    8     1.217483  0.6095880  1.0118483
##    9     1.260720  0.5962066  1.0355060
##   10     1.306164  0.5843910  1.0510701
##   11     1.320139  0.5819132  1.0565354
##   12     1.333271  0.5712727  1.0671789
##   13     1.370425  0.5574550  1.0960167
##   14     1.410802  0.5471344  1.1283985
##   15     1.454649  0.5365969  1.1458422
##   16     1.481589  0.5298125  1.1540927
##   17     1.498174  0.5234310  1.1570644
##   18     1.534203  0.5085435  1.1732633
##   19     1.579129  0.4944106  1.1943942
##   20     1.619288  0.4841343  1.2124802
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.

Plot Model

plot(plsmod)

Model Summary

summary(plsmod$finalModel)
## Data:    X dimension: 144 56 
##  Y dimension: 144 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
##           1 comps  2 comps  3 comps  4 comps  5 comps
## X           17.45    26.59    32.23    38.72    45.95
## .outcome    49.72    64.52    70.22    72.24    73.37

Optimal value: The optimal number of principal components included in the PLS model is 5. This captures 45.95% of the variation in the predictors and 73.37% of the variation in the outcome variable (yield).

The lowest point in the curve indicates the optimal value, which is the best minimised error in cross-validation. We can extract this value as:

plsmod$bestTune
##   ncomp
## 5     5
  1. Predict the response for the test set. What is the value of the performance metric and how does this compare with the resampled performance metric on the training set?
predictions <- plsmod %>% predict(xtest)
plot(predictions, col = "darkgreen", main = "Observed vs. Predicted", xlab = "", ylab = "Predictions")
par(new = TRUE)
plot(ytest, col = "blue", axes=F, ylab = "", xlab="Observed") 
abline(0,1, col='orange')

cbind(
  RMSE = RMSE(predictions, ytest),
  R_squared = caret::R2(predictions, ytest)
)
##          RMSE R_squared
## [1,] 1.119891 0.5183108

The scores here are lower hence, better than the resampled metrics.

  1. Which predictors are most important in the model you have trained? Do either the biological or process predictors dominate the list?
varImp(plsmod)
## pls variable importance
## 
##   only 20 most important variables shown (out of 56)
## 
##                        Overall
## ManufacturingProcess32  100.00
## ManufacturingProcess13   94.04
## ManufacturingProcess17   88.41
## ManufacturingProcess09   82.88
## ManufacturingProcess36   78.05
## BiologicalMaterial02     61.80
## ManufacturingProcess06   60.10
## BiologicalMaterial06     59.29
## ManufacturingProcess12   58.69
## BiologicalMaterial08     57.11
## ManufacturingProcess33   56.93
## BiologicalMaterial12     55.68
## BiologicalMaterial03     55.57
## BiologicalMaterial11     54.49
## ManufacturingProcess11   51.19
## BiologicalMaterial04     49.62
## BiologicalMaterial01     48.58
## ManufacturingProcess04   43.29
## ManufacturingProcess28   42.97
## ManufacturingProcess37   39.43
plot(varImp(plsmod))

Based on the plot and values displayed, seems as though the processing predictors dominate the list.

  1. Explore the relationships between each of the top predictors and the response. How could this information be helpful in improving yield in future runs of the manufacturing process?

For this question, I’ll only look at the top predictor recorded for the manufacturing processes and the biological materials.

cor(yield, ChemicalManufacturingProcess$ManufacturingProcess32)
##           [,1]
## [1,] 0.6083321
cor(yield, ChemicalManufacturingProcess$BiologicalMaterial02)
##           [,1]
## [1,] 0.4815158

As stated in the intro for this question, Biological materials are used to asses the quality of raw materials before processing. If the results are good then the yield of the product may increase. Looking at the top Biological material, we can see that its positively but moderately correlated to the response variable.

On the other hand, manufacturing processes are possibly the steps taken to create the end product graded by a rate. We can see there is a positive but low correlation as well which make sense. If the process is good then the product will come out great.



Sources

“Kuhn, Max, and Kjell Johnson. Applied Predictive Modeling. Springer, 2013, Springer Link, link.springer.com/book/10.1007/978-1-4614-6849-3.”

Define Partial Least Squares