Partial Least Squares (PLS) regression is a technique that reduces the predictors to a smaller set of uncorrelated components and performs least squares regression on these components, instead of just on the original data. PLS regression is a solution for:
The problem of Multicollinearity. That is, when the predictors are highly collinear.
When there more predictors than observations which encourages overfitting due to lack of degree of freedom.
When ordinary least-squares regression either produces coefficients with high standard errors or fails completely.
PLS does not assume that the predictors are fixed, unlike multiple regression. This means that the predictors can be measured with error, making PLS more robust to measure uncertainty.
PLS regression is primarily used in the chemical, drug, food, and plastic industries. In this case, we will use an example of such background.
This example comes from the Applied Predictive Modeling text by Kuhn & Johnson pg.139
.
library(AppliedPredictiveModeling)
library(tidyverse)
library(MASS)
library(caret)
library(pls)
library(Amelia)
A chemical manufacturing process for a pharmaceutical product was discussed in Sect.1.4. In this problem, the objective is to understand the relationship between biological measurements of the raw materials (predictors), measurements of the manufacturing process (predictors), and the response of product yield. Biological predictors cannot be changed but can be used to assess the quality of the raw material before processing. On the other hand, manufacturing process predictors can be changed in the manufacturing process. Improving product yield by 1% will boost revenue by approximately one hundred thousand dollars per batch:
The matrix processPredictors
contains the 57 predictors (12 describing the input biological material and 45 describing the process predictors) for the 176 manufacturing runs. yield
contains the percent yield for each run.
We can see some predictors do have missing values.
Below I will preprocess the data. This includes:
centering
and scaling
the dataknn
imputation method to replace missing valuescorr
to filter out highly correlated predictorsnzv
to filter near zero variance predictors that could cause trouble.#preprocess data excluding the yeild column
preprocessing <- preProcess(ChemicalManufacturingProcess[,-1], method = c("center", "scale", "knnImpute", "corr", "nzv"))
Xpreprocess <- predict(preprocessing, ChemicalManufacturingProcess[,-1])
missmap(Xpreprocess, col = c("yellow", "navy"))
As seen in this second plot, the missing values were replaced and the data is now complete.
yield <- as.matrix(ChemicalManufacturingProcess$Yield)
set.seed(789)
split <- yield %>%
createDataPartition(p = 0.8, list = FALSE, times = 1)
Xtrain <- Xpreprocess[split, ] #chem train
xtest <- Xpreprocess[-split, ] #chem test
Ytrain <- yield[split, ] #yield train
ytest <- yield[-split, ] #yield test
## Partial Least Squares
##
## 144 samples
## 56 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 130, 128, 130, 129, 132, 129, ...
## Resampling results across tuning parameters:
##
## ncomp RMSE Rsquared MAE
## 1 1.378920 0.4782463 1.1198697
## 2 1.209939 0.5699965 0.9880707
## 3 1.178335 0.6137609 0.9689162
## 4 1.174866 0.6287347 0.9724853
## 5 1.158603 0.6416229 0.9634906
## 6 1.168447 0.6285696 0.9637299
## 7 1.189409 0.6197823 0.9822850
## 8 1.217483 0.6095880 1.0118483
## 9 1.260720 0.5962066 1.0355060
## 10 1.306164 0.5843910 1.0510701
## 11 1.320139 0.5819132 1.0565354
## 12 1.333271 0.5712727 1.0671789
## 13 1.370425 0.5574550 1.0960167
## 14 1.410802 0.5471344 1.1283985
## 15 1.454649 0.5365969 1.1458422
## 16 1.481589 0.5298125 1.1540927
## 17 1.498174 0.5234310 1.1570644
## 18 1.534203 0.5085435 1.1732633
## 19 1.579129 0.4944106 1.1943942
## 20 1.619288 0.4841343 1.2124802
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was ncomp = 5.
## Data: X dimension: 144 56
## Y dimension: 144 1
## Fit method: oscorespls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 17.45 26.59 32.23 38.72 45.95
## .outcome 49.72 64.52 70.22 72.24 73.37
Optimal value: The optimal number of principal components included in the PLS model is 5. This captures 45.95% of the variation in the predictors and 73.37% of the variation in the outcome variable (yield).
The lowest point in the curve indicates the optimal value, which is the best minimised error in cross-validation. We can extract this value as:
## ncomp
## 5 5
plot(predictions, col = "darkgreen", main = "Observed vs. Predicted", xlab = "", ylab = "Predictions")
par(new = TRUE)
plot(ytest, col = "blue", axes=F, ylab = "", xlab="Observed")
abline(0,1, col='orange')
## RMSE R_squared
## [1,] 1.119891 0.5183108
The scores here are lower hence, better than the resampled metrics.
## pls variable importance
##
## only 20 most important variables shown (out of 56)
##
## Overall
## ManufacturingProcess32 100.00
## ManufacturingProcess13 94.04
## ManufacturingProcess17 88.41
## ManufacturingProcess09 82.88
## ManufacturingProcess36 78.05
## BiologicalMaterial02 61.80
## ManufacturingProcess06 60.10
## BiologicalMaterial06 59.29
## ManufacturingProcess12 58.69
## BiologicalMaterial08 57.11
## ManufacturingProcess33 56.93
## BiologicalMaterial12 55.68
## BiologicalMaterial03 55.57
## BiologicalMaterial11 54.49
## ManufacturingProcess11 51.19
## BiologicalMaterial04 49.62
## BiologicalMaterial01 48.58
## ManufacturingProcess04 43.29
## ManufacturingProcess28 42.97
## ManufacturingProcess37 39.43
Based on the plot and values displayed, seems as though the processing predictors dominate the list.
For this question, I’ll only look at the top predictor recorded for the manufacturing processes and the biological materials.
## [,1]
## [1,] 0.6083321
## [,1]
## [1,] 0.4815158
As stated in the intro for this question, Biological materials are used to asses the quality of raw materials before processing. If the results are good then the yield of the product may increase. Looking at the top Biological material, we can see that its positively but moderately correlated to the response variable.
On the other hand, manufacturing processes are possibly the steps taken to create the end product graded by a rate. We can see there is a positive but low correlation as well which make sense. If the process is good then the product will come out great.
“Kuhn, Max, and Kjell Johnson. Applied Predictive Modeling. Springer, 2013, Springer Link, link.springer.com/book/10.1007/978-1-4614-6849-3.”