Partial least squares regression (PLS regression) is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of minimum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares Discriminant Analysis (PLS-DA) is a variant used when the Y is categorical.
In many data sets, the predictors we use could be correlated to the response (what we are looking for) and to each other (not good for variability). If too many predictor variables are correlated to each other, then the variability would render the regression unstable.
One way to correct for predictor correlation is to use PCA or principle component analysis which seeks to find a linear combination of predictors to capture the most variance.
PLS regression, like PCA, seeks to find components which maximize the variability of predictors but differs from PCA as PLS requires the components to have maximum correlation with the response. PLS is a supervised procedure whereas PCA is unsupervised.
We will use the Solubility data from the AppliedPredicitvModeling package.
# load libraries
library(MASS)
library(caret)
library(AppliedPredictiveModeling)
library(lars)
library(pls) # which is the PLS function library
## Warning: package 'pls' was built under R version 3.2.5
library(elasticnet)
data(solubility)
First we need to create a data.frame that includes the predictors and the target variable. Note that we used the transformed data set, and created a new variable called training$solubility to which we assigned the target solubility vector.
training = solTrainXtrans # transformed using Box-cox
training$solubility = solTrainY
In the pls library, we will use the plsr() function for partial least squares regression. We specify the response variable as solubility, use all the predictor variables with a ".", and include cross-validation parameter as "CV".
set.seed(2167)
plsFit = plsr(solubility ~ ., data=training, validation="CV")
pls.pred = predict(plsFit, solTestXtrans[1:5, ], ncomp=1:2)
With the familiar predict() function, use the plsFit list object which contains the predicted values for each component iteration of the PLSR algorithm, with default set at the Dayal and MacGregor kernel algorithm "kernelpls". We predict the first 5 solubility values "solTestXtrans[1:5, ]" using 1 and 2 components "ncomp=1:2". The results are shown below.
pls.pred
## , , 1 comps
##
## solubility
## 20 -1.789335
## 21 -1.427551
## 23 -2.268798
## 25 -2.269782
## 28 -1.867960
##
## , , 2 comps
##
## solubility
## 20 0.2520469
## 21 0.3555028
## 23 -1.8795338
## 25 -0.6848584
## 28 -1.5531552
But how do we know from what number of predictors to choose? Well there are several methods, we can check the output of summary(plsFit), which is truncated into two parts. We should look for component number which adequately explains both predictors and response variances.
The first section displays the root mean squared error of prediction (RMSEP), cross-validation estimate, as well as the adj-CV, which is adjusted for bias. Take note of the dimensions of X and Y of the data towards the top of the output.
The next section shows the percent of variance explained by components for predictors and response. See how the variance explained rises quickly from 1 component and stabilizes above 10..13 components. That would be a good component range for the pls model.
Another method is by plotting and locating the lowest Root Mean Standard Error of Prediction (RMSEP).
There are two methods able to produce the visualization of RMSEP. The first uses the function validationplot() from the pls package which will give us the RMSEP plot for each component iteration in the model. Specify the validation type to be "val.type=RMSEP".
We also can plot the R2.
We see from the validation plotting function that there is a sharp decrease in the root mean standard error for the first few numbers of components and then plateaus after about 10. Keep in mind the summary information about the explained variance increments stabilizing around 10.
However we want to find the component number corresponding to lowest point in the RMSEP. Using the RMSEP() function with estimate parameter set to "CV" we can derive the desired value.
To calculate the lowest RMSEP, we use the which.min() function, which returns the index of the lowest value. The minimum value is found at 11 components and we add a point to the plot using the points() function such that the x and y values are: (component number, minimum RMSEP value).
pls.RMSEP = RMSEP(plsFit, estimate="CV")
plot(pls.RMSEP, main="RMSEP PLS Solubility", xlab="components")
min_comp = which.min(pls.RMSEP$val)
points(min_comp, min(pls.RMSEP$val), pch=1, col="red", cex=1.5)
min_comp
## [1] 11
Using the training data, we can plot the predicted solubilities from the training set to the actual solubilities, using another plot() function:
Now that we have an optimal component number (11), we can go ahead and use the plsr model to predict new solubility values for test predictor data. Again we use the predict() function and specify the test data (solTestXtrans) and the number of components (ncomp=11).
pls.pred2 = predict(plsFit, solTestXtrans, ncomp=11)
plot(solTestY, pls.pred2, ylim=c(-11,2), xlim=c(-11,2),main="Test Dataset", xlab="observed", ylab="PLS Predicted")
abline(0, 1, col="red")
Because we have the observed solubility values for the 'test' data we can evaluate how well the model performed. However in the usual circumstances, we would not know the actual values, hence what we are doing- predicting values from the test predictor data.
Using the observed values in solTestY and the predicted values in pls.pred2[,1,1], we can use the defaultSummary() function to obtain the RMSE and Rsquared, as shown below.
pls.eval=data.frame(obs=solTestY, pred=pls.pred2[,1,1])
defaultSummary(pls.eval)
## RMSE Rsquared
## 0.7374743 0.8744013
From the output, we see that the RMSE is around 0.7366, and the Rsquared is approximately 0.8744, explaining 87% of the test data variance.
Original Post.## [1] "Last Updated on Wed May 11 11:48:21 2016"