PCR combines predictors in an unsupervised way.1010 The response, Y, is not used to generate the principal component Partial Least Squares (PLS) is a supervised alternative.11

We can apply PCR to the Hitters dataset in order to predict Salary. First, we must remove any missing values.

Data preparation

names(Hitters) #column names

##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"

dim(Hitters) #Rows and Columns

## [1] 322  20

#Number of NA values for salary
sum(is.na(Hitters$Salary))

## [1] 59

Hitters=na.omit(Hitters) #Remove NAs
dim(Hitters) #Rows and Columns

## [1] 263  20

sum(is.na(Hitters)) #Check Number of NA values

## [1] 0

Pls model

The function allows us to standardize the data and use 10 fold cross-validation for each possible value of M.

library(pls) #load library

## Warning: package 'pls' was built under R version 4.0.5

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(2) #make replicable
#Fit PCR model
#Standardize data with scale=TRUE
#10 fold cross-validation for each value of M
pcr.fit=pcr(Salary~.,data=Hitters, scale=TRUE, 
             validation="CV")

We can view the resulting fit with summary. The CV score is provided for every value of M and represents the root mean squared error. To obtain the MSE we must square it.1414 MSE of 352.82=124,468

The summmary function also provides the percentage of variance explained in the predictors and the response. For example, M=1 captures 38.31% of the variance in the predictors and 40.63% of the variance in the response.

View resulting fit

summary(pcr.fit)

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    351.9    353.2    355.0    352.8    348.4    343.6
## adjCV          452    351.6    352.7    354.4    352.1    347.6    342.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       345.5    347.7    349.6     351.4     352.1     353.5     358.2
## adjCV    344.7    346.7    348.5     350.1     350.7     352.0     356.5
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        349.7     349.4     339.9     341.6     339.2     339.6
## adjCV     348.0     347.7     338.2     339.7     337.2     337.6
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
## Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
##         16 comps  17 comps  18 comps  19 comps
## X          99.89     99.97     99.99    100.00
## Salary     53.01     53.85     54.61     54.61

We can see from the plot that the smallest root MSE occurs when M=16. However, it is not much smaller then when M=1 and so a simpler model could be used. Also, M=19 is equivalent to M=p and the least squares model.1515 PCR results in only a modest improvement over linear regression

1.Plot Root MSE

validationplot(pcr.fit, val.type="MSEP")

2.Plot Root MSE

We can also perform PCR on a training set and then evaluate the model on a test set.

set.seed(1) # make replicable
#training set = half the data
train=sample(1:nrow(Hitters),nrow(Hitters)/2)
#test set will be remaining rows
test=(-train)
#response variable
y=Hitters$Salary
#response for test set
y.test=y[test]

set.seed(1) #make replicable

Perform PCR

set.seed(1) #make replicable
pcr.fit=pcr(Salary~.,data=Hitters,subset=train,
             scale=TRUE, validation="CV")

View MSE plot

validationplot(pcr.fit,val.type = "MSEP")

3.Plot Root MSE

We can see from the plot that the lowest CV error occurs at M=7.

Using the predict funtion we find that the test MSE is 96556

#Predict on test set with M=7
pcr.pred=predict(pcr.fit,Hitters[test,],ncomp=7)
#MSE value
mean((pcr.pred-y.test)^2)

## [1] 140751.3

#[1] 96556.22

Finally, we can fit the full model using M=7.

PCR dataset with M=7

Finally, we can fit the full model using M=7.

pcr.fit=pcr(Salary~.,data=Hitters,
             scale=TRUE, ncomp=7)

View resulting fit

summary(pcr.fit)

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69

PLS Example

We can apply PLS with the plsr() function.

set.seed(1)#make replicable
#Fit PLS model on training set
pls.fit=plsr(Salary~.,data=Hitters,subset=train,
             scale=TRUE, validation="CV")

View model results

summary(pls.fit)

## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           428.3    325.5    329.9    328.8    339.0    338.9    340.1
## adjCV        428.3    325.0    328.2    327.2    336.6    336.1    336.6
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       339.0    347.1    346.4     343.4     341.5     345.4     356.4
## adjCV    336.2    343.4    342.8     340.2     338.3     341.8     351.1
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        348.4     349.1     350.0     344.2     344.5     345.0
## adjCV     344.2     345.0     345.9     340.4     340.6     341.1
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         39.13    48.80    60.09    75.07    78.58    81.12    88.21    90.71
## Salary    46.36    50.72    52.23    53.03    54.07    54.77    55.05    55.66
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         93.17     96.05     97.08     97.61     97.97     98.70     99.12
## Salary    55.95     56.12     56.47     56.68     57.37     57.76     58.08
##         16 comps  17 comps  18 comps  19 comps
## X          99.61     99.70     99.95    100.00
## Salary     58.17     58.49     58.56     58.62

Plot root MSE

validationplot(pls.fit,val.type = "MSEP")

4.Root MSE by PLS

Root MSE by PLS component The lowest CV score occurs when M=2.

The test set MSE of 101417 is higher than it was for PCR

Predictions test M=2

pls.pred=predict(pls.fit,Hitters[test,],ncomp=2)

MSE value

mean((pls.pred-y.test)^2)

## [1] 145367.7

#[1] 101417.5

Finally, we perform PLS using the full dataset with M=2.

PLS model M=2

pls.fit=plsr(Salary~.,data=Hitters,scale=TRUE,
              ncomp=2)

View results

summary(pls.fit)

## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
##         1 comps  2 comps
## X         38.08    51.03
## Salary    43.05    46.40

Partial Least Squares

Crafted by Bambangpe

Data preparation

Pls model

View resulting fit

1.Plot Root MSE

2.Plot Root MSE

Perform PCR

View MSE plot

3.Plot Root MSE

PCR dataset with M=7

View resulting fit

PLS Example

View model results

Plot root MSE

4.Root MSE by PLS

Predictions test M=2

MSE value

PLS model M=2

View results