PCR combines predictors in an unsupervised way.1010 The response, Y, is not used to generate the principal component Partial Least Squares (PLS) is a supervised alternative.11

We can apply PCR to the Hitters dataset in order to predict Salary. First, we must remove any missing values.

Data preparation

names(Hitters) #column names
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
##  [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
## [13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
## [19] "Salary"    "NewLeague"
dim(Hitters) #Rows and Columns
## [1] 322  20
#Number of NA values for salary
sum(is.na(Hitters$Salary)) 
## [1] 59
Hitters=na.omit(Hitters) #Remove NAs
dim(Hitters) #Rows and Columns
## [1] 263  20
sum(is.na(Hitters)) #Check Number of NA values
## [1] 0

Pls model

The function allows us to standardize the data and use 10 fold cross-validation for each possible value of M.

library(pls) #load library
## Warning: package 'pls' was built under R version 4.0.5
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(2) #make replicable
#Fit PCR model
#Standardize data with scale=TRUE
#10 fold cross-validation for each value of M
pcr.fit=pcr(Salary~.,data=Hitters, scale=TRUE, 
             validation="CV")

We can view the resulting fit with summary. The CV score is provided for every value of M and represents the root mean squared error. To obtain the MSE we must square it.1414 MSE of 352.82=124,468

The summmary function also provides the percentage of variance explained in the predictors and the response. For example, M=1 captures 38.31% of the variance in the predictors and 40.63% of the variance in the response.

View resulting fit

summary(pcr.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV             452    351.9    353.2    355.0    352.8    348.4    343.6
## adjCV          452    351.6    352.7    354.4    352.1    347.6    342.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       345.5    347.7    349.6     351.4     352.1     353.5     358.2
## adjCV    344.7    346.7    348.5     350.1     350.7     352.0     356.5
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        349.7     349.4     339.9     341.6     339.2     339.6
## adjCV     348.0     347.7     338.2     339.7     337.2     337.6
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26    94.96
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69    46.75
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         96.28     97.26     97.98     98.65     99.15     99.47     99.75
## Salary    46.86     47.76     47.82     47.85     48.10     50.40     50.55
##         16 comps  17 comps  18 comps  19 comps
## X          99.89     99.97     99.99    100.00
## Salary     53.01     53.85     54.61     54.61

We can see from the plot that the smallest root MSE occurs when M=16. However, it is not much smaller then when M=1 and so a simpler model could be used. Also, M=19 is equivalent to M=p and the least squares model.1515 PCR results in only a modest improvement over linear regression

1.Plot Root MSE

validationplot(pcr.fit, val.type="MSEP")

2.Plot Root MSE

We can also perform PCR on a training set and then evaluate the model on a test set.

set.seed(1) # make replicable
#training set = half the data
train=sample(1:nrow(Hitters),nrow(Hitters)/2)
#test set will be remaining rows
test=(-train)
#response variable
y=Hitters$Salary
#response for test set
y.test=y[test]

set.seed(1) #make replicable

Perform PCR

set.seed(1) #make replicable
pcr.fit=pcr(Salary~.,data=Hitters,subset=train,
             scale=TRUE, validation="CV")

View MSE plot

validationplot(pcr.fit,val.type = "MSEP")

3.Plot Root MSE

We can see from the plot that the lowest CV error occurs at M=7.

Using the predict funtion we find that the test MSE is 96556

#Predict on test set with M=7
pcr.pred=predict(pcr.fit,Hitters[test,],ncomp=7)
#MSE value
mean((pcr.pred-y.test)^2)
## [1] 140751.3
#[1] 96556.22

Finally, we can fit the full model using M=7.

PCR dataset with M=7

Finally, we can fit the full model using M=7.

pcr.fit=pcr(Salary~.,data=Hitters,
             scale=TRUE, ncomp=7)

View resulting fit

summary(pcr.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: svdpc
## Number of components considered: 7
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         38.31    60.16    70.84    79.03    84.29    88.63    92.26
## Salary    40.63    41.58    42.17    43.22    44.90    46.48    46.69

PLS Example

We can apply PLS with the plsr() function.

set.seed(1)#make replicable
#Fit PLS model on training set
pls.fit=plsr(Salary~.,data=Hitters,subset=train,
             scale=TRUE, validation="CV")

View model results

summary(pls.fit)
## Data:    X dimension: 131 19 
##  Y dimension: 131 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           428.3    325.5    329.9    328.8    339.0    338.9    340.1
## adjCV        428.3    325.0    328.2    327.2    336.6    336.1    336.6
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       339.0    347.1    346.4     343.4     341.5     345.4     356.4
## adjCV    336.2    343.4    342.8     340.2     338.3     341.8     351.1
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        348.4     349.1     350.0     344.2     344.5     345.0
## adjCV     344.2     345.0     345.9     340.4     340.6     341.1
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X         39.13    48.80    60.09    75.07    78.58    81.12    88.21    90.71
## Salary    46.36    50.72    52.23    53.03    54.07    54.77    55.05    55.66
##         9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X         93.17     96.05     97.08     97.61     97.97     98.70     99.12
## Salary    55.95     56.12     56.47     56.68     57.37     57.76     58.08
##         16 comps  17 comps  18 comps  19 comps
## X          99.61     99.70     99.95    100.00
## Salary     58.17     58.49     58.56     58.62

Plot root MSE

validationplot(pls.fit,val.type = "MSEP")

4.Root MSE by PLS

Root MSE by PLS component The lowest CV score occurs when M=2.

The test set MSE of 101417 is higher than it was for PCR

Predictions test M=2

pls.pred=predict(pls.fit,Hitters[test,],ncomp=2)

MSE value

mean((pls.pred-y.test)^2)
## [1] 145367.7
#[1] 101417.5

Finally, we perform PLS using the full dataset with M=2.

PLS model M=2

pls.fit=plsr(Salary~.,data=Hitters,scale=TRUE,
              ncomp=2)

View results

summary(pls.fit)
## Data:    X dimension: 263 19 
##  Y dimension: 263 1
## Fit method: kernelpls
## Number of components considered: 2
## TRAINING: % variance explained
##         1 comps  2 comps
## X         38.08    51.03
## Salary    43.05    46.40